[OpenAFS-devel] Re: OpenAFS Licensing Update Discussion

Thu, 24 Jun 2021 19:40:49 -0500

On Tue, 15 Jun 2021 21:04:41 -0400
Jeffrey E Altman <jaltman@auristor.com> wrote:

> In order to narrow the scope of work the contributors that are known to
> have contributed tens or hundreds of commits to the necessary source
> files will be approached to execute CLAs first.  

I've been messing around with semi-automating some of the process of
identifying/categorizing relevant files and commits.

First, to identify which source files are included in libafs on Linux, I
have this branch/commit:
<https://github.com/adeason/openafs/commits/adeason/linux-libafs-printdeps>
This could be done manually, of course, but I wanted to see if we could
get such a list from the build system itself (since of course it knows
what's going in).

When building the linux kernel module, that prints out the source files
that were built, and the header files we depend on (calculated from gcc
-MMD). The commit message includes example output from a typical amd64
system; I don't know if we need to bother with trying too hard to see if
that list changes on different arches / versions, but it should be easy
to do so. However, you do need a somewhat newish GNU make to run that;
3.82 on a centos 7 box I had is not new enough (but 4.2.1 from another
box is).

That's also not completely automatic, since it doesn't print the full
path of the relevant source file, and we rename some and move them
around. This just provides a list to start from; it's not hard to figure
out the rest by hand.

I can submit that to gerrit if people prefer. As it exists now, the
logic is unconditional so it shouldn't be included in the actual openafs
tree. Maybe it could be done conditionally for a special build target or
make var? Or maybe that's not worth the effort.

Next, there's the process of identifying commits that touch those files.
I have a script here to assign metadata "tags" to the relevant commits,
to specify various metadata and categorize things. These aren't git
tags, but, for example, tagging a commit with 'linux-kernel' to say it's
for the linux kernel module, or 'license:ibm' to show that it's an
IPL-licensed file. These "tags" are just arbitrary strings that we can
use with various conventions to convey meaning.

The tool is here: <https://github.com/adeason/openafs-mtag>. That
contains the scripting itself and the commit tag-based metadata. It can
live somewhere else if needed, but having it in a git repo seems helpful
to track progress. I've committed both the sources and tool itself, and
the resulting data it generates. Some comments in mtag.py show some
example usage.

This tool assigns various 'tags' based on what file a commit touches,
who the author is, etc. The idea is that you specify the "rules" of what
tags are used in mtag.yaml, and the tool spits out another yaml file
(commits.yaml) that says what commits have what tags. Then you can
specify what tags to ignore (e.g. ignore the 'tiny' tag to ignore
commits that are too small to be copyrightable, ignore commits authored
by me once I've signed a CLA, etc).

The data is output in YAML, but it could easily go into a csv or
something else if people wanted; I just picked something. It's written
in python, but most of the heavy lifting is done by 'git log' (ignoring
whitespace changes and following renames), so it's not too slow. You
don't need to know python to use it; you do need to interact with YAML,
but it should be pretty intuitive.

I've filled in some info for each of the relevant source files used for
the linux kernel module, and some author-related tags, sometimes
guessing at some details. I haven't really gone through the individual
commits to judge what can be ignored; I've just included a single
manually-tagged commit as an example.

The current cutoff for "is this commit too small to be copyrightable" is
"does this commit add 3 lines or fewer to the files we care about". The
calculation of this is the normal 'git log -w --numstat' logic, so a
changed line is 1 added line (and 1 removed line). The limit of '3
lines' is arbitrary; I don't know if there's a standard or something to
reference for the chosen limit.

Some other things that I've noticed as I've been doing this:

- Some commits are authored by someone who is a known employee of some
  company or other organization, so the company owns the copyright and
  not the person. Sometimes it's not clear whether this is the case;
  e.g. I don't know if an @mit.edu commit is by someone who did it as an
  employee of MIT, or it's just someone who is just a student or
  whatever. I've tried tagging everything that looks like such an org
  could be relevant with 'org:domain.tld'.

- iirc, works created by employees of the US federal government are
  generally public domain. I wasn't sure if that means everything
  submitted by @anl.gov, @.mil, etc doesn't need a CLA and can be
  ignored, or if they need some special inspection or what.

- Some files have a copyright notice stating they are copyright IBM and
  under the IPL, but the commit history suggests that it's not accurate.
  (e.g. it was submitted years after openafs 1.0, by a non-IBM person.)
  I've tagged these as "license:ibm_maybe" to be examined later.

- Since we have rxgen-generated source in the linux kernel module, I
  flagged all of the rxgen sources as being relevant. I don't know if
  that's needed (or how much this matters). Similarly, I also included
  src/config/Makefile.version-NOCML.in; normally I don't look at
  Makefiles, but that one has the logic for generating
  AFS_component_version_number.c.

-- 
Andrew Deason
adeason@dson.org