Calculating authorship with git and awk

A few days ago, Chris asked on our Intridea instance of if anyone knew how to ask git how much code was written by a certain author. This kind of request came up fairly regularly at my previous job, and as far as I know, there isn’t a built-in git command for it, so I took a few minutes and came up with something.

First of all, “git log –numstat” will give us output like this

Which is very strongly patterned. When I see strongly patterned output in the console, the first thing I think of is awk. Write ups for awk abound (my personal favorite is here) but the basics are simple: you define a series of [pattern, action] pairs. When the awk script is run, each line of input is compared to the patterns in turn, and when a pattern matches, the associated action is executed.

Let’s look at the evolution of this “score authorship” script as an example. Save this as “git_score.awk” and then (from inside a git repo) run “git log –numstat | awk -f /path/to/git_score.awk”

Patterns are delimited by forward slashes, and actions are surrounded by curly braces. The special pattern END matches once the end of input has been reached. The pattern /Author:/ matches the lines beginning with “Author:”, and the variable $2 is automatically set by awk to the second word on that line. Arrays in awk can be autovivified (they spring into existence when referenced), and they default to zero. Also, awk, like PHP, uses associative arrays throughout (in ruby/perl parlance, hashes and arrays are the same type). Running the above file should give you a listing of authors who contributed to the repo and a count of the number of commits that author made.

Convinced that this would work, I continued:

Now we count changes by adding a pattern to match any lines that begin with a digit.

Lines like the above, which correlate to lines inserted, lines deleted, path/to/file/changed. We save those to a running tally of changes per author by using two more autovivicated arrays, printing the totals when we finish.

At this point, Chris mentioned that he’d like to ignore counting contributions to files in the framework directory. That’s accomplished easily enough, and while we’re at it, let’s count files as well as insertions/deletions.

It’s important that the exclusion pattern come BEFORE the inclusion pattern. The “next” action will skip matching any further patterns and immediately start over on the next line of input. If we wanted to also skip files in vendor, we could make the pattern

One more problem: Chris asked for percentages, and at this point we’re outputting counts. This may not be the neatest way to convert between the two, but it is an easy one (unless you have an author named “tot”!)

I showed this to Chris (elapsed time ~30 minutes). He was satisfied, but Simon jokingly asked for a more machine-friendly output (CSV or YAML). I considered switching to ruby (the .to_yaml method was tempting) but decided that it wasn’t necessary yet, since YAML is such a simple markup.

There you have it. Authorship percentages for a git repository. Just save the above to a file and run it as “git log –numstat | awk -f /path/to/the/file/you/just/saved”. Or create an alias in your .shell_rc file.