Calculating authorship with git and awk

A few days ago, Chris asked on our Intridea instance of Present.ly if anyone knew how to ask git how much code was written by a certain author. This kind of request came up fairly regularly at my previous job, and as far as I know, there isn’t a built-in git command for it, so I took a few minutes and came up with something.

First of all, “git log –numstat” will give us output like this

 commit c82826f6616a68b84303176c8a702e389dfc7be5 Author: paul <paul@afcfe4b9-ce34-4591-b728-90b6288d594e> Date:   Tue Feb 10 22:18:34 2009 +0000      Refactor for increased speed  53      41      app/controllers/degrees_controller.rb 1       1       app/views/degrees/demographics.xml.builder 2       14      spec/controllers/degrees_controller_spec.rb 79      53      spec/views/degrees/demographics_xml_builder_spec.rb   commit 93bf620d210afde1835133f0feeb5ba8a5076d3c Author: paul <paul@afcfe4b9-ce34-4591-b728-90b6288d594e> Date:   Tue Feb 10 22:16:01 2009 +0000      Improving specs for degrees  1       1       app/controllers/degrees_controller.rb 3       1       db/schema.rb 44      0       spec/controllers/degrees_controller_spec.rb

Which is very strongly patterned. When I see strongly patterned output in the console, the first thing I think of is awk. Write ups for awk abound (my personal favorite is here) but the basics are simple: you define a series of [pattern, action] pairs. When the awk script is run, each line of input is compared to the patterns in turn, and when a pattern matches, the associated action is executed.

Let’s look at the evolution of this “score authorship” script as an example. Save this as “git_score.awk” and then (from inside a git repo) run “git log –numstat | awk -f /path/to/git_score.awk”

 /^Author:/ {    author           = $2    commits[author] += 1 }  END {    for (author in commits) {       print author, commits[author]    } }

Patterns are delimited by forward slashes, and actions are surrounded by curly braces. The special pattern END matches once the end of input has been reached. The pattern /Author:/ matches the lines beginning with “Author:”, and the variable $2 is automatically set by awk to the second word on that line. Arrays in awk can be autovivified (they spring into existence when referenced), and they default to zero. Also, awk, like PHP, uses associative arrays throughout (in ruby/perl parlance, hashes and arrays are the same type). Running the above file should give you a listing of authors who contributed to the repo and a count of the number of commits that author made.

Convinced that this would work, I continued:

 /^Author:/ {    author           = $2    commits[author] += 1 }  /^[0-9]/ {    more[author] += $1    less[author] += $2 }  END {    for (author in commits) {       print author, "inserted", more[author], "and deleted", less[author], "lines over", commits[author], "commits"    } }

Now we count changes by adding a pattern to match any lines that begin with a digit.

 1       1       app/controllers/degrees_controller.rb 3       1       db/schema.rb 44      0       spec/controllers/degrees_controller_spec.rb

Lines like the above, which correlate to lines inserted, lines deleted, path/to/file/changed. We save those to a running tally of changes per author by using two more autovivicated arrays, printing the totals when we finish.

At this point, Chris mentioned that he’d like to ignore counting contributions to files in the framework directory. That’s accomplished easily enough, and while we’re at it, let’s count files as well as insertions/deletions.

 /^Author:/ {    author           = $2    commits[author] += 1 }  /^[0-9]+ +[0-9]+ +framework/ { next }   /^[0-9]/ {    more[author] += $1    less[author] += $2    file[author] += 1 }  END {    for (author in commits) {       print author, "inserted", more[author], "and deleted", less[author], "lines over", file[author], "files"    } }

It’s important that the exclusion pattern come BEFORE the inclusion pattern. The “next” action will skip matching any further patterns and immediately start over on the next line of input. If we wanted to also skip files in vendor, we could make the pattern

 /[0-9]+ +[0-9]+ +framework|vendor/

One more problem: Chris asked for percentages, and at this point we’re outputting counts. This may not be the neatest way to convert between the two, but it is an easy one (unless you have an author named “tot”!)

 /^Author:/ {    author           = $2    commits[author] += 1    commits["tot"]  += 1 }  /^[0-9].*framework/ { next }  /^[0-9]/ {    more[author] += $1    less[author] += $2    file[author] += 1     more["tot"]  += $1    less["tot"]  += $2    file["tot"]  += 1 }  END {    for (author in commits) {       if (author != "tot") {          more[author]    = more[author] / more["tot"] * 100          less[author]    = less[author] / less["tot"] * 100          file[author]    = file[author] / file["tot"] * 100          commits[author] = commits[author] / commits["tot"] * 100           printf "%s added %.0f%% and removed %.0f%% of the lines accounting for %.0f%% of the files changed and %.0f%% of the commitsn", author, more[author], less[author], file[author], commits[author]       }    } }

I showed this to Chris (elapsed time ~30 minutes). He was satisfied, but Simon jokingly asked for a more machine-friendly output (CSV or YAML). I considered switching to ruby (the .to_yaml method was tempting) but decided that it wasn’t necessary yet, since YAML is such a simple markup.

 /^Author:/ {    author           = $2    commits[author] += 1    commits["tot"]  += 1 }  /^[0-9]+ +[0-9]+ +vendor|framework/ { next }  /^[0-9]/ {    more[author] += $1    less[author] += $2    file[author] += 1     more["tot"]  += $1    less["tot"]  += $2    file["tot"]  += 1 }  END {    for (author in commits) {       if (author != "tot") {          more[author]    = more[author] / more["tot"] * 100          less[author]    = less[author] / less["tot"] * 100          file[author]    = file[author] / file["tot"] * 100          commits[author] = commits[author] / commits["tot"] * 100           printf "%s:n  insertions: %.0f%%n  deletions: %.0f%%n  files: %.0f%%n  commits: %.0f%%n", author, more[author], less[author], file[author], commits[author]       }    } }

There you have it. Authorship percentages for a git repository. Just save the above to a file and run it as “git log –numstat | awk -f /path/to/the/file/you/just/saved”. Or create an alias in your .shell_rc file.