Why your research project needs build automation

Software build tools automate compiling source code into executable binaries. (For example, if you’ve installed Linux packages, you’ve likely used Make.)

Like software packages, research projects are large collections of code that are executed in sequence to produce output. Your research code has a first step (download raw data) and a last step (generate paper PDF). Its input-output structure is a directed graph (dependency graph).

The simplest build approach for a Stata user is a “master” do file. If a project involves A through Z, this master file executes A, B, …, Y, and Z in order. But the “run everything” approach is inefficient: if you edit Y, you only need to run Y and Z; you don’t need to run A through X again. Software build tools automate these processes for you. They can be applied to all of your research code.

Build tools use a dependency graph and information about file changes (e.g., timestamps) to produce output using (all and only) necessary steps. Build automation is valuable for any non-trivial research project. Build automation can be particularly valuable for big data. If you need to process data for 100 cities, you shouldn’t manually track which cities are up-to-date and which need to run the latest code. Define the dependencies and let the build tool track everything.

Make is an old, widely used build tool. It should be available on every Linux box by default (e.g., it’s available inside the Census RDCs). For Mac users, Make is included in OS X’s developer tools. I use Make. There are other build tools. Gentzkow and Shapiro use SCons (a Python-based tool). If all of your code is Stata, you could try the project package written by Robert Picard, though I haven’t tried it myself.

A Makefile consists of a dependency graph and a recipe for each graph node. Define dependencies by writing a target before the colon and that target’s prerequisites after the colon. The next line gives the recipe that translates those inputs into output. Make can execute any recipe you can write on the command line.

I have written much more about Make and Makefiles in Section A.3 of my project template. Here are four introductions to Make, listed in the order that I suggest reading them: