Category Archives: Computation

Spatial Economics for Granular Settings

Economists studying spatial connections are excited about a growing body of increasingly fine spatial data. We’re no longer studying country- or city-level aggregates. For example, many folks now leverage satellite data, so that their unit of observation is a pixel, which can be as small as only 30 meters wide. See Donaldson and Storeygaard’s “The View from Above: Applications of Satellite Data in Economics“. Standard administrative data sources like the LEHD publish neighborhood-to-neighborhood commuting matrices. And now “digital exhaust” extracted from the web and smartphones offers a glimpse of behavior not even measured in traditional data sources. Dave Donaldson’s keynote address on “The benefits of new data for measuring the benefits of new transportation infrastructure” at the Urban Economics Association meetings in October highlighted a number of such exciting developments (ship-level port flows, ride-level taxi data, credit-card transactions, etc).

But finer and finer data are not a free lunch. Big datasets bring computational burdens, of course, but more importantly our theoretical tools need to keep up with the data we’re leveraging. Most models of the spatial distribution of economic activity assume that the number of people per place is reasonably large. For example, theoretical results describing space as continuous formally assume a “regular” geography so that every location has positive population. But the US isn’t regular, in that it has plenty of “empty” land: more than 80% of the US population lives on only 3% of its land area. Conventional estimation procedures aren’t necessarily designed for sparse data sets. It’s an open question how well these tools will do when applied to empirical settings that don’t quite satisfy their assumptions.

Felix Tintelnot and I examine one aspect of this challenge in our new paper, “Spatial Economics for Granular Settings“. We look at commuting flows, which are described by a gravity equation in quantitative spatial models. It turns out that the empirical settings we often study are granular: the number of decision-makers is small relative to the number of economic outcomes. For example, there are 4.6 million possible residence-workplace pairings in New York City, but only 2.5 million people who live and work in the city. Applying the law of large numbers may not work well when a model has more parameters than people.

Felix and I introduce a model of a “granular” spatial economy. “Granular” just means that we assume that there are a finite number of individuals rather than an uncountably infinite continuum. This distinction may seem minor, but it turns out that estimated parameters and counterfactual predictions are pretty sensitive to how one handles the granular features of the data. We contrast the conventional approach and granular approach by examining these models’ predictions for changes in commuting flows associated with tract-level employment booms in New York City. When we regress observed changes on predicted changes, our granular model does pretty well (slope about one, intercept about zero). The calibrated-shares approach (trade folks may know this as “exact hat algebra“), which perfectly fits the pre-event data, does not do very well. In more than half of the 78 employment-boom events, its predicted changes are negatively correlated with the observed changes in commuting flows.

The calibrated-shares procedure’s failure to perform well out of sample despite perfectly fitting the in-sample observations may not surprise those who have played around with machine learning. The fundamental concern with applying a continuum model to a granular setting can be illustrated by the finite-sample properties of the multinomial distribution. Suppose that a lottery allocates I independently-and-identically-distributed balls across N urns. An econometrician wants to infer the probability that any ball i is allocated to urn n from observed data. With infinite balls, the observed share of balls in urn n would reveal this probability. In a finite sample, the realized share may differ greatly from the underlying probability. The figure below depicts this ratio for one urn when I balls are distributed across 10 urns uniformly. A procedure that equates observed shares and modeled probabilities needs this ratio to be one. As the histograms reveal, the realized ratio can be far from one even when there are two orders of magnitude more balls than urns. Unfortunately, in many empirical settings in which spatial models are calibrated to match observed shares, the number of balls (commuters) and the number of urns (residence-workplace pairs) are roughly the same. The red histogram suggests that shares and probabilities will often differ substantially in these settings.

Balls and 10 urns: Histogram of realized share divided by underlying probability

Balls and 10 urns: Histogram of realized share divided by underlying probability

Granularity is also a reason for economists to be cautious about their counterfactual exercises. In a granular world, equilibrium outcomes depend in part of the idiosyncratic components of individuals’ choices. Thus, the confidence intervals reported for counterfactual outcomes ought to incorporate uncertainty due to granularity in addition to the usual statistical uncertainty that accompanies estimated parameter values.

See the paper for more details on the theoretical model, estimation procedure, and event-study results. We’re excited about the growing body of fine spatial data used to study economic outcomes for regions, cities, and neighborhoods. Our quantitative model is designed precisely for these applications.

Why your research project needs build automation

Software build tools automate compiling source code into executable binaries. (For example, if you’ve installed Linux packages, you’ve likely used Make.)

Like software packages, research projects are large collections of code that are executed in sequence to produce output. Your research code has a first step (download raw data) and a last step (generate paper PDF). Its input-output structure is a directed graph (dependency graph).

The simplest build approach for a Stata user is a “master” do file. If a project involves A through Z, this master file executes A, B, …, Y, and Z in order. But the “run everything” approach is inefficient: if you edit Y, you only need to run Y and Z; you don’t need to run A through X again. Software build tools automate these processes for you. They can be applied to all of your research code.

Build tools use a dependency graph and information about file changes (e.g., timestamps) to produce output using (all and only) necessary steps. Build automation is valuable for any non-trivial research project. Build automation can be particularly valuable for big data. If you need to process data for 100 cities, you shouldn’t manually track which cities are up-to-date and which need to run the latest code. Define the dependencies and let the build tool track everything.

Make is an old, widely used build tool. It should be available on every Linux box by default (e.g., it’s available inside the Census RDCs). For Mac users, Make is included in OS X’s developer tools. I use Make. There are other build tools. Gentzkow and Shapiro use SCons (a Python-based tool). If all of your code is Stata, you could try the project package written by Robert Picard, though I haven’t tried it myself.

A Makefile consists of a dependency graph and a recipe for each graph node. Define dependencies by writing a target before the colon and that target’s prerequisites after the colon. The next line gives the recipe that translates those inputs into output. Make can execute any recipe you can write on the command line.

I have written much more about Make and Makefiles in Section A.3 of my project template. Here are four introductions to Make, listed in the order that I suggest reading them:

Research resources that I recommend

While advising PhD students, I find myself repeatedly suggesting the same tools and tricks. Since these are general-purpose technologies, the following list of resources that I regularly recommend to my students might interest others as well. Going forward, I’ll update this webpage, not this blog post.

Presenting

Coding

Writing

The job market

  • One year before you’ll be on the market, read John Cawley’s very comprehensive Guide and Advice For Economists on the US Junior Academic Job Market. The process will be more coherent and less intimidating if you see the big picture from the beginning.
  • Give a full draft of your paper to your advisors in June. Sharing something in September is too late.

Why I encourage econ PhD students to learn Julia

Julia is a scientific computing language that an increasing number of economists are adopting (e.g., Tom Sargent, the NY FRB). It is a close substitute for Matlab, and the cost of switching from Matlab to Julia is somewhat modest since Julia syntax is quite similar to Matlab syntax after you change array references from parentheses to square brackets (e.g., “A(2, 2)” in Matlab is “A[2, 2]” in Julia and most other languages), though there are important differences. Julia also competes with Python, R, and C++, among other languages, as a computational tool.

I am now encouraging students to try Julia, which recently released version 1.0. I first installed Julia in the spring of 2016, when it was version 0.4. Julia’s advantages are that it is modern, elegant, open source, and often faster than Matlab. Its downside is that it is a young language, so its syntax is evolving, its user community is smaller, and some features are still in development.

A proper computer scientist would discuss Julia’s computational advantages in terms of concepts like multiple dispatch and typing of variables. For an unsophisticated economist like me, the proof of the pudding is in the eating. My story is quite similar to that of Bradley Setzler, whose structural model that took more than 24 hours to solve in Python took only 15 minutes using Julia. After hearing two of my computationally savvy Booth colleagues praise Julia, I tried it out when doing the numerical simulations in our “A Spatial Knowledge Economy” paper. I took my Matlab code, made a few modest syntax changes, and found that my Julia code solved for equilibrium in only one-sixth of the time that my Matlab code did. My code was likely inefficient in both cases, but that speed improvement persuaded me to use Julia for that project.

For a proper comparison of computational performance, you should look at papers by S. Boragan Aruoba and Jesus Fernandez-Villaverde and by Jon Danielsson and Jia Rong Fan. Aruoba and Fernandez-Villaverde have solved the stochastic neoclassical growth model in a dozen languages. Their 2018 update says “C++ is the fastest alternative, Julia offers a great balance of speed and ease of use, and Python is too slow.” Danielsson and Fan compared Matlab, R, Julia, and Python when implementing financial risk forecasting methods. While you should read their rich comparison, a brief summary of their assessment is that Julia excels in language features and speed but has considerable room for improvement in terms of data handling and libraries.

While I like Julia a lot, it is a young language, which comes at a cost. In March, I had to painfully convert a couple research projects written in Julia 0.5 to version 0.6 after an upgrade of GitHub’s security standards meant that Julia 0.5 users could no longer easily install packages. My computations were fine, of course, but a replication package that required artisanally-installed packages in a no-longer-supported environment wouldn’t have been very helpful to everyone else. I hope that Julia’s 1.0 release means that those who adopt the language now are less likely to face such growing pains, though it might be a couple of months before most packages support 1.0.

At this point, you probably should not use Julia for data cleaning. To be brief, Danielsson and Fan say that Julia is the worst of the four languages they considered for data handling. In our “How Segregated is Urban Consumption?” code, we did our data cleaning in Stata and our computation in Julia. Similarly, Michael Stepner’s health inequality code relies on Julia rather than Stata for a computation-intensive step and Tom Wollmann split his JMP code between Stata and Julia. At this point, I think most users would tell you to use Julia for computation, not data prep. (Caveat: I haven’t tried the JuliaDB package yet.)

If you want to get started in Julia, I found the “Lectures in Quantitative Economics” introduction to Julia by Tom Sargent and John Stachurski very helpful. Also look at Bradley Setzler’s Julia economics tutorials.

Trade economists might be interested in the Julia package FixedEffectModels.jl. It claims to be an order of magnitude faster than Stata when estimating two-way high-dimensional fixed-effects models, which is a bread-and-butter gravity regression. I plan to ask PhD students to explore these issues this fall and will report back after learning more.