Examining Historical Growth I: Basic trends

Posted by Ezra Glenn on April 11, 2012
Data, Missions, Shape Your Neighborhood, Simulation

The nature of predictions

To paraphrase John Allen Paulos, author of A Mathematician Reads the Newspaper, all expert predictions can be essentially restated in one of two ways: “Things will continue roughly as they have been until something changes”; and its corollary, “Things will change after an indeterminate period of stability.” Although these statements are both true and absurd, they contain a kernel of wisdom: simply assuming a relative degree of stability and painting a picture of the future based on current trends is the first step of scenario planning. The trick, of course, is to never completely forget the “other shoe” of Paulos’s statement: as the disclaimer states on all investment offerings, “Past performance is not a guarantee of future results”; at some point in the future our present trends will no longer accurately describe where we are headed. (We will deal with this as well, with a few “safety valves.”)

From the second stage of the Rational Planning Paradigm (covered in the background sections of the book) we should have gathered information on both past and present circumstances related to our planning effort. If we are looking at housing production, we might have data on annual numbers of building permits and new subdivision approvals, mortgage rates, and housing prices; if we are looking at public transportation we might need monthly ridership numbers, information of fare changes, population and employment figures, and even data on past weather patterns or changes in vehicle ownership and gas prices. The first step of projection, therefore, is to gather relevant information and get it into a form that you can use.

Since we will be thinking about changes over time in order to project a trend into the future, we’ll need to make sure that our data has time as an element: a series of data points with one observation for each point or period of time is known as a time series. The exact units of time are not important—they could be days, months, years, decades, or something different—but it is customary (and important) to obtain data where points are regularly spaced at even intervals.1 Essentially, time series data is a special case of multivariate data in which we treat time itself as an additional variable and look for relationships as it changes. Luckily, R has some excellent functions and packages for dealing with time-series data, which we will cover below in passing. For starters, however, let’s consider a simple example, to start to think about what goes into projections.

Population Growth in Houston: Plotting a time series in R

Over the past 100 years, the City of Houston has been transformed from a fairly small city of 44,000 people (just barely making the Census Bureau’s list of the top 100 American cities in 1900) to a large metropolis of with a population nearing 2,000,000 by the century’s end (earning a solid slot as the fourth largest city in the country for the past three decades2). Here are the population figures for this period:

Population of Houston, 1900-2000 (U.S. Census Bureau)

Year Population Rank
1910 78,800 68
1920 138,276 45
1930 292,352 26
1940 384,514 21
1950 596,163 14
1960 938,219 7
1970 1,232,802 6
1980 1,595,138 5
1990 1,630,553 4
2000 1,953,631 4

The official count from the 2010 Census is now available, but before we look at it, let’s think about projections.3 To get started, let’s fire up R and enter this data like so:4

> year=seq(1900, 2000, 10)
> population=scan()
1: 44633
2: 78800
3: 138276
4: 292352
5: 384514
6: 596163
7: 938219
8: 1232802
9: 1595138
10: 1630553
11: 1953631
12: 
Read 11 items
> plot(population ~ year, type="o")

http://eglenn.scripts.mit.edu/citystate/wp-content/uploads/2012/04/wpid-houston_pop1.jpg

The graph produced by that final plot() command gives us a pretty good “eyeball” sense of how population has changed in Houston over the past century. In our next mission, we’ll move from plotting history to making predictions for the future.

Technical Coda: Plots in R

As a technical appendix for those learning R as we go, note that by default, plot uses the vector names — here year and population — as labels for the x- and y-axes, and places “tick marks” at intelligent intervals, which are nice touches. (Note the use of the type=”o” option, which tells R to plot both points and lines in an “overplotted” format; see help(plot) for additional options like this.)

Also note how plot can accept input in R‘s “model formula” format to determine which variables or vectors to plot (as opposed to x= and y= format). Although this way of doing things may seem strange at first, it’s a good idea to become familiar with specification in model format: population ~ year, which basically means “population modeled on year“. Once you are comfortable with model format, you’ll find it helps simplify some more complex plotting and analysis operations: the general format used for any multivariate “model-like” object (whether for plotting or for analyzing in some other way) is to specify the “response” variable first (the thing being modeled or predicted), followed by one or more “predictor” variables: y ~ x or (for more complicated formulas) y ~ a + b + c. (For more uses of model formulas, see the examples in help(lm).)

Footnotes:

1 I know, I know, technically neither months nor years are exactly equal intervals, but this is typically overlooked in the interest of common sense and interpretability.

2 In 1990 Houston passed Philadelpia, which had previously been listed as one of the four largest cities in the country since the first Census in 1790.

3 The idea of reserving some data to use later to verify our results is a good one; all too often analysts use every last shred of data in developing a methodology, and then have to wait to actual test it against new data. By reserving some of the numbers for this purpose we have instant access to the data we need to check our predictions, and we also guard against “over-fitting” our model to the existing data.

4 You can also download the data here and use houston.pop=read.csv("houston_population.csv") to import it: houston_population.csv.

Tags: , , , , ,

Leave a Reply