Examining Historical Growth II: Using the past to predict the future

Posted by Ezra Glenn on April 12, 2012
Data, Missions, Shape Your Neighborhood, Simulation

In our previous mission we plotted population numbers in Houston for 1900–2000, to start to understand the growth trend for that city. Now, what if we didn’t have access to the latest Census figures, and we wanted to try to guess Houston’s population for 2010, using nothing but the data from 1900–2000?

One place to start would be with the 2000 population (1,953,631) and adjust it a bit based on historical trends. With 100 year’s worth of data, we can do this in R with a simple call to some vector math.1

> attach(houston.pop) # optional, see footnote
> population[11]      # don't forget: 11, not 10, data points
[1] 1953631
> annual.increase=(population[11]-population[1])/100   # watch the parentheses!
> population[11]+10*annual.increase
[1] 2144531

Remember that we actually have eleven data points, since we have both 1900 and 2000, so we need to specify population[11] as our endpoint. But since there are only ten decade intervals, we divide by 100 to get the annual increase. Adding ten times this increase to the 2000 population, we get an estimate for 2010 of 2,144,531. (Bonus question: based on this estimated annual increase, in what year would Houston have passed the two-million mark?2)

We can uses the lines() function to add this trend line to a plot. (We will first re-draw the graph and adjust the xlim= and ylim= options to give a little more room for the projections.)

> plot(population ~ year, type="o", xlim=c(1900,2050), 
> lines(x=c(1900,2100),y=c(population[1],
         population[11]+100*annual.increase), col=4, lty=2)


As you can see from the graph, the trend line does a reasonable job capturing the general growth over the century, but other than for three points (1900, 1980, and 2000) it is pretty imprecise. (Also, if you check the math, you will see that we are always guaranteed to hit the first and last of the series, so maybe it’s not as great as it seems…) The problem seems to be that the points do not really lie in a straight line to start with: the growth takes a while to get going, and then really picks up after 1950 or so. (In fact, if it weren’t for the slower growth throughout the 1980s, we might look at this as exponential growth — more on that later.) What if we ignored the first half of the century, and formed our estimate from the points from 1950–2000?

> annual.increase.hundred=annual.increase
> annual.increase.fifty=(population[11]-population[6])/50
> plot(population ~ year, type="o", xlim=c(1900,2050), ylim=c(0,3000000))
> lines(x=c(1950,2100),y=c(population[6],
           population[11]+100*annual.increase.fifty), col=4, lty=2)
> lines(x=c(1900,2100),y=c(population[1],
           population[11]+100*annual.increase.hundred), col=3, lty=3)
> legend(x="bottomright" , inset=.02, col=c(1,4,3) , 
           lty=c(1,2,3) , 
           legend=c("Census Data","50-Year Trend","100-Year Trend"))
> population[11]+10*annual.increase.fifty
[1] 2225125


The first line saves our old annual.increase value so we can continue to use it for comparison purposes, while the second line creates a new, parallel variable based on the 50-year annual increase. After calling our plot(), the lines() functions allow us to draw additional lines onto the existing plot; by varying col= and lty= options we can easily distinguish the two lines as well. The legend() function adds a legend at the points specified by the options given; see help(legend) for more on how this works. Finally, the new call to the formula population[11]+10*annual.increase.fifty recalculates the 2010 population estimate based on the revised annual increase—and arrives at a figure almost exactly in line with the city’s official estimate!

We could also use the R‘s vector math ability to easily create an entire series of population estimates based on both of these trends, and easily compare them.

> fifty.year.trend=population[11]+annual.increase.fifty*seq(0,10)
> hundred.year.trend=population[11]+annual.increase.hundred*seq(0,10)
> pop.estimates=data.frame(row.names=seq(2000,2100,10),
> pop.estimates
     fifty.year.trend hundred.year.trend
2000          1953631            1953631
2010          1980780            1972721
2020          2007930            1991811
2030          2035079            2010901
2040          2062228            2029991
2050          2089378            2049081
2060          2116527            2068171
2070          2143677            2087261
2080          2170826            2106351
2090          2197975            2125441
2100          2225125            2144531
> pop.estimates$difference=fifty.year.trend-hundred.year.trend
> pop.estimates[3]
2000       0.00
2010    8059.38
2020   16118.76
2030   24178.14
2040   32237.52
2050   40296.90
2060   48356.28
2070   56415.66
2080   64475.04
2090   72534.42
2100   80593.80

As you can see, the two estimates are pretty close for the early decades of the 2000s, but by the mid century they are starting to diverge; by 2100 the difference is over 80,000 people, the size of a small city (although in relative terms the two are still within about four percentage points of each other.

In our next mission, we’ll see how these estimates compare with the actual Census count from 2010, learn some more sophisticated tools for forecasting in R, and then discuss some problems with predictions based on historical trends.


1 Note: if you loaded the Houston population data from the website with read.csv(), you might want to attach(houston.pop) first, to make it easier to access the column variables without the houston.pop$ preface. (But don’t forget to detach(houston.pop) at the end of the session.

2 Answer: it wouldn’t be a bonus question if the answer was just right here — go figure it out!

Tags: , , , , ,

Leave a Reply