In our previous mission we plotted population numbers in Houston for 1900–2000, to start to understand the growth trend for that city. Now, what if we didn’t have access to the latest Census figures, and we wanted to try to guess Houston’s population for 2010, using nothing but the data from 1900–2000?
One place to start would be with the 2000 population (1,953,631) and adjust it a bit based on historical trends. With 100 year’s worth of data, we can do this in
R with a simple call to some vector math.1
> attach(houston.pop) # optional, see footnote > population # don't forget: 11, not 10, data points  1953631 > annual.increase=(population-population)/100 # watch the parentheses! > population+10*annual.increase  2144531 >
Remember that we actually have eleven data points, since we have both 1900 and 2000, so we need to specify
population as our endpoint. But since there are only ten decade intervals, we divide by 100 to get the annual increase. Adding ten times this increase to the 2000 population, we get an estimate for 2010 of 2,144,531. (Bonus question: based on this estimated annual increase, in what year would Houston have passed the two-million mark?2)
We can uses the
lines() function to add this trend line to a plot. (We will first re-draw the graph and adjust the
ylim= options to give a little more room for the projections.)
> plot(population ~ year, type="o", xlim=c(1900,2050), ylim=c(0,3000000)) > lines(x=c(1900,2100),y=c(population, population+100*annual.increase), col=4, lty=2) >
As you can see from the graph, the trend line does a reasonable job capturing the general growth over the century, but other than for three points (1900, 1980, and 2000) it is pretty imprecise. (Also, if you check the math, you will see that we are always guaranteed to hit the first and last of the series, so maybe it’s not as great as it seems…) The problem seems to be that the points do not really lie in a straight line to start with: the growth takes a while to get going, and then really picks up after 1950 or so. (In fact, if it weren’t for the slower growth throughout the 1980s, we might look at this as exponential growth — more on that later.) What if we ignored the first half of the century, and formed our estimate from the points from 1950–2000?
> annual.increase.hundred=annual.increase > annual.increase.fifty=(population-population)/50 > plot(population ~ year, type="o", xlim=c(1900,2050), ylim=c(0,3000000)) > lines(x=c(1950,2100),y=c(population, population+100*annual.increase.fifty), col=4, lty=2) > lines(x=c(1900,2100),y=c(population, population+100*annual.increase.hundred), col=3, lty=3) > legend(x="bottomright" , inset=.02, col=c(1,4,3) , lty=c(1,2,3) , legend=c("Census Data","50-Year Trend","100-Year Trend")) > population+10*annual.increase.fifty  2225125 >
The first line saves our old
annual.increase value so we can continue to use it for comparison purposes, while the second line creates a new, parallel variable based on the 50-year annual increase. After calling our
lines() functions allow us to draw additional lines onto the existing plot; by varying
lty= options we can easily distinguish the two lines as well. The
legend() function adds a legend at the points specified by the options given; see
help(legend) for more on how this works. Finally, the new call to the formula
population+10*annual.increase.fifty recalculates the 2010 population estimate based on the revised annual increase—and arrives at a figure almost exactly in line with the city’s official estimate!
We could also use the
R‘s vector math ability to easily create an entire series of population estimates based on both of these trends, and easily compare them.
> fifty.year.trend=population+annual.increase.fifty*seq(0,10) > hundred.year.trend=population+annual.increase.hundred*seq(0,10) > pop.estimates=data.frame(row.names=seq(2000,2100,10), fifty.year.trend,hundred.year.trend) > pop.estimates fifty.year.trend hundred.year.trend 2000 1953631 1953631 2010 1980780 1972721 2020 2007930 1991811 2030 2035079 2010901 2040 2062228 2029991 2050 2089378 2049081 2060 2116527 2068171 2070 2143677 2087261 2080 2170826 2106351 2090 2197975 2125441 2100 2225125 2144531 > > pop.estimates$difference=fifty.year.trend-hundred.year.trend > pop.estimates difference 2000 0.00 2010 8059.38 2020 16118.76 2030 24178.14 2040 32237.52 2050 40296.90 2060 48356.28 2070 56415.66 2080 64475.04 2090 72534.42 2100 80593.80 >
As you can see, the two estimates are pretty close for the early decades of the 2000s, but by the mid century they are starting to diverge; by 2100 the difference is over 80,000 people, the size of a small city (although in relative terms the two are still within about four percentage points of each other.
In our next mission, we’ll see how these estimates compare with the actual Census count from 2010, learn some more sophisticated tools for forecasting in
R, and then discuss some problems with predictions based on historical trends.
1 Note: if you loaded the Houston population data from the website with
read.csv(), you might want to
attach(houston.pop) first, to make it easier to access the column variables without the
houston.pop$ preface. (But don’t forget to
detach(houston.pop) at the end of the session.
2 Answer: it wouldn’t be a bonus question if the answer was just right here — go figure it out!