Examining Historical Growth III: The forecast() package

Posted by Ezra Glenn on April 21, 2012
Data, Missions, Shape Your Neighborhood, Simulation

In our last mission we used R to plot a trend-line for population growth in Houston, based on historical data from the past century. Depending on which of two different methods we used, we arrived at an estimate for the city’s 2010 population of 2,144,531 (based on the 100-year growth trend for the city) or 2,225,125 (based on the steeper growth trend of the past fifty years). Looking now at the official Census count for 2010, it turns out that our guesses are close, but both of too high: the actual reported figure for 2010 is 2,099,451.

It would have been surprising to have guessed perfectly based on nothing other than a linear trend — and the fact that we came as close as we did speaks well of this sort of “back of the envelope” projection technique (at least for the case of steady-growth). But there was a lot of information contained in those data points that we essentially ignored: our two trendlines were really based on nothing more than a start and an end point.

A more sophisticated set of tools for making projections — which may be able to extract some extra meaning from the variation contained in the data — is provided in R by the excellent forecast package, developed by Rob Hyndman of the Monash University in Australia. To access these added functions, you’ll need to install it:

> install.packages(forecast)
> library(forecast)

Time-series in R: an object with class

Although R is perfectly happy to help you analyze and plot time series data organized in vectors and dataframes, it actually has a specialized object class for this sort of thing, created with the ts() function. Remember: R is an “object-oriented” language. Every object (a variable, a dataframe, a function, a time series) is associated with a certain class, which helps the language figure out how to manage and interact with them. To find the class of an object, use the class() functions:

> a=c(1,2)
> class(a)
[1] "numeric"
> a=TRUE
> class(a)
[1] "logical"
> class(plot)
[1] "function"
> a=ts(1)
> class(a)
[1] "ts"
> 

As options, ts() expects the data itself (usually given as a first, unnamed option, but of course you can put it anywhere in the parentheses and specifically call it with the data= option), as well as some information about how the time series is structured: the time coding of the start and (optional) end points, as well as either the frequency of the series (the number of data points between time periods) or its inverse, deltat (the number of time periods between data points).

Using the historical population dataset for Houston from the previous two missions,1 we can create a new timeseries object for this data:

> houston.ts=ts(data=population, start=1900, end=2000, frequency=1/10)
> houston.ts
Time Series:
Start = 1900 
End = 2000 
Frequency = 0.1 
 [1]   44633   78800  138276  292352  384514  596163  938219 1232802 1595138
[10] 1630553 1953631
> 

You can summarize and plot time series objects just like other data types; one of the nice features about R‘s object-oriented nature is that functions (like print(), plot(), or summary()) recognize the class of objects they are passed and tend to react accordingly.

> summary(houston.ts)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  44630  215300  596200  807700 1414000 1954000 
> plot(houston.ts, type="o")
> diff(houston.ts)
Time Series:
Start = 1910 
End = 2000 
Frequency = 0.1 
 [1]  34167  59476 154076  92162 211649 342056 294583 362336  35415 323078
> plot(diff(houston.ts), ylab="Growth in Previous Decade", type="h", lwd=9)
> plot(diff(houston.ts)/houston.ts[1:10], 
           ylab="Percent Growth in Previous Decade", type="h", lwd=9)
> 

http://eglenn.scripts.mit.edu/citystate/wp-content/uploads/2012/04/wpid-houston_pop_d.jpg

http://eglenn.scripts.mit.edu/citystate/wp-content/uploads/2012/04/wpid-houston_pop_e.jpg

For the first two functions, R treats the time series like any other vector. With the call to diff() we begin to treat the time series as something interesting, where we can to plot growth from one decade to the next (in absolute and then in percentage terms); in essence, we are plotting a crude derivative of our population graph.

The forceast package

Beyond these tricks, the real power of time series objects lies in other functions specifically developed to work with them (including those provided by the forecast package). When it comes to projecting future values from historical trends, the workhorse of these tools is the forecast() function, which accepts a time series object and outputs yet another type of object, of the forecast class.

A small work-around

The current configuration of the forecast package2 has a little trouble dealing with time-series data with a frequency of less than one (such as decennial data, where unit of time is nominally one year, but this represents only 1/10th of our interval). There is a good chance that in the very near-future a new version of the forecast package will be released that will make this step unnecessary, but for now we need a small workaround: rather than thinking of our data measured in years, let’s recast it as something measured in decades, so that the frequency will be 1, not .1 (technically speaking, this is actually closer to the truth anyways). So we can redefine houston.ts as follows:

> houston.ts=ts(data=population, start=190, end=200, frequency=1)
> houston.ts
Time Series:
Start = 190
End = 200
Frequency = 1 
 [1]   44633   78800  138276  292352  384514  596163  938219 1232802 1595138
[10] 1630553 1953631
> 

That is, we have a time series that starts at the 190th decade, and continues to measure population for a total of ten more decades. Everything is exactly the same as before, except for the units of time are measured in decades.

Back to forecasting

When passed a ts object (and possibly some other optional arguments — see ?forecast), the forecast function will return a new class of object (which we tie to the name houston.future here) that looks a lot like a dataframe when printed.It is actually a complicated list; type str(houston.future) to see this. By default, the class instructs print() to only print certain elements.3

By default it will forecast out for ten time units (which, in our case, would be ten decades, not just ten years):

> houston.future=forecast(houston.ts)
> class(houston.future)
[1] "forecast"
> houston.future
> forecast(houston.ts)
    Point Forecast   Lo 80   Hi 80   Lo 95   Hi 95
201        2141479 1970080 2312877 1879348 2403610
202        2338520 2105658 2571382 1982388 2694652
203        2535561 2254354 2816768 2105492 2965630
204        2732602 2410212 3054992 2239549 3225655
205        2929643 2570757 3288529 2380775 3478512
206        3126684 2734679 3518689 2527164 3726204
207        3323725 2901183 3746268 2677502 3969949
208        3520766 3069743 3971790 2830986 4210547
209        3717807 3239992 4195623 2987052 4448563
210        3914849 3411661 4418036 3145289 4684408
>
> plot(houston.future, type="o" , 
         ylim=c(0,3000000), main="Forecast of Growth")

As you can see, houston.future contains point-forecasts for the next ten time periods (this can be changed by specifying the h= option to forecast), as well as high and low estimates at various confidence levels (which can also be specified through the level= option; the default is level=c(80,95) , but you can pass it any vector of desired levels). When plotted, a special forecast method for the plot function is used, which causes these high and low estimates appear as shaded (or colored) bands around the projected trend line.

http://eglenn.scripts.mit.edu/citystate/wp-content/uploads/2012/04/wpid-houston_forecast.jpg

“Detroit, We Have a Problem…”

Houston’s growth, however, has been relatively well-behaved in this respect. If instead we look at, say, Detroit, we get a really different picture, and it clearly makes all the difference in the world whether we look at a 50- or a 100-year trend, and how seriously we take forecasts based on historic trends.

> detroit.pop=read.csv("./detroit_population.csv")
> annual.increase.det=
       (detroit.pop$population[11]-detroit.pop$population[1])/100
> annual.increase.det.100=annual.increase.det
> annual.increase.det.50=
       (detroit.pop$population[11]-detroit.pop$population[6])/50
> plot(detroit.pop, type="o", xlim=c(1900,2050), 
       ylim=c(0,2000000), 
       main="Detroit Population\n50- and 100-Year Trends")
> lines(x=c(1950,2100), 
       y=c(detroit.pop$population[6],
           detroit.pop$population[11]+100*annual.increase.det.50), 
       col=3, lty=3)
> lines(x=c(1900,2100),
       y=c(detroit.pop$population[1],
           detroit.pop$population[11]+100*annual.increase.det.100), 
       col=4, lty=2)
> legend(x="bottomright" , inset=.02, cex=.66, col=c(1,4,3), 
       lty=c(1,2,3) , legend=c("Census Data","100-Year Trend","50-Year Trend"))
>  

http://eglenn.scripts.mit.edu/citystate/wp-content/uploads/2012/04/wpid-detroit_population.jpg

One could also make a good argument that the population loss is finally stopping, or at least slowing, and that a 10- or 20-year trend might be a better indicator; in 2009 the Census Bureau seemed to believe this, offering an official estimate of 910,200, not too much below the 2000 figure of 951,270. But remember that estimates can be off in this direction as well: in 2010, just one year later, the official count was found to be 713,777 — a 25% drop over the decade! Either way, the point stands: before making estimates, think about which historical trends you think are most relevant, and proceed from there.

The Detroit example provides another sobering lesson: when making projections, it may not be obvious when past trends are completely irrelevant in terms of describing the future. To planners in 1950, the growth rates in Detroit would probably have seemed a lot like the way the Houston numbers seem to us today: a city with steady growth over the past century, nearing the two-million mark; to see how well this fits with the data, look at the graph below, which compares the Houston numbers with those from Detroit adjusted by 50 years.

http://eglenn.scripts.mit.edu/citystate/wp-content/uploads/2012/04/wpid-houston_detroit1.jpg

Statistical Coda: “Don’t Get Cocky, Kid…”

Be extremely careful in presenting, describing, and interpreting the “high” and “low” estimates generated by forecast(); they are not “confidence intervals” or “+/- errors” in the classic statistical sense. Like the trend line itself, they are predicated on the idea that past trends represent future states: assuming that /the variability of the data shown in the past is representative of how much this type of data jumps around its trend/, then they are good confidence bands — but be aware that you have now introduced two assumptions into your projections (one regarding your faith in the general trend, and one concerning the representativeness of the observed historical variation).

Footnotes:

1 See this mission to load the data.

2 Version 3.20, at the time of writing.

3 It is actually a complicated list — enter str(houston.future) to see this. By default, the class instructs print() to only print certain elements.

Tags: , , , , ,

Leave a Reply