In our last mission we used
R to plot a trend-line for population growth in Houston, based on historical data from the past century. Depending on which of two different methods we used, we arrived at an estimate for the city’s 2010 population of 2,144,531 (based on the 100-year growth trend for the city) or 2,225,125 (based on the steeper growth trend of the past fifty years). Looking now at the official Census count for 2010, it turns out that our guesses are close, but both of too high: the actual reported figure for 2010 is 2,099,451.
It would have been surprising to have guessed perfectly based on nothing other than a linear trend — and the fact that we came as close as we did speaks well of this sort of “back of the envelope” projection technique (at least for the case of steady-growth). But there was a lot of information contained in those data points that we essentially ignored: our two trendlines were really based on nothing more than a start and an end point.
A more sophisticated set of tools for making projections — which may be able to extract some extra meaning from the variation contained in the data — is provided in
R by the excellent
forecast package, developed by Rob Hyndman of the Monash University in Australia. To access these added functions, you’ll need to install it:
> install.packages(forecast) > library(forecast)
R: an object with
R is perfectly happy to help you analyze and plot time series data organized in vectors and dataframes, it actually has a specialized object class for this sort of thing, created with the
ts() function. Remember:
R is an “object-oriented” language. Every object (a variable, a dataframe, a function, a time series) is associated with a certain class, which helps the language figure out how to manage and interact with them. To find the class of an object, use the
> a=c(1,2) > class(a)  "numeric" > a=TRUE > class(a)  "logical" > class(plot)  "function" > a=ts(1) > class(a)  "ts" >
ts() expects the data itself (usually given as a first, unnamed option, but of course you can put it anywhere in the parentheses and specifically call it with the
data= option), as well as some information about how the time series is structured: the time coding of the
start and (optional)
end points, as well as either the
frequency of the series (the number of data points between time periods) or its inverse,
deltat (the number of time periods between data points).
Using the historical population dataset for Houston from the previous two missions,1 we can create a new timeseries object for this data:
> houston.ts=ts(data=population, start=1900, end=2000, frequency=1/10) > houston.ts Time Series: Start = 1900 End = 2000 Frequency = 0.1  44633 78800 138276 292352 384514 596163 938219 1232802 1595138  1630553 1953631 >
You can summarize and plot time series objects just like other data types; one of the nice features about
R‘s object-oriented nature is that functions (like
summary()) recognize the class of objects they are passed and tend to react accordingly.
> summary(houston.ts) Min. 1st Qu. Median Mean 3rd Qu. Max. 44630 215300 596200 807700 1414000 1954000 > plot(houston.ts, type="o") > diff(houston.ts) Time Series: Start = 1910 End = 2000 Frequency = 0.1  34167 59476 154076 92162 211649 342056 294583 362336 35415 323078 > plot(diff(houston.ts), ylab="Growth in Previous Decade", type="h", lwd=9) > plot(diff(houston.ts)/houston.ts[1:10], ylab="Percent Growth in Previous Decade", type="h", lwd=9) >
For the first two functions,
R treats the time series like any other vector. With the call to
diff() we begin to treat the time series as something interesting, where we can to plot growth from one decade to the next (in absolute and then in percentage terms); in essence, we are plotting a crude derivative of our population graph.
Beyond these tricks, the real power of time series objects lies in other functions specifically developed to work with them (including those provided by the
forecast package). When it comes to projecting future values from historical trends, the workhorse of these tools is the
forecast() function, which accepts a time series object and outputs yet another type of object, of the
A small work-around
The current configuration of the
forecast package2 has a little trouble dealing with time-series data with a frequency of less than one (such as decennial data, where unit of time is nominally one year, but this represents only 1/10th of our interval). There is a good chance that in the very near-future a new version of the
forecast package will be released that will make this step unnecessary, but for now we need a small workaround: rather than thinking of our data measured in years, let’s recast it as something measured in decades, so that the frequency will be 1, not .1 (technically speaking, this is actually closer to the truth anyways). So we can redefine
houston.ts as follows:
> houston.ts=ts(data=population, start=190, end=200, frequency=1) > houston.ts Time Series: Start = 190 End = 200 Frequency = 1  44633 78800 138276 292352 384514 596163 938219 1232802 1595138  1630553 1953631 >
That is, we have a time series that starts at the 190th decade, and continues to measure population for a total of ten more decades. Everything is exactly the same as before, except for the units of time are measured in decades.
Back to forecasting
When passed a
ts object (and possibly some other optional arguments — see
forecast function will return a new class of object (which we tie to the name
houston.future here) that looks a lot like a dataframe when printed.It is actually a complicated
str(houston.future) to see this. By default, the class instructs
print() to only print certain elements.3
By default it will forecast out for ten time units (which, in our case, would be ten decades, not just ten years):
> houston.future=forecast(houston.ts) > class(houston.future)  "forecast" > houston.future > forecast(houston.ts) Point Forecast Lo 80 Hi 80 Lo 95 Hi 95 201 2141479 1970080 2312877 1879348 2403610 202 2338520 2105658 2571382 1982388 2694652 203 2535561 2254354 2816768 2105492 2965630 204 2732602 2410212 3054992 2239549 3225655 205 2929643 2570757 3288529 2380775 3478512 206 3126684 2734679 3518689 2527164 3726204 207 3323725 2901183 3746268 2677502 3969949 208 3520766 3069743 3971790 2830986 4210547 209 3717807 3239992 4195623 2987052 4448563 210 3914849 3411661 4418036 3145289 4684408 > > plot(houston.future, type="o" , ylim=c(0,3000000), main="Forecast of Growth")
As you can see,
houston.future contains point-forecasts for the next ten time periods (this can be changed by specifying the
h= option to
forecast), as well as high and low estimates at various confidence levels (which can also be specified through the
level= option; the default is
level=c(80,95) , but you can pass it any vector of desired levels). When plotted, a special
forecast method for the
plot function is used, which causes these high and low estimates appear as shaded (or colored) bands around the projected trend line.
“Detroit, We Have a Problem…”
Houston’s growth, however, has been relatively well-behaved in this respect. If instead we look at, say, Detroit, we get a really different picture, and it clearly makes all the difference in the world whether we look at a 50- or a 100-year trend, and how seriously we take forecasts based on historic trends.
> detroit.pop=read.csv("./detroit_population.csv") > annual.increase.det= (detroit.pop$population-detroit.pop$population)/100 > annual.increase.det.100=annual.increase.det > annual.increase.det.50= (detroit.pop$population-detroit.pop$population)/50 > plot(detroit.pop, type="o", xlim=c(1900,2050), ylim=c(0,2000000), main="Detroit Population\n50- and 100-Year Trends") > lines(x=c(1950,2100), y=c(detroit.pop$population, detroit.pop$population+100*annual.increase.det.50), col=3, lty=3) > lines(x=c(1900,2100), y=c(detroit.pop$population, detroit.pop$population+100*annual.increase.det.100), col=4, lty=2) > legend(x="bottomright" , inset=.02, cex=.66, col=c(1,4,3), lty=c(1,2,3) , legend=c("Census Data","100-Year Trend","50-Year Trend")) >
One could also make a good argument that the population loss is finally stopping, or at least slowing, and that a 10- or 20-year trend might be a better indicator; in 2009 the Census Bureau seemed to believe this, offering an official estimate of 910,200, not too much below the 2000 figure of 951,270. But remember that estimates can be off in this direction as well: in 2010, just one year later, the official count was found to be 713,777 — a 25% drop over the decade! Either way, the point stands: before making estimates, think about which historical trends you think are most relevant, and proceed from there.
The Detroit example provides another sobering lesson: when making projections, it may not be obvious when past trends are completely irrelevant in terms of describing the future. To planners in 1950, the growth rates in Detroit would probably have seemed a lot like the way the Houston numbers seem to us today: a city with steady growth over the past century, nearing the two-million mark; to see how well this fits with the data, look at the graph below, which compares the Houston numbers with those from Detroit adjusted by 50 years.
Statistical Coda: “Don’t Get Cocky, Kid…”
Be extremely careful in presenting, describing, and interpreting the “high” and “low” estimates generated by
forecast(); they are not “confidence intervals” or “+/- errors” in the classic statistical sense. Like the trend line itself, they are predicated on the idea that past trends represent future states: assuming that /the variability of the data shown in the past is representative of how much this type of data jumps around its trend/, then they are good confidence bands — but be aware that you have now introduced two assumptions into your projections (one regarding your faith in the general trend, and one concerning the representativeness of the observed historical variation).