The nature of predictions
To paraphrase John Allen Paulos, author of A Mathematician Reads the Newspaper, all expert predictions can be essentially restated in one of two ways: “Things will continue roughly as they have been until something changes”; and its corollary, “Things will change after an indeterminate period of stability.” Although these statements are both true and absurd, they contain a kernel of wisdom: simply assuming a relative degree of stability and painting a picture of the future based on current trends is the first step of scenario planning. The trick, of course, is to never completely forget the “other shoe” of Paulos’s statement: as the disclaimer states on all investment offerings, “Past performance is not a guarantee of future results”; at some point in the future our present trends will no longer accurately describe where we are headed. (We will deal with this as well, with a few “safety valves.”)
From the second stage of the Rational Planning Paradigm (covered in the background sections of the book) we should have gathered information on both past and present circumstances related to our planning effort. If we are looking at housing production, we might have data on annual numbers of building permits and new subdivision approvals, mortgage rates, and housing prices; if we are looking at public transportation we might need monthly ridership numbers, information of fare changes, population and employment figures, and even data on past weather patterns or changes in vehicle ownership and gas prices. The first step of projection, therefore, is to gather relevant information and get it into a form that you can use.
Since we will be thinking about changes over time in order to project a trend into the future, we’ll need to make sure that our data has time as an element: a series of data points with one observation for each point or period of time is known as a time series. The exact units of time are not important—they could be days, months, years, decades, or something different—but it is customary (and important) to obtain data where points are regularly spaced at even intervals.1 Essentially, time series data is a special case of multivariate data in which we treat time itself as an additional variable and look for relationships as it changes. Luckily,
R has some excellent functions and packages for dealing with time-series data, which we will cover below in passing. For starters, however, let’s consider a simple example, to start to think about what goes into projections.
Population Growth in Houston: Plotting a time series in
Over the past 100 years, the City of Houston has been transformed from a fairly small city of 44,000 people (just barely making the Census Bureau’s list of the top 100 American cities in 1900) to a large metropolis of with a population nearing 2,000,000 by the century’s end (earning a solid slot as the fourth largest city in the country for the past three decades2). Here are the population figures for this period:
Population of Houston, 1900-2000 (U.S. Census Bureau)
> year=seq(1900, 2000, 10) > population=scan() 1: 44633 2: 78800 3: 138276 4: 292352 5: 384514 6: 596163 7: 938219 8: 1232802 9: 1595138 10: 1630553 11: 1953631 12: Read 11 items > plot(population ~ year, type="o")
The graph produced by that final
plot() command gives us a pretty good “eyeball” sense of how population has changed in Houston over the past century. In our next mission, we’ll move from plotting history to making predictions for the future.
Technical Coda: Plots in
As a technical appendix for those learning
R as we go, note that by default,
plot uses the vector names — here
population — as labels for the x- and y-axes, and places “tick marks” at intelligent intervals, which are nice touches. (Note the use of the type=”o” option, which tells
R to plot both points and lines in an “overplotted” format; see
help(plot) for additional options like this.)
Also note how
plot can accept input in
R‘s “model formula” format to determine which variables or vectors to plot (as opposed to
y= format). Although this way of doing things may seem strange at first, it’s a good idea to become familiar with specification in model format:
population ~ year, which basically means “
population modeled on
year“. Once you are comfortable with model format, you’ll find it helps simplify some more complex plotting and analysis operations: the general format used for any multivariate “model-like” object (whether for plotting or for analyzing in some other way) is to specify the “response” variable first (the thing being modeled or predicted), followed by one or more “predictor” variables:
y ~ x or (for more complicated formulas)
y ~ a + b + c. (For more uses of model formulas, see the examples in
1 I know, I know, technically neither months nor years are exactly equal intervals, but this is typically overlooked in the interest of common sense and interpretability.
2 In 1990 Houston passed Philadelpia, which had previously been listed as one of the four largest cities in the country since the first Census in 1790.
3 The idea of reserving some data to use later to verify our results is a good one; all too often analysts use every last shred of data in developing a methodology, and then have to wait to actual test it against new data. By reserving some of the numbers for this purpose we have instant access to the data we need to check our predictions, and we also guard against “over-fitting” our model to the existing data.