Pitfalls of Working with Time-Series Data

Posted by Ezra Glenn on April 24, 2012
Data, Missions, Shape Your Neighborhood, Simulation

In addition to the general caution against using past data for projecting future conditions (and the need for equally spaced time intervals mentioned above), the particulars of time series data require additional attention to some special issues.

Inflation and Constant Dollars

Any time series that deals with dollars (or yen, pounds sterling, wampum, or other forms of currency) must confront the fact that the value of money changes over time. If you are simply making a time series showing the shrinking value of the dollar, that’s fine — it’s what you want to show — but if you want to show something else (say, changes in wages or home prices), then you will need to correct your data to some common base. Usually this is done by starting with a base year (often the start or end of the series, or the “current” year) and adjusting values based on changes to some official inflation statistic (e.g., the consumer price index).1

Growth and Change to the Underlying Population

Over time — especially over long periods — the population of a place can change quite a lot, both in terms of overall numbers and the demographic components. As with inflation, this may be precisely the change that you are interested in observing and predicting (as in the first examples in this chapter), but at times it can introduce a spurious or intervening variable into your analysis.

As an obvious example of the sort of thinking that has tripped up more than one official indicator, consider the growth in automobile fatalities on highways in the U.S.: in 1960, the U.S. Department of Transportation reports that 36,399 people died on the highways; by 2005 this number had grown to 43,510 — an alarming increase of nearly 20%! — until one realizes that the number of people, cars, and highways in the country has grown as well. In this case, dividing the raw numbers of fatalities by the number of “vehicle miles traveled” is a common way to arrive at a standard accident rate to track across the decades. The graph below depicts time series plots of the raw numbers (top) and the rates (bottom).

http://eglenn.scripts.mit.edu/citystate/wp-content/uploads/2012/04/wpid-highway_deaths.jpg

Two Plots of Highway Fatalities

This example may seem obvious, but there are other more subtle ways for changes to the overall population or the phenomenon being observed to creep in unnoticed, and it is not always clear or simple to correct for it. Consider the three time series graphs below, which depict data from the OFDA/CRED International Disaster Database. The top graph shows the number of major natural disasters over this time period, while the middle and bottom two graphs attempt to estimate the total damages (in millions of US$) and total annual deaths, respectively.

http://eglenn.scripts.mit.edu/citystate/wp-content/uploads/2012/04/wpid-disasters.jpg

Damages from Natural Disasters: should there be a denominator?

While the graphs do seem to be saying something about the growing risk (and increasingly erratic nature) of disasters since 1970, it is not entirely clear how to put this data into perspective. For starters, we should adjust the “damages” figures for inflation (see “Inflation and Constant Dollars,” above), but there seems to be something else at work as well. The population of the world has more than doubled over this time period, and all those people have settled a lot more of the earth. Are disasters becoming more frequent, severe, or deadly, or are we just more likely to be putting people and things in harm’s way? Were there “events” in 1970 that would not have been considered “disasters,” simply because no one was living in the blast zone at the time? (“If a tree falls in the forest and no one has built a cabin under it, can we still file an insurance claim…?”) The analysis of time series data is rife with just these sorts of problems; it’s one of the reasons that is can be so fun, but also so challenging.

Cyclical Trends

In addition to showing general trends over long periods, time series data will often also capture more fine-grained patterns of predictable variation: weekday vs. weekend data across a number of weeks in commuting patterns (or illnesses, or movie attendance, or whatever), for example, or seasonal cycles over a year. Some periods may be much longer and easier to forget about: for example, think about how voter turnout varies from year to year based on whether the election includes a vote for president (every four years), senators (twice every six years), congress (every other year), or local offices only (in many locations, every other year in off years from national elections); these factors may create all sorts of periodic variation against a backdrop of general decline (or growth) in voter turnout.

Fortunately for us, R has some great tools to explore and address cyclical patterns in time series data, and the forecast() function described in “Examining Historical Growth III: The forecast() package” is generally able to notice such trends and act accordingly. (In a future mission, we’ll try this out with some data on urban crime from the city of Boston.)

Other Sticky Problems

Beyond the issues outlined here, time series data can present other problems. There can be gaps in the series, or inconsistencies in the ways data was collected — two problems that can occur in any data set, but are more likely in those that are gathered over the period many years. In addition to underlying populations changing (see above), definitions for the very thing being measured may change. The official definition used to measure the country’s unemployment rate, for example, has changed so significantly over the years as to make comparassons between period difficult and unreliable.

Finally, when measurement periods are too spaced out (as, for example, with the Decennial Census), a lot may happen in the intervening years that never makes it into the data. Being mindful of all of these issues, and thinking creatively about how to address or minimize their effects on your analysis, will go a long way towards improving the knowledge base you use for your planning work.

Footnotes:

1 Of course, the choice of this index may introduce additional issues of reliability, availability, and hidden biases…

Tags: , , , , ,

Leave a Reply