R

Examining Historical Growth II: Using the past to predict the future

Posted by Ezra Glenn on April 12, 2012
Data, Missions, Shape Your Neighborhood, Simulation / No Comments

In our previous mission we plotted population numbers in Houston for 1900–2000, to start to understand the growth trend for that city. Now, what if we didn’t have access to the latest Census figures, and we wanted to try to guess Houston’s population for 2010, using nothing but the data from 1900–2000?

One place to start would be with the 2000 population (1,953,631) and adjust it a bit based on historical trends. With 100 year’s worth of data, we can do this in R with a simple call to some vector math.1

> attach(houston.pop) # optional, see footnote
> population[11]      # don't forget: 11, not 10, data points
[1] 1953631
> annual.increase=(population[11]-population[1])/100   # watch the parentheses!
> population[11]+10*annual.increase
[1] 2144531
> 

Remember that we actually have eleven data points, since we have both 1900 and 2000, so we need to specify population[11] as our endpoint. But since there are only ten decade intervals, we divide by 100 to get the annual increase. Adding ten times this increase to the 2000 population, we get an estimate for 2010 of 2,144,531. (Bonus question: based on this estimated annual increase, in what year would Houston have passed the two-million mark?2)

Continue reading…

Tags: , , , , ,

Examining Historical Growth I: Basic trends

Posted by Ezra Glenn on April 11, 2012
Data, Missions, Shape Your Neighborhood, Simulation / No Comments

The nature of predictions

To paraphrase John Allen Paulos, author of A Mathematician Reads the Newspaper, all expert predictions can be essentially restated in one of two ways: “Things will continue roughly as they have been until something changes”; and its corollary, “Things will change after an indeterminate period of stability.” Although these statements are both true and absurd, they contain a kernel of wisdom: simply assuming a relative degree of stability and painting a picture of the future based on current trends is the first step of scenario planning. The trick, of course, is to never completely forget the “other shoe” of Paulos’s statement: as the disclaimer states on all investment offerings, “Past performance is not a guarantee of future results”; at some point in the future our present trends will no longer accurately describe where we are headed. (We will deal with this as well, with a few “safety valves.”)

From the second stage of the Rational Planning Paradigm (covered in the background sections of the book) we should have gathered information on both past and present circumstances related to our planning effort. If we are looking at housing production, we might have data on annual numbers of building permits and new subdivision approvals, mortgage rates, and housing prices; if we are looking at public transportation we might need monthly ridership numbers, information of fare changes, population and employment figures, and even data on past weather patterns or changes in vehicle ownership and gas prices. The first step of projection, therefore, is to gather relevant information and get it into a form that you can use.

Since we will be thinking about changes over time in order to project a trend into the future, we’ll need to make sure that our data has time as an element: a series of data points with one observation for each point or period of time is known as a time series. The exact units of time are not important—they could be days, months, years, decades, or something different—but it is customary (and important) to obtain data where points are regularly spaced at even intervals.1 Essentially, time series data is a special case of multivariate data in which we treat time itself as an additional variable and look for relationships as it changes. Luckily, R has some excellent functions and packages for dealing with time-series data, which we will cover below in passing. For starters, however, let’s consider a simple example, to start to think about what goes into projections. Continue reading…

Tags: , , , , ,

acs Package at Upcoming Conference: UseR! 2012

Posted by Ezra Glenn on April 09, 2012
Census, Code, Self-promotion / No Comments

I’m happy to report that I’ll be giving a paper on my acs package at the 8th annual useR! conference, Coming June 12-15th to Vanderbilt University in Nashville, TN. The paper is titled “Estimates with Errors and Errors with Estimates: Using the R acs Package for Analysis of American Community Survey Data.” Here’s the abstract:


"Estimates with Errors and Errors with Estimates: Using the R acs
Package for Analysis of American Community Survey Data"
Ezra Haber Glenn

Over the past decade, the U.S. Census Bureau has implemented the
American Community Survey (ACS) as a replacement for its traditional
decennial ``long-form'' survey.  Last year—for the first time
ever—ACS data was made available at the census tract and block group
level for the entire nation, representing geographies small enough to
be useful to local planners; in the future these estimates will be
updated on a yearly basis, providing much more current data than was
ever available in the past.  Although the ACS represents a bold
strategy with great promise for government planners, policy-makers,
and other advocates working at the neighborhood scale, it will require
them to become comfortable with statistical techniques and concerns
that they have traditionally been able to avoid.

To help with this challenge the author has been working with
local-level planners to determine the most common problems associated
with using ACS data, and has implemented these functions as a package
in R.  The package—currently hosted on CRAN in version 0.8—defines
a new ``acs'' class object (containing estimates, standard errors, and
metadata for tables from the ACS), with methods to deal appropriately
with common tasks (e.g., combining subgroups or geographies,
mathematical operations on estimates, tests of significance, plots of
confidence intervals, etc.).

This paper will present both the use and the internal structure of the
package, with discussion of additional lines of development.

Hope to see you all there!

Tags: , , , , ,

Constantly Improving: acs development versions

Posted by Ezra Glenn on March 29, 2012
Census, Code / No Comments

As noted elsewhere here on CityState, I’ve developed a package for working with data from the American Community Survey in the R statistical computing language. The most recent official version of the package is 0.8, which can be found on CRAN. Since the package is still in active development, I’ve decided to provide development snapshots here, for users who are looking to work with the latest code as I develop it.

I’m hoping that the next major release will be version 1.0, due out sometime this spring. As I work towards that, here is version 0.8.1, which can be considered the first “snapshot” headed toward this release.

acs_0.8.1.tar.gz

To install, simply download, start R, and type:

 

> install.packages("path/to/file//acs_0.8.1.tar.gz") > library(acs)

Updates include:

  • read.acs can now accept either a csv or a zip file downloaded directly from the FactFinder site, and it does a much better job (a) guessing how many rows to skip, (b) figuring out how to generate intelligent variable names for the columns, and (c) dealing with arcane non-numeric symbols used by FactFinder for some estimates and margins of error.
  • plot now includes a true.min= option, which allows you to

specify whether you want to allow error bars to span into negative values (true.min=T, the default), or to bound them at zero (true.min=F – or some other numeric value). This seemed necessary because it looks silly to say “The number of children who speak Spanish in this tract is 15, plus or minus 80…” At the same time, if the variable turns out to be something like the difference in the income of Males and the income if Females in the geography, a negative value may make a lot of sense, and should be plotted as such.

Tags: , , , ,

acs Package Updated: version 0.8 now on CRAN

Posted by Ezra Glenn on March 18, 2012
Census, Code / 2 Comments

I’ve just released a new version of my acs package for working with the U.S. Census American Community Survey data in R, available on CRAN. The current version 0.8 includes all the original version 0.6 code, plus a whole lot more features and fixes. Some highlights:

  • An improved read.acs function for importing data downloaded from the Census American FactFinder site.
  • rbind and cbind functions to help create larger acs objects from smaller ones.
  • A new sum method to aggregate rows or columns of ACS data, dealing correctly with both estimates and standard errors.
  • A new apply method to allow users to apply virtually any function to each row or column of an acs data object.
  • A snazzy new plot method, capable of plotting both density plots (for estimates of a single geography and variable) and multiple estimates with errors bars (for estimates of the same variable over multiple geographies, or vice versa). See sample plots below.

 

  • New functions to deal with adjusting the nominal values of currency from different years for the purpose of comparing between one survey and another. (See currency.convert and currency.year in the documentation.)
  • A new tract-level dataset from the ACS for Lawrence, MA, with dollar value currency estimates (useful to show off the aforementioned new currency conversion functions).
  • A new prompt method to serve as a helper function when changing geographic rownames or variable column names.
  • Improved documentation on the acs class and all of these various new functions and methods, with examples.

With this package, once you’ve found and downloaded your data from FactFinder, you can read it into R with a single command, aggregate multiple tracts into a neighborhood with another, generate a table of estimates and confidence intervals for your neighborhood with a third command, and a produce a print-ready plot of your data (complete with error bars for the margins of error) with a fourth:

my.data=read.acs("some_data.csv")
my.neighborhood=apply(my.data, FUN="sum", MARGIN=1, agg.term="My.Neighborhood") 
confint(my.neighborhood, conf.level=.95) 
plot(my.neighborhood, col="blue", err.col="violet", pch=16)

Already this package has come a long way, in large part thanks to the input of R users, so please check it out and let me know what you think — and how I can make it better.

Tags: , , , ,