We’re pleased to announce the creation of a new mailing list for the acs.R package. The “acs” package allows users to download, manipulate, analyze, and visualize data from the American Community Survey in R; the “acs-r” e-mail list allows members to keep in touch and share information about the package, including updates from the development team concerning improvements, user questions and help requests, worked examples, and more. To register, visit http://mailman.mit.edu/mailman/listinfo/acs-r.
census
A very nice user wrote the following in an email to me about the latest version of the acs.R package:
> Thanks for providing such a wonderful package in R. I'm having > difficulty defining a geo at the block group level. Would you mind > sharing an example with me?
I responded via email, but thought that my answer — which took the form of a short worked-example — might be helpful to others, so I am posting it here as well. Here’s what I said:
To showcase how the package can create new census geographies based on stuff like blockgroups, let’s look in my home state of Massachusetts, in Middlesex County. If I wanted to get info on all the block groups for tract 387201,1 I could create a new geo like this:
> my.tract=geo.make(state="MA", county="Middlesex", tract=387201, block.group="*", check=T) Testing geography item 1: Tract 387201, Blockgroup *, Middlesex County, Massachusetts .... OK. >
(This might be a useful first step, especially if I didn’t know how many block groups there were in the tract, or what they were called. Also, note that check=T is not required, but can often help ensure you are dealing with valid geos.)
If I then wanted to get very basic info on these block groups – say, table number B01003 (Total Population), I could type:
> total.pop=acs.fetch(geo=my.tract, table.number="B01003") > total.pop ACS DATA: 2007 -- 2011 ; Estimates w/90% confidence intervals; for different intervals, see confint() B01003_001 Block Group 1 2681 +/- 319 Block Group 2 952 +/- 213 Block Group 3 1010 +/- 156 Block Group 4 938 +/- 214 >
Here we can see that the block.group=”*” has yielded the actual four block groups for the tract.
Now, if instead of wanting all of them, we only wanted the first two, we could just type:
> my.bgs=geo.make(state="MA", county="Middlesex", tract=387201, block.group=1:2, check=T) Testing geography item 1: Tract 387201, Blockgroup 1, Middlesex County, Massachusetts .... OK. Testing geography item 2: Tract 387201, Blockgroup 2, Middlesex County, Massachusetts .... OK. >
And then:
> bg.total.pop=acs.fetch(geo=my.bgs, table.number="B01003") > bg.total.pop ACS DATA: 2007 -- 2011 ; Estimates w/90% confidence intervals; for different intervals, see confint() B01003_001 Block Group 1 2681 +/- 319 Block Group 2 952 +/- 213 >
Now, if we wanted to add in some blockgroups from tract 387100 (a.k.a. “tract 3871″ — but remember: we need those trailing zeroes) – say, blockgroups 2 and 3 – we could enter:
> my.bgs=my.bgs+geo.make(state="MA", county="Middlesex", tract=387100, block.group=2:3, check=T) Testing geography item 1: Tract 387100, Blockgroup 2, Middlesex County, Massachusetts .... OK. Testing geography item 2: Tract 387100, Blockgroup 3, Middlesex County, Massachusetts .... OK.
And then:
> new.total.pop=acs.fetch(geo=my.bgs, table.number="B01003") > new.total.pop ACS DATA: 2007 -- 2011 ; Estimates w/90% confidence intervals; for different intervals, see confint() B01003_001 Block Group 1 2681 +/- 319 Block Group 2 952 +/- 213 Block Group 2 827 +/- 171 Block Group 3 1821 +/- 236 >
Note that the short rownames can be confusing – as in this example — but if you type:
> geography(new.total.pop)
NAME state county tract blockgroup
1 Block Group 1 25 17 387201 1
2 Block Group 2 25 17 387201 2
3 Block Group 2 25 17 387100 2
4 Block Group 3 25 17 387100 3
>
you can see that the two entries for “Block Group 2″ are actually in different tracts. (Also note: you can combine block groups and other levels of geography, all in a single geo objects…)
And now, to show off the coolest part! Let’s say I don’t just want to get data on the four blockgroups, but I want to combine them into a single new geographic entity. Before downloading, I could simply say:
> combine(my.bgs)=T > combine.term(my.bgs)="Select Blockgroups" > new.total.pop=acs.fetch(geo=my.bgs, table.number="B01003") > new.total.pop ACS DATA: 2007 -- 2011 ; Estimates w/90% confidence intervals; for different intervals, see confint() B01003_001 Select Blockgroups 6281 +/- 481.733328720362 >
And see – voila! – it sums the estimates and deals with the margins of error, so you don’t need to get your hands dirty with square roots and standard errors and all that messy stuff.
You can even create interesting nested geo.sets, where some of the lower levels are combined, like this:
> combine.term(my.bgs)="Select Blockgroups, Tracts 387100 and 387201" > more.bgs=c(my.bgs, geo.make(state="MA", county="Middlesex", tract=370300, block.group=1:2, check=T), geo.make(state="MA", county="Middlesex", tract=370400, block.group=1:3, combine=T, combine.term="Select Blockgroups, Tract 3703", check=T)) Testing geography item 1: Tract 370300, Blockgroup 1, Middlesex County, Massachusetts .... OK. Testing geography item 2: Tract 370300, Blockgroup 2, Middlesex County, Massachusetts .... OK. Testing geography item 1: Tract 370400, Blockgroup 1, Middlesex County, Massachusetts .... OK. Testing geography item 2: Tract 370400, Blockgroup 2, Middlesex County, Massachusetts .... OK. Testing geography item 3: Tract 370400, Blockgroup 3, Middlesex County, Massachusetts .... OK. > more.total.pop=acs.fetch(geo=more.bgs, table.number="B01003") > more.total.pop ACS DATA: 2007 -- 2011 ; Estimates w/90% confidence intervals; for different intervals, see confint() B01003_001 Select Blockgroups, Tracts 387100 and 387201 6281 +/- 481.733328720362 Block Group 1 315 +/- 132 Block Group 2 1460 +/- 358 Select Blockgroups, Tract 3703 2594 +/- 487.719181496894 >
In closing: I hope this helps, and be sure to contact me if you have other questions/problems about using the package.
Footnotes:
1 Note that tracts are often referred to in a strange “four-digit+decimal extension” shorthand, so “tract 387201″ may be also known as “tract 3872.01″. When working with this package, be careful and always use six-digit tract numbers in this package without the decimal point. If the tract number seems to only be four digits long, add two extra “trailing” zeroes at the end.
It’s been a while since I last updated the acs.R package, but as noted here, I’ll be using CityState to provide updates and test-versions of the package prior to uploading to CRAN. I’m happy to report that we now have a near-final package of version 1.0.
The most significant improvements to the package (beyond those mentioned previously) are the following;
- The package is now capable of downloading data directly from the new Census American Community Survey API and importing into R (with proper statistical treatment of estimates and error, variable and geographic relabeling, and more), all through a single “acs.fetch()” function;
- The package includes a new “geo.make()” function to allow users to create their own custom geographies for organize and download data; and
- The package provides two special “lookup” tools to help filter through all the existing Census geographies (with the “geo.lookup()” function) and tables (with the “acs.lookup()” function) to find exactly what they want. These functions return new R “lookup” objects which can be saved, manipulated, and passed to acs.fetch() for downloading data.
I want to thank the very kind folks at the Puget Sound Regional Council, who have been supporting the development of this package (in exchange for some special attention to scripts and functions they really want to include for themselves and their member communities). They continue to provide excellent help and advice “from the trenches” as we refine the package.
If you’re interested in trying out the new version, you can download it below, along with a brief set of “Introductory Notes” written for the team at PSRC. (Users may also want to check out the manual for the previous version of the package and this article from 2011 on the package.)
Census, Missions, Reconnaissance, Shape Your Neighborhood / No Comments
Previous missions have demonstrated a whole lot of things you can do with census data. Here are a few of the problems you can get yourself into.
Census Geography Pitfalls
- Unequal tracts: Despite what you may think, not all Census tracts (and their composite block groups and blocks) are created equal. The Census Bureau tries to structure their geography so that all tracts will be approximately the same size (about 4,000 people), but in practice there is a pretty large range (between 1,500 and 8,000 people per tract). If you are looking at raw numbers (counts of any sort), be sure to think about the overall population—it’s the denominator you’ll need to put the figures in perspective; conversely, if you are looking at percentages, remember that a small percentage of a large tract could actually be more people than a large percentage of a very small one.
- Overlapping districts: Unfortunately, although the formal “pyramid” of Census geography is well-structured—building from block to block group to tract and so on up—our political and cultural divisions are not always so straightforward: cities sometimes spread across county lines, metropolitan areas may even cross state lines, and legislative districts have become a gerrymandered mess that would drive any rational cartographer to drink. As a result, there may be times when Census geographers have been forced to choose between a strictly “nested” geography that ignores higher-order political elements, and one with intermediate levels that do not fit neatly within each other.
- Confusing or ambiguous place names: Partially related to the previous point, and partially due to the general orneriness of the culture (or perhaps the species), there are often times when the same name will occur in multiple places in Census geography. The name “New York” refers to a state, a metro region, a city, a county (strangely, one that is smaller than its city), and even an avenue in Atlantic City (or on the Monopoly Board). Luckily, once you get down to the level of census tracts and below, you enter the realm of pure-and-orderly numbers, and can largely avoid this trap—they are even sometimes referred to as “logical record numbers,” or LOGRECNO—although it’s a lot less fun to say “AR census tract 9803 block group 3″ when you could be saying “Goobertown, Arkansas”.
- Changing boundaries: Occasionally the Census Bureau needs to redraw the lines for some particular location—perhaps a city has annexed new land, or a large county has been split by an act of the state legislature. In these situations, you may see a sharp rise (or drop) in the counts from one Census to the next. For example, according to the 2000 census, the city of Bigfork, Montana had 1,421 people; in the 2010 census, this figure had grown to 4,270—a seeming tripling of the population. However, upon closer scrutiny, it turns out that most of this increase was the result of a change in census boundaries. (These situations may also exacerbate some of the previous problems.)
Data, Missions, Shape Your Neighborhood, Simulation / No Comments
In our last mission we used R to plot a trend-line for population growth in Houston, based on historical data from the past century. Depending on which of two different methods we used, we arrived at an estimate for the city’s 2010 population of 2,144,531 (based on the 100-year growth trend for the city) or 2,225,125 (based on the steeper growth trend of the past fifty years). Looking now at the official Census count for 2010, it turns out that our guesses are close, but both of too high: the actual reported figure for 2010 is 2,099,451.
It would have been surprising to have guessed perfectly based on nothing other than a linear trend — and the fact that we came as close as we did speaks well of this sort of “back of the envelope” projection technique (at least for the case of steady-growth). But there was a lot of information contained in those data points that we essentially ignored: our two trendlines were really based on nothing more than a start and an end point.
A more sophisticated set of tools for making projections — which may be able to extract some extra meaning from the variation contained in the data — is provided in R by the excellent forecast package, developed by Rob Hyndman of the Monash University in Australia. To access these added functions, you’ll need to install it:
> install.packages(forecast)
> library(forecast)
Time-series in R: an object with class
Although R is perfectly happy to help you analyze and plot time series data organized in vectors and dataframes, it actually has a specialized object class for this sort of thing, created with the ts() function. Remember: R is an “object-oriented” language. Every object (a variable, a dataframe, a function, a time series) is associated with a certain class, which helps the language figure out how to manage and interact with them. To find the class of an object, use the class() functions:
> a=c(1,2) > class(a) [1] "numeric" > a=TRUE > class(a) [1] "logical" > class(plot) [1] "function" > a=ts(1) > class(a) [1] "ts" >
Data, Missions, Shape Your Neighborhood, Simulation / No Comments
In our previous mission we plotted population numbers in Houston for 1900–2000, to start to understand the growth trend for that city. Now, what if we didn’t have access to the latest Census figures, and we wanted to try to guess Houston’s population for 2010, using nothing but the data from 1900–2000?
One place to start would be with the 2000 population (1,953,631) and adjust it a bit based on historical trends. With 100 year’s worth of data, we can do this in R with a simple call to some vector math.1
> attach(houston.pop) # optional, see footnote > population[11] # don't forget: 11, not 10, data points [1] 1953631 > annual.increase=(population[11]-population[1])/100 # watch the parentheses! > population[11]+10*annual.increase [1] 2144531 >
Remember that we actually have eleven data points, since we have both 1900 and 2000, so we need to specify population[11] as our endpoint. But since there are only ten decade intervals, we divide by 100 to get the annual increase. Adding ten times this increase to the 2000 population, we get an estimate for 2010 of 2,144,531. (Bonus question: based on this estimated annual increase, in what year would Houston have passed the two-million mark?2)
Data, Missions, Shape Your Neighborhood, Simulation / No Comments
The nature of predictions
To paraphrase John Allen Paulos, author of A Mathematician Reads the Newspaper, all expert predictions can be essentially restated in one of two ways: “Things will continue roughly as they have been until something changes”; and its corollary, “Things will change after an indeterminate period of stability.” Although these statements are both true and absurd, they contain a kernel of wisdom: simply assuming a relative degree of stability and painting a picture of the future based on current trends is the first step of scenario planning. The trick, of course, is to never completely forget the “other shoe” of Paulos’s statement: as the disclaimer states on all investment offerings, “Past performance is not a guarantee of future results”; at some point in the future our present trends will no longer accurately describe where we are headed. (We will deal with this as well, with a few “safety valves.”)
From the second stage of the Rational Planning Paradigm (covered in the background sections of the book) we should have gathered information on both past and present circumstances related to our planning effort. If we are looking at housing production, we might have data on annual numbers of building permits and new subdivision approvals, mortgage rates, and housing prices; if we are looking at public transportation we might need monthly ridership numbers, information of fare changes, population and employment figures, and even data on past weather patterns or changes in vehicle ownership and gas prices. The first step of projection, therefore, is to gather relevant information and get it into a form that you can use.
Since we will be thinking about changes over time in order to project a trend into the future, we’ll need to make sure that our data has time as an element: a series of data points with one observation for each point or period of time is known as a time series. The exact units of time are not important—they could be days, months, years, decades, or something different—but it is customary (and important) to obtain data where points are regularly spaced at even intervals.1 Essentially, time series data is a special case of multivariate data in which we treat time itself as an additional variable and look for relationships as it changes. Luckily, R has some excellent functions and packages for dealing with time-series data, which we will cover below in passing. For starters, however, let’s consider a simple example, to start to think about what goes into projections. Continue reading…
I’m happy to report that I’ll be giving a paper on my acs package at the 8th annual useR! conference, Coming June 12-15th to Vanderbilt University in Nashville, TN. The paper is titled “Estimates with Errors and Errors with Estimates: Using the R acs Package for Analysis of American Community Survey Data.” Here’s the abstract:
"Estimates with Errors and Errors with Estimates: Using the R acs Package for Analysis of American Community Survey Data" Ezra Haber Glenn Over the past decade, the U.S. Census Bureau has implemented the American Community Survey (ACS) as a replacement for its traditional decennial ``long-form'' survey. Last year—for the first time ever—ACS data was made available at the census tract and block group level for the entire nation, representing geographies small enough to be useful to local planners; in the future these estimates will be updated on a yearly basis, providing much more current data than was ever available in the past. Although the ACS represents a bold strategy with great promise for government planners, policy-makers, and other advocates working at the neighborhood scale, it will require them to become comfortable with statistical techniques and concerns that they have traditionally been able to avoid. To help with this challenge the author has been working with local-level planners to determine the most common problems associated with using ACS data, and has implemented these functions as a package in R. The package—currently hosted on CRAN in version 0.8—defines a new ``acs'' class object (containing estimates, standard errors, and metadata for tables from the ACS), with methods to deal appropriately with common tasks (e.g., combining subgroups or geographies, mathematical operations on estimates, tests of significance, plots of confidence intervals, etc.). This paper will present both the use and the internal structure of the package, with discussion of additional lines of development.
Hope to see you all there!
Census, Missions, Reconnaissance, Shape Your Neighborhood / No Comments
In a previous mission (see Finding Obama in the smallest Census geography) we delved down to the see what data was available at the level of individual blocks. Unfortunately, as we noted there, the Census doesn’t provide a whole lot of useful data at the block-level, since the results will exclude sample data from the SF3 “long form” (or, post-2000, the American Community Survey). If we want to know more about a neighborhood we will need to think in slightly larger geographies, and seek data at the tract-level or higher.
For this mission, we’ll be zooming into to Park Slope neighborhood on Brooklyn, and gathering data on income, race, education, and the breakdown of owners and renters for a single census tract. Since its often helpful to be able to view data like this in the context of the surrounding neighborhood, subsequent missions will explore ways to make comparisons with this sort of data, either to other tracts or to larger geographies.
But for starters, our target: although defining the exact edges of a neighborhood is never easy – especially ones in dense, diverse areas, where even residents disagree over terminology and the continual processes of gentrification, urban decline, migration, and other demographic shifts continually redefine the categories – most observers would agree that the neighborhood extends roughly north and west from Bartel Pritchard Square, at the lower corner of Prospect Park, with both 15th Street and Prospect Park itself providing something of an “edge.” Since edges are often exciting places to observe change, we will select an address along 15th Street, near the corner of 5th Avenue. Continue reading…
Missions, Reconnaissance, Shape Your Neighborhood / 1 Comment
The most basic unit of the U.S. Census is the individual household — that’s who fills out the surveys – but the Census won’t report data at the household level: in order to deliver on its promise of privacy and confidentiality (and thereby ensure our willingness to be enumerated), the Census always aggregates data before releasing it. This is important, and should become something of a mantra for would-be data analysts: all Census data is summary data. That said, we can still learn quite a lot at these micro-geographies, especially when we know what we are looking for.
Finding Barack
As an example of how to work with the building blocks of Census summary data – the individual “blocks” – let’s go back a bit in time and look at a very particular neighborhood in Chicago. At the time of the 2000 Census, President Obama was serving as a Senator from Illinois, living at 5429 S. Harper Avenue in Chicago. Starting with just an address, you can easily find how it fits into the census geography on the “American FactFinder” site: just visit the main Census site, click the menu-bar for Data, and select the link for American FactFinder.
