We’re pleased to announce the creation of a new mailing list for the acs.R package. The “acs” package allows users to download, manipulate, analyze, and visualize data from the American Community Survey in R; the “acs-r” e-mail list allows members to keep in touch and share information about the package, including updates from the development team concerning improvements, user questions and help requests, worked examples, and more. To register, visit http://mailman.mit.edu/mailman/listinfo/acs-r.
A very nice user wrote the following in an email to me about the latest version of the acs.R package:
> Thanks for providing such a wonderful package in R. I'm having > difficulty defining a geo at the block group level. Would you mind > sharing an example with me?
I responded via email, but thought that my answer — which took the form of a short worked-example — might be helpful to others, so I am posting it here as well. Here’s what I said:
To showcase how the package can create new census geographies based on stuff like blockgroups, let’s look in my home state of Massachusetts, in Middlesex County. If I wanted to get info on all the block groups for tract 387201,1 I could create a new geo like this:
> my.tract=geo.make(state="MA", county="Middlesex", tract=387201, block.group="*", check=T) Testing geography item 1: Tract 387201, Blockgroup *, Middlesex County, Massachusetts .... OK. >
(This might be a useful first step, especially if I didn’t know how many block groups there were in the tract, or what they were called. Also, note that check=T is not required, but can often help ensure you are dealing with valid geos.)
If I then wanted to get very basic info on these block groups – say, table number B01003 (Total Population), I could type:
> total.pop=acs.fetch(geo=my.tract, table.number="B01003") > total.pop ACS DATA: 2007 -- 2011 ; Estimates w/90% confidence intervals; for different intervals, see confint() B01003_001 Block Group 1 2681 +/- 319 Block Group 2 952 +/- 213 Block Group 3 1010 +/- 156 Block Group 4 938 +/- 214 >
Here we can see that the block.group=”*” has yielded the actual four block groups for the tract.
Now, if instead of wanting all of them, we only wanted the first two, we could just type:
> my.bgs=geo.make(state="MA", county="Middlesex", tract=387201, block.group=1:2, check=T) Testing geography item 1: Tract 387201, Blockgroup 1, Middlesex County, Massachusetts .... OK. Testing geography item 2: Tract 387201, Blockgroup 2, Middlesex County, Massachusetts .... OK. >
> bg.total.pop=acs.fetch(geo=my.bgs, table.number="B01003") > bg.total.pop ACS DATA: 2007 -- 2011 ; Estimates w/90% confidence intervals; for different intervals, see confint() B01003_001 Block Group 1 2681 +/- 319 Block Group 2 952 +/- 213 >
Now, if we wanted to add in some blockgroups from tract 387100 (a.k.a. “tract 3871″ — but remember: we need those trailing zeroes) – say, blockgroups 2 and 3 – we could enter:
> my.bgs=my.bgs+geo.make(state="MA", county="Middlesex", tract=387100, block.group=2:3, check=T) Testing geography item 1: Tract 387100, Blockgroup 2, Middlesex County, Massachusetts .... OK. Testing geography item 2: Tract 387100, Blockgroup 3, Middlesex County, Massachusetts .... OK.
> new.total.pop=acs.fetch(geo=my.bgs, table.number="B01003") > new.total.pop ACS DATA: 2007 -- 2011 ; Estimates w/90% confidence intervals; for different intervals, see confint() B01003_001 Block Group 1 2681 +/- 319 Block Group 2 952 +/- 213 Block Group 2 827 +/- 171 Block Group 3 1821 +/- 236 >
Note that the short rownames can be confusing – as in this example — but if you type:
> geography(new.total.pop) NAME state county tract blockgroup 1 Block Group 1 25 17 387201 1 2 Block Group 2 25 17 387201 2 3 Block Group 2 25 17 387100 2 4 Block Group 3 25 17 387100 3 >
you can see that the two entries for “Block Group 2″ are actually in different tracts. (Also note: you can combine block groups and other levels of geography, all in a single geo objects…)
And now, to show off the coolest part! Let’s say I don’t just want to get data on the four blockgroups, but I want to combine them into a single new geographic entity. Before downloading, I could simply say:
> combine(my.bgs)=T > combine.term(my.bgs)="Select Blockgroups" > new.total.pop=acs.fetch(geo=my.bgs, table.number="B01003") > new.total.pop ACS DATA: 2007 -- 2011 ; Estimates w/90% confidence intervals; for different intervals, see confint() B01003_001 Select Blockgroups 6281 +/- 481.733328720362 >
And see – voila! – it sums the estimates and deals with the margins of error, so you don’t need to get your hands dirty with square roots and standard errors and all that messy stuff.
You can even create interesting nested geo.sets, where some of the lower levels are combined, like this:
> combine.term(my.bgs)="Select Blockgroups, Tracts 387100 and 387201" > more.bgs=c(my.bgs, geo.make(state="MA", county="Middlesex", tract=370300, block.group=1:2, check=T), geo.make(state="MA", county="Middlesex", tract=370400, block.group=1:3, combine=T, combine.term="Select Blockgroups, Tract 3703", check=T)) Testing geography item 1: Tract 370300, Blockgroup 1, Middlesex County, Massachusetts .... OK. Testing geography item 2: Tract 370300, Blockgroup 2, Middlesex County, Massachusetts .... OK. Testing geography item 1: Tract 370400, Blockgroup 1, Middlesex County, Massachusetts .... OK. Testing geography item 2: Tract 370400, Blockgroup 2, Middlesex County, Massachusetts .... OK. Testing geography item 3: Tract 370400, Blockgroup 3, Middlesex County, Massachusetts .... OK. > more.total.pop=acs.fetch(geo=more.bgs, table.number="B01003") > more.total.pop ACS DATA: 2007 -- 2011 ; Estimates w/90% confidence intervals; for different intervals, see confint() B01003_001 Select Blockgroups, Tracts 387100 and 387201 6281 +/- 481.733328720362 Block Group 1 315 +/- 132 Block Group 2 1460 +/- 358 Select Blockgroups, Tract 3703 2594 +/- 487.719181496894 >
In closing: I hope this helps, and be sure to contact me if you have other questions/problems about using the package.
1 Note that tracts are often referred to in a strange “four-digit+decimal extension” shorthand, so “tract 387201″ may be also known as “tract 3872.01″. When working with this package, be careful and always use six-digit tract numbers in this package without the decimal point. If the tract number seems to only be four digits long, add two extra “trailing” zeroes at the end.
It’s been a while since I last updated the acs.R package, but as noted here, I’ll be using CityState to provide updates and test-versions of the package prior to uploading to CRAN. I’m happy to report that we now have a near-final package of version 1.0.
The most significant improvements to the package (beyond those mentioned previously) are the following;
- The package is now capable of downloading data directly from the new Census American Community Survey API and importing into R (with proper statistical treatment of estimates and error, variable and geographic relabeling, and more), all through a single “acs.fetch()” function;
- The package includes a new “geo.make()” function to allow users to create their own custom geographies for organize and download data; and
- The package provides two special “lookup” tools to help filter through all the existing Census geographies (with the “geo.lookup()” function) and tables (with the “acs.lookup()” function) to find exactly what they want. These functions return new R “lookup” objects which can be saved, manipulated, and passed to acs.fetch() for downloading data.
I want to thank the very kind folks at the Puget Sound Regional Council, who have been supporting the development of this package (in exchange for some special attention to scripts and functions they really want to include for themselves and their member communities). They continue to provide excellent help and advice “from the trenches” as we refine the package.
If you’re interested in trying out the new version, you can download it below, along with a brief set of “Introductory Notes” written for the team at PSRC. (Users may also want to check out the manual for the previous version of the package and this article from 2011 on the package.)
Census, Missions, Reconnaissance, Shape Your Neighborhood / No Comments
Previous missions have demonstrated a whole lot of things you can do with census data. Here are a few of the problems you can get yourself into.
Census Geography Pitfalls
- Unequal tracts: Despite what you may think, not all Census tracts (and their composite block groups and blocks) are created equal. The Census Bureau tries to structure their geography so that all tracts will be approximately the same size (about 4,000 people), but in practice there is a pretty large range (between 1,500 and 8,000 people per tract). If you are looking at raw numbers (counts of any sort), be sure to think about the overall population—it’s the denominator you’ll need to put the figures in perspective; conversely, if you are looking at percentages, remember that a small percentage of a large tract could actually be more people than a large percentage of a very small one.
- Overlapping districts: Unfortunately, although the formal “pyramid” of Census geography is well-structured—building from block to block group to tract and so on up—our political and cultural divisions are not always so straightforward: cities sometimes spread across county lines, metropolitan areas may even cross state lines, and legislative districts have become a gerrymandered mess that would drive any rational cartographer to drink. As a result, there may be times when Census geographers have been forced to choose between a strictly “nested” geography that ignores higher-order political elements, and one with intermediate levels that do not fit neatly within each other.
- Confusing or ambiguous place names: Partially related to the previous point, and partially due to the general orneriness of the culture (or perhaps the species), there are often times when the same name will occur in multiple places in Census geography. The name “New York” refers to a state, a metro region, a city, a county (strangely, one that is smaller than its city), and even an avenue in Atlantic City (or on the Monopoly Board). Luckily, once you get down to the level of census tracts and below, you enter the realm of pure-and-orderly numbers, and can largely avoid this trap—they are even sometimes referred to as “logical record numbers,” or LOGRECNO—although it’s a lot less fun to say “AR census tract 9803 block group 3″ when you could be saying “Goobertown, Arkansas”.
- Changing boundaries: Occasionally the Census Bureau needs to redraw the lines for some particular location—perhaps a city has annexed new land, or a large county has been split by an act of the state legislature. In these situations, you may see a sharp rise (or drop) in the counts from one Census to the next. For example, according to the 2000 census, the city of Bigfork, Montana had 1,421 people; in the 2010 census, this figure had grown to 4,270—a seeming tripling of the population. However, upon closer scrutiny, it turns out that most of this increase was the result of a change in census boundaries. (These situations may also exacerbate some of the previous problems.)
I’m happy to report that I’ll be giving a paper on my acs package at the 8th annual useR! conference, Coming June 12-15th to Vanderbilt University in Nashville, TN. The paper is titled “Estimates with Errors and Errors with Estimates: Using the R acs Package for Analysis of American Community Survey Data.” Here’s the abstract:
"Estimates with Errors and Errors with Estimates: Using the R acs Package for Analysis of American Community Survey Data" Ezra Haber Glenn Over the past decade, the U.S. Census Bureau has implemented the American Community Survey (ACS) as a replacement for its traditional decennial ``long-form'' survey. Last year—for the first time ever—ACS data was made available at the census tract and block group level for the entire nation, representing geographies small enough to be useful to local planners; in the future these estimates will be updated on a yearly basis, providing much more current data than was ever available in the past. Although the ACS represents a bold strategy with great promise for government planners, policy-makers, and other advocates working at the neighborhood scale, it will require them to become comfortable with statistical techniques and concerns that they have traditionally been able to avoid. To help with this challenge the author has been working with local-level planners to determine the most common problems associated with using ACS data, and has implemented these functions as a package in R. The package—currently hosted on CRAN in version 0.8—defines a new ``acs'' class object (containing estimates, standard errors, and metadata for tables from the ACS), with methods to deal appropriately with common tasks (e.g., combining subgroups or geographies, mathematical operations on estimates, tests of significance, plots of confidence intervals, etc.). This paper will present both the use and the internal structure of the package, with discussion of additional lines of development.
Hope to see you all there!
Census, Missions, Reconnaissance, Shape Your Neighborhood / No Comments
In a previous mission (see Finding Obama in the smallest Census geography) we delved down to the see what data was available at the level of individual blocks. Unfortunately, as we noted there, the Census doesn’t provide a whole lot of useful data at the block-level, since the results will exclude sample data from the SF3 “long form” (or, post-2000, the American Community Survey). If we want to know more about a neighborhood we will need to think in slightly larger geographies, and seek data at the tract-level or higher.
For this mission, we’ll be zooming into to Park Slope neighborhood on Brooklyn, and gathering data on income, race, education, and the breakdown of owners and renters for a single census tract. Since its often helpful to be able to view data like this in the context of the surrounding neighborhood, subsequent missions will explore ways to make comparisons with this sort of data, either to other tracts or to larger geographies.
But for starters, our target: although defining the exact edges of a neighborhood is never easy – especially ones in dense, diverse areas, where even residents disagree over terminology and the continual processes of gentrification, urban decline, migration, and other demographic shifts continually redefine the categories – most observers would agree that the neighborhood extends roughly north and west from Bartel Pritchard Square, at the lower corner of Prospect Park, with both 15th Street and Prospect Park itself providing something of an “edge.” Since edges are often exciting places to observe change, we will select an address along 15th Street, near the corner of 5th Avenue. Continue reading…
As noted elsewhere here on CityState, I’ve developed a package for working with data from the American Community Survey in the R statistical computing language. The most recent official version of the package is 0.8, which can be found on CRAN. Since the package is still in active development, I’ve decided to provide development snapshots here, for users who are looking to work with the latest code as I develop it.
I’m hoping that the next major release will be version 1.0, due out sometime this spring. As I work towards that, here is version 0.8.1, which can be considered the first “snapshot” headed toward this release.
To install, simply download, start R, and type:
> install.packages("path/to/file//acs_0.8.1.tar.gz") > library(acs)
read.acscan now accept either a csv or a zip file downloaded directly from the FactFinder site, and it does a much better job (a) guessing how many rows to skip, (b) figuring out how to generate intelligent variable names for the columns, and (c) dealing with arcane non-numeric symbols used by FactFinder for some estimates and margins of error.
plotnow includes a
true.min=option, which allows you to
specify whether you want to allow error bars to span into negative values (
true.min=T, the default), or to bound them at zero (
true.min=F – or some other numeric value). This seemed necessary because it looks silly to say “The number of children who speak Spanish in this tract is 15, plus or minus 80…” At the same time, if the variable turns out to be something like the difference in the income of Males and the income if Females in the geography, a negative value may make a lot of sense, and should be plotted as such.
I’ve just released a new version of my
acs package for working with the U.S. Census American Community Survey data in R, available on CRAN. The current version 0.8 includes all the original version 0.6 code, plus a whole lot more features and fixes. Some highlights:
- An improved
read.acsfunction for importing data downloaded from the Census American FactFinder site.
cbindfunctions to help create larger acs objects from smaller ones.
- A new
summethod to aggregate rows or columns of ACS data, dealing correctly with both estimates and standard errors.
- A new
applymethod to allow users to apply virtually any function to each row or column of an acs data object.
- A snazzy new
plotmethod, capable of plotting both density plots (for estimates of a single geography and variable) and multiple estimates with errors bars (for estimates of the same variable over multiple geographies, or vice versa). See sample plots below.
- New functions to deal with adjusting the nominal values of currency from different years for the purpose of comparing between one survey and another. (See
currency.yearin the documentation.)
- A new tract-level dataset from the ACS for Lawrence, MA, with dollar value currency estimates (useful to show off the aforementioned new currency conversion functions).
- A new
promptmethod to serve as a helper function when changing geographic rownames or variable column names.
- Improved documentation on the
acsclass and all of these various new functions and methods, with examples.
With this package, once you’ve found and downloaded your data from FactFinder, you can read it into
R with a single command, aggregate multiple tracts into a neighborhood with another, generate a table of estimates and confidence intervals for your neighborhood with a third command, and a produce a print-ready plot of your data (complete with error bars for the margins of error) with a fourth:
my.data=read.acs("some_data.csv") my.neighborhood=apply(my.data, FUN="sum", MARGIN=1, agg.term="My.Neighborhood") confint(my.neighborhood, conf.level=.95) plot(my.neighborhood, col="blue", err.col="violet", pch=16)
Already this package has come a long way, in large part thanks to the input of R users, so please check it out and let me know what you think — and how I can make it better.
On March 14, 2012, I’ll be working again with the Mel King Institute for Community Building to offer a half-day training in “Making Use of Local Census Data.” We designed the class for planners and community development practitioners working at the neighborhood-scale, and we’ll talk about ways to access the latest data from the U.S. Census American Community Survey (and how to use it responsibly).
Unlike earlier versions of the training, we’ll be working exclusively with the New American Factfinder (previously discussed in this post) to download data. We’ve also moved the class to one of MIT’s computer labs, and added an hour at the end as a “clinic,” so participants will get some hands-on time to dig up data on their own community.
For more information about the Mel King Institute, or to register for the training, see this page. See you there!
Update: a few weeks ago I posted this article calling attention to yet more delays in the unrolling of the long-awaited Supplemental Poverty Measure from the Census Bureau. As it turns out, they have recently announced that this new index is in fact ready for prime time (see, for example, this press release).
More analysis and thoughts later after I am able to take a look, but I wanted to file something quick just to acknowledge the effort to get something out there.