Monthly Archives: July 2013

Using acs.R to create choropleth maps

Posted by Ezra Glenn on July 15, 2013
Census, Code, Maps / No Comments

Some time ago, FlowingData issued a challenge to create choropleth maps with open source tools, resulting in some nice little scripts using R’s ggplot2 and maps packages — all nicely covered on the Revolution Analytics blog. Some users have recently asked whether the acs package can be used to create similar maps, and the answer (of course) is yes. Here’s how.

For starters, to manage expectations, keep in mind that the map_data() function, which actually generates the geographic data you need to plot maps, does not currently provide boundary data for very many Census geographies — so sadly a lot of the new expanded support for various Census summary levels built into the acs package can’t be used. What you can do, however, is very easily plot state and county maps, which is what we’ll showcase below.

Secondly, the statistician in me feels compelled to point out one problem with using the acs package for these sorts of maps: choropleth maps are really fun to use to quickly show off the geographic distribution of some key statistic, but there is a price: in order to plot the data, we are really limited to a single number for each polygon. As a result, we can plot estimates from the ACS, but only if we are willing to ignore the margins of errors, and pretend that the data is less “fuzzy” than it really is. Given that the whole point of the acs package was to call attention to standard errors, and provide tools to work with them, it seems sort of counter-intuitive to use the package in this way — but given the ease of downloading ACS through the acs.fetch() function, it may still be a good (although slightly irresponsible) use of the package.

Given all that, I’ll step back down off my high horse, and show you how to make some maps, drawing heavily on the scripts provided by Hadley Wickham for using the ggplot2 package. For this example, let’s look at the percentage of people who take public transportation to work in the U.S., by county.

For starters, we’ll need to install and load the required packages:

> install.packages("acs")
> install.packages("ggplot2")
> install.packages("maps")
> library(acs)
> library(ggplot2)
> library(maps)

If you haven’t already obtained and installed an API key from the Census, you’ll need to do that as well — see ?api.key.install or check section 3.3 in the user guide.

Next, we use the map_data function to create some map boundary files to use later.

# load the boundary data for all counties
> county.df=map_data("county")
# rename fields for later merge
> names(county.df)[5:6]=c("state","county")
> state.df=map_data("state")

Turning to the acs package, we create a new geo.set consisting of all the tracts for all the state, and in a single call to acs.fetch download Census table B08301, which contains data on Mean of Transportation to Work. (If we didn’t know the table number we needed, we could use acs.lookup() to search for likely candidates, or even pass search strings directly to the acs.fetch() function.)

# wow! a single geo.set to hold all the counties...?
> us.county=geo.make(state="*", county="*")
# .. and a single command to fetch the data...!
> us.transport=acs.fetch(geography=us.county, 
     table.number="B08301", col.names="pretty")

The data we’ve fetched includes estimates of raw numbers of workers, not percentages, so we’ll need to do some division. Since we are interested in a proportion (and not a ratio), we to use the divide.acs function, not just “/”.1 For our dataset, the 10th column is the number of workers taking public transportation to work, and the 1st column is the total number of workers in the county. (See acs.colnames(us.transport) to verify this.) After we complete the division, we extract the estimates into a new data.frame, along with state and county names for each. (We need to do a little string manipulation to make these fields match with those from the map_data function above.)

> us.pub.trans=divide.acs(numerator=us.transport[,10], 
     denominator=us.transport[,1], method="proportion")
> pub.trans.est=data.frame(county=geography(us.pub.trans)[[1]], 
     percent.pub.trans=as.numeric(estimate(us.pub.trans)))
# this next step is all for Louisiana!
> pub.trans.est$county=gsub("Parish", "County", pub.trans.est$county)
# clean up county names and find the states
> pub.trans.est$state=tolower(gsub("^.*County, ", "", pub.trans.est$county))
> pub.trans.est$county=tolower(gsub(" County,.*", "", pub.trans.est$county))

Next, following Wickham’s script, we merge the boundaries with the data into a new data.frame (called choropleth) and reorder and recode it for out map levels.

> choropleth=merge(county.df, pub.trans.est, by=c("state","county"))
> choropleth=choropleth[order(choropleth$order), ]
> choropleth$pub.trans.rate.d=cut(choropleth$percent.pub.trans, 
     breaks=c(0,.01,.02,.03,.04,.05,.1,1), include.lowest=T)

And voila – a single call to ggplot and we have our map!

> ggplot(choropleth, aes(long, lat, group = group)) +
     geom_polygon(aes(fill = pub.trans.rate.d), colour = "white", size = 0.2) + 
     geom_polygon(data = state.df, colour = "white", fill = NA) +
     scale_fill_brewer(palette = "Purples")

./acs_scripts/choropleth_county_pub_trans.jpg

(For those who would like to see the entire script in all its efficient 14-line glory, I’ve pasted it below for easy cutting and pasting.)

install.packages("acs")
install.packages("ggplot2")
install.packages("maps")
library(acs)
library(ggplot2)
library(maps)
county.df=map_data("county")
names(county.df)[5:6]=c("state","county")
state.df=map_data("state")
us.county=geo.make(state="*", county="*")
us.transport=acs.fetch(geography=us.county, 
     table.number="B08301", col.names="pretty")
us.pub.trans=divide.acs(numerator=us.transport[,10], 
     denominator=us.transport[,1], method="proportion")
pub.trans.est=data.frame(county=geography(us.pub.trans)[[1]], 
     percent.pub.trans=as.numeric(estimate(us.pub.trans)))
pub.trans.est$county=gsub("Parish", "County", pub.trans.est$county)
pub.trans.est$state=tolower(gsub("^.*County, ", "", pub.trans.est$county))
pub.trans.est$county=tolower(gsub(" County,.*", "", pub.trans.est$county))
choropleth=merge(county.df, pub.trans.est, by=c("state","county"))
choropleth=choropleth[order(choropleth$order), ]
choropleth$pub.trans.rate.d=cut(choropleth$percent.pub.trans, 
     breaks=c(0,.01,.02,.03,.04,.05,.1,1), include.lowest=T)
ggplot(choropleth, aes(long, lat, group = group)) +
     geom_polygon(aes(fill = pub.trans.rate.d), colour = "white", size = 0.2) + 
     geom_polygon(data = state.df, colour = "white", fill = NA) +
     scale_fill_brewer(palette = "Purples")

wpid-choropleth\_county\_pub\_trans1.jpg

Footnotes:

1 Technically, since we are going to ignore the standard errors in our map, this could just be a standard division using “/”, but we might later want to look at the margins of error, etc. (For more on this issue, see ?divide.acs.)

Tags: , , , , ,

acs.R version 1.1: PUMAs and Zip Codes and MSAs, Oh My!

Posted by Ezra Glenn on July 14, 2013
Census, Code, Self-promotion / No Comments

Development continues on the acs package for R, with the latest update (version 1.1) now officially available on the CRAN repository. If you’ve already installed the package in the past, you can easily update with the update.packages() command; if you’ve never installed it, you can just as easily install it for the first time, by simply typing install.packages(“acs”). In either case, be sure to load the library after installing by typing library(acs), and install (or re-install) an API key with api.key.install() — see the documentation and the latest version of the acs user guide (which still references version 1.0).

Beyond improvements described in a previous post about version 1.0, the most significant change in the latest version is support for many more different combinations of census geography via the geo.make function. As described in the manual and on-line help, users can now specify options to create user-defined geographies composed of combinations of states, counties, county subdivisions, tracts, places, blockgroups (all available in the previous version), plus many more: public use microdata areas (PUMAs), metropolitan statistical areas (MSAs), combined statistical areas (CSAs), zip code tabulation areas, census regions and divisions, congressional district and state legislative districts (both upper and lower chambers), American Indian Areas, state school districts (of various types), New England County and Town Areas (NECTAs), and census urban areas. These geographies can be combined to create 25 different census summary levels, which can then even be bundled together to make even more complex geo.sets.

Once created and saved, these new user-defined geo.sets can be fed into the existing acs.fetch function to immediately download data from the ACS for these areas, combining them as desired in the process (and handling all those pesky estimates and margins of error in statistically-appropriate ways.)

We encourage you to update to the latest version and begin to explore the full power of the census data now available through the Census American Community Survey API. (And be sure to subscribe to the acs.R user group mailing list to be informed of future improvements.

Tags: , , , , ,

acs.R example: downloading all the tracts in a county or state

Posted by Ezra Glenn on July 03, 2013
Census, Code / No Comments

An acs.R user asks:

 
> How do I use acs to download all the census tracts? is there
> some handy command to do that?

Here’s some help:

All the tracts in a single county

You can’t automatically download all the tracts for the whole country (or even for an entire state) in a single step (but see below for ways to do this). If you just need all the tracts in a single county, it’s really simple — just use the “*” wildcard for the tract number when creating your geo.set.

The example below creates a geo.set for all the tracts in Middlesex County, Massachusetts, and then downloads data from ACS table B01003 on Total Population for them.

> my.tracts=geo.make(state="MA", county="Middlesex", tract="*") 
> acs.fetch(geography=my.tracts, table.number="B01003")

All the tracts in a state

If you happen to have a vector list of the names (or FIPS codes) of all the counties in a given state (or the ones you want), you could do something like this to get all the tracts in each of them:

> all.tracts=geo.make(state="MA", county=list.of.counties, 
  tract="*")
> acs.fetch(geography=all.tracts, table.number="B01003")

As an added bonus, if you don’t happen to have a list of counties, but want to use the package to get one, you could do something like this:

> mass=acs.fetch(geography=geo.make(state=25, county="*"), 
  table.number="B01003")

#  mass is now a new acs object with data for each county in
#  Massachusetts.  The "geography" function returns a dataframe of the
#  geographic metadata, which includes FIPS codes as the third
#  column.  So you can use it like this:

> all.tracts=geo.make(state="MA", 
  county=as.numeric(geography(mass)[[3]]), 
  tract="*", check=T)
> acs.fetch(geography=all.tracts, table.number="B01003")

All the tracts in the entire country

In theory, you could even use this to get all the tracts from all the 3,225 counties in the country:

> all.counties=acs.fetch(geography=geo.make(state="*", county="*"),
  table.number="B01003")
> all.tracts=geo.make(state=as.numeric(geography(all.counties)[[2]]),,
  county=as.numeric(geography(all.counties)[[3]]), tract="*", check=T)

Unfortunately (or perhaps fortunately), this is just too much for R to download without changing some of the internal variables that limit this sort of thing — if you try, R will complain with “Error: evaluation nested too deeply: infinite recursion…” To prove to yourself that it works, you could limit the number of counties to just the first 250, and try that — it will get you from Autauga County, Alabama to Bent County, Colorado.

> some.counties=all.counties[1:250]
> some.tracts=geo.make(state=as.numeric(geography(some.counties)[[2]]), 
  county=as.numeric(geography(some.counties)[[3]]), tract="*", check=T)
> lots.of.data=acs.fetch(geography=some.tracts, table.number="B01003")

This is really a lot of data — on my machine, this took about 18 seconds, resulting in a new acs object containing population data on 11,872 different tracts. I haven’t checked to see what the upper limits are, but I imagine it wouldn’t take much to figure out a way to get tract-level data from all 3,225 counties. (But remember: with great power comes great responsibility — don’t be too rough on downloading stuff from the Census, even if it is free and easy.)

Using the built-in FIPS data

An alternative approach to these last two examples would be to use the FIPS datasets that we’ve built-in to the acs.R package. For example, the “fips.county” dataset includes the names of each county, by state. Feed this (or part of this) to your geo.make command and you can do all sorts of neat things.

> head(fips.county)
  State State.ANSI County.ANSI    County.Name ANSI.Cl
1    AL          1           1 Autauga County      H1
2    AL          1           3 Baldwin County      H1
3    AL          1           5 Barbour County      H1
4    AL          1           7    Bibb County      H1
5    AL          1           9  Blount County      H1
6    AL          1          11 Bullock County      H1
> 

So instead of the last block above, you could do something like this:

> random.counties=sample(x=3225,size=20, replace=F)
> some.tracts=geo.make(state=fips.county[random.counties,1], 
  county=fips.county[random.counties,3], tract="*", check=T)
Testing geography item 1: Tract *, Ponce Municipio, Puerto Rico .... OK.
Testing geography item 2: Tract *, Alleghany County, North Carolina .... OK.
Testing geography item 3: Tract *, Wayne County, Pennsylvania .... OK.
Testing geography item 4: Tract *, Comerio Municipio, Puerto Rico .... OK.
Testing geography item 5: Tract *, Lafayette County, Wisconsin .... OK.
Testing geography item 6: Tract *, Hartford County, Connecticut .... OK.
Testing geography item 7: Tract *, Real County, Texas .... OK.
Testing geography item 8: Tract *, Costilla County, Colorado .... OK.
Testing geography item 9: Tract *, Sarpy County, Nebraska .... OK.
Testing geography item 10: Tract *, McLennan County, Texas .... OK.
Testing geography item 11: Tract *, Donley County, Texas .... OK.
Testing geography item 12: Tract *, McIntosh County, Georgia .... OK.
Testing geography item 13: Tract *, Chilton County, Alabama .... OK.
Testing geography item 14: Tract *, Richland County, Montana .... OK.
Testing geography item 15: Tract *, Mitchell County, Kansas .... OK.
Testing geography item 16: Tract *, Muscogee County, Georgia .... OK.
Testing geography item 17: Tract *, Martin County, Indiana .... OK.
Testing geography item 18: Tract *, Naguabo Municipio, Puerto Rico .... OK.
Testing geography item 19: Tract *, Aguas Buenas Municipio, Puerto Rico .... OK.
Testing geography item 20: Tract *, Washington County, Arkansas .... OK.

> # you may get different counties in your random set
>
> acs.fetch(geography=some.tracts, table.number="B01003")

Which will return population data from all the tracts in a random set of 20 counties.

Tags: , , , ,