If you have been using the acs package to create custom geo.sets which combine existing census geographies (i.e., geo.sets with “combine=T”), please read on.
It has come to my attention that some users working with custom combined geo.sets may be introducing errors into their data if they attempt to combine census variables dealing with medians, percentages, or similar derived summary data.
Most of the data available through the package (via the ACS and the Decennial Census APIs) comes in the form of raw counts – numbers of people, households, commuters, etc. When a geo.set includes multiple elements and “combine=T”, the package will fetch the data, and then combine the geographies by (1) adding the estimates and (2) calculating the standard errors of these aggregate estimates. This procedure is absolutely proper for count-data, but it is not appropriate for median incomes (or median ages, or mean incomes, or mean travel times, or derived percentages, etc.).
For example, if you attempt to aggregate three tracts with median incomes of $25,000, $35,000, and $50,000 into a single neighborhood, the acs.fetch will return a neighborhood with an “aggregate” median income of $110,000: wrong.
A quick demonstration:
> all.us=geo.make(state=fips.state[1:51,2], combine=T) > median.income=acs.fetch(geography=all.us, table.number="B06011", endyear=2014, span=1) > median.income
Try this and you’ll see that the country’s “median income” is $1,394,002…
In the package’s defense, there really isn’t a proper way to aggregate median incomes like this. Since medians – or means, or percentages – are derived from underlying data, they are really “summaries,” and without at least some more info about the underlying data you can’t always properly combine them. So, in the example above, we know that the median income for the neighborhood is somewhere between $25,000 and $50,000, but not really where. We can take a median of the medians ($35,000), or a mean of the medians ($100,000 / 3 = $36,667), but these are just guesses as well: without knowing how many observations there were in each tract and what they incomes were, we simply can’t calculate it. (This is why I didn’t think it would be an issue – but now I’m thinking at least a stronger warning somewhere would be a good idea, hence this post and some new language I’ll add to the guidance docs.)
Please note that this issue only occurs when users create geo.sets with multiple elements and then combine them (by setting “combine=T” in the geo.set) before passing them to acs.fetch to download data. As long as you are not combining multiple tracts, counties, blockgroups, etc., the package is still fine for fetching and working with median incomes, percentage, and the like. (But be careful: your own code may slip in similar mistakes, if you combine this sort of data.)
Please pass on this info to your colleagues who may be using the package, and be sure to check your code if it (a) deals with combined geo.set and (b) downloads non-count data. If you have any questions or concerns, by all means ask contact me mailto:firstname.lastname@example.org and I’ll be happy to discuss more.
Thanks, and sorry if this wasn’t clear.