Using acs.R for a t-test

Posted by Ezra Glenn on November 06, 2013
Census

I’ve been asked to provide a very quick example of using the acs.R package to conduct a t-test of significance when comparing ACS data from two different geographical areas—so here goes: a quick example.

Let’s look at the number of school-age children in different towns on Martha’s Vineyard. There are seven towns on this island, and luckily (like all New England towns) they are each represented as a different “county subdivision;” in this case, they together make up Dukes County. So we can get some quick age demographics for all seven of them in two quick commands:

> towns=geo.make(state="MA", county="Dukes", county.subdivision="*")
> towns.pop.sex=acs.fetch(geography=towns, table.number="B01001",
  col.name="pretty")
> # one more step just to shorten geography -- just town names
> geography(towns.pop.sex)[,1]=str_replace(
      geography(towns.pop.sex)[,1]," town.*","")

If you look at the column names using acs.colnames(towns.pop.sex), you will see that we are most interested in columns 4-6 (male children age 5-17) and 28-30 (female children, same ages). We also might need column 1 (the total population for the town), for the purpose of calculating percentages later on.

We can aggregate the total number of children in each town using apply and sum—the acs package takes case of all the standard errors for us:

> kids.by.town=apply(X=towns.pop.sex[,c(4:6,28:30)], 
     FUN=sum, MARGIN=2, agg.term="kids"))
> kids.by.town
ACS DATA: 
 2007 -- 2011 ;
  Estimates w/90% confidence intervals;
  for different intervals, see confint()
             kids               
Aquinnah     138 +/- 81.2342292386652
Chilmark     144 +/- 56.0178542966437
Edgartown    379 +/- 146.741268905513
Gosnold      3 +/- 212.464114617034  
Oak Bluffs   896 +/- 245.293701509028
Tisbury      565 +/- 182.148291235466
West Tisbury 320 +/- 105.437185091409
> 

Looking at some of those numbers, it appears that some towns—say, Oak Bluffs or Tisbury—have a lot of school kids (relatively speaking, at least—keep in mind these are all small towns), whereas others—Gosnold or Aquinnah—have far fewer (in absolute numbers). But, asks the statistician (or the tax-payer): is this difference statistically significant? Is it possible, for example, that Tisbury might actually have fewer children than, say, Edgartown? (This sort of question might come up if we were allocating funds for school construction or expansion, for example—although in reality with such small number we’d just do an actual census….) This calls for a t-test.

Now, despite its impressive-sounding name, a t-test is really nothing more than a procedure to compare the magnitude of a difference in two estimates, relative to the standard error of the difference, measured against the benchmark of a t-table or a normal curve. If you want to compute this yourself, you can do it with just a calculator: I’d recommend you look at Appendix 4 of the excellent Census Compass Guide for Users of ACS Data for State and Local Governments, which includes the formulas for making these sorts of comparisons. But why bother? The acs package takes care of most of this for you anytime you subtract one estimate from another.

So, to compare the number of school age children in, say, Tisbury and Edgartown, we simply subtract one from the other:

> diff.tisb.edgar=kids.by.town[6]-kids.by.town[3]
> estimate(diff.tisb.edgar)/standard.error(diff.tisb.edgar)
                            kids
( Tisbury - Edgartown ) 1.308102
> 

This last value (1.308102) is actually the t-statistic for this difference. By consulting a table of critical values, we’d learn that, in fact, the difference between the estimated number of children in these two towns is not statistically significant at the 95% level: that is, from this data, it is conceivable that the two towns have the same number of children—or even that Edgartown has more. (If we didn’t happen to have a table of critical values, we could use the R’s build-in pnorm and qnorm functions to compare out results.)

We could also use these sums, along with the total population column from our original dataset, to ask a more complicated question: let’s turn the numbers of children into the percentages of children in each town:

> kids.by.town.perc=divide.acs(num=kids.by.town, 
  den=towns.pop.sex[,1], method="proportion")
Warning message:
In .acs.divider(num = numerator, den = denominator, proportion = T,  :
  ** using formula for PROPORTIONS, which assumes that numerator 
     is a SUBSET of denominator **
> kids.by.town.perc
ACS DATA: 
 2007 -- 2011 ;
  Estimates w/90% confidence intervals;
  for different intervals, see confint()
             ( kids / Sex by Age:  Total:  )          
Aquinnah     0.296137339055794 +/- 0.123876862068712  
Chilmark     0.179775280898876 +/- 0.0514611383164102 
Edgartown    0.0939514129895885 +/- 0.0363744426332717
Gosnold      0.0163934426229508 +/- 1.16097871550958  
Oak Bluffs   0.201393571589121 +/- 0.0551247386992141 
Tisbury      0.144353602452734 +/- 0.046532896606136  
West Tisbury 0.127693535514765 +/- 0.039617555642514  
> 

Since this division is a “proportion” type operation, be sure to use the special acs.divide function, and not just “kids.by.town/towns.pop.sex[,1]“, which assumes “ratio” type division, where the numerator is not part of the denominator. (Try it and see the warning.)

To run a t-test on this, we again simply subtract one estimate from another, and divide by the standard error of the difference—it really doesn’t matter whether the estimates we are comparing are straight from the ACS tables, or the result of our summing across columns or geographies, or even more complicated operations (like making percentages). So, for example: the ACS estimates indicate that over 29% of Aquinnah’s population is school-age children; but let’s see whether we are really confident that this figure for Aquinnah is higher than the number for Oak Bluffs:

> diff.aqui.oak.perc=kids.by.town.perc[1]-kids.by.town.perc[5]
> estimate(diff.aqui.oak.perc)/standard.error(diff.aqui.oak.perc)
                          ( kids / Sex by Age:  Total:  )
( Aquinnah - Oak Bluffs )                         1.14946
>  

Again: lower than our required critical score—not significant! (We need a score >1.96 to be significant at the 95% level)

But before you give up hope entirely on the ACS, keep in mind that these are small towns, with large margins of error. When working with larger geographies—big cities, entire counties—you are much more likely to see statistical significance, even in small differences in estimates.

In closing, I should note that we hope to include a t-test() function in the next version of the package to make all this even easier—stay tuned!

Leave a Reply