Appendix B

Supplemental Material for the Segregation Chapter

The 2000 Census

Decennial census data by zip code for the 2000 census were downloaded using the American FactFinder tool on the Census Bureau’s website. Zip codes for Puerto Rico, the Virgin Islands, and military installations were deleted, leaving data for 31,720 zip codes.

Calculation of the Zip Code Centile Scores

The centile score is based on the sum of standardized scores for a zip code’s percentage of adults with college educations and its median family income, weighted by population. This would be a simple matter of creating standardized scores and weighting the sum of those scores by population, except for a complication: The centile was to represent where individuals within a zip code fit within the national population of individuals, not where the zip code as a whole fit within the national set of zip codes.

Standardized Scores

Standardized scores provide a way to compare apples and oranges. For example, suppose you want to know who is taller relative to their reference groups: a 5’4” female gymnast or a 6’10” player in the National Basketball Association. You need to place the gymnast and the basketball player in the distribution of heights of their respective groups. The way you do that is by a simple arithmetic formula, z = (X – M)/S, where z is the standardized score, X is the value for the individual, M is the mean of the group, and S is the standard deviation of the group. Wikipedia has a straightforward explanation of what a standard deviation is.

The creation of the zip code index variable, centile, began with a Stata database with a line for each zip code. The variables in the database were the percentage of persons in the zip code with a BA (pbabin), the median family income of persons in that zip (medianinc) in thousands of 2010 dollars, and the population ages 25 and older in that zip code (pop25).

The database of the nation’s zip codes was expanded by a tenth of the size of the population ages 25 and older using Stata’s EXPAND command (so that, for example, a zip code with 1,000 persons ages 25 and older had 100 lines in the expanded database), resulting in a database of 18,216,898 lines. Each of these lines included the two indicators in the index, pbabin and medianinc, for the zip code in which the individual lived. Standardized scores were computed for both indicators. The index is the sum of the two standardized scores. The RANK function in Stata calculated ranks from low to high, with the highest ranks signifying the highest combined levels of education and income in that census tract. Thus the centile score consists of the rank of the index score divided by the total sample of the population ages 25 and older, then multiplied by 100 so that it ranges from 0 to 99.

I prepared two versions of centile score, one of which used the actual median income and the other of which used the logged value of median income, which reduces the value of extremely high medians. An examination of the two versions, which had a correlation of 0.998, revealed that the version using actual median income gave greater weight to education than to minor changes in income for zip codes that were in the bottom half of the distribution, which seemed to me to be a more realistic representation of the relative importance of the two at low levels of both. Given the focus of the book, the more important question was whether the two versions had importantly different scores at the top. They did not. With only a few exceptions, the two versions of centile were within less than 2 percentage points of each other. I chose to use the version using actual median income, which gave more interpretable results at the low end.

The SuperZips consist of all zip codes with centile scores of 95 or higher.

Linking Zip Codes with the Political Ideology of the Congressional Representative

The database I employed to link congressional districts with zip codes was the Congressional District Database sold by zipinfo.com. Zip codes that fell into more than one district were assigned to the district that contained a majority of the zip+4 codings, which take the breakdown of zip codes to the block level.

As the measure of the political orientation of a congressional district, I averaged the liberal quotient calculated annually by the Americans for Democratic Action (ADA) for each congressperson for the 108th through 111th Congresses (those elected in 2002, 2004, 2006, and 2008). I used the ratings for only one year of each Congress (2004, 2005, 2007, and 2009), since the correlation within the two years of a Congress is close to perfect.

Census Tracts in the 1960 Census

Census tract data for the 1960 census were taken from the Elizabeth Mullen Bogue file (hereafter “Bogue”), available from the ICPSR. The data comprised 175 metropolitan areas that included 104,010,696 people out of the total resident population (all ages) in the 1960 census of 179,323,175. The population not included in the database was exclusively rural or lived in towns that were not part of metropolitan areas.

The Bogue file does not include the Census Bureau’s calculation of median income, but I was able to replicate the census values through the standard formula for computing medians from grouped income, median = l + h ((n/2 – cf)/f), where l is the lower limit of the median class (the interval within which the median must lie), n is the total number of cases, cf is the cumulative number of cases in intervals prior to the median class, f is the number of cases in the median class, and h is the width of the median class (e.g., if the median class represents people with incomes of $5,000–$5,999, the width is 1,000).

For the twenty-three census tracts with a median income higher than the top code of $25,000, the census reports simply “$25,000+.” Using the 1 percent sample of the 1960 census provided through IPUMS, I knew that if the distribution of incomes beyond $25,000 followed the same logarithmic trend as exhibited for incomes of $15,000–$24,999, I could expect half of those above $25,000 to make $28,000 or less. But I also knew that the number of those with incomes greater than $25,000 was almost three times as large as we would have predicted knowing the distribution from $15,000 through $24,999. I used $50,000 as my estimate of the point at which half of the $25,000+ population would be reached. This is probably too high, but it is better to err on the high side (given the thrust of my argument, which stresses the separation of the new upper class in 2000, compared to the high-income population in 1960).

The Alumni Sample

The elite schools keep careful track of their alumni for fund-raising purposes, which means that their periodic anniversary reports and alumni directories have close to 100 percent data on the whereabouts of their living alumni. Using the anniversary report of my own Harvard class (1965) and volumes provided by friends and colleagues, I recorded the zip codes of the home addresses for alumni from Harvard, Princeton, Yale, and Wesleyan in the following classes and years to which the home zip codes apply:

Harvard/Radcliffe. Classes/zip code years: 1965/1990, 1968/1993, 1990/2010

Princeton. Classes: 1980, 1981, 1982, 1985, 1987, 1989, 1990, 1991; zip code year: 2009

Yale. Classes/zip code years: 1964/1989, 1970/2000, 1979/2004

Wesleyan. Classes: randomly selected graduates from 1970 to 1979; zip code year: 1996

For persons who were at a typical age for college graduation, 22, the zip codes apply to their home residence at the ages of 40–52 for the HPY sample and 39–48 for the Wesleyan sample.

Table B.1 shows the sample sizes, and centile means and standard deviations by school.

TABLE B.1. BASIC STATISTICS FOR THE ALUMNI ZIP CODE SAMPLES

    Centile scores
School N Mean Standard deviation
Harvard 3,499 84.0 21.2
Princeton 8,049 84.7 20.9
Yale 2,769 82.8 21.3
Wesleyan 1,588 79.9 22.4
Total 15,905 83.7 21.2

The mean centile scores of the zip codes for the three iconic schools were remarkably close, and Wesleyan wasn’t far behind. The overall percentage of HPY graduates living in SuperZips was 43.6, with Yale having a slightly lower percentage of 40.9 compared to 43.9 percent for Harvard and 44.4 percent for Princeton. The higher percentages for Harvard and Princeton are plausibly attributable to a hometown effect—the Boston area was more attractive to Harvard graduates, and the Princeton area to Princeton graduates, than the New Haven area was to Yale graduates. Since the zip codes of Cambridge westward from Boston, and the zip codes surrounding Princeton, are dense with SuperZips, this tendency to stay near their college gave an upward push to the overall mean for Harvard and Princeton that Yale did not share. The proportion of Wesleyan graduates living in SuperZips was 31.5 percent.