Research has shifted toward administrative data for three reasons:

1. Administrative records offer much larger sample sizes for full populations, which support more compelling research designs and research into important but relatively rare events.

2. Administrative files often have an inherent longitudinal structure that enables researchers to follow individuals over time and address policy questions.

3. Administrative data are less likely than survey data to suffer from high and rising rates of nonresponse, attrition, and underreporting. (HUD, 2013: 3) Harnessing the power of these data through web-based information systems and geospatial analysis and matching these data with survey and administrative data from other agencies will provide the foundation for the next generation of evidence-based policymaking.

One particularly important area for investigation is the use of AR for improving the Census Bureau MAF. Improving the MAF—the basis for all Census Bureau household survey samples—will yield benefits to all such surveys and to the next decennial census. Under an agreement with the Census Bureau, the U.S. Postal Service (USPS) already provides a copy of its Delivery Sequence File (DSF) twice a year, and each DSF is used to update the MAF. The Census Bureau is investigating the use of National Change of Address files for improving the MAF.

Two other key components of MAF updates associated with the decennial census—address canvassing to determine ground truth and local updates—can be brought further into the digital

age. Efforts under way through the Census Bureau Geographic Support Systems Initiative will establish links to counties and large cities that can provide periodic electronic updates to their address files. The Census Bureau currently has no plans to run the Local Update of Census Addresses program as an ongoing program rather than a once-a-decade program. True partnership between the Census Bureau and state and local governments to improve the address list should be a twoway street.

Through an interagency agreement with USPS, HUD receives counts of total and vacant business and residential addresses in the United States at the ZIP+4 geographic level. HUD uses these data for a variety of purposes, including researching neighborhood change, tracking disaster recovery, gauging the foreclosure crisis, analyzing housing markets, and measuring the effect of HUD funding on communities. HUD also makes the vacancy data available at the census tract level to government and nonprofit organizations through a permitted-user sublicense agreement.

HUD collects information on the tenants in HUD-subsidized housing in its Public and Indian Housing Information Center (PIC) system and its Tenant Rental Assistance Certification System (TRACS).

Local program administrators use form HUD-50058 to submit data to the PIC system and form HUD-50059 to provide HUD with tenant data for TRACS. PIC data contain longitudinal information on families living in public housing or receiving tenant-based housing vouchers, whereas TRACS data contain longitudinal information on families living in project-based Section 8 housing. HUD uses these data in several ways and provides them for research purposes to other government agencies that promise confidentiality protection.5 Mast provides the following information about PIC.

The PIC system has quarterly entries for each family receiving HUD rental assistance starting in 1995. Data are available on income, rent, and a large number of other household and PHA [public housing agency] characteristics. … The PIC data system is transaction based. The most common transactions are (1) admissions, (2) annual [reexaminations], (3) interim [reexaminations] due to changes in eligibility factors such as income or family size, (4) moves, and (5) exits from the program. The system captures the most recent transaction at the end of each quarter. If multiple transactions for a household occur during a quarter, only the most recent is available. If no transaction occurs during a quarter, the family’s entry is a duplicate of the entry for the previous quarter.

Rent contracts are effective for 1 year and most households have only one transaction per year. Therefore, most changes are made annually, not quarterly. (Mast, 2012: 60) The HUD Office of Policy Development and Research produces annual tabulations from the PIC/ TRACS data called Picture of Subsidized Households (the most recent is for 2009). As the website notes, “Picture of Subsidized Households describes the nearly 5 million households living in HUDsubsidized housing in the United States [providing] characteristics of assisted housing units and residents, summarized at the national, state, public housing agency (PHA), project, census tract, county, Core-Based Statistical Area and city levels as downloadable files.”6 A 5-percent sample of the For examples of research using the HUD-PIC extract file, see Lubell, Shroder, and Steffen (2003); Mills et al. (2006);

Olsen et al. (2005); Shroder (2002); and Tatian and Snow (2005).

Quoted from http://www.huduser.org/portal/datasets/picture/yearlydata.html#download-tab.

microdata is available to qualified researchers. In addition, as mentioned previously, the 2011 AHS collected data from a supplementary sample of HUD-subsidized units selected from PIC/TRACS.

The Federal Financial Institutions Examination Council (FFIEC) collects data from lending institutions related to the enforcement of mortgage regulations. The Home Mortgage Disclosure Act (HMDA) was enacted by Congress in 1975 and was implemented by Federal Reserve Board Regulation C.

On July 21, 2011, the rule-writing authority of Regulation C was transferred to the Consumer Financial Protection Bureau. Regulation C requires lending institutions to report public loan data to assist in—

• Determining whether financial institutions are serving the housing needs of their communities.

• Siting local public-sector investments so as to attract private investment to areas where it is needed.

• Identifying possible discriminatory lending patterns.

HMDA initially required reporting of the geographic location of originated and purchased home loans. In 1989, Congress expanded HMDA data to include information about denied home loan applications and the race, sex, and income of applicants and borrowers. In 2002, the Federal Reserve Board amended the HMDA regulations to require lenders to report price data for certain higher priced home mortgage loans and other new data. For each transaction, with some exceptions, the lender reports data about—

• The loan (or application), such as the type and amount of the loan made (or applied for) and, in limited circumstances, its price.

• The disposition of the application, such as whether it was denied or resulted in a loan origination.

• The property to which the loan relates, such as its type (single-family or multifamily) and location (including the census tract).

• The applicant’s ethnicity, race, gender, and income.

• The sale of the loan (if applicable).

This regulation applies to certain financial institutions, including banks, savings associations, credit unions, and other mortgage-lending institutions. FFIEC also collects similar data from private mortgage insurance companies on a voluntary basis and is responsible for administering the regulations to implement the Community Reinvestment Act of 1977,7 “intended to encourage depository institutions to help meet the credit needs of the communities in which they operate.”8 Several countries maintain housing registers—a list of all housing units and their characteristics— that can form the basis for housing analysis. For example, Denmark established its first housing register in the 1880s for the city of Copenhagen. As Christensen noted— The [Danish] Building and Housing Register (BBR) was established in 1977. Since 1981, BBR has been updated annually by the municipalities. Before 1981, data on housing

conditions were collected as part of nationwide census of all households in Denmark that took place every fifth year. The first nationwide census including housing information took place in 1955. BBR consists of national data concerning building and housing. The purpose of the register is to describe the total housing stock and individuals’ housing conditions and is used for administrative purposes. … There are good opportunities to carry out research on Danish housing conditions. The key data in BBR are of high quality and go back in time so longitudinal analyses can be executed. Furthermore, BBR can be matched with other registers so it is possible to make detailed analyses of tenant composition over time. In particular, analyses that compare individuals over time living in different segments of the housing market, e.g. ownership, social housing sector, and private sector, provide unique knowledge of individuals’ living and housing conditions. (Christensen, 2011: 106, 108) No U.S. housing register exists, however. The closest approximation is MAF, which is confidential under federal law.9 Under Title 13, however, MAF can be accessed for research that also benefits the Census Bureau (through its network of Research Data Centers). The MAF contains little information other than the address and associated census geography, but it can be linked to many Census Bureau household surveys.

The public property records in the United States that are the basis for property taxes are also potential data sources. Because these records are assembled at the municipal level, however, they are of varying quality, such as might result from delays in reassessment. Companies such as Zillow aggregate these records to offer services to the public for specific addresses. Researchers may be able to access these records for their own research.

Promising Techniques for Creating Additional Data Sources While the data in AR datasets are interesting and useful, their value can be enhanced for research purposes.

Linking One method that can enhance the value of existing data is to link datasets together. In this section, I describe a recent effort (Andersson et al., 2013, in which I participated) that linked together decennial census data, unemployment insurance AR on earnings, and HUD administrative data on subsidies to create a new database for housing research.

Andersson et al. (2013) and ongoing research focus on a difficult research issue—analyzing how children’s housing affects their earnings in early adulthood. Andersson et al. developed a frame of households and children from the internal version of the 2000 decennial census. The short form provided a set of demographic variables that can be used to control for observable characteristics of parents and children. It also provided the residential location of households in 2000, which Andersson et al. linked to neighborhood characteristic variables (aggregates of the long-form data Code of Federal Regulations, Title 13.

to the block group and census tract levels). Next, they used person identifiers developed at the Census Bureau to link the parents and children to HUD-PIC, the administrative data file of housing assistance recipients described previously. The HUD-PIC file covered 1997 through 2005; it was used to identify each year a parent or child was in subsidized housing and whether they were in public housing or received a housing voucher enabling them to live in private-sector housing.

Finally, Andersson et al. used the unique person identifiers to link the children in the sample to earnings records for 2008 through 2010 (and parents to their income for the entire period). The Census Bureau Longitudinal Employer-Household Dynamics (LEHD) dataset provides earnings records for more than 130 million workers each quarter from the mid-2000s onward.10 Those records provided a measure of labor market outcomes for 1.8 million children who were ages 13 to 18 in 2000 in low-income families—a sample size sufficient to present results disaggregated by race and Hispanic origin, gender, and housing subsidy program, while controlling for neighborhood conditions such as poverty level. When the initial analysis is complete, analysis with the file can be expanded to other topics, such as residential mobility and intergenerational earnings mobility.

Synthetic Data One key problem with using the American Housing Survey for housing analysis is the relatively small sample sizes in any one location (metropolitan area), though the sample sizes appear adequate for national analysis. One key problem with using the American Community Survey (ACS) for housing analysis is the relatively few questions asked about housing and neighborhood physical, social, and economic characteristics. Is there any way to combine the strengths of the two surveys to enhance the data available for housing analysis?

Recent work by Reiter and others suggests it is possible to create a (partially) synthetic dataset that combines AHS and ACS using exact matches and modeling.11 Synthetic datasets are created based on a multiple draws from a derived joint distribution of variables; that distribution is based on observed data relationships. Fully synthetic datasets create all variables this way, whereas partially synthetic datasets retain survey observations for some variables and impute other variables.

LEHD is a partnership between the Census Bureau and all 50 states and the District of Columbia; it produces public use data tabulations (Quarterly Workforce Indicators and an interactive web-based commuting analysis tool, OnTheMap) that are widely used by state and local governments. At its core are two AR files provided by states on a quarterly basis: (1) unemployment insurance (UI) wage records, giving the earnings of each worker at each employer;

