nach oben

Erschienen in:

2021 | OriginalPaper | Buchkapitel

1. Introduction: Stories, Data and Statistics

verfasst von : Matthew J. Holian

Erschienen in: Data and the American Dream

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

This chapter describes the American Community Survey (ACS) and how to use microdata from it to calculate descriptive statistics and make inferences about cause and effect social relationships. It introduces the core statistical technique of regression. This chapter emphasizes an intuitive understanding of techniques and concepts, and defines and clarifies dozens of key terms used in econometric research. Questions for Review at the end of the chapter, on topics including sample weighting and inflation adjustments, illustrate the use of empirical best practices to those readers either beginning in econometrics or with experience but looking to add a valuable new data source to their repertoire.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nächstes Kapitel At Home: Housing and Energy Use

IPUMS also constructs its own variables, such as those describing family interrelationship. I have written a blog post that contains a link to all the person and household variables the Census produced in 2015, for PUMA 068511. The variable names and coding values that appear in this book reflect IPUMS conventions, but it can be illustrative to see how the same data looks when it is distributed by Census.gov. http://mattholian.blogspot.com/2019/09/downloading-census-micro-data-ipums-or.html.

These maps were created using an open-source software program called QGIS. This book is in part a guide to R software, but R is not the only “open-source” software I used in writing this book. In Appendix A, I discuss a third software program called LaTeX which I used for word processing. Here, I wanted to provide some guidance on cartography. Making maps with computer software is part of a field called Geographic Information Systems (GIS). An excellent commercial version of GIS software is ArcGIS, but the free, open-source QGIS program is also easy to use. You can download the software and find training manuals at: www.qgis.org. More resources are at www.qgistutorials.com. Once you have GIS software, you need input files known as “shapefiles” or map “layers.” Download these here: https://usa.ipums.org/usa/volii/tgeotools.shtml.

It is possible to download some of these statistics from data.census.gov. While this is a good place to find commonly used statistics like average income, you won’t find specialized statistics, like average lawyer earnings by college major. To find these a user has to calculate them themselves with the public microdata as I have done here.

Angrist and Pischke (2014, p. 10) discuss selection bias in the context of an equation, which I adopt as a definition of the term: Selection Bias = (Difference in Means) − (Average Causal Effect). Here, selection bias is the entire gap between what we observe ($37,024, the difference in means) and the true impact of the treatment, which generally is unknown, but could be measured in an ideal randomized experiment. There’s a lot of jargon in econometrics, some of it unfortunate, and some of it necessary to discuss nuanced concepts. Take, for example, the treatment effect. The effect of the economics curriculum likely varies across people. The average of the individual effects is known as the average causal effect.

If their intention behind randomly assigning students to major was to study the effect of major on income, we would call it an actual experiment (or a randomized experiment, or maybe a field experiment) but not a natural experiment. See Dunning (2012) for further discussion of natural experiments. Bleemer and Mehta (2021) use a grade point average policy at UC Santa Cruz, in a technique called regression discontinuity, to study the causal effect of the economics curriculum on earnings.

In 1886, Francis Galton found that children of very tall parents tend to be shorter than their parents, and he described this as, “regression to the mean.” The statistical technique he developed to study this phenomenon used an equation that has since become known as a “regression equation.” See also Bailey (2017, Chapter 3, footnote 2) and Angrist and Pischke (2014, pp. 79–81) on the history of the term regression.

Sometimes the description of the estimation subsample is referred to as a model’s “data rules.” Note here the estimation subsample includes persons with all undergraduate majors, not just economics and electrical engineering. Determining the estimation subsample a researcher used is often a major challenge in replicating a study, but it is the critical first step. In the file script2.R on this book’s webpage, one line of code defines the estimation subsample for the Winters (2016) replication. This line creates a data set (or “data frame” in R language) that I named “subset2w”: subset2w = subset(ACSmaster, OCC1990==178 & EDUCD>114 & AGE>29 & AGE<62 & YEAR>2008 & YEAR<2014). Here, subset2w is the name I gave the data frame which is the estimation subsample, and subset() is an R function that creates a smaller data frame from a larger data frame. The larger data frame, ACSmaster, is the IPUMS extract with 61 variables from 14 survey years that I discuss in Appendix A, and subset2w is a much smaller data frame that only contains lawyers surveyed in certain years with certain other characteristics. The data frame ACSmaster is a large file. It is nowhere near as large as the file would be if the extract included all variables and all samples available from IPUMS, but it is large enough to enable me to estimate every statistic I present in this book. Every statistic presented in this book is estimated on data that is a subset of the ACSmaster data frame described in Appendix A.

Selection bias was defined in footnote 4 as the entire gap between the difference in means and the true average causal effect. In the context of regression control, we often discuss the different concept OVB, which is the gap between estimated values of the short and long coefficients. Specifically here, OVB = $32,650 − $25,206. Unless our long regression includes all possible factors that meet the two OVB conditions (they influence the dependent variable and are correlated with the independent variable of interest), OVB will be less in magnitude than selection bias. Review Question 8 presents an interesting relationship between coefficients from various regressions and the two OVB conditions.

The gender of a child does seem to be a more important factor for households on the margin between three- and four-bedroom homes. Analysis carried out in the file script3.R finds that 54% of households in three-bedroom homes are same-gender child households, while only 47% of households in four-bedroom homes have children of the same gender.

The OVB equation requires estimating a so-called auxiliary regression, the equation for which is: $FEMALE_{i}=\pi _{0} + \pi _{1} ECON_{i} + u_{i}.$ We call OLS models with binary dependent variables like this one linear probability models, and we interpret fitted values from them as predicted probabilities. The right-hand side of the OVB equation highlights that, if we omit a variable in a regression equation that is (1) highly correlated with the main variable of interest (we see this in the auxiliary regression on $\pi _{1}$) and (2) an important determinant of the dependent variable when it is included (we see this in the long regression on $\gamma $), the estimated coefficient in the bivariate regression will suffer from OVB. See also Angrist and Pischke (2014, p. 71) and Bailey (2017, Section 5.2).

Angrist, Joshua D., and Jörn-Steffen Pischke. Mastering ’metrics: The path from cause to effect. Princeton University Press, 2014.

Bailey, Michael A. Real econometrics: The right tools to answer important questions. Oxford University Press, 2017.

Bleemer, Zachary, and Aashish Mehta. “Will studying economics make you rich? A regression discontinuity analysis of the returns to college major.” American Economic Journal: Applied Economics. Forthcoming, 2021.

Dunning, Thad. Natural experiments in the social sciences: A design-based approach. Cambridge University Press, 2012.

Gerring, John. “Mere description.” British Journal of Political Science (2012): 721–746.

Grimmer, Justin. “We are all social scientists now: How big data, machine learning, and causal inference work together.” PS, Political Science & Politics 48, no. 1 (2015): 80. CrossRef

Holian, Matthew J. “The impact of urban form on vehicle ownership.” Economics Letters 186 (2020): 108763.

Huang, Eddie. Fresh off the boat: A memoir. Spiegel & Grau, 2013.

Winters, J. V. “Is economics a good major for future lawyers? Evidence from earnings data.” The Journal of Economic Education 47, no. 2 (2016): 187–191.CrossRef

Titel: Introduction: Stories, Data and Statistics
verfasst von: Matthew J. Holian
Verlag: Springer International Publishing
Buch: Data and the American Dream
Print ISBN: 978-3-030-64261-7

Electronic ISBN: 978-3-030-64262-4

Copyright-Jahr: 2021
DOI: https://doi.org/10.1007/978-3-030-64262-4_1