Skip to main content
Erschienen in:
Buchtitelbild

2022 | OriginalPaper | Buchkapitel

1. A First Look at Data

verfasst von : Maurits Kaptein, Edwin van den Heuvel

Erschienen in: Statistics for Data Scientists

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

For data scientists, the most important use of statistics will be in making sense of data. Therefore, in this first chapter we immediately start by examining, describing, and visualizing data. We will use a dataset called face-data.csv throughout this chapter; this dataset, as well as all the other datasets we use throughout this book, is described in more detail in the preface. The dataset can be downloaded at http://​www.​nth-iteration.​com/​statistics-for-data-scientist. In this first chapter we will discuss techniques that help visualize and describe available data. We will use and introduce R, a free and publicly available statistical software package that we will use to handle our calculations and graphics. You can download R at https://​www.​r-project.​org.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
A number of the chapters in this book contain additional materials that are positioned directly after the assignments. These materials are not essential to understand the material, but they provide additional background.
 
2
In R you can always type ?function_name to get to the help page of a function. You should replace function_name by the name of the function you want to see more information about.
 
3
.sav files are data files from the statistical package SPSS.
 
4
As is true for many programming languages, R is continuously updated. For the interested reader here is a discussion regarding the change of the default stringsAsFactors argument: https://​developer.​r-project.​org/​Blog/​public/​2020/​02/​16/​stringsasfactors​/​.
 
5
Note that the read.csv function loads all the data into RAM; be aware that this might not be feasible for large datasets.
 
6
To prevent a messy output the R console will stop printing at some point, but still, the output will be largely uninformative.
 
7
In this book we do not provide a comprehensive overview of R; we provide what you need to know to follow the book. A short introduction can be found in Ippel (2016), while for a more thorough overview we recommend Crawley (2012).
 
8
While it is convenient to think of a data.frame as a generalization of a matrix object, it technically isn’t. The data.frame is “a list of factors, vectors, and matrices with all of these having the same length (equal number of rows in matrices). Additionally, a data frame also has names attributes for labelling of variables and also row name attributes for the labelling of cases.”.
 
9
In R actually the command 1:9 would suffice to create the vector; however, we stick to using the function c() explicitly when creating vectors.
 
10
In practice, continuous data does not exist, since we record data always with a finite number of digits and hence the property that there would be a value in between any two values is lost. Thus data is essentially always discrete.
 
11
Data imputation is a field in its own right (see, e.g., Vidotto et al. 2015); we will not discuss it in this topic further in this book. However a very decent introduction is provided by Baguley and Andrews (2016).
 
12
The cumulative frequency makes more sense for ordinal data than for nominal data, since ordinal data can be ordered in size, which is not possible for nominal data.
 
13
Not all readers will be familiar with summation notation; let \(x_1, x_2, \dots , x_j\) be a set of numbers, than \(\sum _{k=1}^j x_k = x_1 + x_2 + \dots + x_j\).
 
14
It is good practice to execute parts of a complicated line of code like this separately so that you understand each command (in this case match, tabulate, and which.max).
 
15
Finally, note that the smallest most frequently occurring value is reported; it is an interesting exercise to change this function such that it returns all the modes if multiple modes exist.
 
16
Yes, the second quartile is equal to the median.
 
17
Note that the average of the standardized values is equal to \(\frac{1}{n}\sum _{i=1}^{n}z_i=\frac{1}{n}\sum _{i=1}^{n}x_i/s-\bar{x}/s=0\) and that the sample variance is equal to \(\frac{1}{n-1}\!\sum _{i=1}^{n}\!\left( z_i-0\right) ^2\!=\!\frac{1}{n-1}\sum _{i=1}^{n}\left( x_i-\bar{x}\right) ^2/s^2=s^2/s^2\!=\!1\).
 
18
Each argument provided to a function is of a certain type, for example a vector or a data.frame as we discussed before. R uses this type information to determine what type of plot to produce.
 
19
The bins don’t always correspond to exactly the number you put in, because of the way R runs its algorithm to break up the data, but it gives you generally what you want. If you want more control over the exact breakpoints between bins, you can be more precise with the breaks option and give it a vector of breakpoints.
 
20
The interpretation of densities is different than the interpretation of histograms.
 
21
This makes sense: (a) people often do not respond in the exact same way, and (b) the value for dim2 differs as well!.
 
22
Obviously, you can use operators such as + and *.
 
23
We will discuss the command rnorm in more detail in Sect. 4.​8.​1.
 
Literatur
Zurück zum Zitat T. Baguley, M. Andrews, Handling missing data. Modern Statistical Methods for HCI (Springer, Berlin, 2016), pp. 57–82 T. Baguley, M. Andrews, Handling missing data. Modern Statistical Methods for HCI (Springer, Berlin, 2016), pp. 57–82
Zurück zum Zitat M.J. Crawley, The R Book (Wiley, Hoboken, 2012) M.J. Crawley, The R Book (Wiley, Hoboken, 2012)
Zurück zum Zitat L. Ippel, Getting started with [r]; a brief introduction. Modern Statistical Methods for HCI (Springer, Berlin, 2016), pp. 19–35 L. Ippel, Getting started with [r]; a brief introduction. Modern Statistical Methods for HCI (Springer, Berlin, 2016), pp. 19–35
Zurück zum Zitat M.C. Kaptein, R. Van Emden, D. Iannuzzi, Tracking the decoy: maximizing the decoy effect through sequential experimentation. Palgrave Commun. 2(1), 1–9 (2016)CrossRef M.C. Kaptein, R. Van Emden, D. Iannuzzi, Tracking the decoy: maximizing the decoy effect through sequential experimentation. Palgrave Commun. 2(1), 1–9 (2016)CrossRef
Zurück zum Zitat G. Norman, Likert scales, levels of measurement and the “laws’’ of statistics. Adv. Health Sci. Educ. 15(5), 625–632 (2010)CrossRef G. Norman, Likert scales, levels of measurement and the “laws’’ of statistics. Adv. Health Sci. Educ. 15(5), 625–632 (2010)CrossRef
Zurück zum Zitat T. Rahlf, Data Visualisation with R: 100 Examples (Springer, Berlin, 2017) T. Rahlf, Data Visualisation with R: 100 Examples (Springer, Berlin, 2017)
Zurück zum Zitat D. Vidotto, J.K. Vermunt, M.C. Kaptein, Multiple imputation of missing categorical data using latent class models: state of the art. Psychol. Test Assess. Model. 57(4), 542 (2015) D. Vidotto, J.K. Vermunt, M.C. Kaptein, Multiple imputation of missing categorical data using latent class models: state of the art. Psychol. Test Assess. Model. 57(4), 542 (2015)
Zurück zum Zitat J. Young, J. Wessnitzer, Descriptive statistics, graphs, and visualisation. Modern Statistical Methods for HCI (Springer, Berlin, 2016), pp. 37–56 J. Young, J. Wessnitzer, Descriptive statistics, graphs, and visualisation. Modern Statistical Methods for HCI (Springer, Berlin, 2016), pp. 37–56
Metadaten
Titel
A First Look at Data
verfasst von
Maurits Kaptein
Edwin van den Heuvel
Copyright-Jahr
2022
DOI
https://doi.org/10.1007/978-3-030-10531-0_1