Skip to main content

2022 | OriginalPaper | Buchkapitel

4. Random Variables and Distributions

verfasst von : Maurits Kaptein, Edwin van den Heuvel

Erschienen in: Statistics for Data Scientists

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In the first chapter we discussed the calculation of some statistics that could be useful to summarize the observed data. In Chap. 2 we explained sampling approaches for the proper collection of data from populations. We demonstrated, using the appropriate statistics, how we may extend our conclusions beyond our sample to our population. Probability sampling required reasoning with probabilities, and we provided a more detailed description of this topic in Chap. 3. The topic of probability seems distant from the type of data that we looked at in the first chapter, but we did show how probability is related to measures of effect size for binary data. We will continue discussing real-world data in this chapter, but to do so we will need to make one more theoretical step. We will need to go from distinct events to dealing with more abstract random variables. This allows us to extend our theory on probability to other types of data without restricting it to specific events (i.e., binary data). Thus, this chapter will introduce random variables so that we can talk about continuous and discrete data. Random variables are directly related to the data that we collect from the population; a relationship we explore in depth. Subsequently we will discuss the distributions of random variables. Distributions relate probabilities to outcomes of random variables. We will discuss separately distributions for discrete random variables and for continuous random variables. In each case we will introduce several well-known distributions. In both cases we will also discuss properties of the random variables: we will explain their expected value, variance, and moments. These properties provide summaries of the population. They are closely related to the mean, variance, skewness, and kurtosis we discussed in Chaps. 1 and 2. However, we will only finish our circle—from data to theory to data—in the next chapter.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Although the subscript notation that we introduce here is often used, in some contexts the notation \(f( \cdot | \boldsymbol{\theta })\) is preferred to make explicit that the distribution function is conditional on the parameters (for example in Chap. 8 of this book). In the current chapter we will, however, use the subscript notation.
 
2
Although in general we like to denote the parameters of a PDF with \(\boldsymbol{\theta }=(\theta _1, \theta _2,\ldots ,\theta _m)^T\), for many specific PDFs other notation is used. For the normal PDF we should have used \(\boldsymbol{\theta }=(\theta _1,\theta _2)^T\), with \(\theta _1=\mu \) and \(\theta _2=\sigma \), but \(\mu \) and \(\sigma \) are more common in the literature.
 
3
Here we assume the existence of an infinite population of measurement errors having the normal PDF from which one error e is randomly sampled when we conduct one measurement of the quantity. This error is then added to the true value \(\eta \) to obtain a measurement \(x=\eta +e\) of the quantity or a reading of the unit.
 
4
Note that we use the same notation \(f_{\mu , \sigma }\) for the normal PDF and lognormal PDF. This does not mean that the normal and lognormal PDFs are equal, but we did not want to use a different letter every time we introduce a new PDF. We believe that this does not lead to confusion, since we always mention which PDF we refer to.
 
5
Now we know, with all our knowledge on probability and statistics, that a calculation of the observations like the arithmetic average is in most cases better than just selecting one of them.
 
6
There is actually, and perhaps surprisingly, quite an active debate surrounding the definition of a random variable. A definition that is more mathematical but might still be accessible is the following: “A random variable is a mapping (i.e., a function) from events to the real number line”. This definition allows us to mathematically link the material in Chap. 3—where we discussed events—to the material presented in this chapter. However, this definition is sometimes perceived as confusing as it does not contain any reference to random processes or outcomes.
 
7
In the analysis of life tables it is much more common to calculate probabilities of surviving after a specific age x, i.e., \(\Pr (X>x)\), but this is of course equal to \(\Pr (X>x)=1-\Pr (X\le x)\), as we discussed in Chap. 3.
 
8
Discrete does not always mean that we observe values in \(\mathbb {N}\). For instance, grades on a data science test may take values in \(\{1, 1.5, 2.0, 2.5, \ldots , 9.0, 9.5, 10\}\). Thus, it would be more rigorous to say that a discrete random variable X takes its values in the set \(\{x_0, x_1, x_2, \ldots , x_k, \ldots \}\), with \(x_{k}\) an element of the real line (\(x_{k}\in \mathbb {R}\)) and with an ordering of the values \(x_{0}<x_{1}<x_{2}<\cdots \). However, in many practical settings we can map this set to a subset of \(\mathbb {N}\) or to the whole set \(\mathbb {N}\).
 
9
In the more general setting, the probability can be defined as \(P\left( X=x_{k}\right) =p_{k}\).
 
10
If the set is \(\{x_0, x_1, x_2, \ldots , x_k, \ldots \}\), with \(x_{0}<x_{1}<x_{2}<\cdots \), then the CDF is defined as \(F(x)=\sum _{k=0}^{m_x} f(x_k)\), with \(m_x\) the largest value for k that satisfies \(x_k\le x\).
 
11
Many more distribution functions are known and often used and studied; we present only a small selection.
 
12
Note that -norm uses standard deviations instead of variances. You can always type ?rnorm to see the exact arguments.
 
13
Yes, you are correct, practice is more complicated since a man and a woman may share a household and therefore their weights may be related.
 
Literatur
Zurück zum Zitat R.E. Barlow, Mathematical theory of reliability: a historical perspective. IEEE Trans. Reliab. 33(1), 16–20 (1984)CrossRef R.E. Barlow, Mathematical theory of reliability: a historical perspective. IEEE Trans. Reliab. 33(1), 16–20 (1984)CrossRef
Zurück zum Zitat W.G. Cochran, Estimation of bacterial densities by means of the “most probable number’’. Biometrics 6(2), 105–116 (1950)CrossRef W.G. Cochran, Estimation of bacterial densities by means of the “most probable number’’. Biometrics 6(2), 105–116 (1950)CrossRef
Zurück zum Zitat D. Glass, Graunt’s life table. J. Inst. Actuar. 76(1), 60–64 (1950)CrossRef D. Glass, Graunt’s life table. J. Inst. Actuar. 76(1), 60–64 (1950)CrossRef
Zurück zum Zitat A. Hald, A History of Parametric Statistical Inference from Bernoulli to Fisher, 1713–1935. (Springer Science & Business Media, 2008) A. Hald, A History of Parametric Statistical Inference from Bernoulli to Fisher, 1713–1935. (Springer Science & Business Media, 2008)
Zurück zum Zitat S. Kotz, N. Balakrishnan, N.L. Johnson, Continuous Multivariate Distributions, Volume 1: Models and Applications (Wiley, Hoboken, 2004) S. Kotz, N. Balakrishnan, N.L. Johnson, Continuous Multivariate Distributions, Volume 1: Models and Applications (Wiley, Hoboken, 2004)
Zurück zum Zitat C. Liu, D. Zheng, C. Griffiths, A. Murray, Comparison of repeatability of blood pressure measurements between oscillometric and auscultatory methods, in 2015 Computing in Cardiology Conference (CinC) (IEEE, 2015), pp. 1073–1076 C. Liu, D. Zheng, C. Griffiths, A. Murray, Comparison of repeatability of blood pressure measurements between oscillometric and auscultatory methods, in 2015 Computing in Cardiology Conference (CinC) (IEEE, 2015), pp. 1073–1076
Zurück zum Zitat E. van den Heuvel, Estimation of the limit of detection for quantal response bioassays. Pharm. Stat. 10(3), 203–212 (2011)CrossRef E. van den Heuvel, Estimation of the limit of detection for quantal response bioassays. Pharm. Stat. 10(3), 203–212 (2011)CrossRef
Metadaten
Titel
Random Variables and Distributions
verfasst von
Maurits Kaptein
Edwin van den Heuvel
Copyright-Jahr
2022
DOI
https://doi.org/10.1007/978-3-030-10531-0_4