Skip to main content

2020 | OriginalPaper | Buchkapitel

3. Databases in R

verfasst von : Alfonso Zamora Saiz, Carlos Quesada González, Lluís Hurtado Gil, Diego Mondéjar Ruiz

Erschienen in: An Introduction to Data Analysis in R

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Prior to any data analysis, it is fundamental to be able to handle different sources and formats of information, such as files or web sites, and it is equally important to understand how to transform and manipulate all kinds of data so as to prepare everything in the right way to perform an statistical analysis. This chapter is divided into two parts, the first delves with the diversity of environments for data sources, ranging from importation of structured data or the use of APIs to the more advanced usage od scraping tools for cases when data is not prepared to be downloaded explicitly. Then advanced features that allow to transform raw data into ready to analyze tables are discussed with special focus on the exceptionally fast data.table.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Distributed computing is a model in which components of a software system are located on different networked computers, as in cloud computing.
 
2
Researchers around the world keep improving the implementation of specific algorithms to take full advantage of the very nature of distributed computation.
 
3
Technically speaking, RStudio automatically saves the session when closing so the work is actually preserved, but it is not stored as an external editable file, just as a snapshot of the session.
 
4
The comma is used as a decimal separator in many countries in the world, see https://​en.​wikipedia.​org/​w/​index.​php?​title=​Decimal_​separator&​oldid=​932234568.
 
5
Using different libraries will be a constant throughout the book. Whenever a new package is introduced it is understood it should be installed first, even if it is not specified. See Sect. 2.​1.
 
6
Here the f in fread and fwrite stands for fast.
 
7
As of 2019 the community seems to agree that the package jsonlite is a bit better than rjson or rjsonio.
 
17
PHP stands for Personal Hypertext Processor, and is a language for webpages.
 
18
Often the so-called RESTful APIs for the advantages and easiness when using them.
 
19
URL stands for Uniform Resource Locator and corresponds to what is usually called web address or link.
 
20
All data available in this subsection has been kindly provided by The OpenSky Network, https://​opensky-network.​org. See also the original OpenSky paper [10].
 
24
Make sure to run this code while matches are being played, otherwise, an empty file will be generated.
 
27
Tickers are not provided in the API documentation, but can be easily found in www.​quandl.​com or googled.
 
35
TheSportsDB is a quite stable web but with time the tags might change. Should this happen, the reader can download the snapshot of the web as it was when the book was written from our data repository at https://​github.​com/​DataAR/​Data-Analysis-in-R/​tree/​master/​webs. Then replace https://​www.​thesportsdb.​com/​season.​php?​l=​4387&​s=​1920 by the route to the downloaded .html file in your computer.
 
36
Any other match can also be clicked to obtain the equivalent information, but with a different part of the code highlighted.
 
37
It is not intended here to learn .html, just being able to extract information using tags, for more information on tags check https://​www.​w3schools.​com/​html/​html_​elements.​asp.
 
38
These strings control spacing in .html. For example, ∖n corresponds to a new line and ∖t is a tabulation.
 
39
As before, Goodreads is a very stable web but with time the tags might change. Should this happen, the reader can download the snapshot of the web as it was when the book was written from our repository at https://​github.​com/​DataAR/​Data-Analysis-in-R/​tree/​master/​webs. Then replace “https://​www.​goodreads.​com/​list/​show/​7.​Best_​Books_​of_​the_​21st_​Century” by the route to the downloaded .html in your computer.
 
41
The icao24 is a permanent hexadecimal code that identifies every aircraft.
 
43
They can also be coerced with as.data.table( ) but the command setDT( ) is faster and uses less memory.
 
44
Not having row names might be confusing at first but it is really an advantage. Either the row name is just the index which is irrelevant information and there is no need to keep it, or it is meaningful information and then it should be treated as any other variable, with the same column status.
 
45
Type ?swiss in the console for details.
 
46
The variable Education represents the percentage of military draftees who got an education beyond primary school.
 
47
In this case, the thresholds for the variables have been chosen arbitrarily, so they might be misleading. Previous knowledge of the analyst about the subject should rule this kind of choices.
 
48
Recall from Sect. 2.​2.​1 that NULL stands for the empty object.
 
49
Recall from Sect. 2.​2.​1 the meaning of the NA object, reserving a place in tables which stores non available entries.
 
50
The ranking is open to user ratings and the displayed table might change with time.
 
51
The three variables are not the same, and in some cases they have very different values, but we just keep the last one for the sake of simplicity.
 
52
This date in Unix timestamp is 1548028800.
 
53
In fact, more preprocessing can be conceived, see the Exercises, but this is enough to confidently start the analysis.
 
54
SQL is a language designed for database handling, but not for data analysis. Given a database, it constructs an ad hoc predefined hierarchy for it. Since this structure is created specifically, the speed for accessing data or editing single fields is huge. However, the rigid structure suffers if big changes are made to the database or large-scale edition is the goal, two major drawbacks in modern data analysis.
 
Literatur
4.
Zurück zum Zitat EMC Education Services. Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. Wiley, New York, USA, 2015. EMC Education Services. Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. Wiley, New York, USA, 2015.
5.
Zurück zum Zitat Garca, S., Luengo, J. and Herrera, F. Data Preprocessing in Data Mining. Springer Publishing Company, Incorporated, New York, USA, 2014. Garca, S., Luengo, J. and Herrera, F. Data Preprocessing in Data Mining. Springer Publishing Company, Incorporated, New York, USA, 2014.
6.
Zurück zum Zitat Gibbons, A. and Rytter, W. Efficient Parallel Algorithms. Cambridge University Press, Cambridge, USA, 1988.MATH Gibbons, A. and Rytter, W. Efficient Parallel Algorithms. Cambridge University Press, Cambridge, USA, 1988.MATH
8.
Zurück zum Zitat Pyle, D. Data Preparation for Data Mining. Morgan Kaufmann Publishers Inc., California, USA, 1999. Pyle, D. Data Preparation for Data Mining. Morgan Kaufmann Publishers Inc., California, USA, 1999.
10.
Zurück zum Zitat Schäfer, M., Strohmeier, M., Lenders, V., Martinovic, I. and Wilhelm, M. Bringing up OpenSky: A large-scale ADS-B sensor network for research. Proceedings of the 13th IEEE/ACM International Symposium on Information Processing in Sensor Networks (IPSN), pages 83–94, 2014. Schäfer, M., Strohmeier, M., Lenders, V., Martinovic, I. and Wilhelm, M. Bringing up OpenSky: A large-scale ADS-B sensor network for research. Proceedings of the 13th IEEE/ACM International Symposium on Information Processing in Sensor Networks (IPSN), pages 83–94, 2014.
11.
Zurück zum Zitat Sedgewick, R. Algorithms inC+ +—Parts 1–4: Fundamentals, Data Structures, Sorting, Searching. Addison Wesley Professional, Massachusetts, USA, 1999. Sedgewick, R. Algorithms inC+ +—Parts 1–4: Fundamentals, Data Structures, Sorting, Searching. Addison Wesley Professional, Massachusetts, USA, 1999.
12.
Zurück zum Zitat Wickham, H. R Packages: Organize, Test, Document, and Share Your Code. O’Reilly Media, California, USA, 2015. Wickham, H. R Packages: Organize, Test, Document, and Share Your Code. O’Reilly Media, California, USA, 2015.
13.
Zurück zum Zitat Wickham, H. and Grolemund, G. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media, Inc., California, USA, 2017. Wickham, H. and Grolemund, G. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media, Inc., California, USA, 2017.
Metadaten
Titel
Databases in R
verfasst von
Alfonso Zamora Saiz
Carlos Quesada González
Lluís Hurtado Gil
Diego Mondéjar Ruiz
Copyright-Jahr
2020
DOI
https://doi.org/10.1007/978-3-030-48997-7_3

Premium Partner