Skip to main content

2017 | OriginalPaper | Buchkapitel

Towards Automatic Data Format Transformations: Data Wrangling at Scale

verfasst von : Alex Bogatu, Norman W. Paton, Alvaro A. A. Fernandes

Erschienen in: Data Analytics

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Data wrangling is the process whereby data is cleaned and integrated for analysis. Data wrangling, even with tool support, is typically a labour intensive process. One aspect of data wrangling involves carrying out format transformations on attribute values, for example so that names or phone numbers are represented consistently. Recent research has developed techniques for synthesising format transformation programs from examples of the source and target representations. This is valuable, but still requires a user to provide suitable examples, something that may be challenging in applications in which there are huge data sets or numerous data sources. In this paper we investigate the automatic discovery of examples that can be used to synthesise format transformation programs. In particular, we propose an approach to identifying candidate data examples and validating the transformations that are synthesised from them. The approach is evaluated empirically using data sets from open government data.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
2
It can be seen that the complexity of Algorithm 1 is \(\mathcal {O}(nm)\) where n is the number of attributes of S and m is the number of attributes of T. This is due to the cross product between the columns of the two data sets (i.e. the two for loops from the beginning of the algorithm). We do not analyse here the complexity of the other algorithms used in our experiments as this has been done in the original papers. Nor do we emphasize on the impact of input size on the overall solution. In our experiments, the run-time of Algorithm 1, pertaining examples generation alone, did not exceed one second for any of the datasets used.
 
Literatur
1.
Zurück zum Zitat Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: ICDE 2013, pp. 458–469 (2013) Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: ICDE 2013, pp. 458–469 (2013)
2.
Zurück zum Zitat Fan, W.: Dependencies revisited for improving data quality. In: PODS 2008, pp. 159–170, 9–11 June 2008 Fan, W.: Dependencies revisited for improving data quality. In: PODS 2008, pp. 159–170, 9–11 June 2008
3.
Zurück zum Zitat Fan, W.: Data quality: from theory to practice. SIGMOD Rec. 44(3), 7–18 (2015)CrossRef Fan, W.: Data quality: from theory to practice. SIGMOD Rec. 44(3), 7–18 (2015)CrossRef
4.
Zurück zum Zitat Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. VLDB J. 21(2), 213–238 (2012)CrossRef Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. VLDB J. 21(2), 213–238 (2012)CrossRef
5.
Zurück zum Zitat Furche, T., Gottlob, G., Libkin, L., Orsi, G., Paton, N.W.: Data wrangling for big data: challenges and opportunities. In: EDBT, pp. 473–478 (2016) Furche, T., Gottlob, G., Libkin, L., Orsi, G., Paton, N.W.: Data wrangling for big data: challenges and opportunities. In: EDBT, pp. 473–478 (2016)
6.
Zurück zum Zitat Gulwani, S.: Automating string processing in spreadsheets using input-output examples. In: POPL, pp. 317–330 (2011) Gulwani, S.: Automating string processing in spreadsheets using input-output examples. In: POPL, pp. 317–330 (2011)
7.
Zurück zum Zitat Heer, J., Hellerstein, J.M., Kandel, S.: Predictive interaction for data transformation. In: CIDR 2015, 4–7 January 2015 Heer, J., Hellerstein, J.M., Kandel, S.: Predictive interaction for data transformation. In: CIDR 2015, 4–7 January 2015
8.
Zurück zum Zitat Kandel, S., Paepcke, A., Hellerstein, J., Heer, J.: Wrangler: interactive visual specification of data transformation scripts. In: CHI, pp. 3363–3372 (2011) Kandel, S., Paepcke, A., Hellerstein, J., Heer, J.: Wrangler: interactive visual specification of data transformation scripts. In: CHI, pp. 3363–3372 (2011)
9.
Zurück zum Zitat Papenbrock, T., Naumann, F.: A hybrid approach to functional dependency discovery. In: SIGMOD Conference, pp. 821–833. ACM (2016) Papenbrock, T., Naumann, F.: A hybrid approach to functional dependency discovery. In: SIGMOD Conference, pp. 821–833. ACM (2016)
10.
Zurück zum Zitat Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDBJ 10(4), 334–350 (2001)CrossRefMATH Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDBJ 10(4), 334–350 (2001)CrossRefMATH
11.
Zurück zum Zitat Raman, V., Hellerstein, J.M.: Potter’s wheel: an interactive data cleaning system. In: VLDB 2001, pp. 381–390, 11–14 September 2001 Raman, V., Hellerstein, J.M.: Potter’s wheel: an interactive data cleaning system. In: VLDB 2001, pp. 381–390, 11–14 September 2001
12.
Zurück zum Zitat Singh, R.: BlinkFill: semi-supervised programming by example for syntactic string transformations. PVLDB 9(10), 816–827 (2016) Singh, R.: BlinkFill: semi-supervised programming by example for syntactic string transformations. PVLDB 9(10), 816–827 (2016)
13.
Zurück zum Zitat Jia, X., Fan, W., Geerts, F., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. TODS 33(1), 6:1–6:48 (2008) Jia, X., Fan, W., Geerts, F., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. TODS 33(1), 6:1–6:48 (2008)
14.
Zurück zum Zitat Wu, B., Knoblock, C.A.: An iterative approach to synthesize data transformation programs. In: IJCAI 2015, pp. 1726–1732, 25–31 July 2015 Wu, B., Knoblock, C.A.: An iterative approach to synthesize data transformation programs. In: IJCAI 2015, pp. 1726–1732, 25–31 July 2015
15.
Zurück zum Zitat Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. PVLDB 4(5), 279–289 (2011) Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. PVLDB 4(5), 279–289 (2011)
Metadaten
Titel
Towards Automatic Data Format Transformations: Data Wrangling at Scale
verfasst von
Alex Bogatu
Norman W. Paton
Alvaro A. A. Fernandes
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-60795-5_4

Premium Partner