Skip to main content
Erschienen in: The VLDB Journal 4/2019

05.01.2019 | Regular Paper

Parametric schema inference for massive JSON datasets

verfasst von: Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani

Erschienen in: The VLDB Journal | Ausgabe 4/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In recent years, JSON established itself as a very popular data format for representing massive data collections. JSON data collections are usually schemaless. While this ensures several advantages, the absence of schema information has important negative consequences as well: Data analysts and programmers cannot exploit a schema for a reliable description of the structure of the dataset, the correctness of complex queries and programs cannot be statically checked, and many schema-based optimizations are not possible. In this paper, we deal with the problem of inferring a schema from massive JSON datasets. We first identify a JSON type language which is simple and, at the same time, expressive enough to capture irregularities and to give complete structural information about input data. We then present our contributions, which are the design of a parametric and parallelizable schema inference algorithm, its theoretical study, and its implementation based on Spark, enabling reasonable schema inference time for massive collections. Our algorithm is parametric as the analyst can specify a parameter determining the level of precision and conciseness of the inferred schema. Finally, we report about an experimental analysis showing the effectiveness of our approach in terms of execution time, conciseness of inferred schemas, and scalability.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
We are here ignoring empty structural types, which are record types where one mandatory field has type \(\emptyset \), since they are never inferred, and we could even forbid them in the syntax.
 
9
The inferred types for each dataset are reported in [19].
 
10
We may be more formal, as follows: Consider n keys and a space where every point is a set of shapes, that is, a set of subsets of https://static-content.springer.com/image/art%3A10.1007%2Fs00778-018-0532-7/MediaObjects/778_2018_532_IEq445_HTML.gif . In this setting, every \({\mathcal {L}}\)-reduced type exactly indicates one point of a space whose size is \(2^{2^n}\); hence, each \({\mathcal {L}}\)-reduced type brings exactly the same amount of information: \(2^n\) bits. On the other side, a \({\mathcal {K}}\)-reduced type is, in general, compatible with many different points in this space; hence, it brings a lower number of bits, which depends on the number of optional keys, and may be computed for each specific \({\mathcal {K}}\)-reduced type. We may compare this number with \(2^n\) in order to mathematically quantify the information gain. We do not pursue this avenue because this model embeds the unrealistic idea that every distribution of shapes has the same probability, and because we do not believe that this model, although mathematically coherent, is a useful model of the information needs of the data analyst.
 
11
In the case of VK, we multiplied the original datasets 4, 6, ..., 20 times to reach a minimum size of 100 GB as the largest size.
 
Literatur
2.
Zurück zum Zitat Baazizi, M.A., Ben Lahmar, H., Colazzo, D., Ghelli, G., Sartiani, C.: Schema inference for massive JSON datasets. In: EDBT ’17 (2017) Baazizi, M.A., Ben Lahmar, H., Colazzo, D., Ghelli, G., Sartiani, C.: Schema inference for massive JSON datasets. In: EDBT ’17 (2017)
3.
Zurück zum Zitat Baazizi, M.A., Bidoit, N., Colazzo, D., Malla, N., Sahakyan, M.: Projection for XML update optimization. In: EDBT ’11, pp. 307–318 (2011) Baazizi, M.A., Bidoit, N., Colazzo, D., Malla, N., Sahakyan, M.: Projection for XML update optimization. In: EDBT ’11, pp. 307–318 (2011)
4.
Zurück zum Zitat Baazizi, M.-A., Colazzo, D., Ghelli, G., Sartiani, C.: Counting types for massive JSON datasets. In: DBPL ’17 (2017) Baazizi, M.-A., Colazzo, D., Ghelli, G., Sartiani, C.: Counting types for massive JSON datasets. In: DBPL ’17 (2017)
6.
Zurück zum Zitat Benzaken, V., Castagna, G., Colazzo, D., Nguyên, K.: Type-based XML projection. In: VLDB ’06, pp. 271–282 (2006) Benzaken, V., Castagna, G., Colazzo, D., Nguyên, K.: Type-based XML projection. In: VLDB ’06, pp. 271–282 (2006)
7.
Zurück zum Zitat Bex, G.J., Neven, F., Schwentick, T., Tuyls, K.: Inference of concise DTDs from XML data. In: VLDB ‘06, pp. 115–126 (2006) Bex, G.J., Neven, F., Schwentick, T., Tuyls, K.: Inference of concise DTDs from XML data. In: VLDB ‘06, pp. 115–126 (2006)
8.
Zurück zum Zitat Beyer, K.S., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M.Y., Kanne, C., Özcan, F., Shekita, E.J.: Jaql: a scripting language for large scale semistructured data analysis. PVLDB 4(12), 1272–1283 (2011) Beyer, K.S., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M.Y., Kanne, C., Özcan, F., Shekita, E.J.: Jaql: a scripting language for large scale semistructured data analysis. PVLDB 4(12), 1272–1283 (2011)
9.
Zurück zum Zitat Bonetta, D., Brantner, M.: Fad.js: fast JSON data access using JIT-based speculative optimizations. PVLDB 10(12), 1778–1789 (2017) Bonetta, D., Brantner, M.: Fad.js: fast JSON data access using JIT-based speculative optimizations. PVLDB 10(12), 1778–1789 (2017)
10.
Zurück zum Zitat Bourhis, P., Reutter, J.L., Suárez, F., Vrgoc, D.: JSON: data model, query languages and schema specification. In: PODS ’17, pp. 123–135 (2017) Bourhis, P., Reutter, J.L., Suárez, F., Vrgoc, D.: JSON: data model, query languages and schema specification. In: PODS ’17, pp. 123–135 (2017)
12.
Zurück zum Zitat Cebiric, S., Goasdoué, F., Manolescu, I.: Query-oriented summarization of RDF graphs. PVLDB 8(12), 2012–2015 (2015) Cebiric, S., Goasdoué, F., Manolescu, I.: Query-oriented summarization of RDF graphs. PVLDB 8(12), 2012–2015 (2015)
13.
Zurück zum Zitat Ciucanu, R., Staworko, S.: Learning schemas for unordered XML. In: DBPL ‘13 (2013) Ciucanu, R., Staworko, S.: Learning schemas for unordered XML. In: DBPL ‘13 (2013)
14.
Zurück zum Zitat Colazzo, D., Ghelli, G., Sartiani, C.: Typing massive JSON datasets. In: XLDI ’12, Affiliated with ICFP (2012) Colazzo, D., Ghelli, G., Sartiani, C.: Typing massive JSON datasets. In: XLDI ’12, Affiliated with ICFP (2012)
15.
Zurück zum Zitat DiScala, M., Abadi, D.J.: Automatic generation of normalized relational schemas from nested key-value data. In: Özcan, F., Koutrika, G., Madden, S. (eds.) SIGMOD ’16, pp. 295–310. ACM (2016) DiScala, M., Abadi, D.J.: Automatic generation of normalized relational schemas from nested key-value data. In: Özcan, F., Koutrika, G., Madden, S. (eds.) SIGMOD ’16, pp. 295–310. ACM (2016)
16.
Zurück zum Zitat Freydenberger, D.D., Kötzing, T.: Fast learning of restricted regular expressions and DTDs. Theory Comput. Syst. 57(4), 1114–1158 (2015)MathSciNetCrossRefMATH Freydenberger, D.D., Kötzing, T.: Fast learning of restricted regular expressions and DTDs. Theory Comput. Syst. 57(4), 1114–1158 (2015)MathSciNetCrossRefMATH
17.
Zurück zum Zitat Garofalakis, M.N., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: a system for extracting document type descriptors from XML documents. In: SIGMOD ’00, pp. 165–176 (2000) Garofalakis, M.N., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: a system for extracting document type descriptors from XML documents. In: SIGMOD ’00, pp. 165–176 (2000)
18.
Zurück zum Zitat Goldman, R., Widom, J.: Dataguides: enabling query formulation and optimization in semistructured databases. In: VLDB’97, pp. 436–445 (1997) Goldman, R., Widom, J.: Dataguides: enabling query formulation and optimization in semistructured databases. In: VLDB’97, pp. 436–445 (1997)
23.
Zurück zum Zitat Li, Y., Katsipoulakis, N.R., Chandramouli, B., Goldstein, J., Kossmann, D.: Mison: a fast JSON parser for data analytics. PVLDB 10(10), 1118–1129 (2017) Li, Y., Katsipoulakis, N.R., Chandramouli, B., Goldstein, J., Kossmann, D.: Mison: a fast JSON parser for data analytics. PVLDB 10(10), 1118–1129 (2017)
24.
Zurück zum Zitat Liu, Z.H., Hammerschmidt, B., McMahon, D.: JSON data management: supporting schema-less development in RDBMS. In: SIGMOD ’14, pp. 1247–1258 (2014) Liu, Z.H., Hammerschmidt, B., McMahon, D.: JSON data management: supporting schema-less development in RDBMS. In: SIGMOD ’14, pp. 1247–1258 (2014)
25.
Zurück zum Zitat Lohrey, M., Maneth, S., Reh, C.P.: Compression of unordered XML trees. In: ICDT’07, pp. 18:1–18:17 (2017) Lohrey, M., Maneth, S., Reh, C.P.: Compression of unordered XML trees. In: ICDT’07, pp. 18:1–18:17 (2017)
26.
Zurück zum Zitat McHugh, J., Widom, J.: Query optimization for XML. In: VLDB ’99, pp. 315–326. Morgan Kaufmann Publishers Inc. (1999) McHugh, J., Widom, J.: Query optimization for XML. In: VLDB ’99, pp. 315–326. Morgan Kaufmann Publishers Inc. (1999)
27.
Zurück zum Zitat Murata, M., Lee, D., Mani, M., Kawaguchi, K.: Taxonomy of XML schema languages using formal language theory. ACM Trans. Internet Technol. 5(4), 660–704 (2005)CrossRef Murata, M., Lee, D., Mani, M., Kawaguchi, K.: Taxonomy of XML schema languages using formal language theory. ACM Trans. Internet Technol. 5(4), 660–704 (2005)CrossRef
28.
Zurück zum Zitat Nestorov, S., Abiteboul, S., Motwani, R.: Inferring structure in semistructured data. SIGMOD Rec. 26(4), 39–43 (1997)CrossRef Nestorov, S., Abiteboul, S., Motwani, R.: Inferring structure in semistructured data. SIGMOD Rec. 26(4), 39–43 (1997)CrossRef
29.
Zurück zum Zitat Nestorov, S., Abiteboul, S., Motwani, R.: Extracting schema from semistructured data. In: SIGMOD ’98, pp. 295–306 (1998) Nestorov, S., Abiteboul, S., Motwani, R.: Extracting schema from semistructured data. In: SIGMOD ’98, pp. 295–306 (1998)
30.
Zurück zum Zitat Pezoa, F., Reutter, J.L., Suarez, F., Ugarte, M., Vrgoč, D.: Foundations of JSON Schema. In: WWW ’16, pp. 263–273 (2016) Pezoa, F., Reutter, J.L., Suarez, F., Ugarte, M., Vrgoč, D.: Foundations of JSON Schema. In: WWW ’16, pp. 263–273 (2016)
31.
Zurück zum Zitat Scherzinger, S., de Almeida, E.C., Cerqueus, T., de Almeida, L.B., Holanda, P.: Finding and fixing type mismatches in the evolution of object-nosql mappings. In: Proceedings of the Workshops of the EDBT/ICDT 2016 (2016) Scherzinger, S., de Almeida, E.C., Cerqueus, T., de Almeida, L.B., Holanda, P.: Finding and fixing type mismatches in the evolution of object-nosql mappings. In: Proceedings of the Workshops of the EDBT/ICDT 2016 (2016)
36.
Zurück zum Zitat Wang, L., Zhang, S., Shi, J., Jiao, L., Hassanzadeh, O., Zou, J., Wangz, C.: Schema management for document stores. Proc. VLDB Endow. 8(9), 922–933 (2015)CrossRef Wang, L., Zhang, S., Shi, J., Jiao, L., Hassanzadeh, O., Zou, J., Wangz, C.: Schema management for document stores. Proc. VLDB Endow. 8(9), 922–933 (2015)CrossRef
Metadaten
Titel
Parametric schema inference for massive JSON datasets
verfasst von
Mohamed-Amine Baazizi
Dario Colazzo
Giorgio Ghelli
Carlo Sartiani
Publikationsdatum
05.01.2019
Verlag
Springer Berlin Heidelberg
Erschienen in
The VLDB Journal / Ausgabe 4/2019
Print ISSN: 1066-8888
Elektronische ISSN: 0949-877X
DOI
https://doi.org/10.1007/s00778-018-0532-7

Weitere Artikel der Ausgabe 4/2019

The VLDB Journal 4/2019 Zur Ausgabe

Premium Partner