nach oben

The VLDB Journal

Erschienen in:

05.01.2019 | Regular Paper

Parametric schema inference for massive JSON datasets

verfasst von: Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani

Erschienen in: The VLDB Journal | Ausgabe 4/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

In recent years, JSON established itself as a very popular data format for representing massive data collections. JSON data collections are usually schemaless. While this ensures several advantages, the absence of schema information has important negative consequences as well: Data analysts and programmers cannot exploit a schema for a reliable description of the structure of the dataset, the correctness of complex queries and programs cannot be statically checked, and many schema-based optimizations are not possible. In this paper, we deal with the problem of inferring a schema from massive JSON datasets. We first identify a JSON type language which is simple and, at the same time, expressive enough to capture irregularities and to give complete structural information about input data. We then present our contributions, which are the design of a parametric and parallelizable schema inference algorithm, its theoretical study, and its implementation based on Spark, enabling reasonable schema inference time for massive collections. Our algorithm is parametric as the analyst can specify a parameter determining the level of precision and conciseness of the inferred schema. Finally, we report about an experimental analysis showing the effectiveness of our approach in terms of execution time, conciseness of inferred schemas, and scalability.

Vorheriger Artikel Compact representations of temporal databases

Nächster Artikel Scalable computational geometry in MapReduce

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

We are here ignoring empty structural types, which are record types where one mandatory field has type \(\emptyset \), since they are never inferred, and we could even forbid them in the syntax.

https://developer.nytimes.com.

https://dumps.wikimedia.org/wikidatawiki/entities/.

https://www.kaggle.com/borisch/russian-election-2018-twitter.

https://www.kaggle.com/borisch/russian-election-2018-vkcom-user-activity/feed.

https://vk.com/dev/streaming_api_docs_2.

https://core.ac.uk.

https://core.ac.uk/services#dump-structure.

The inferred types for each dataset are reported in [19].

We may be more formal, as follows: Consider n keys and a space where every point is a set of shapes, that is, a set of subsets of

https://static-content.springer.com/image/art%3A10.1007%2Fs00778-018-0532-7/MediaObjects/778_2018_532_IEq445_HTML.gif

. In this setting, every \({\mathcal {L}}\)-reduced type exactly indicates one point of a space whose size is \(2^{2^n}\); hence, each \({\mathcal {L}}\)-reduced type brings exactly the same amount of information: \(2^n\) bits. On the other side, a \({\mathcal {K}}\)-reduced type is, in general, compatible with many different points in this space; hence, it brings a lower number of bits, which depends on the number of optional keys, and may be computed for each specific \({\mathcal {K}}\)-reduced type. We may compare this number with \(2^n\) in order to mathematically quantify the information gain. We do not pursue this avenue because this model embeds the unrealistic idea that every distribution of shapes has the same probability, and because we do not believe that this model, although mathematically coherent, is a useful model of the information needs of the data analyst.

In the case of VK, we multiplied the original datasets 4, 6, ..., 20 times to reach a minimum size of 100 GB as the largest size.

Apache Spark. http://spark.apache.org

Baazizi, M.A., Ben Lahmar, H., Colazzo, D., Ghelli, G., Sartiani, C.: Schema inference for massive JSON datasets. In: EDBT ’17 (2017)

Baazizi, M.A., Bidoit, N., Colazzo, D., Malla, N., Sahakyan, M.: Projection for XML update optimization. In: EDBT ’11, pp. 307–318 (2011)

Baazizi, M.-A., Colazzo, D., Ghelli, G., Sartiani, C.: Counting types for massive JSON datasets. In: DBPL ’17 (2017)

Baazizi, M.-A., Colazzo, D., Ghelli, G., Sartiani, C.: Proofs for parametric schema inference for massive JSON datasets. Working paper or preprint (2018). https://hal.archives-ouvertes.fr/hal-01960464/

Benzaken, V., Castagna, G., Colazzo, D., Nguyên, K.: Type-based XML projection. In: VLDB ’06, pp. 271–282 (2006)

Bex, G.J., Neven, F., Schwentick, T., Tuyls, K.: Inference of concise DTDs from XML data. In: VLDB ‘06, pp. 115–126 (2006)

Beyer, K.S., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M.Y., Kanne, C., Özcan, F., Shekita, E.J.: Jaql: a scripting language for large scale semistructured data analysis. PVLDB 4(12), 1272–1283 (2011)

Bonetta, D., Brantner, M.: Fad.js: fast JSON data access using JIT-based speculative optimizations. PVLDB 10(12), 1778–1789 (2017)

10.

Bourhis, P., Reutter, J.L., Suárez, F., Vrgoc, D.: JSON: data model, query languages and schema specification. In: PODS ’17, pp. 123–135 (2017)

11.

Bray, T.: The JavaScript object notation (JSON) data interchange format (2014). https://tools.ietf.org/html/rfc7159

12.

Cebiric, S., Goasdoué, F., Manolescu, I.: Query-oriented summarization of RDF graphs. PVLDB 8(12), 2012–2015 (2015)

13.

Ciucanu, R., Staworko, S.: Learning schemas for unordered XML. In: DBPL ‘13 (2013)

14.

Colazzo, D., Ghelli, G., Sartiani, C.: Typing massive JSON datasets. In: XLDI ’12, Affiliated with ICFP (2012)

15.

DiScala, M., Abadi, D.J.: Automatic generation of normalized relational schemas from nested key-value data. In: Özcan, F., Koutrika, G., Madden, S. (eds.) SIGMOD ’16, pp. 295–310. ACM (2016)

16.

Freydenberger, D.D., Kötzing, T.: Fast learning of restricted regular expressions and DTDs. Theory Comput. Syst. 57(4), 1114–1158 (2015)MathSciNetCrossRefMATH

17.

Garofalakis, M.N., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: a system for extracting document type descriptors from XML documents. In: SIGMOD ’00, pp. 165–176 (2000)

18.

Goldman, R., Widom, J.: Dataguides: enabling query formulation and optimization in semistructured databases. In: VLDB’97, pp. 436–445 (1997)

19.

http://webia.lip6.fr/~baazizi/rs/js/vj18

20.

JSON schema definition language. http://jsoniq.org/docs/JSound/html-single/

21.

JSON schema language. http://json-schema.org

22.

Labs, T.S.: Studio 3T, 2017. https://studio3t.com

23.

Li, Y., Katsipoulakis, N.R., Chandramouli, B., Goldstein, J., Kossmann, D.: Mison: a fast JSON parser for data analytics. PVLDB 10(10), 1118–1129 (2017)

24.

Liu, Z.H., Hammerschmidt, B., McMahon, D.: JSON data management: supporting schema-less development in RDBMS. In: SIGMOD ’14, pp. 1247–1258 (2014)

25.

Lohrey, M., Maneth, S., Reh, C.P.: Compression of unordered XML trees. In: ICDT’07, pp. 18:1–18:17 (2017)

26.

McHugh, J., Widom, J.: Query optimization for XML. In: VLDB ’99, pp. 315–326. Morgan Kaufmann Publishers Inc. (1999)

27.

Murata, M., Lee, D., Mani, M., Kawaguchi, K.: Taxonomy of XML schema languages using formal language theory. ACM Trans. Internet Technol. 5(4), 660–704 (2005)CrossRef

28.

Nestorov, S., Abiteboul, S., Motwani, R.: Inferring structure in semistructured data. SIGMOD Rec. 26(4), 39–43 (1997)CrossRef

29.

Nestorov, S., Abiteboul, S., Motwani, R.: Extracting schema from semistructured data. In: SIGMOD ’98, pp. 295–306 (1998)

30.

Pezoa, F., Reutter, J.L., Suarez, F., Ugarte, M., Vrgoč, D.: Foundations of JSON Schema. In: WWW ’16, pp. 263–273 (2016)

31.

Scherzinger, S., de Almeida, E.C., Cerqueus, T., de Almeida, L.B., Holanda, P.: Finding and fixing type mismatches in the evolution of object-nosql mappings. In: Proceedings of the Workshops of the EDBT/ICDT 2016 (2016)

32.

Schmidt, P.: mongodb-schema (2017). https://github.com/mongodb-js/mongodb-schema

33.

scrapinghub. Skinfer (2015). https://github.com/scrapinghub/skinfer

34.

Spark dataframe. https://spark.apache.org/docs/latest/sql-programming-guide.html

35.

The JSON Query Language. http://www.jsoniq.org

36.

Wang, L., Zhang, S., Shi, J., Jiao, L., Hassanzadeh, O., Zou, J., Wangz, C.: Schema management for document stores. Proc. VLDB Endow. 8(9), 922–933 (2015)CrossRef

Titel: Parametric schema inference for massive JSON datasets
verfasst von: Mohamed-Amine Baazizi
Dario Colazzo
Giorgio Ghelli
Carlo Sartiani
Publikationsdatum: 05.01.2019
Verlag: Springer Berlin Heidelberg
Erschienen in: The VLDB Journal / Ausgabe 4/2019
Print ISSN: 1066-8888
Elektronische ISSN: 0949-877X
DOI: https://doi.org/10.1007/s00778-018-0532-7

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 4/2019

Prescriptive analytics: a survey of emerging trends and technologies

A framework for efficient multi-attribute movement data analysis

Fast and scalable method for distributed Boolean tensor factorization

Scalable computational geometry in MapReduce

Fast diversified coherent core search on multi-layer graphs

Interleaving with coroutines: a systematic and practical approach to hide memory latency in index joins

Premium Partner