Abstract
Document stores that provide the efficiency of a schema-less interface are widely used by developers in mobile and cloud applications. However, the simplicity developers achieved controversially leads to complexity for data management due to lack of a schema. In this paper, we present a schema management framework for document stores. This framework discovers and persists schemas of JSON records in a repository, and also supports queries and schema summarization. The major technical challenge comes from varied structures of records caused by the schema-less data model and schema evolution. In the discovery phase, we apply a canonical form based method and propose an algorithm based on equivalent sub-trees to group equivalent schemas efficiently. Together with the algorithm, we propose a new data structure, eSiBu-Tree, to store schemas and support queries. In order to present a single summarized representation for heterogenous schemas in records, we introduce the concept of "skeleton", and propose to use it as a relaxed form of the schema, which captures a small set of core attributes. Finally, extensive experiments based on real data sets demonstrate the efficiency of our proposed schema discovery algorithms, and practical use cases in real-world data exploration and integration scenarios are presented to illustrate the effectiveness of using skeletons in these applications.
- DB2 V10.5 Manual. http://www-01.ibm.com/support/docview.wss?uid=swg27038855. {Accessed April 3, 2015}Google Scholar
- Oracle Database 12c. https://docs.oracle.com/en/database/database.html. {Accessed April 3, 2015}Google Scholar
- Why NoSQL? Technical report, CouchBase, 2013.Google Scholar
- T. Asai, K. Abe, et al. Efficient substructure discovery from large semi-structured data. In SDM, 158--174, 2002.Google Scholar
- D. Burdick, M. A. Hernández, et al. Extracting, linking and integrating data from public sources: A financial case study. In IEEE Data Eng. Bull., 34(3): 60--67, 2011.Google Scholar
- Y. Chi, Y. Yang, and R. R. Muntz. Canonical forms for labelled trees and their applications in frequent subtree mining. In Knowledge and Information Systems, 8(2): 203--234, 2005. Google ScholarDigital Library
- Couchbase. http://www.couchbase.com. {Accessed April 3, 2015}Google Scholar
- DBpedia. http://dbpedia.org. {Accessed April 3, 2015}Google Scholar
- DrugBank. http://drugbank.ca. {Accessed April 3, 2015}Google Scholar
- Freebase. http://freebase.com. {Accessed April 3, 2015}Google Scholar
- M. Garofalakis, A. Gionis, et al. Xtract: a system for extracting document type descriptors from XML documents. In ACM SIGMOD Record, 29: 165--176, 2000. Google ScholarDigital Library
- O. Hassanzadeh, S. Hassas Yeganeh, and R. J. Miller. Linking semistructured data on the web. In WebDB, 2011.Google Scholar
- O. Hassanzadeh, K. Q. Pu, et al. Discovering linkage points over web data. In VLDB, 6(6): 444--456, 2013. Google ScholarDigital Library
- J. Hegewald, F. Naumann, and M. Weis. Xstruct: efficient schema extraction from multiple and large XML documents. In ICDE Workshop, 81, 2006. Google ScholarDigital Library
- IMDb. http://www.imdb.com/. {Accessed April 3, 2015}Google Scholar
- Twitter Inc. Twitter developers documentation. https://dev.twitter.com/docs/api/1.1/overview. {Accessed April 3, 2015}Google Scholar
- Facebook Inc. Facebook developers documentation. https://developers.facebook.com/docs/. {Accessed April 3, 2015}Google Scholar
- Google Inc. Using JSON in the google data protocol. https://developers.google.com/gdata/docs/json. {Accessed April 3, 2015}Google Scholar
- JSON. http://www.json.org/. {Accessed April 3, 2015}Google Scholar
- Z. H. Liu, B. Hammerschmidt, and D. McMahon. JSON data management: supporting schema-less development in RDBMs. In SIGMOD, 1247--1258, 2014. Google ScholarDigital Library
- J. K. Min, J. Y. Ahn, and C. W. Chung. Efficient extraction of schemas for XML documents. In Information Processing Letters, 85(1): 7--12, 2003. Google ScholarDigital Library
- MongoDB. http://www.mongodb.org/. {Accessed April 3, 2015}Google Scholar
- S. Nestorov, J. Ullman, et al. Representative objects: Concise representations of semistructured, hierarchical data. In ICDE, 79--90, 1997. Google ScholarDigital Library
- S. Nestorov, S. Abiteboul, and R. Motwani. Extracting schema from semistructured data. In ACM SIGMOD Record, 27: 295--306, 1998. Google ScholarDigital Library
- E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. In VLDB, 10(4): 334--350, 2001. Google ScholarDigital Library
- List of NoSQL Databases. http://nosql-database.org/. {Accessed April 3, 2015}Google Scholar
- F. Özcan, N. Tatbul, et al. Are we experiencing a big data bubble? In SIGMOD, 1407--1408, 2014. Google ScholarDigital Library
- K. Wang and H. Liu. Schema discovery for semistructured data. In KDD, 97: 271--274, 1997.Google Scholar
- Q. Y. Wang, J. X. Yu, and K. F. Wong. Approximate graph schema extraction for semi-structured data. In EDBT, 302--316, 2000. Google ScholarDigital Library
- R. Xu, D. Wunsch, et al. Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3): 645--678, 2005. Google ScholarDigital Library
- C. Yu and H. Jagadish. Schema summarization. In VLDB, 319--330, 2006. Google ScholarDigital Library
- M. Zaki. Efficiently mining frequent trees in a forest. In SIGKDD, 71--80, 2002. Google ScholarDigital Library
Index Terms
- Schema management for document stores
Recommendations
A mapping schema and interface for XML stores
WIDM '02: Proceedings of the 4th international workshop on Web information and data managementMost XML storage efforts have focused on mapping documents to relational databases. Mapping choices range from storing documents verbatim to shredding documents into relations in various ways. These choices are usually hard-coded into each storage ...
Comments