skip to main content
research-article

Schema management for document stores

Published:01 May 2015Publication History
Skip Abstract Section

Abstract

Document stores that provide the efficiency of a schema-less interface are widely used by developers in mobile and cloud applications. However, the simplicity developers achieved controversially leads to complexity for data management due to lack of a schema. In this paper, we present a schema management framework for document stores. This framework discovers and persists schemas of JSON records in a repository, and also supports queries and schema summarization. The major technical challenge comes from varied structures of records caused by the schema-less data model and schema evolution. In the discovery phase, we apply a canonical form based method and propose an algorithm based on equivalent sub-trees to group equivalent schemas efficiently. Together with the algorithm, we propose a new data structure, eSiBu-Tree, to store schemas and support queries. In order to present a single summarized representation for heterogenous schemas in records, we introduce the concept of "skeleton", and propose to use it as a relaxed form of the schema, which captures a small set of core attributes. Finally, extensive experiments based on real data sets demonstrate the efficiency of our proposed schema discovery algorithms, and practical use cases in real-world data exploration and integration scenarios are presented to illustrate the effectiveness of using skeletons in these applications.

References

  1. DB2 V10.5 Manual. http://www-01.ibm.com/support/docview.wss?uid=swg27038855. {Accessed April 3, 2015}Google ScholarGoogle Scholar
  2. Oracle Database 12c. https://docs.oracle.com/en/database/database.html. {Accessed April 3, 2015}Google ScholarGoogle Scholar
  3. Why NoSQL? Technical report, CouchBase, 2013.Google ScholarGoogle Scholar
  4. T. Asai, K. Abe, et al. Efficient substructure discovery from large semi-structured data. In SDM, 158--174, 2002.Google ScholarGoogle Scholar
  5. D. Burdick, M. A. Hernández, et al. Extracting, linking and integrating data from public sources: A financial case study. In IEEE Data Eng. Bull., 34(3): 60--67, 2011.Google ScholarGoogle Scholar
  6. Y. Chi, Y. Yang, and R. R. Muntz. Canonical forms for labelled trees and their applications in frequent subtree mining. In Knowledge and Information Systems, 8(2): 203--234, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Couchbase. http://www.couchbase.com. {Accessed April 3, 2015}Google ScholarGoogle Scholar
  8. DBpedia. http://dbpedia.org. {Accessed April 3, 2015}Google ScholarGoogle Scholar
  9. DrugBank. http://drugbank.ca. {Accessed April 3, 2015}Google ScholarGoogle Scholar
  10. Freebase. http://freebase.com. {Accessed April 3, 2015}Google ScholarGoogle Scholar
  11. M. Garofalakis, A. Gionis, et al. Xtract: a system for extracting document type descriptors from XML documents. In ACM SIGMOD Record, 29: 165--176, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. O. Hassanzadeh, S. Hassas Yeganeh, and R. J. Miller. Linking semistructured data on the web. In WebDB, 2011.Google ScholarGoogle Scholar
  13. O. Hassanzadeh, K. Q. Pu, et al. Discovering linkage points over web data. In VLDB, 6(6): 444--456, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Hegewald, F. Naumann, and M. Weis. Xstruct: efficient schema extraction from multiple and large XML documents. In ICDE Workshop, 81, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. IMDb. http://www.imdb.com/. {Accessed April 3, 2015}Google ScholarGoogle Scholar
  16. Twitter Inc. Twitter developers documentation. https://dev.twitter.com/docs/api/1.1/overview. {Accessed April 3, 2015}Google ScholarGoogle Scholar
  17. Facebook Inc. Facebook developers documentation. https://developers.facebook.com/docs/. {Accessed April 3, 2015}Google ScholarGoogle Scholar
  18. Google Inc. Using JSON in the google data protocol. https://developers.google.com/gdata/docs/json. {Accessed April 3, 2015}Google ScholarGoogle Scholar
  19. JSON. http://www.json.org/. {Accessed April 3, 2015}Google ScholarGoogle Scholar
  20. Z. H. Liu, B. Hammerschmidt, and D. McMahon. JSON data management: supporting schema-less development in RDBMs. In SIGMOD, 1247--1258, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. K. Min, J. Y. Ahn, and C. W. Chung. Efficient extraction of schemas for XML documents. In Information Processing Letters, 85(1): 7--12, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. MongoDB. http://www.mongodb.org/. {Accessed April 3, 2015}Google ScholarGoogle Scholar
  23. S. Nestorov, J. Ullman, et al. Representative objects: Concise representations of semistructured, hierarchical data. In ICDE, 79--90, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Nestorov, S. Abiteboul, and R. Motwani. Extracting schema from semistructured data. In ACM SIGMOD Record, 27: 295--306, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. In VLDB, 10(4): 334--350, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. List of NoSQL Databases. http://nosql-database.org/. {Accessed April 3, 2015}Google ScholarGoogle Scholar
  27. F. Özcan, N. Tatbul, et al. Are we experiencing a big data bubble? In SIGMOD, 1407--1408, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. K. Wang and H. Liu. Schema discovery for semistructured data. In KDD, 97: 271--274, 1997.Google ScholarGoogle Scholar
  29. Q. Y. Wang, J. X. Yu, and K. F. Wong. Approximate graph schema extraction for semi-structured data. In EDBT, 302--316, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. R. Xu, D. Wunsch, et al. Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3): 645--678, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. C. Yu and H. Jagadish. Schema summarization. In VLDB, 319--330, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Zaki. Efficiently mining frequent trees in a forest. In SIGKDD, 71--80, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Schema management for document stores

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Proceedings of the VLDB Endowment
          Proceedings of the VLDB Endowment  Volume 8, Issue 9
          May 2015
          76 pages

          Publisher

          VLDB Endowment

          Publication History

          • Published: 1 May 2015
          Published in pvldb Volume 8, Issue 9

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader