research-article

Schema management for document stores

Authors:
Lanjun Wang

IBM Research - China

IBM Research - China
View Profile

,
Shuo Zhang

IBM Research - China

IBM Research - China
View Profile

,
Juwei Shi

IBM Research - China

IBM Research - China
View Profile

,
Limei Jiao

IBM Research - China

IBM Research - China
View Profile

,
Oktie Hassanzadeh

IBM T.J. Watson Research Center

IBM T.J. Watson Research Center
View Profile

,
Jia Zou

Tsinghua University

Tsinghua University
View Profile

,
Chen Wangz

Tsinghua University

Tsinghua University
View Profile

Proceedings of the VLDB Endowment Volume 8 Issue 9pp 922–933https://doi.org/10.14778/2777598.2777601

Published:01 May 2015Publication History

Proceedings of the VLDB Endowment

Abstract

Document stores that provide the efficiency of a schema-less interface are widely used by developers in mobile and cloud applications. However, the simplicity developers achieved controversially leads to complexity for data management due to lack of a schema. In this paper, we present a schema management framework for document stores. This framework discovers and persists schemas of JSON records in a repository, and also supports queries and schema summarization. The major technical challenge comes from varied structures of records caused by the schema-less data model and schema evolution. In the discovery phase, we apply a canonical form based method and propose an algorithm based on equivalent sub-trees to group equivalent schemas efficiently. Together with the algorithm, we propose a new data structure, eSiBu-Tree, to store schemas and support queries. In order to present a single summarized representation for heterogenous schemas in records, we introduce the concept of "skeleton", and propose to use it as a relaxed form of the schema, which captures a small set of core attributes. Finally, extensive experiments based on real data sets demonstrate the efficiency of our proposed schema discovery algorithms, and practical use cases in real-world data exploration and integration scenarios are presented to illustrate the effectiveness of using skeletons in these applications.

References

DB2 V10.5 Manual. http://www-01.ibm.com/support/docview.wss?uid=swg27038855. {Accessed April 3, 2015}Google Scholar
Oracle Database 12c. https://docs.oracle.com/en/database/database.html. {Accessed April 3, 2015}Google Scholar
Why NoSQL? Technical report, CouchBase, 2013.Google Scholar
T. Asai, K. Abe, et al. Efficient substructure discovery from large semi-structured data. In SDM, 158--174, 2002.Google Scholar
D. Burdick, M. A. Hernández, et al. Extracting, linking and integrating data from public sources: A financial case study. In IEEE Data Eng. Bull., 34(3): 60--67, 2011.Google Scholar
Y. Chi, Y. Yang, and R. R. Muntz. Canonical forms for labelled trees and their applications in frequent subtree mining. In Knowledge and Information Systems, 8(2): 203--234, 2005. Google ScholarDigital Library
Couchbase. http://www.couchbase.com. {Accessed April 3, 2015}Google Scholar
DBpedia. http://dbpedia.org. {Accessed April 3, 2015}Google Scholar
DrugBank. http://drugbank.ca. {Accessed April 3, 2015}Google Scholar
Freebase. http://freebase.com. {Accessed April 3, 2015}Google Scholar
M. Garofalakis, A. Gionis, et al. Xtract: a system for extracting document type descriptors from XML documents. In ACM SIGMOD Record, 29: 165--176, 2000. Google ScholarDigital Library
O. Hassanzadeh, S. Hassas Yeganeh, and R. J. Miller. Linking semistructured data on the web. In WebDB, 2011.Google Scholar
O. Hassanzadeh, K. Q. Pu, et al. Discovering linkage points over web data. In VLDB, 6(6): 444--456, 2013. Google ScholarDigital Library
J. Hegewald, F. Naumann, and M. Weis. Xstruct: efficient schema extraction from multiple and large XML documents. In ICDE Workshop, 81, 2006. Google ScholarDigital Library
IMDb. http://www.imdb.com/. {Accessed April 3, 2015}Google Scholar
Twitter Inc. Twitter developers documentation. https://dev.twitter.com/docs/api/1.1/overview. {Accessed April 3, 2015}Google Scholar
Facebook Inc. Facebook developers documentation. https://developers.facebook.com/docs/. {Accessed April 3, 2015}Google Scholar
Google Inc. Using JSON in the google data protocol. https://developers.google.com/gdata/docs/json. {Accessed April 3, 2015}Google Scholar
JSON. http://www.json.org/. {Accessed April 3, 2015}Google Scholar
Z. H. Liu, B. Hammerschmidt, and D. McMahon. JSON data management: supporting schema-less development in RDBMs. In SIGMOD, 1247--1258, 2014. Google ScholarDigital Library
J. K. Min, J. Y. Ahn, and C. W. Chung. Efficient extraction of schemas for XML documents. In Information Processing Letters, 85(1): 7--12, 2003. Google ScholarDigital Library
MongoDB. http://www.mongodb.org/. {Accessed April 3, 2015}Google Scholar
S. Nestorov, J. Ullman, et al. Representative objects: Concise representations of semistructured, hierarchical data. In ICDE, 79--90, 1997. Google ScholarDigital Library
S. Nestorov, S. Abiteboul, and R. Motwani. Extracting schema from semistructured data. In ACM SIGMOD Record, 27: 295--306, 1998. Google ScholarDigital Library
E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. In VLDB, 10(4): 334--350, 2001. Google ScholarDigital Library
List of NoSQL Databases. http://nosql-database.org/. {Accessed April 3, 2015}Google Scholar
F. Özcan, N. Tatbul, et al. Are we experiencing a big data bubble? In SIGMOD, 1407--1408, 2014. Google ScholarDigital Library
K. Wang and H. Liu. Schema discovery for semistructured data. In KDD, 97: 271--274, 1997.Google Scholar
Q. Y. Wang, J. X. Yu, and K. F. Wong. Approximate graph schema extraction for semi-structured data. In EDBT, 302--316, 2000. Google ScholarDigital Library
R. Xu, D. Wunsch, et al. Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3): 645--678, 2005. Google ScholarDigital Library
C. Yu and H. Jagadish. Schema summarization. In VLDB, 319--330, 2006. Google ScholarDigital Library
M. Zaki. Efficiently mining frequent trees in a forest. In SIGKDD, 71--80, 2002. Google ScholarDigital Library

Index Terms

Schema management for document stores
1. Information systems
  1. Data management systems
    1. Database design and models
      1. Data model extensions
    2. Database management system engines
  2. Information storage systems
    1. Record storage systems

Recommendations

A mapping schema and interface for XML stores
WIDM '02: Proceedings of the 4th international workshop on Web information and data management

Most XML storage efforts have focused on mapping documents to relational databases. Mapping choices range from storing documents verbatim to shredding documents into relations in various ways. These choices are usually hard-coded into each storage ...
Read More
Definitive XML Schema
Read More
Xml schema
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 8, Issue 9
May 2015
76 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 May 2015
Published in pvldb Volume 8, Issue 9
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 22
  Total Citations
  View Citations
- 394
  Total Downloads
- Downloads (Last 12 months)50
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Schema management for document stores

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

A mapping schema and interface for XML stores

Definitive XML Schema

Xml schema

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Schema management for document stores

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

A mapping schema and interface for XML stores

Definitive XML Schema

Xml schema

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media