research-article

Data profiling with metanome

Authors:
Thorsten Papenbrock

Hasso-Plattner-Institut, Potsdam, Germany

Hasso-Plattner-Institut, Potsdam, Germany
View Profile

,
Tanja Bergmann

Hasso-Plattner-Institut, Potsdam, Germany

Hasso-Plattner-Institut, Potsdam, Germany
View Profile

,
Moritz Finke

Hasso-Plattner-Institut, Potsdam, Germany

Hasso-Plattner-Institut, Potsdam, Germany
View Profile

,
Jakob Zwiener

Hasso-Plattner-Institut, Potsdam, Germany

Hasso-Plattner-Institut, Potsdam, Germany
View Profile

,
Felix Naumann

Hasso-Plattner-Institut, Potsdam, Germany

Hasso-Plattner-Institut, Potsdam, Germany
View Profile

Proceedings of the VLDB Endowment Volume 8 Issue 12pp 1860–1863https://doi.org/10.14778/2824032.2824086

Published:01 August 2015Publication History

Proceedings of the VLDB Endowment

Abstract

Data profiling is the discipline of discovering metadata about given datasets. The metadata itself serve a variety of use cases, such as data integration, data cleansing, or query optimization. Due to the importance of data profiling in practice, many tools have emerged that support data scientists and IT professionals in this task. These tools provide good support for profiling statistics that are easy to compute, but they are usually lacking automatic and efficient discovery of complex statistics, such as inclusion dependencies, unique column combinations, or functional dependencies.

We present Metanome, an extensible profiling platform that incorporates many state-of-the-art profiling algorithms. While Metanome is able to calculate simple profiling statistics in relational data, its focus lies on the automatic discovery of complex metadata. Metanome's goal is to provide novel profiling algorithms from research, perform comparative evaluations, and to support developers in building and testing new algorithms. In addition, Metanome is able to rank profiling results according to various metrics and to visualize the, at times, large metadata sets.

References

Z. Abedjan, P. Schulze, and F. Naumann. DFD: Efficient functional dependency discovery. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 949--958, 2014. Google Scholar
J. Bauckmann, U. Leser, and F. Naumann. Efficiently computing inclusion dependencies for schema discovery. In ICDE Workshops, page 2, 2006. Google Scholar
P. A. Flach and I. Savnik. Database dependency discovery: a machine learning approach. AI Communications, 12(3):139--160, 1999. Google Scholar
A. Heise, J.-A. Quiané-Ruiz, Z. Abedjan, A. Jentzsch, and F. Naumann. Scalable discovery of unique column combinations. Proceedings of the VLDB Endowment, 7(4):301--312, 2013. Google Scholar
Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. TANE: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2):100--111, 1999.Google Scholar
S. Lopes, J.-M. Petit, and L. Lakhal. Efficient discovery of functional dependencies and Armstrong relations. In Proceedings of the International Conference on Extending Database Technology (EDBT), pages 350--364, 2000. Google Scholar
F. D. Marchi, S. Lopes, and J.-M. Petit. Unary and n-ary inclusion dependency discovery in relational databases. Journal of Intelligent Information Systems (JIIS), 32(1):53--73, 2009. Google Scholar
N. Novelli and R. Cicchetti. FUN: An efficient algorithm for mining functional and embedded dependencies. In Proceedings of the International Conference on Database Theory (ICDT), pages 189--203, 2001. Google Scholar
T. Papenbrock, J. Ehrlich, J. Marten, T. Neubert, J.-P. Rudolph, M. Schönberg, J. Zwiener, and F. Naumann. Functional dependency discovery: An experimental evaluation of seven algorithms. Proceedings of the VLDB Endowment, 8(10), 2015. Google Scholar
T. Papenbrock, S. Kruse, J.-A. Quiané-Ruiz, and F. Naumann. Divide & conquer-based inclusion dependency discovery. Proceedings of the VLDB Endowment, 8(7):774--785, 2015. Google Scholar
C. Wyss, C. Giannella, and E. Robertson. FastFDs: A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances. In Proceedings of the International Conference of Data Warehousing and Knowledge Discovery (DaWaK), pages 101--110, 2001. Google Scholar
H. Yao and H. J. Hamilton. Mining functional dependencies from data. Data Mining and Knowledge Discovery, 16(2):197--219, 2008. Google Scholar

Index Terms

Data profiling with metanome
1. Information systems
  1. Data management systems
    1. Database management system engines
  2. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Data Profiling: A Tutorial
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

is to understand the dataset at hand and its metadata. The process of metadata discovery is known as data profiling. Profiling activities range from ad-hoc approaches, such as eye-balling random subsets of the data or formulating aggregation queries, to ...
Read More
Data profiling revisited

Data profiling comprises a broad range of methods to efficiently analyze a given data set. In a typical scenario, which mirrors the capabilities of commercial data profiling tools, tables of a relational database are scanned to derive metadata, such as ...
Read More
Data Profiling: The Key to Successful Data Integration
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 8, Issue 12
Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii
August 2015
728 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2015
Published in pvldb Volume 8, Issue 12
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 48
  Total Citations
  View Citations
- 457
  Total Downloads
- Downloads (Last 12 months)65
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Data profiling with metanome

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Data Profiling: A Tutorial

Data profiling revisited

Data Profiling: The Key to Successful Data Integration

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Data profiling with metanome

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Data Profiling: A Tutorial

Data profiling revisited

Data Profiling: The Key to Successful Data Integration

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media