Abstract
Data profiling is the discipline of discovering metadata about given datasets. The metadata itself serve a variety of use cases, such as data integration, data cleansing, or query optimization. Due to the importance of data profiling in practice, many tools have emerged that support data scientists and IT professionals in this task. These tools provide good support for profiling statistics that are easy to compute, but they are usually lacking automatic and efficient discovery of complex statistics, such as inclusion dependencies, unique column combinations, or functional dependencies.
We present Metanome, an extensible profiling platform that incorporates many state-of-the-art profiling algorithms. While Metanome is able to calculate simple profiling statistics in relational data, its focus lies on the automatic discovery of complex metadata. Metanome's goal is to provide novel profiling algorithms from research, perform comparative evaluations, and to support developers in building and testing new algorithms. In addition, Metanome is able to rank profiling results according to various metrics and to visualize the, at times, large metadata sets.
- Z. Abedjan, P. Schulze, and F. Naumann. DFD: Efficient functional dependency discovery. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 949--958, 2014. Google Scholar
- J. Bauckmann, U. Leser, and F. Naumann. Efficiently computing inclusion dependencies for schema discovery. In ICDE Workshops, page 2, 2006. Google Scholar
- P. A. Flach and I. Savnik. Database dependency discovery: a machine learning approach. AI Communications, 12(3):139--160, 1999. Google Scholar
- A. Heise, J.-A. Quiané-Ruiz, Z. Abedjan, A. Jentzsch, and F. Naumann. Scalable discovery of unique column combinations. Proceedings of the VLDB Endowment, 7(4):301--312, 2013. Google Scholar
- Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. TANE: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2):100--111, 1999.Google Scholar
- S. Lopes, J.-M. Petit, and L. Lakhal. Efficient discovery of functional dependencies and Armstrong relations. In Proceedings of the International Conference on Extending Database Technology (EDBT), pages 350--364, 2000. Google Scholar
- F. D. Marchi, S. Lopes, and J.-M. Petit. Unary and n-ary inclusion dependency discovery in relational databases. Journal of Intelligent Information Systems (JIIS), 32(1):53--73, 2009. Google Scholar
- N. Novelli and R. Cicchetti. FUN: An efficient algorithm for mining functional and embedded dependencies. In Proceedings of the International Conference on Database Theory (ICDT), pages 189--203, 2001. Google Scholar
- T. Papenbrock, J. Ehrlich, J. Marten, T. Neubert, J.-P. Rudolph, M. Schönberg, J. Zwiener, and F. Naumann. Functional dependency discovery: An experimental evaluation of seven algorithms. Proceedings of the VLDB Endowment, 8(10), 2015. Google Scholar
- T. Papenbrock, S. Kruse, J.-A. Quiané-Ruiz, and F. Naumann. Divide & conquer-based inclusion dependency discovery. Proceedings of the VLDB Endowment, 8(7):774--785, 2015. Google Scholar
- C. Wyss, C. Giannella, and E. Robertson. FastFDs: A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances. In Proceedings of the International Conference of Data Warehousing and Knowledge Discovery (DaWaK), pages 101--110, 2001. Google Scholar
- H. Yao and H. J. Hamilton. Mining functional dependencies from data. Data Mining and Knowledge Discovery, 16(2):197--219, 2008. Google Scholar
Index Terms
- Data profiling with metanome
Recommendations
Data Profiling: A Tutorial
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Datais to understand the dataset at hand and its metadata. The process of metadata discovery is known as data profiling. Profiling activities range from ad-hoc approaches, such as eye-balling random subsets of the data or formulating aggregation queries, to ...
Data profiling revisited
Data profiling comprises a broad range of methods to efficiently analyze a given data set. In a typical scenario, which mirrors the capabilities of commercial data profiling tools, tables of a relational database are scanned to derive metadata, such as ...
Comments