skip to main content
research-article

Data profiling with metanome

Published:01 August 2015Publication History
Skip Abstract Section

Abstract

Data profiling is the discipline of discovering metadata about given datasets. The metadata itself serve a variety of use cases, such as data integration, data cleansing, or query optimization. Due to the importance of data profiling in practice, many tools have emerged that support data scientists and IT professionals in this task. These tools provide good support for profiling statistics that are easy to compute, but they are usually lacking automatic and efficient discovery of complex statistics, such as inclusion dependencies, unique column combinations, or functional dependencies.

We present Metanome, an extensible profiling platform that incorporates many state-of-the-art profiling algorithms. While Metanome is able to calculate simple profiling statistics in relational data, its focus lies on the automatic discovery of complex metadata. Metanome's goal is to provide novel profiling algorithms from research, perform comparative evaluations, and to support developers in building and testing new algorithms. In addition, Metanome is able to rank profiling results according to various metrics and to visualize the, at times, large metadata sets.

References

  1. Z. Abedjan, P. Schulze, and F. Naumann. DFD: Efficient functional dependency discovery. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 949--958, 2014. Google ScholarGoogle Scholar
  2. J. Bauckmann, U. Leser, and F. Naumann. Efficiently computing inclusion dependencies for schema discovery. In ICDE Workshops, page 2, 2006. Google ScholarGoogle Scholar
  3. P. A. Flach and I. Savnik. Database dependency discovery: a machine learning approach. AI Communications, 12(3):139--160, 1999. Google ScholarGoogle Scholar
  4. A. Heise, J.-A. Quiané-Ruiz, Z. Abedjan, A. Jentzsch, and F. Naumann. Scalable discovery of unique column combinations. Proceedings of the VLDB Endowment, 7(4):301--312, 2013. Google ScholarGoogle Scholar
  5. Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. TANE: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2):100--111, 1999.Google ScholarGoogle Scholar
  6. S. Lopes, J.-M. Petit, and L. Lakhal. Efficient discovery of functional dependencies and Armstrong relations. In Proceedings of the International Conference on Extending Database Technology (EDBT), pages 350--364, 2000. Google ScholarGoogle Scholar
  7. F. D. Marchi, S. Lopes, and J.-M. Petit. Unary and n-ary inclusion dependency discovery in relational databases. Journal of Intelligent Information Systems (JIIS), 32(1):53--73, 2009. Google ScholarGoogle Scholar
  8. N. Novelli and R. Cicchetti. FUN: An efficient algorithm for mining functional and embedded dependencies. In Proceedings of the International Conference on Database Theory (ICDT), pages 189--203, 2001. Google ScholarGoogle Scholar
  9. T. Papenbrock, J. Ehrlich, J. Marten, T. Neubert, J.-P. Rudolph, M. Schönberg, J. Zwiener, and F. Naumann. Functional dependency discovery: An experimental evaluation of seven algorithms. Proceedings of the VLDB Endowment, 8(10), 2015. Google ScholarGoogle Scholar
  10. T. Papenbrock, S. Kruse, J.-A. Quiané-Ruiz, and F. Naumann. Divide & conquer-based inclusion dependency discovery. Proceedings of the VLDB Endowment, 8(7):774--785, 2015. Google ScholarGoogle Scholar
  11. C. Wyss, C. Giannella, and E. Robertson. FastFDs: A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances. In Proceedings of the International Conference of Data Warehousing and Knowledge Discovery (DaWaK), pages 101--110, 2001. Google ScholarGoogle Scholar
  12. H. Yao and H. J. Hamilton. Mining functional dependencies from data. Data Mining and Knowledge Discovery, 16(2):197--219, 2008. Google ScholarGoogle Scholar

Index Terms

  1. Data profiling with metanome
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 8, Issue 12
        Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii
        August 2015
        728 pages

        Publisher

        VLDB Endowment

        Publication History

        • Published: 1 August 2015
        Published in pvldb Volume 8, Issue 12

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader