column

Data profiling revisited

Author:
Felix Naumann

Qatar Computing Research Institute (QCRI), Doha, Qatar

Qatar Computing Research Institute (QCRI), Doha, Qatar
View Profile

Authors Info & Claims

ACM SIGMOD Record Volume 42 Issue 4December 2013pp 40–49https://doi.org/10.1145/2590989.2590995

Published:28 February 2014Publication History

ACM SIGMOD Record

Abstract

Data profiling comprises a broad range of methods to efficiently analyze a given data set. In a typical scenario, which mirrors the capabilities of commercial data profiling tools, tables of a relational database are scanned to derive metadata, such as data types and value patterns, completeness and uniqueness of columns, keys and foreign keys, and occasionally functional dependencies and association rules. Individual research projects have proposed several additional profiling tasks, such as the discovery of inclusion dependencies or conditional functional dependencies.

Data profiling deserves a fresh look for two reasons: First, the area itself is neither established nor defined in any principled way, despite significant research activity on individual parts in the past. Second, more and more data beyond the traditional relational databases are being created and beg to be profiled. The article proposes new research directions and challenges, including interactive and incremental profiling and profiling heterogeneous and non-relational data.

References

D. J. Abadi. Column stores for wide and sparse data. In Proceedings of the Conference on Innovative Data Systems Research (CIDR), pages 292--297, Asilomar, CA, 2007.Google Scholar
Z. Abedjan and F. Naumann. Advancing the discovery of unique column combinations. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 1565--1570, Glasgow, UK, 2011. Google ScholarDigital Library
A. Abelló, J. Darmont, L. Etcheverry, M. Golfarelli, J.-N. Mazón, F. Naumann, T. B. Pedersen, S. Rizzi, J. Trujillo, P. Vassiliadis, and G. Vossen. Fusion Cubes: Towards self-service business intelligence. Data Warehousing and Mining (IJDWM), in press, 2013.Google Scholar
D. Agrawal, P. Bernstein, E. Bertino, S. Davidson, U. Dayal, M. Franklin, J. Gehrke, L. Haas, A. Halevy, J. Han, H. V. Jagadish, A. Labrinidis, S. Madden, Y. Papakonstantinou, J. M. Patel, R. Ramakrishnan, K. Ross, C. Shahabi, D. Suciu, S. Vaithyanathan, and J. Widom. Challenges and opportunities with Big Data. Technical report, Computing Community Consortium, http://cra.org/ccc/docs/init/bigdatawhitepaper.pdf, 2012.Google Scholar
J. Bauckmann, Z. Abedjan, H. Müller, U. Leser, and F. Naumann. Discovering conditional inclusion dependencies. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 2094--2098, Maui, HI, 2012. Google ScholarDigital Library
J. Bauckmann, U. Leser, F. Naumann, and V. Tietz. Efficiently detecting inclusion dependencies. In Proceedings of the International Conference on Data Engineering (ICDE), pages 1448--1450, Istanbul, Turkey, 2007.Google ScholarCross Ref
J. Berlin and A. Motro. Database schema matching using machine learning with feature selection. In Proceedings of the Conference on Advanced Information Systems Engineering (CAiSE), pages 452--466, Toronto, Canada, 2002. Google ScholarDigital Library
C. Böhm, G. Kasneci, and F. Naumann. Latent topics in graph-structured data. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 2663--2666, Maui, HI, 2012. Google ScholarDigital Library
C. Böhm, J. Lorey, and F. Naumann. Creating voiD descriptions for web-scale data. Journal of Web Semantics, 9(3):339--345, 2011. Google ScholarDigital Library
C. Böhm, F. Naumann, Z. Abedjan, D. Fenz, T. Grütze, D. Hefenbrock, M. Pohl, and D. Sonnabend. Profiling linked open data with ProLOD. In Proceedings of the International Workshop on New Trends in Information Integration (NTII), pages 175--178, Long Beach, CA, 2010.Google ScholarCross Ref
L. Bravo, W. Fan, and S. Ma. Extending dependencies with conditions. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 243--254, Vienna, Austria, 2007. Google ScholarDigital Library
M. J. Cafarella, A. Halevy, and J. Madhavan. Structured data on the web. Communications of the ACM, 54(2):72--79, 2011. Google ScholarDigital Library
S. Chaudhuri, U. Dayal, and V. Ganti. Data management technology for decision support systems. Advances in Computers, 62:293--326, 2004.Google ScholarCross Ref
F. Chiang and R. J. Miller. Discovering data quality rules. Proceedings of the VLDB Endowment, 1:1166--1177, 2008. Google ScholarDigital Library
P. Christen. Data Matching. Springer Verlag, Berlin -- Heidelberg -- New York, 2012.Google Scholar
G. Cormode, M. N. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases, 4(1-3):1--294, 2012. Google ScholarDigital Library
O. Curé. Conditional inclusion dependencies for data cleansing: Discovery and violation detection issues. In Proceedings of the International Workshop on Quality in Databases (QDB), Lyon, France, 2009.Google Scholar
J. Euzenat and P. Shvaiko. Ontology Matching. Springer Verlag, Berlin -- Heidelberg -- New York, 2007. Google ScholarDigital Library
S. M. Fakhrahmad, M. H. Sadreddini, and M. Z. Jahromi. AD-Miner: A new incremental method for discovery of minimal approximate dependencies using logical operations. Intelligent Data Analysis, 12(6):607--619, 2008. Google ScholarDigital Library
W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for capturing data inconsistencies. ACM Transactions on Database Systems (TODS), 33(2):1--48, 2008. Google ScholarDigital Library
W. Fan, F. Geerts, J. Li, and M. Xiong. Discovering conditional functional dependencies. IEEE Transactions on Knowledge and Data Engineering (TKDE), 23(4):683--698, 2011. Google ScholarDigital Library
L. Golab, F. Korn, and D. Srivastava. Efficient and effective analysis of data quality using pattern tableaux. IEEE Data Engineering Bulletin, 34(3):26--33, 2011.Google Scholar
L. Golab and M. T. Özsu. Data Stream Management. Morgan Claypool Publishers, 2010. Google ScholarDigital Library
J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2011. Google ScholarDigital Library
D. I. Holmes. Authorship attribution. Computers and the Humanities, 28:87--106, 1994.Google ScholarCross Ref
Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. TANE: An efficient algorithm for discovering functional and approximate dependencies. Computer Journal, 42:100--111, 1999.Google ScholarCross Ref
I. F. Ilyas, V. Markl, P. J. Haas, P. Brown, and A. Aboulnaga. CORDS: Automatic discovery of correlations and soft functional dependencies. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 647--658, Paris, France, 2004. Google ScholarDigital Library
S. Kandel, R. Parikh, A. Paepcke, J. Hellerstein, and J. Heer. Profiler: Integrated statistical analysis and visualization for data quality assessment. In Proceedings of Advanced Visual Interfaces (AVI), pages 547--554, Capri, Italy, 2012. Google ScholarDigital Library
D. A. Keim and D. Oelke. Literature fingerprinting: A new method for visual literary analysis. In Proceedings of Visual Analytics Science and Technology (VAST), pages 115 --122, Sacramento, CA, 2007. Google ScholarDigital Library
S. Lopes, J.-M. Petit, and F. Toumani. Discovering interesting inclusion dependencies: application to logical database tuning. Information Systems, 27(1):1--19, 2002. Google ScholarDigital Library
A. Löser, F. Hueske, and V. Markl. Situational business intelligence. In Proceedings Business Intelligence for the Real-Time Enterprise (BIRTE), pages 1--11, Auckland, New Zealand, 2008.Google Scholar
M. V. Mannino, P. Chu, and T. Sager. Statistical profile estimation in database systems. ACM Computing Surveys, 20(3):191--221, 1988. Google ScholarDigital Library
F. D. Marchi, S. Lopes, and J.-M. Petit. Efficient algorithms for mining inclusion dependencies. In Proceedings of the International Conference on Extending Database Technology (EDBT), pages 464--476, Prague, Czech Republic, 2002. Google ScholarDigital Library
F. D. Marchi, S. Lopes, and J.-M. Petit. Unary and n-ary inclusion dependency discovery in relational databases. Journal of Intelligent Information Systems, 32:53--73, 2009. Google ScholarDigital Library
V. M. Markowitz and J. A. Makowsky. Identifying extended entity-relationship object structures in relational schemas. IEEE Transactions on Software Engineering, 16(8):777--790, 1990. Google ScholarDigital Library
T. Özsu and P. Valduriez. Principles of Distributed Database Systems. Prentice-Hall, 2nd edition, 1999. Google ScholarDigital Library
V. Poosala, P. J. Haas, Y. E. Ioannidis, and E. J. Shekita. Improved histograms for selectivity estimation of range predicates. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 294--305, Montreal, Canada, 1996. Google ScholarDigital Library
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999. Google ScholarDigital Library
E. Rahm and H.-H. Do. Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4):3--13, 2000.Google Scholar
V. Raman and J. M. Hellerstein. Potters Wheel: An interactive data cleaning system. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 381--390, Rome, Italy, 2001. Google ScholarDigital Library
A. Rostin, O. Albrecht, J. Bauckmann, F. Naumann, and U. Leser. A machine learning approach to foreign key discovery. In Proceedings of the ACM SIGMOD Workshop on the Web and Databases (WebDB), Providence, RI, 2009.Google Scholar
Y. Sismanis, P. Brown, P. J. Haas, and B. Reinwald. GORDIAN: Efficient and scalable discovery of composite keys. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 691--702, Seoul, Korea, 2006. Google ScholarDigital Library
K. P. Smith, M. Morse, P. Mork, M. H. Li, A. Rosenthal, M. D. Allen, and L. Seligman. The role of schema matching in large enterprises. In Proceedings of the Conference on Innovative Data Systems Research (CIDR), Asilomar, CA, 2009.Google Scholar
W. Wu, B. Reinwald, Y. Sismanis, and R. Manjrekar. Discovering topical structures of databases. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 1019--1030, Vancouver, Canada, 2008. Google ScholarDigital Library
H. Yao and H. J. Hamilton. Mining functional dependencies from data. Data Mining and Knowledge Discovery, 16(2):197--219, 2008. Google ScholarDigital Library

Index Terms

Data profiling revisited
1. Information systems
  1. Data management systems
    1. Database management system engines
  2. Information retrieval
    1. Document representation

Recommendations

Data profiling with metanome
Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii

Data profiling is the discipline of discovering metadata about given datasets. The metadata itself serve a variety of use cases, such as data integration, data cleansing, or query optimization. Due to the importance of data profiling in practice, many ...
Read More
Profiling relational data: a survey

Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce ...
Read More
Data Profiling: A Tutorial
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

is to understand the dataset at hand and its metadata. The process of metadata discovery is known as data profiling. Profiling activities range from ad-hoc approaches, such as eye-balling random subsets of the data or formulating aggregation queries, to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGMOD Record Volume 42, Issue 4
December 2013
73 pages
ISSN:0163-5808
DOI:10.1145/2590989
Editors:
Ioana Manolescu
INRIA Saclay
,
Denilson Barbosa
University of Alberta
,
Pablo Barceló
Universidad de Chile
,
Vanessa Braganholo
Universidade Federal Fluminense
,
Marco Brambilla
Politecnico di Milano
,
Chee Yong Chan
National University of Singapore
,
Rada Chirkova
North Carolina State University
,
Anish Das Sarma
Google Research
,
Glenn Paulley
Conestoga College
,
Alkis Simitsis
HP Labs
,
Nesime Tatbul
ETH Zurich
,
Marianne Winslett
University of Illinois
Issue’s Table of Contents
Copyright © 2014 Author
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 February 2014
Check for updates
Qualifiers
- column
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 112
  Total Citations
  View Citations
- 1,882
  Total Downloads
- Downloads (Last 12 months)187
- Downloads (Last 6 weeks)34
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Data profiling revisited

ACM SIGMOD Record

Abstract

References

Cited By

Index Terms

Recommendations

Data profiling with metanome

Profiling relational data: a survey

Data Profiling: A Tutorial