Abstract
Data profiling comprises a broad range of methods to efficiently analyze a given data set. In a typical scenario, which mirrors the capabilities of commercial data profiling tools, tables of a relational database are scanned to derive metadata, such as data types and value patterns, completeness and uniqueness of columns, keys and foreign keys, and occasionally functional dependencies and association rules. Individual research projects have proposed several additional profiling tasks, such as the discovery of inclusion dependencies or conditional functional dependencies.
Data profiling deserves a fresh look for two reasons: First, the area itself is neither established nor defined in any principled way, despite significant research activity on individual parts in the past. Second, more and more data beyond the traditional relational databases are being created and beg to be profiled. The article proposes new research directions and challenges, including interactive and incremental profiling and profiling heterogeneous and non-relational data.
- D. J. Abadi. Column stores for wide and sparse data. In Proceedings of the Conference on Innovative Data Systems Research (CIDR), pages 292--297, Asilomar, CA, 2007.Google Scholar
- Z. Abedjan and F. Naumann. Advancing the discovery of unique column combinations. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 1565--1570, Glasgow, UK, 2011. Google ScholarDigital Library
- A. Abelló, J. Darmont, L. Etcheverry, M. Golfarelli, J.-N. Mazón, F. Naumann, T. B. Pedersen, S. Rizzi, J. Trujillo, P. Vassiliadis, and G. Vossen. Fusion Cubes: Towards self-service business intelligence. Data Warehousing and Mining (IJDWM), in press, 2013.Google Scholar
- D. Agrawal, P. Bernstein, E. Bertino, S. Davidson, U. Dayal, M. Franklin, J. Gehrke, L. Haas, A. Halevy, J. Han, H. V. Jagadish, A. Labrinidis, S. Madden, Y. Papakonstantinou, J. M. Patel, R. Ramakrishnan, K. Ross, C. Shahabi, D. Suciu, S. Vaithyanathan, and J. Widom. Challenges and opportunities with Big Data. Technical report, Computing Community Consortium, http://cra.org/ccc/docs/init/bigdatawhitepaper.pdf, 2012.Google Scholar
- J. Bauckmann, Z. Abedjan, H. Müller, U. Leser, and F. Naumann. Discovering conditional inclusion dependencies. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 2094--2098, Maui, HI, 2012. Google ScholarDigital Library
- J. Bauckmann, U. Leser, F. Naumann, and V. Tietz. Efficiently detecting inclusion dependencies. In Proceedings of the International Conference on Data Engineering (ICDE), pages 1448--1450, Istanbul, Turkey, 2007.Google ScholarCross Ref
- J. Berlin and A. Motro. Database schema matching using machine learning with feature selection. In Proceedings of the Conference on Advanced Information Systems Engineering (CAiSE), pages 452--466, Toronto, Canada, 2002. Google ScholarDigital Library
- C. Böhm, G. Kasneci, and F. Naumann. Latent topics in graph-structured data. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 2663--2666, Maui, HI, 2012. Google ScholarDigital Library
- C. Böhm, J. Lorey, and F. Naumann. Creating voiD descriptions for web-scale data. Journal of Web Semantics, 9(3):339--345, 2011. Google ScholarDigital Library
- C. Böhm, F. Naumann, Z. Abedjan, D. Fenz, T. Grütze, D. Hefenbrock, M. Pohl, and D. Sonnabend. Profiling linked open data with ProLOD. In Proceedings of the International Workshop on New Trends in Information Integration (NTII), pages 175--178, Long Beach, CA, 2010.Google ScholarCross Ref
- L. Bravo, W. Fan, and S. Ma. Extending dependencies with conditions. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 243--254, Vienna, Austria, 2007. Google ScholarDigital Library
- M. J. Cafarella, A. Halevy, and J. Madhavan. Structured data on the web. Communications of the ACM, 54(2):72--79, 2011. Google ScholarDigital Library
- S. Chaudhuri, U. Dayal, and V. Ganti. Data management technology for decision support systems. Advances in Computers, 62:293--326, 2004.Google ScholarCross Ref
- F. Chiang and R. J. Miller. Discovering data quality rules. Proceedings of the VLDB Endowment, 1:1166--1177, 2008. Google ScholarDigital Library
- P. Christen. Data Matching. Springer Verlag, Berlin -- Heidelberg -- New York, 2012.Google Scholar
- G. Cormode, M. N. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases, 4(1-3):1--294, 2012. Google ScholarDigital Library
- O. Curé. Conditional inclusion dependencies for data cleansing: Discovery and violation detection issues. In Proceedings of the International Workshop on Quality in Databases (QDB), Lyon, France, 2009.Google Scholar
- J. Euzenat and P. Shvaiko. Ontology Matching. Springer Verlag, Berlin -- Heidelberg -- New York, 2007. Google ScholarDigital Library
- S. M. Fakhrahmad, M. H. Sadreddini, and M. Z. Jahromi. AD-Miner: A new incremental method for discovery of minimal approximate dependencies using logical operations. Intelligent Data Analysis, 12(6):607--619, 2008. Google ScholarDigital Library
- W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for capturing data inconsistencies. ACM Transactions on Database Systems (TODS), 33(2):1--48, 2008. Google ScholarDigital Library
- W. Fan, F. Geerts, J. Li, and M. Xiong. Discovering conditional functional dependencies. IEEE Transactions on Knowledge and Data Engineering (TKDE), 23(4):683--698, 2011. Google ScholarDigital Library
- L. Golab, F. Korn, and D. Srivastava. Efficient and effective analysis of data quality using pattern tableaux. IEEE Data Engineering Bulletin, 34(3):26--33, 2011.Google Scholar
- L. Golab and M. T. Özsu. Data Stream Management. Morgan Claypool Publishers, 2010. Google ScholarDigital Library
- J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2011. Google ScholarDigital Library
- D. I. Holmes. Authorship attribution. Computers and the Humanities, 28:87--106, 1994.Google ScholarCross Ref
- Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. TANE: An efficient algorithm for discovering functional and approximate dependencies. Computer Journal, 42:100--111, 1999.Google ScholarCross Ref
- I. F. Ilyas, V. Markl, P. J. Haas, P. Brown, and A. Aboulnaga. CORDS: Automatic discovery of correlations and soft functional dependencies. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 647--658, Paris, France, 2004. Google ScholarDigital Library
- S. Kandel, R. Parikh, A. Paepcke, J. Hellerstein, and J. Heer. Profiler: Integrated statistical analysis and visualization for data quality assessment. In Proceedings of Advanced Visual Interfaces (AVI), pages 547--554, Capri, Italy, 2012. Google ScholarDigital Library
- D. A. Keim and D. Oelke. Literature fingerprinting: A new method for visual literary analysis. In Proceedings of Visual Analytics Science and Technology (VAST), pages 115 --122, Sacramento, CA, 2007. Google ScholarDigital Library
- S. Lopes, J.-M. Petit, and F. Toumani. Discovering interesting inclusion dependencies: application to logical database tuning. Information Systems, 27(1):1--19, 2002. Google ScholarDigital Library
- A. Löser, F. Hueske, and V. Markl. Situational business intelligence. In Proceedings Business Intelligence for the Real-Time Enterprise (BIRTE), pages 1--11, Auckland, New Zealand, 2008.Google Scholar
- M. V. Mannino, P. Chu, and T. Sager. Statistical profile estimation in database systems. ACM Computing Surveys, 20(3):191--221, 1988. Google ScholarDigital Library
- F. D. Marchi, S. Lopes, and J.-M. Petit. Efficient algorithms for mining inclusion dependencies. In Proceedings of the International Conference on Extending Database Technology (EDBT), pages 464--476, Prague, Czech Republic, 2002. Google ScholarDigital Library
- F. D. Marchi, S. Lopes, and J.-M. Petit. Unary and n-ary inclusion dependency discovery in relational databases. Journal of Intelligent Information Systems, 32:53--73, 2009. Google ScholarDigital Library
- V. M. Markowitz and J. A. Makowsky. Identifying extended entity-relationship object structures in relational schemas. IEEE Transactions on Software Engineering, 16(8):777--790, 1990. Google ScholarDigital Library
- T. Özsu and P. Valduriez. Principles of Distributed Database Systems. Prentice-Hall, 2nd edition, 1999. Google ScholarDigital Library
- V. Poosala, P. J. Haas, Y. E. Ioannidis, and E. J. Shekita. Improved histograms for selectivity estimation of range predicates. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 294--305, Montreal, Canada, 1996. Google ScholarDigital Library
- D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999. Google ScholarDigital Library
- E. Rahm and H.-H. Do. Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4):3--13, 2000.Google Scholar
- V. Raman and J. M. Hellerstein. Potters Wheel: An interactive data cleaning system. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 381--390, Rome, Italy, 2001. Google ScholarDigital Library
- A. Rostin, O. Albrecht, J. Bauckmann, F. Naumann, and U. Leser. A machine learning approach to foreign key discovery. In Proceedings of the ACM SIGMOD Workshop on the Web and Databases (WebDB), Providence, RI, 2009.Google Scholar
- Y. Sismanis, P. Brown, P. J. Haas, and B. Reinwald. GORDIAN: Efficient and scalable discovery of composite keys. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 691--702, Seoul, Korea, 2006. Google ScholarDigital Library
- K. P. Smith, M. Morse, P. Mork, M. H. Li, A. Rosenthal, M. D. Allen, and L. Seligman. The role of schema matching in large enterprises. In Proceedings of the Conference on Innovative Data Systems Research (CIDR), Asilomar, CA, 2009.Google Scholar
- W. Wu, B. Reinwald, Y. Sismanis, and R. Manjrekar. Discovering topical structures of databases. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 1019--1030, Vancouver, Canada, 2008. Google ScholarDigital Library
- H. Yao and H. J. Hamilton. Mining functional dependencies from data. Data Mining and Knowledge Discovery, 16(2):197--219, 2008. Google ScholarDigital Library
Index Terms
- Data profiling revisited
Recommendations
Data profiling with metanome
Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, HawaiiData profiling is the discipline of discovering metadata about given datasets. The metadata itself serve a variety of use cases, such as data integration, data cleansing, or query optimization. Due to the importance of data profiling in practice, many ...
Profiling relational data: a survey
Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce ...
Data Profiling: A Tutorial
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Datais to understand the dataset at hand and its metadata. The process of metadata discovery is known as data profiling. Profiling activities range from ad-hoc approaches, such as eye-balling random subsets of the data or formulating aggregation queries, to ...
Comments