skip to main content
column

Data profiling revisited

Published:28 February 2014Publication History
Skip Abstract Section

Abstract

Data profiling comprises a broad range of methods to efficiently analyze a given data set. In a typical scenario, which mirrors the capabilities of commercial data profiling tools, tables of a relational database are scanned to derive metadata, such as data types and value patterns, completeness and uniqueness of columns, keys and foreign keys, and occasionally functional dependencies and association rules. Individual research projects have proposed several additional profiling tasks, such as the discovery of inclusion dependencies or conditional functional dependencies.

Data profiling deserves a fresh look for two reasons: First, the area itself is neither established nor defined in any principled way, despite significant research activity on individual parts in the past. Second, more and more data beyond the traditional relational databases are being created and beg to be profiled. The article proposes new research directions and challenges, including interactive and incremental profiling and profiling heterogeneous and non-relational data.

References

  1. D. J. Abadi. Column stores for wide and sparse data. In Proceedings of the Conference on Innovative Data Systems Research (CIDR), pages 292--297, Asilomar, CA, 2007.Google ScholarGoogle Scholar
  2. Z. Abedjan and F. Naumann. Advancing the discovery of unique column combinations. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 1565--1570, Glasgow, UK, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Abelló, J. Darmont, L. Etcheverry, M. Golfarelli, J.-N. Mazón, F. Naumann, T. B. Pedersen, S. Rizzi, J. Trujillo, P. Vassiliadis, and G. Vossen. Fusion Cubes: Towards self-service business intelligence. Data Warehousing and Mining (IJDWM), in press, 2013.Google ScholarGoogle Scholar
  4. D. Agrawal, P. Bernstein, E. Bertino, S. Davidson, U. Dayal, M. Franklin, J. Gehrke, L. Haas, A. Halevy, J. Han, H. V. Jagadish, A. Labrinidis, S. Madden, Y. Papakonstantinou, J. M. Patel, R. Ramakrishnan, K. Ross, C. Shahabi, D. Suciu, S. Vaithyanathan, and J. Widom. Challenges and opportunities with Big Data. Technical report, Computing Community Consortium, http://cra.org/ccc/docs/init/bigdatawhitepaper.pdf, 2012.Google ScholarGoogle Scholar
  5. J. Bauckmann, Z. Abedjan, H. Müller, U. Leser, and F. Naumann. Discovering conditional inclusion dependencies. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 2094--2098, Maui, HI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Bauckmann, U. Leser, F. Naumann, and V. Tietz. Efficiently detecting inclusion dependencies. In Proceedings of the International Conference on Data Engineering (ICDE), pages 1448--1450, Istanbul, Turkey, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  7. J. Berlin and A. Motro. Database schema matching using machine learning with feature selection. In Proceedings of the Conference on Advanced Information Systems Engineering (CAiSE), pages 452--466, Toronto, Canada, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Böhm, G. Kasneci, and F. Naumann. Latent topics in graph-structured data. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 2663--2666, Maui, HI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Böhm, J. Lorey, and F. Naumann. Creating voiD descriptions for web-scale data. Journal of Web Semantics, 9(3):339--345, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Böhm, F. Naumann, Z. Abedjan, D. Fenz, T. Grütze, D. Hefenbrock, M. Pohl, and D. Sonnabend. Profiling linked open data with ProLOD. In Proceedings of the International Workshop on New Trends in Information Integration (NTII), pages 175--178, Long Beach, CA, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  11. L. Bravo, W. Fan, and S. Ma. Extending dependencies with conditions. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 243--254, Vienna, Austria, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. J. Cafarella, A. Halevy, and J. Madhavan. Structured data on the web. Communications of the ACM, 54(2):72--79, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Chaudhuri, U. Dayal, and V. Ganti. Data management technology for decision support systems. Advances in Computers, 62:293--326, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  14. F. Chiang and R. J. Miller. Discovering data quality rules. Proceedings of the VLDB Endowment, 1:1166--1177, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. Christen. Data Matching. Springer Verlag, Berlin -- Heidelberg -- New York, 2012.Google ScholarGoogle Scholar
  16. G. Cormode, M. N. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases, 4(1-3):1--294, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. O. Curé. Conditional inclusion dependencies for data cleansing: Discovery and violation detection issues. In Proceedings of the International Workshop on Quality in Databases (QDB), Lyon, France, 2009.Google ScholarGoogle Scholar
  18. J. Euzenat and P. Shvaiko. Ontology Matching. Springer Verlag, Berlin -- Heidelberg -- New York, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. M. Fakhrahmad, M. H. Sadreddini, and M. Z. Jahromi. AD-Miner: A new incremental method for discovery of minimal approximate dependencies using logical operations. Intelligent Data Analysis, 12(6):607--619, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for capturing data inconsistencies. ACM Transactions on Database Systems (TODS), 33(2):1--48, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. W. Fan, F. Geerts, J. Li, and M. Xiong. Discovering conditional functional dependencies. IEEE Transactions on Knowledge and Data Engineering (TKDE), 23(4):683--698, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. L. Golab, F. Korn, and D. Srivastava. Efficient and effective analysis of data quality using pattern tableaux. IEEE Data Engineering Bulletin, 34(3):26--33, 2011.Google ScholarGoogle Scholar
  23. L. Golab and M. T. Özsu. Data Stream Management. Morgan Claypool Publishers, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. D. I. Holmes. Authorship attribution. Computers and the Humanities, 28:87--106, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  26. Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. TANE: An efficient algorithm for discovering functional and approximate dependencies. Computer Journal, 42:100--111, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  27. I. F. Ilyas, V. Markl, P. J. Haas, P. Brown, and A. Aboulnaga. CORDS: Automatic discovery of correlations and soft functional dependencies. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 647--658, Paris, France, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Kandel, R. Parikh, A. Paepcke, J. Hellerstein, and J. Heer. Profiler: Integrated statistical analysis and visualization for data quality assessment. In Proceedings of Advanced Visual Interfaces (AVI), pages 547--554, Capri, Italy, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. D. A. Keim and D. Oelke. Literature fingerprinting: A new method for visual literary analysis. In Proceedings of Visual Analytics Science and Technology (VAST), pages 115 --122, Sacramento, CA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Lopes, J.-M. Petit, and F. Toumani. Discovering interesting inclusion dependencies: application to logical database tuning. Information Systems, 27(1):1--19, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. Löser, F. Hueske, and V. Markl. Situational business intelligence. In Proceedings Business Intelligence for the Real-Time Enterprise (BIRTE), pages 1--11, Auckland, New Zealand, 2008.Google ScholarGoogle Scholar
  32. M. V. Mannino, P. Chu, and T. Sager. Statistical profile estimation in database systems. ACM Computing Surveys, 20(3):191--221, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. F. D. Marchi, S. Lopes, and J.-M. Petit. Efficient algorithms for mining inclusion dependencies. In Proceedings of the International Conference on Extending Database Technology (EDBT), pages 464--476, Prague, Czech Republic, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. F. D. Marchi, S. Lopes, and J.-M. Petit. Unary and n-ary inclusion dependency discovery in relational databases. Journal of Intelligent Information Systems, 32:53--73, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. V. M. Markowitz and J. A. Makowsky. Identifying extended entity-relationship object structures in relational schemas. IEEE Transactions on Software Engineering, 16(8):777--790, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. T. Özsu and P. Valduriez. Principles of Distributed Database Systems. Prentice-Hall, 2nd edition, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. V. Poosala, P. J. Haas, Y. E. Ioannidis, and E. J. Shekita. Improved histograms for selectivity estimation of range predicates. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 294--305, Montreal, Canada, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. E. Rahm and H.-H. Do. Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4):3--13, 2000.Google ScholarGoogle Scholar
  40. V. Raman and J. M. Hellerstein. Potters Wheel: An interactive data cleaning system. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 381--390, Rome, Italy, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. A. Rostin, O. Albrecht, J. Bauckmann, F. Naumann, and U. Leser. A machine learning approach to foreign key discovery. In Proceedings of the ACM SIGMOD Workshop on the Web and Databases (WebDB), Providence, RI, 2009.Google ScholarGoogle Scholar
  42. Y. Sismanis, P. Brown, P. J. Haas, and B. Reinwald. GORDIAN: Efficient and scalable discovery of composite keys. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 691--702, Seoul, Korea, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. K. P. Smith, M. Morse, P. Mork, M. H. Li, A. Rosenthal, M. D. Allen, and L. Seligman. The role of schema matching in large enterprises. In Proceedings of the Conference on Innovative Data Systems Research (CIDR), Asilomar, CA, 2009.Google ScholarGoogle Scholar
  44. W. Wu, B. Reinwald, Y. Sismanis, and R. Manjrekar. Discovering topical structures of databases. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 1019--1030, Vancouver, Canada, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. H. Yao and H. J. Hamilton. Mining functional dependencies from data. Data Mining and Knowledge Discovery, 16(2):197--219, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Data profiling revisited

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader