skip to main content
research-article

Differential dependencies: Reasoning and discovery

Authors Info & Claims
Published:26 August 2011Publication History
Skip Abstract Section

Abstract

The importance of difference semantics (e.g., “similar” or “dissimilar”) has been recently recognized for declaring dependencies among various types of data, such as numerical values or text values. We propose a novel form of Differential Dependencies (dds), which specifies constraints on difference, called differential functions, instead of identification functions in traditional dependency notations like functional dependencies. Informally, a differential dependency states that if two tuples have distances on attributes X agreeing with a certain differential function, then their distances on attributes Y should also agree with the corresponding differential function on Y. For example, [date(≤ 7)]→[price(< 100)] states that the price difference of any two days within a week length should be no greater than 100 dollars. Such differential dependencies are useful in various applications, for example, violation detection, data partition, query optimization, record linkage, etc.

In this article, we first address several theoretical issues of differential dependencies, including formal definitions of dds and differential keys, subsumption order relation of differential functions, implication of dds, closure of a differential function, a sound and complete inference system, and minimal cover for dds. Then, we investigate a practical problem, that is, how to discover dds and differential keys from a given dataset. Due to the intrinsic hardness, we develop several pruning methods to improve the discovery efficiency in practice. Finally, through an extensive experimental evaluation on real datasets, we demonstrate the discovery performance and the effectiveness of dds in several real applications.

References

  1. Abiteboul, S., Hull, R., and Vianu, V. 1995. Foundations of Databases. Addison-Wesley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Arenas, M., Bertossi, L. E., and Chomicki, J. 1999. Consistent query answers in inconsistent databases. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'99). 68--79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Armstrong, W. W. 1974. Dependency structures of data base relationships. In Proceedings of the IFIP Congress. 580--583.Google ScholarGoogle Scholar
  4. Aspvall, B., Plass, M. F., and Tarjan, R. E. 1979. A linear-time algorithm for testing the truth of certain quantified boolean formulas. Inf. Process. Lett. 8, 3, 121--123.Google ScholarGoogle ScholarCross RefCross Ref
  5. Bitton, D., Millman, J., and Torgersen, S. 1989. A feasibility and performance study of dependency inference. In Proceedings of the International Conference on Data Engineering (ICDE'89). 635--641. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bohannon, P., Fan, W., Geert S. F., Jia, X., and Kementsietsidis, A. 2007. Conditional functional dependencies for data cleaning. In Proceedings of the International Conference on Data Engineering (ICDE'07). 746--755.Google ScholarGoogle Scholar
  7. Bravo, L., Fan, W., Geerts, F., and Ma, S. 2008. Increasing the expressivity of conditional functional dependencies without extra complexity. In Proceedings of the International Conference on Data Engineering (ICDE'08). 516--525. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bravo, L., Fan, W., and Ma, S. 2007. Extending dependencies with conditions. In Proceedings of the International Conference on Very Large Databases (VLDB'07). 243--254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chakravarthy, U. S., Grant, J., and Minker, J. 1990. Logic-Based approach to semantic query optimization. ACM Trans. Datab. Syst. 15, 2, 162--207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chaudhuri, S., Sarma, A. D., Ganti, V., and Kaushik, R. 2007. Leveraging aggregate constraints for deduplication. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 437--448. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Chazelle, B. and Guibas, L. J. 1986a. Fractional cascading: I. A data structuring technique. Algorithmica 1, 2, 133--162.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Chazelle, B. and Guibas, L. J. 1986b. Fractional cascading: Ii. Applications. Algorithmica 1, 2, 163--191.Google ScholarGoogle ScholarCross RefCross Ref
  13. Chiang, F. and Miller, R. J. 2008. Discovering data quality rules. Proc. VLDB 1, 1, 1166--1177. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Chomicki, J. and Marcinkowski, J. 2005. On the computational complexity of minimal-change integrity maintenance in relational databases. In Inconsistency Tolerance. 119--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Cong, G., Fan, W., Geerts, F., Jia, X., and Ma, S. 2007. Improving data quality: Consistency and accuracy. In Proceedings of the International Conference on Very Large Databases (VLDB'07). 315--326. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. De Berg, M., Cheong, O., Van Kreveld, M., and Overmars, M. 2008. Computational Geometry: Algorithms and Applications 3rd Ed. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Dong, J. and Hull, R. 1982. Applying approximate order dependency to reduce indexing space. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 119--127. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. el Magarmid, A. K., Ipeirotis, P. G., and Verykios, V. S. 2007. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Engin. 19, 1, 1--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Fan, W. 2008. Dependencies revisited for improving data quality. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'08). 159--170. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Fan, W., Geerts, F., Jia, X., and Kementsietsidis, A. 2008a. Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Datab. Syst. 33, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Fan, W., Geerts, F., Lakshmanan, L. V. S., and Xiong, M. 2009a. Discovering conditional functional dependencies. In Proceedings of the International Conference on Data Engineering (ICDE'09). 1231--1234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Fan, W., Li, J., Jia, X., and Ma, S. 2009b. Reasoning about record matching rules. In Proceedings of the International Conference on Very Large Databases (VLDB'09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Fan, W., Ma, S., Hu, Y., Liu, J., and Wu, Y. 2008b. Propagating functional dependencies with conditions. Proc. VLDB 1, 1, 391--407. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Flach, P. A. and Savnik, I. 1999. Database dependency discovery: A machine learning approach. Artif. Intell. Comm. 12, 3, 139--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Franklin, M. J., Halevy, A. Y., and Maier, D. 2005. From databases to dataspaces: A new abstraction for information management. SIGMOD Rec. 34, 4, 27--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Garey, M. R. and Johnson, D. S. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Ginsburg, S. and Hull, R. 1983a. Order dependency in the relational model. Theor. Comput. Sci. 26, 149--195.Google ScholarGoogle ScholarCross RefCross Ref
  28. Ginsburg, S. and Hull, R. 1983b. Sort sets in the relational model. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'83). 332--339. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Ginsburg, S. and Hull, R. 1986. Sort sets in the relational model. J. ACM 33, 3, 465--488. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Golab, L., Karloff, H. J., Korn, F., Saha, A., and Srivastava, D. 2009. Sequential dependencies. Proc. VLDB 2, 1, 574--585. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Golab, L., Karloff, H. J., Korn, F., Srivastava, D., and Yu, B. 2008. On generating near-optimal tableaux for conditional functional dependencies. Proc. VLDB 1, 1, 376--390. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Gravano, L., Ipeirotis, P. G., Jagadish, H. V., Koudas, N., Muthukrishnan, S., Pietarinen, L., and Srivastava, D. 2001. Using q-grams in a dbms for approximate string processing. IEEE Data Engin. Bull. 24, 4, 28--34.Google ScholarGoogle Scholar
  33. Hernandez, M. A. and Stolfo, S. J. 1995. The merge/purge problem for large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 127--138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Huhtala, Y., Karkkainen, J., Porkka, P., and Toivonen, H. 1998. Efficient discovery of functional and approximate dependencies using partitions. In Proceedings of the International Conference on Data Engineering (ICDE'98). 392--401. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Huhtala, Y., Karkkainen, J., Porkka, P., and Toivonen, H. 1999. Tane: An efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42, 2, 100--111.Google ScholarGoogle ScholarCross RefCross Ref
  36. Ilyas, I. F., Markl, V., Haas, P. J., Brown, P., and Aboulnaga, A. 2004. Cords: Automatic discovery of correlations and soft functional dependencies. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 647--658. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Karp, R. M. 1972. Reducibility among combinatorial problems. In Complexity of Computer Computations, R. E. Miller and J. W. Thatcher Eds., Plenum Press, 85--103.Google ScholarGoogle Scholar
  38. Kimura, H., Huo, G., Rasin, A., Madden, S., and Zdonik, S. B. 2009. Correlation maps: A compressed access method for exploiting soft functional dependencies. Proc. VLDB 2, 1, 1222--1233. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Kivinen, J. and Mannila, H. 1995. Approximate inference of functional dependencies from relations. Theor. Comput. Sci. 149, 1, 129--149. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Korn, F., Muthukrishnan, S., and Zhu, Y. 2003. Checks and balances: Monitoring data quality problems in network traffic databases. In Proceedings of the International Conference on Very Large Databases (VLDB'03). 536--547. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Koudas, N., Saha, A., Srivastava, D., and Venkatasubramanian, S. 2009. Metric functional dependencies. In Proceedings of the International Conference on Data Engineering (ICDE'09). 1275--1278. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Kramer, S. and Pfahringer, B. 1996. Efficient search for strong partial determinations. In Proceedings of the International SIGKDD Conference on Knowledge Discovery and Data Mining. 371--374.Google ScholarGoogle Scholar
  43. Levene, M. and Loizou, G. 1999. A Guided Tour of Relational Databases and Beyond. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Levy, A. Y. and Sagiv, Y. 1995. Semantic query optimization in datalog programs. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'95). 163--173. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Mannila, H. and Rai Ha, K.-J. 1987. Dependency inference. In Proceedings of the International Conference on Very Large Databases (VLDB'87). 155--158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Mannila, H. and Rai Ha, K.-J. 1989. Practical algorithms for finding prime attributes and testing normal forms. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'89). 128--133. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Mannila, H. and Rai Ha, K.-J. 1992. Design of Relational Databases. Addison-Wesley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Mannila, H. and Rai Ha, K.-J. 1994. Algorithms for inferring functional dependencies from relations. Data Knowl. Engin. 12, 1, 83--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Navarro, G. 2001. A guided tour to approximate string matching. ACM Comput. Surv. 33, 1, 31--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Ng, W. 1999. Ordered functional dependencies in relational databases. Inf. Syst. 24, 7, 535--554. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Ng, W. 2001. An extension of the relational data model to incorporate ordered domains. ACM Trans. Datab. Syst. 26, 3, 344--383. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Papadimitriou, C. H. and Steiglitz, K. 1982. Combinatorial Optimization: Algorithms and Complexity. Prentice-Hall. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Schlimmer, J. C. 1993. Efficiently inducing determinations: A complete and systematic search algorithm that uses optimal pruning. In Proceedings of the International Conference on Machine Learning (ICML'93). 284--290.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Wijsen, J. 1998. Reasoning about qualitative trends in databases. Inf. Syst. 23, 7, 463--487. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Wijsen, J. 2001. Trends in databases: Reasoning and mining. IEEE Trans. Knowl. Data Engin. 13, 3, 426--438. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Wyss, C. M., Giannella, C., and Robertson, E. L. 2001. Fastfds: A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances - extended abstract. In Proceedings of the International Conference on Data Warehousing and Knowledge Discovery (DaWaK'01). 101--110. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Differential dependencies: Reasoning and discovery

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Database Systems
      ACM Transactions on Database Systems  Volume 36, Issue 3
      August 2011
      207 pages
      ISSN:0362-5915
      EISSN:1557-4644
      DOI:10.1145/2000824
      Issue’s Table of Contents

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 26 August 2011
      • Accepted: 1 March 2011
      • Revised: 1 January 2011
      • Received: 1 August 2010
      Published in tods Volume 36, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader