Abstract
The importance of difference semantics (e.g., “similar” or “dissimilar”) has been recently recognized for declaring dependencies among various types of data, such as numerical values or text values. We propose a novel form of Differential Dependencies (dds), which specifies constraints on difference, called differential functions, instead of identification functions in traditional dependency notations like functional dependencies. Informally, a differential dependency states that if two tuples have distances on attributes X agreeing with a certain differential function, then their distances on attributes Y should also agree with the corresponding differential function on Y. For example, [date(≤ 7)]→[price(< 100)] states that the price difference of any two days within a week length should be no greater than 100 dollars. Such differential dependencies are useful in various applications, for example, violation detection, data partition, query optimization, record linkage, etc.
In this article, we first address several theoretical issues of differential dependencies, including formal definitions of dds and differential keys, subsumption order relation of differential functions, implication of dds, closure of a differential function, a sound and complete inference system, and minimal cover for dds. Then, we investigate a practical problem, that is, how to discover dds and differential keys from a given dataset. Due to the intrinsic hardness, we develop several pruning methods to improve the discovery efficiency in practice. Finally, through an extensive experimental evaluation on real datasets, we demonstrate the discovery performance and the effectiveness of dds in several real applications.
- Abiteboul, S., Hull, R., and Vianu, V. 1995. Foundations of Databases. Addison-Wesley. Google ScholarDigital Library
- Arenas, M., Bertossi, L. E., and Chomicki, J. 1999. Consistent query answers in inconsistent databases. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'99). 68--79. Google ScholarDigital Library
- Armstrong, W. W. 1974. Dependency structures of data base relationships. In Proceedings of the IFIP Congress. 580--583.Google Scholar
- Aspvall, B., Plass, M. F., and Tarjan, R. E. 1979. A linear-time algorithm for testing the truth of certain quantified boolean formulas. Inf. Process. Lett. 8, 3, 121--123.Google ScholarCross Ref
- Bitton, D., Millman, J., and Torgersen, S. 1989. A feasibility and performance study of dependency inference. In Proceedings of the International Conference on Data Engineering (ICDE'89). 635--641. Google ScholarDigital Library
- Bohannon, P., Fan, W., Geert S. F., Jia, X., and Kementsietsidis, A. 2007. Conditional functional dependencies for data cleaning. In Proceedings of the International Conference on Data Engineering (ICDE'07). 746--755.Google Scholar
- Bravo, L., Fan, W., Geerts, F., and Ma, S. 2008. Increasing the expressivity of conditional functional dependencies without extra complexity. In Proceedings of the International Conference on Data Engineering (ICDE'08). 516--525. Google ScholarDigital Library
- Bravo, L., Fan, W., and Ma, S. 2007. Extending dependencies with conditions. In Proceedings of the International Conference on Very Large Databases (VLDB'07). 243--254. Google ScholarDigital Library
- Chakravarthy, U. S., Grant, J., and Minker, J. 1990. Logic-Based approach to semantic query optimization. ACM Trans. Datab. Syst. 15, 2, 162--207. Google ScholarDigital Library
- Chaudhuri, S., Sarma, A. D., Ganti, V., and Kaushik, R. 2007. Leveraging aggregate constraints for deduplication. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 437--448. Google ScholarDigital Library
- Chazelle, B. and Guibas, L. J. 1986a. Fractional cascading: I. A data structuring technique. Algorithmica 1, 2, 133--162.Google ScholarDigital Library
- Chazelle, B. and Guibas, L. J. 1986b. Fractional cascading: Ii. Applications. Algorithmica 1, 2, 163--191.Google ScholarCross Ref
- Chiang, F. and Miller, R. J. 2008. Discovering data quality rules. Proc. VLDB 1, 1, 1166--1177. Google ScholarDigital Library
- Chomicki, J. and Marcinkowski, J. 2005. On the computational complexity of minimal-change integrity maintenance in relational databases. In Inconsistency Tolerance. 119--150. Google ScholarDigital Library
- Cong, G., Fan, W., Geerts, F., Jia, X., and Ma, S. 2007. Improving data quality: Consistency and accuracy. In Proceedings of the International Conference on Very Large Databases (VLDB'07). 315--326. Google ScholarDigital Library
- De Berg, M., Cheong, O., Van Kreveld, M., and Overmars, M. 2008. Computational Geometry: Algorithms and Applications 3rd Ed. Springer. Google ScholarDigital Library
- Dong, J. and Hull, R. 1982. Applying approximate order dependency to reduce indexing space. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 119--127. Google ScholarDigital Library
- el Magarmid, A. K., Ipeirotis, P. G., and Verykios, V. S. 2007. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Engin. 19, 1, 1--16. Google ScholarDigital Library
- Fan, W. 2008. Dependencies revisited for improving data quality. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'08). 159--170. Google ScholarDigital Library
- Fan, W., Geerts, F., Jia, X., and Kementsietsidis, A. 2008a. Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Datab. Syst. 33, 2. Google ScholarDigital Library
- Fan, W., Geerts, F., Lakshmanan, L. V. S., and Xiong, M. 2009a. Discovering conditional functional dependencies. In Proceedings of the International Conference on Data Engineering (ICDE'09). 1231--1234. Google ScholarDigital Library
- Fan, W., Li, J., Jia, X., and Ma, S. 2009b. Reasoning about record matching rules. In Proceedings of the International Conference on Very Large Databases (VLDB'09). Google ScholarDigital Library
- Fan, W., Ma, S., Hu, Y., Liu, J., and Wu, Y. 2008b. Propagating functional dependencies with conditions. Proc. VLDB 1, 1, 391--407. Google ScholarDigital Library
- Flach, P. A. and Savnik, I. 1999. Database dependency discovery: A machine learning approach. Artif. Intell. Comm. 12, 3, 139--160. Google ScholarDigital Library
- Franklin, M. J., Halevy, A. Y., and Maier, D. 2005. From databases to dataspaces: A new abstraction for information management. SIGMOD Rec. 34, 4, 27--33. Google ScholarDigital Library
- Garey, M. R. and Johnson, D. S. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman. Google ScholarDigital Library
- Ginsburg, S. and Hull, R. 1983a. Order dependency in the relational model. Theor. Comput. Sci. 26, 149--195.Google ScholarCross Ref
- Ginsburg, S. and Hull, R. 1983b. Sort sets in the relational model. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'83). 332--339. Google ScholarDigital Library
- Ginsburg, S. and Hull, R. 1986. Sort sets in the relational model. J. ACM 33, 3, 465--488. Google ScholarDigital Library
- Golab, L., Karloff, H. J., Korn, F., Saha, A., and Srivastava, D. 2009. Sequential dependencies. Proc. VLDB 2, 1, 574--585. Google ScholarDigital Library
- Golab, L., Karloff, H. J., Korn, F., Srivastava, D., and Yu, B. 2008. On generating near-optimal tableaux for conditional functional dependencies. Proc. VLDB 1, 1, 376--390. Google ScholarDigital Library
- Gravano, L., Ipeirotis, P. G., Jagadish, H. V., Koudas, N., Muthukrishnan, S., Pietarinen, L., and Srivastava, D. 2001. Using q-grams in a dbms for approximate string processing. IEEE Data Engin. Bull. 24, 4, 28--34.Google Scholar
- Hernandez, M. A. and Stolfo, S. J. 1995. The merge/purge problem for large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 127--138. Google ScholarDigital Library
- Huhtala, Y., Karkkainen, J., Porkka, P., and Toivonen, H. 1998. Efficient discovery of functional and approximate dependencies using partitions. In Proceedings of the International Conference on Data Engineering (ICDE'98). 392--401. Google ScholarDigital Library
- Huhtala, Y., Karkkainen, J., Porkka, P., and Toivonen, H. 1999. Tane: An efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42, 2, 100--111.Google ScholarCross Ref
- Ilyas, I. F., Markl, V., Haas, P. J., Brown, P., and Aboulnaga, A. 2004. Cords: Automatic discovery of correlations and soft functional dependencies. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 647--658. Google ScholarDigital Library
- Karp, R. M. 1972. Reducibility among combinatorial problems. In Complexity of Computer Computations, R. E. Miller and J. W. Thatcher Eds., Plenum Press, 85--103.Google Scholar
- Kimura, H., Huo, G., Rasin, A., Madden, S., and Zdonik, S. B. 2009. Correlation maps: A compressed access method for exploiting soft functional dependencies. Proc. VLDB 2, 1, 1222--1233. Google ScholarDigital Library
- Kivinen, J. and Mannila, H. 1995. Approximate inference of functional dependencies from relations. Theor. Comput. Sci. 149, 1, 129--149. Google ScholarDigital Library
- Korn, F., Muthukrishnan, S., and Zhu, Y. 2003. Checks and balances: Monitoring data quality problems in network traffic databases. In Proceedings of the International Conference on Very Large Databases (VLDB'03). 536--547. Google ScholarDigital Library
- Koudas, N., Saha, A., Srivastava, D., and Venkatasubramanian, S. 2009. Metric functional dependencies. In Proceedings of the International Conference on Data Engineering (ICDE'09). 1275--1278. Google ScholarDigital Library
- Kramer, S. and Pfahringer, B. 1996. Efficient search for strong partial determinations. In Proceedings of the International SIGKDD Conference on Knowledge Discovery and Data Mining. 371--374.Google Scholar
- Levene, M. and Loizou, G. 1999. A Guided Tour of Relational Databases and Beyond. Springer. Google ScholarDigital Library
- Levy, A. Y. and Sagiv, Y. 1995. Semantic query optimization in datalog programs. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'95). 163--173. Google ScholarDigital Library
- Mannila, H. and Rai Ha, K.-J. 1987. Dependency inference. In Proceedings of the International Conference on Very Large Databases (VLDB'87). 155--158. Google ScholarDigital Library
- Mannila, H. and Rai Ha, K.-J. 1989. Practical algorithms for finding prime attributes and testing normal forms. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'89). 128--133. Google ScholarDigital Library
- Mannila, H. and Rai Ha, K.-J. 1992. Design of Relational Databases. Addison-Wesley. Google ScholarDigital Library
- Mannila, H. and Rai Ha, K.-J. 1994. Algorithms for inferring functional dependencies from relations. Data Knowl. Engin. 12, 1, 83--99. Google ScholarDigital Library
- Navarro, G. 2001. A guided tour to approximate string matching. ACM Comput. Surv. 33, 1, 31--88. Google ScholarDigital Library
- Ng, W. 1999. Ordered functional dependencies in relational databases. Inf. Syst. 24, 7, 535--554. Google ScholarDigital Library
- Ng, W. 2001. An extension of the relational data model to incorporate ordered domains. ACM Trans. Datab. Syst. 26, 3, 344--383. Google ScholarDigital Library
- Papadimitriou, C. H. and Steiglitz, K. 1982. Combinatorial Optimization: Algorithms and Complexity. Prentice-Hall. Google ScholarDigital Library
- Schlimmer, J. C. 1993. Efficiently inducing determinations: A complete and systematic search algorithm that uses optimal pruning. In Proceedings of the International Conference on Machine Learning (ICML'93). 284--290.Google ScholarDigital Library
- Wijsen, J. 1998. Reasoning about qualitative trends in databases. Inf. Syst. 23, 7, 463--487. Google ScholarDigital Library
- Wijsen, J. 2001. Trends in databases: Reasoning and mining. IEEE Trans. Knowl. Data Engin. 13, 3, 426--438. Google ScholarDigital Library
- Wyss, C. M., Giannella, C., and Robertson, E. L. 2001. Fastfds: A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances - extended abstract. In Proceedings of the International Conference on Data Warehousing and Knowledge Discovery (DaWaK'01). 101--110. Google ScholarDigital Library
Index Terms
- Differential dependencies: Reasoning and discovery
Recommendations
Response to “Differential Dependencies Revisited”
Invited Paper from ICDT 2014, Invited Paper from EDBT 2015, Regular Papers and Technical CorrespondenceA recent article [Vincent et al. 2015] concerns the correctness of several results in reasoning about differential dependencies (dds), originally reported in Song and Chen [2011]. The major concern by Vincent et al. [2015] roots from assuming a type of ...
Discovering branching and fractional dependencies in databases
The discovery of dependencies between attributes in databases is an important problem in data mining, and can be applied to facilitate future decision-making. In the present paper some properties of the branching dependencies are examined. We define a ...
Discovering Graph Differential Dependencies
Databases Theory and ApplicationsAbstractGraph differential dependencies (GDDs) are a novel class of integrity constraints in property graphs for capturing and expressing the semantics of difference in graph data. They are more expressive, and subsume other graph dependencies; and thus, ...
Comments