ABSTRACT
Temporal record matching recognizes that if the entities represented by the records change over time, approaches that use temporal information may do better than approaches that do not. Any such temporal matching method relies at its heart on a temporal model that captures information about how entities evolve. In their pioneering work, Li {\it et al.} used an efficiently computable model that simply tries to predict if an attribute is expected to change over a given time interval. In our work, we propose and evaluate a more detailed model that focuses on the probability that a given attribute value reappears over time. The intuition here is that an entity might change its attribute value in the way that is dependent on its past values. In addition, our model considers sets of records (rather than simply pairs of records) to improve robustness and accuracy. Experimental results show that the resulting approach improves both accuracy and resistance to noise while incurring a minimal overhead.
- Academic patenting in Europe (APE-INV). http://www.esf-ape-inv.eu/.Google Scholar
- The DBLP computer science bibliography. http://www.informatik.uni-trier.de/ley/db/.Google Scholar
- Fec-standardizer - an experiment to standardize individual donor names in campaign finance data. https://github.com/cjdd3b/fec-standardizer.Google Scholar
- Twitter - an online social networking and microblogging service. https://twitter.com/.Google Scholar
- E. Cohen and M. Strauss. Maintaining time-decaying stream aggregates. Journal of Algorithms, 59(1):19-36, 2006. Google ScholarDigital Library
- G. Cormode, V. Shkapenyuk, D. Srivastava, and B. Xu. Forward decay: A practical time decay model for streaming systems. In IEEE 25th International Conference on Data Engineering (ICDE), pages 138-149. IEEE, 2009. Google ScholarDigital Library
- P. Domingos. Multi-relational record linkage. In Proc. of the KDD-2004 Workshop on Multi-Relational Data Mining. KDD, 2004.Google Scholar
- A. Elmagarmid, P. Ipeirotis, and V. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1-16, 2007. Google ScholarCross Ref
- I. Fellegi and A. Sunter. A theory for record linkage. Journal of the American Statistical Association, pages 1183-1210, 1969.Google ScholarCross Ref
- O. Hassanzadeh, F. Chiang, H. Lee, and R. Miller. Framework for evaluating clustering algorithms in duplicate detection. Proceedings of the VLDB Endowment, 2(1):1282-1293, 2009. Google ScholarDigital Library
- P. Jaccard. Distribution de la Flore Alpine: dans le Bassin des dranses et dans quelques régions voisines. Rouge, 1901.Google Scholar
- N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 802-803. ACM, 2006. Google ScholarDigital Library
- V. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707-710, 1966.Google Scholar
- P. Li, X. Dong, A. Maurino, and D. Srivastava. Linking temporal records. Proceedings of the VLDB Endowment, 4(11):956-967, 2011.Google ScholarDigital Library
- G. Ozsoyoglu and R. Snodgrass. Temporal and real-time databases: A survey. IEEE Transactions on Knowledge and Data Engineering, 7(4):513-532, 1995. Google ScholarDigital Library
- J. Roddick and M. Spiliopoulou. A survey of temporal knowledge discovery paradigms and methods. IEEE Transactions on Knowledge and Data Engineering, 14(4):750-767, 2002. Google ScholarDigital Library
- D. Wang and M. A. Arbib. Complex temporal sequence learning based on short-term memory. Proceedings of the IEEE, 78(9):1536-1543, 1990.Google ScholarCross Ref
- W. Winkler. Methods for record linkage and bayesian networks. Technical report, Statistical Research Division, US Census Bureau, Washington, DC, 2002.Google Scholar
- M. Yakout, A. Elmagarmid, H. Elmeleegy, M. Ouzzani, and A. Qi. Behavior based record linkage. Proceedings of the VLDB Endowment, 3(1-2):439-448, 2010. Google ScholarDigital Library
Index Terms
- Modeling entity evolution for temporal record matching
Recommendations
Neural Networks for Entity Matching: A Survey
Entity matching is the problem of identifying which records refer to the same real-world entity. It has been actively researched for decades, and a variety of different approaches have been developed. Even today, it remains a challenging problem, and ...
A taxonomy of privacy-preserving record linkage techniques
The process of identifying which records in two or more databases correspond to the same entity is an important aspect of data quality activities such as data pre-processing and data integration. Known as record linkage, data matching or entity ...
Deep Entity Matching: Challenges and Opportunities
On the Horizon, On the Horizon and Experience PapersEntity matching refers to the task of determining whether two different representations refer to the same real-world entity. It continues to be a prevalent problem for many organizations where data resides in different sources and duplicates the need to ...
Comments