Abstract
The presence of duplicate records is a major data quality concern in large databases. To detect duplicates, entity resolution also known as duplication detection or record linkage is used as a part of the data cleaning process to identify records that potentially refer to the same real-world entity. We present the Stringer system that provides an evaluation framework for understanding what barriers remain towards the goal of truly scalable and general purpose duplication detection algorithms. In this paper, we use Stringer to evaluate the quality of the clusters (groups of potential duplicates) obtained from several unconstrained clustering algorithms used in concert with approximate join techniques. Our work is motivated by the recent significant advancements that have made approximate join algorithms highly scalable. Our extensive evaluation reveals that some clustering algorithms that have never been considered for duplicate detection, perform extremely well in terms of both accuracy and scalability.
- N. Ailon, M. Charikar, and A. Newman. Aggregating Inconsistent Information: Ranking and Clustering. In ACM Symp. on Theory of Computing (STOC), pages 684--693, 2005. Google ScholarDigital Library
- P. Andritsos. Scalable Clustering of Categorical Data And Applications. PhD thesis, University of Toronto, Toronto, Canada, September 2004. Google ScholarDigital Library
- A. Arasu, V. Ganti, and R. Kaushik. Efficient Exact Set-Similarity Joins. In Proc. of the Int'l Conf. on Very Large Data Bases (VLDB), pages 918--929, 2006. Google ScholarDigital Library
- J. A. Aslam, E. Pelekhov, and D. Rus. The Star Clustering Algorithm For Static And Dynamic Information Organization. Journal of Graph Algorithms and Applications, 8(1):95--129, 2004.Google ScholarCross Ref
- N. Bansal, A. Blum, and S. Chawla. Correlation Clustering. Machine Learning, 56(1--3):89--113, 2004. Google ScholarDigital Library
- N. Bansal, F. Chiang, N. Koudas, and F. W. Tompa. Seeking Stable Clusters In The Blogosphere. In Proc. of the Int'l Conf. on Very Large Data Bases (VLDB), pages 806--817, Vienna, Austria, 2007. Google ScholarDigital Library
- R. J. Bayardo, Y. Ma, and R. Srikant. Scaling Up All Pairs Similarity Search. In Int'l World Wide Web Conference (WWW), pages 131--140, Banff, Canada, 2007. Google ScholarDigital Library
- I. Bhattacharya and L. Getoor. A Latent Dirichlet Model for Unsupervised Entity Resolution. In Proc. of the SIAM International Conference on Data Mining (SDM), pages 47--58, Bethesda, MD, USA, 2006.Google ScholarCross Ref
- I. Bhattacharya and L. Getoor. Collective Entity Resolution in Relational Data. IEEE Data Engineering Bulletin, 29(2):4--12, 2006.Google Scholar
- U. Brandes, M. Gaertler, and D. Wagner. Experiments on Graph Clustering Algorithms. In The 11th Europ. Symp. Algorithms, pages 568--579. Springer-Verlag, 2003.Google Scholar
- S. Brohee and J. van Helden. Evaluation of Clustering Algorithms for Protein-Protein Interaction Networks. BMC Bioinformatics, 7:488+, 2006.Google ScholarCross Ref
- M. Charikar, V. Guruswami, and A. Wirth. Clustering with Qualitative Information. J. Comput. Syst. Sci., 71(3):360--383, 2005. Google ScholarDigital Library
- S. Chaudhuri, V. Ganti, and R. Motwani. Robust Identification of Fuzzy Duplicates. In IEEE Proc. of the Int'l Conf. on Data Eng., pages 865--876, Washington, DC, USA, 2005. Google ScholarDigital Library
- D. Cheng, R. Kannan, S. Vempala, and G. Wang. On a Recursive Spectral Algorithm for Clustering from Pairwise Similarities. Technical Report MIT-LCS-TR-906, MIT LCS, 2003.Google Scholar
- F. Chierichetti, A. Panconesi, P. Raghavan, M. Sozio, A. Tiberi, and E. Upfal. Finding Near Neighbors Through Cluster Pruning. In Proc. of the ACM Symp. on Principles of Database Systems (PODS), pages 103--112, Beijing, China, 2007. Google ScholarDigital Library
- W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A Comparison of String Distance Metrics for Name-Matching Tasks. In Proc. of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), pages 73--78, Acapulco, Mexico, 2003.Google Scholar
- T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. McGraw Hill and MIT Press, 1990. Google ScholarDigital Library
- W. H. Day and H. Edelsbrunner. Efficient Algorithms for Agglomerative Hierarchical Clustering Methods. Journal of Classification, 1(1):7--24, 1984.Google ScholarCross Ref
- E. D. Demaine, D. Emanuel, A. Fiat, and N. Immorlica. Correlation Clustering In General Weighted Graphs. Theor. Comput. Sci., 361(2):172--187, 2006. Google ScholarDigital Library
- E. A. Dinic. Algorithm for Solution of a Problem Of Maximum Flow in Networks with Power Estimation. Soviet Math. Dokl, 11:1277--1280, 1970.Google Scholar
- J. Edmonds and R. M. Karp. Theoretical Improvements in Algorithmic Efficiency for Network Flow Problems. J. ACM, 19(2):248--264, 1972. Google ScholarDigital Library
- P. F. Felzenszwalb and D. P. Huttenlocher. Efficient Graph-Based Image Segmentation. Int. J. Comput. Vision, 59(2):167--181, 2004. Google ScholarDigital Library
- M. Filippone, F. Camastra, F. Masulli, and S. Rovetta. A Survey of Kernel and Spectral Methods for Clustering. Pattern Recognition, 41(1):176--190, 2008. Google ScholarDigital Library
- G. W. Flake, R. E. Tarjan, and K. Tsioutsiouliklis. Graph Clustering and Minimum Cut Trees. Internet Mathematics, 1(4):385--408, 2004.Google ScholarCross Ref
- L. Ford and D. Fulkerson. Maximal Flow Through a Network. Canadian J. Math, 8:399--404, 1956.Google ScholarCross Ref
- D. Gibson, R. Kumar, and A. Tomkins. Discovering Large Dense Subgraphs in Massive Graphs. In Proc. of the Int'l Conf. on Very Large Data Bases (VLDB), pages 721--732, 2005. Google ScholarDigital Library
- A. V. Goldberg and R. E. Tarjan. A New Approach to the Maximum-Flow Problem. Journal of the ACM, 35(4):921--940, 1988. Google ScholarDigital Library
- M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On clustering validation techniques. journal, 17(2--3):107--145, 2001. Google ScholarDigital Library
- O. Hassanzadeh. Benchmarking Declarative Approximate Selection Predicates. Master's thesis, University of Toronto, February 2007.Google Scholar
- O. Hassanzadeh and R. J. Miller. Creating Probabilistic Databases from Duplicated Data. Technical Report CSRG-568, University of Toronto, To appear in The VLDB Journal, Accepted on 26 June 2009. Google ScholarDigital Library
- O. Hassanzadeh, M. Sadoghi, and R. J. Miller. Accuracy of Approximate String Joins Using Grams. In Proc. of the International Workshop on Quality in Databases (QDB), pages 11--18, Vienna, Austria, 2007.Google Scholar
- T. H. Haveliwala, A. Gionis, and P. Indyk. Scalable Techniques for Clustering the Web. In Proc. of the Int'l Workshop on the Web and Databases (WebDB), pages 129--134, Dallas, Texas, USA, 2000.Google Scholar
- M. A. Hernández and S. J. Stolfo. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998. Google ScholarDigital Library
- L. J. Heyer, S. Kruglyak, and S. Yooseph. Exploring Expression Data: Identification and Analysis of Coexpressed Genes. Genome Res., 9(11):1106--1115, 1999.Google ScholarCross Ref
- A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. Google ScholarDigital Library
- A. K. Jain, M. N. Murty, and P. J. Flynn. Data Clustering: A Review. ACM Computing Surveys, 31(3):264--323, 1999. Google ScholarDigital Library
- R. Kannan, S. Vempala, and A. Vetta. On clusterings: Good, bad and spectral. Journal of the ACM, 51(3):497--515, 2004. Google ScholarDigital Library
- Y. J. Kim and J. M. Patel. A Framework for Protein Structure Classification and Identification of Novel Protein Structures. BMC Bioinformatics, 7:456+, 2006.Google ScholarCross Ref
- A. D. King. Graph Clustering with Restricted Neighbourhood Search. Master's thesis, University of Toronto, 2004.Google Scholar
- J. Kogan. Introduction to Clustering Large and High-Dimensional Data. Cambridge Univ. Press, 2007. Google ScholarDigital Library
- C. Li, B. Wang, and X. Yang. VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams. In Proc. of the Int'l Conf. on Very Large Data Bases (VLDB), pages 303--314, Vienna, Austria, 2007. Google ScholarDigital Library
- A. M. D. Pelleg. X-Means: Extending K-Means with Efficient Estimation of the Number of Clusters. In Proc. of the Int'l Conf. on Machine Learning, pages 727--734, San Francisco, CA, USA, 2000. Google ScholarDigital Library
- S. Sarawagi and A. Kirpal. Efficient Set Joins On Similarity Predicates. In ACM SIGMOD Int'l Conf. on the Mgmt. of Data, pages 743--754, Paris, France, 2004. Google ScholarDigital Library
- N. Slonim. The Information Bottleneck: Theory And Applications. PhD thesis, The Hebrew University, 2003.Google Scholar
- C. Swamy. Correlation Clustering: Maximizing Agreements Via Semidefinite Programming. In Proc. of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 526--527, New Orleans, Louisiana, USA, 2004. Google ScholarDigital Library
- C. Umans. Hardness of Approximating Sigma2p Minimization Problems. In Symp. on Foundations of Computer Science (FOCS), pages 465--474, 1999. Google ScholarDigital Library
- S. van Dongen. Graph Clustering By Flow Simulation. PhD thesis, University of Utrecht, 2000.Google Scholar
- J. A. Whitney. Graph Clustering With Overlap. Master's thesis, University of Toronto, 2006.Google Scholar
- D. T. Wijaya and S. Bressan. Ricochet: A Family of Unconstrained Algorithms for Graph Clustering. In Proc. of the Int'l Conf. on Database Systems for Advanced Applications (DASFAA), pages 153--167, Brisbane, Australia, 2009. Google ScholarDigital Library
- R. Xu and I. Wunsch. Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3):645--678, 2005. Google ScholarDigital Library
- J. Zupan. Clustering of Large Data Sets. Research Studies Press, 1982.Google Scholar
Index Terms
- Framework for evaluating clustering algorithms in duplicate detection
Recommendations
Transforming Pairwise Duplicates to Entity Clusters for High-quality Duplicate Detection
ON THE HORIZON, CHALLENGE PAPER, REGULAR PAPERS, and EXPERIENCE PAPERDuplicate detection algorithms produce clusters of database records, each cluster representing a single real-world entity. As most of these algorithms use pairwise comparisons, the resulting (transitive) clusters can be inconsistent: Not all records ...
Duplicate Record Detection: A Survey
Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription ...
Evaluating indeterministic duplicate detection results
SUM'12: Proceedings of the 6th international conference on Scalable Uncertainty ManagementDuplicate detection is an important process for cleaning or integrating data. Since real-life data is often polluted, detecting duplicates usually comes along with uncertainty. To handle duplicate uncertainty in an appropriate way, indeterministic ...
Comments