skip to main content
research-article

Framework for evaluating clustering algorithms in duplicate detection

Published:01 August 2009Publication History
Skip Abstract Section

Abstract

The presence of duplicate records is a major data quality concern in large databases. To detect duplicates, entity resolution also known as duplication detection or record linkage is used as a part of the data cleaning process to identify records that potentially refer to the same real-world entity. We present the Stringer system that provides an evaluation framework for understanding what barriers remain towards the goal of truly scalable and general purpose duplication detection algorithms. In this paper, we use Stringer to evaluate the quality of the clusters (groups of potential duplicates) obtained from several unconstrained clustering algorithms used in concert with approximate join techniques. Our work is motivated by the recent significant advancements that have made approximate join algorithms highly scalable. Our extensive evaluation reveals that some clustering algorithms that have never been considered for duplicate detection, perform extremely well in terms of both accuracy and scalability.

References

  1. N. Ailon, M. Charikar, and A. Newman. Aggregating Inconsistent Information: Ranking and Clustering. In ACM Symp. on Theory of Computing (STOC), pages 684--693, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. P. Andritsos. Scalable Clustering of Categorical Data And Applications. PhD thesis, University of Toronto, Toronto, Canada, September 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Arasu, V. Ganti, and R. Kaushik. Efficient Exact Set-Similarity Joins. In Proc. of the Int'l Conf. on Very Large Data Bases (VLDB), pages 918--929, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. A. Aslam, E. Pelekhov, and D. Rus. The Star Clustering Algorithm For Static And Dynamic Information Organization. Journal of Graph Algorithms and Applications, 8(1):95--129, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  5. N. Bansal, A. Blum, and S. Chawla. Correlation Clustering. Machine Learning, 56(1--3):89--113, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. N. Bansal, F. Chiang, N. Koudas, and F. W. Tompa. Seeking Stable Clusters In The Blogosphere. In Proc. of the Int'l Conf. on Very Large Data Bases (VLDB), pages 806--817, Vienna, Austria, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. J. Bayardo, Y. Ma, and R. Srikant. Scaling Up All Pairs Similarity Search. In Int'l World Wide Web Conference (WWW), pages 131--140, Banff, Canada, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. I. Bhattacharya and L. Getoor. A Latent Dirichlet Model for Unsupervised Entity Resolution. In Proc. of the SIAM International Conference on Data Mining (SDM), pages 47--58, Bethesda, MD, USA, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  9. I. Bhattacharya and L. Getoor. Collective Entity Resolution in Relational Data. IEEE Data Engineering Bulletin, 29(2):4--12, 2006.Google ScholarGoogle Scholar
  10. U. Brandes, M. Gaertler, and D. Wagner. Experiments on Graph Clustering Algorithms. In The 11th Europ. Symp. Algorithms, pages 568--579. Springer-Verlag, 2003.Google ScholarGoogle Scholar
  11. S. Brohee and J. van Helden. Evaluation of Clustering Algorithms for Protein-Protein Interaction Networks. BMC Bioinformatics, 7:488+, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  12. M. Charikar, V. Guruswami, and A. Wirth. Clustering with Qualitative Information. J. Comput. Syst. Sci., 71(3):360--383, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Chaudhuri, V. Ganti, and R. Motwani. Robust Identification of Fuzzy Duplicates. In IEEE Proc. of the Int'l Conf. on Data Eng., pages 865--876, Washington, DC, USA, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Cheng, R. Kannan, S. Vempala, and G. Wang. On a Recursive Spectral Algorithm for Clustering from Pairwise Similarities. Technical Report MIT-LCS-TR-906, MIT LCS, 2003.Google ScholarGoogle Scholar
  15. F. Chierichetti, A. Panconesi, P. Raghavan, M. Sozio, A. Tiberi, and E. Upfal. Finding Near Neighbors Through Cluster Pruning. In Proc. of the ACM Symp. on Principles of Database Systems (PODS), pages 103--112, Beijing, China, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A Comparison of String Distance Metrics for Name-Matching Tasks. In Proc. of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), pages 73--78, Acapulco, Mexico, 2003.Google ScholarGoogle Scholar
  17. T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. McGraw Hill and MIT Press, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. W. H. Day and H. Edelsbrunner. Efficient Algorithms for Agglomerative Hierarchical Clustering Methods. Journal of Classification, 1(1):7--24, 1984.Google ScholarGoogle ScholarCross RefCross Ref
  19. E. D. Demaine, D. Emanuel, A. Fiat, and N. Immorlica. Correlation Clustering In General Weighted Graphs. Theor. Comput. Sci., 361(2):172--187, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. E. A. Dinic. Algorithm for Solution of a Problem Of Maximum Flow in Networks with Power Estimation. Soviet Math. Dokl, 11:1277--1280, 1970.Google ScholarGoogle Scholar
  21. J. Edmonds and R. M. Karp. Theoretical Improvements in Algorithmic Efficiency for Network Flow Problems. J. ACM, 19(2):248--264, 1972. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P. F. Felzenszwalb and D. P. Huttenlocher. Efficient Graph-Based Image Segmentation. Int. J. Comput. Vision, 59(2):167--181, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Filippone, F. Camastra, F. Masulli, and S. Rovetta. A Survey of Kernel and Spectral Methods for Clustering. Pattern Recognition, 41(1):176--190, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. G. W. Flake, R. E. Tarjan, and K. Tsioutsiouliklis. Graph Clustering and Minimum Cut Trees. Internet Mathematics, 1(4):385--408, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  25. L. Ford and D. Fulkerson. Maximal Flow Through a Network. Canadian J. Math, 8:399--404, 1956.Google ScholarGoogle ScholarCross RefCross Ref
  26. D. Gibson, R. Kumar, and A. Tomkins. Discovering Large Dense Subgraphs in Massive Graphs. In Proc. of the Int'l Conf. on Very Large Data Bases (VLDB), pages 721--732, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. V. Goldberg and R. E. Tarjan. A New Approach to the Maximum-Flow Problem. Journal of the ACM, 35(4):921--940, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On clustering validation techniques. journal, 17(2--3):107--145, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. O. Hassanzadeh. Benchmarking Declarative Approximate Selection Predicates. Master's thesis, University of Toronto, February 2007.Google ScholarGoogle Scholar
  30. O. Hassanzadeh and R. J. Miller. Creating Probabilistic Databases from Duplicated Data. Technical Report CSRG-568, University of Toronto, To appear in The VLDB Journal, Accepted on 26 June 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. O. Hassanzadeh, M. Sadoghi, and R. J. Miller. Accuracy of Approximate String Joins Using Grams. In Proc. of the International Workshop on Quality in Databases (QDB), pages 11--18, Vienna, Austria, 2007.Google ScholarGoogle Scholar
  32. T. H. Haveliwala, A. Gionis, and P. Indyk. Scalable Techniques for Clustering the Web. In Proc. of the Int'l Workshop on the Web and Databases (WebDB), pages 129--134, Dallas, Texas, USA, 2000.Google ScholarGoogle Scholar
  33. M. A. Hernández and S. J. Stolfo. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. L. J. Heyer, S. Kruglyak, and S. Yooseph. Exploring Expression Data: Identification and Analysis of Coexpressed Genes. Genome Res., 9(11):1106--1115, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  35. A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. A. K. Jain, M. N. Murty, and P. J. Flynn. Data Clustering: A Review. ACM Computing Surveys, 31(3):264--323, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. R. Kannan, S. Vempala, and A. Vetta. On clusterings: Good, bad and spectral. Journal of the ACM, 51(3):497--515, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Y. J. Kim and J. M. Patel. A Framework for Protein Structure Classification and Identification of Novel Protein Structures. BMC Bioinformatics, 7:456+, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  39. A. D. King. Graph Clustering with Restricted Neighbourhood Search. Master's thesis, University of Toronto, 2004.Google ScholarGoogle Scholar
  40. J. Kogan. Introduction to Clustering Large and High-Dimensional Data. Cambridge Univ. Press, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. C. Li, B. Wang, and X. Yang. VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams. In Proc. of the Int'l Conf. on Very Large Data Bases (VLDB), pages 303--314, Vienna, Austria, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. A. M. D. Pelleg. X-Means: Extending K-Means with Efficient Estimation of the Number of Clusters. In Proc. of the Int'l Conf. on Machine Learning, pages 727--734, San Francisco, CA, USA, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. S. Sarawagi and A. Kirpal. Efficient Set Joins On Similarity Predicates. In ACM SIGMOD Int'l Conf. on the Mgmt. of Data, pages 743--754, Paris, France, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. N. Slonim. The Information Bottleneck: Theory And Applications. PhD thesis, The Hebrew University, 2003.Google ScholarGoogle Scholar
  45. C. Swamy. Correlation Clustering: Maximizing Agreements Via Semidefinite Programming. In Proc. of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 526--527, New Orleans, Louisiana, USA, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. C. Umans. Hardness of Approximating Sigma2p Minimization Problems. In Symp. on Foundations of Computer Science (FOCS), pages 465--474, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. S. van Dongen. Graph Clustering By Flow Simulation. PhD thesis, University of Utrecht, 2000.Google ScholarGoogle Scholar
  48. J. A. Whitney. Graph Clustering With Overlap. Master's thesis, University of Toronto, 2006.Google ScholarGoogle Scholar
  49. D. T. Wijaya and S. Bressan. Ricochet: A Family of Unconstrained Algorithms for Graph Clustering. In Proc. of the Int'l Conf. on Database Systems for Advanced Applications (DASFAA), pages 153--167, Brisbane, Australia, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. R. Xu and I. Wunsch. Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3):645--678, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. J. Zupan. Clustering of Large Data Sets. Research Studies Press, 1982.Google ScholarGoogle Scholar

Index Terms

  1. Framework for evaluating clustering algorithms in duplicate detection

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader