research-article

Framework for evaluating clustering algorithms in duplicate detection

Authors:
Oktie Hassanzadeh

University of Toronto

University of Toronto
View Profile

,
Fei Chiang

University of Toronto

University of Toronto
View Profile

,
Hyun Chul Lee

Thoora Inc.

Thoora Inc.
View Profile

,
Renée J. Miller

University of Toronto

University of Toronto
View Profile

Proceedings of the VLDB Endowment Volume 2 Issue 1pp 1282–1293https://doi.org/10.14778/1687627.1687771

Published:01 August 2009Publication History

Proceedings of the VLDB Endowment

Abstract

The presence of duplicate records is a major data quality concern in large databases. To detect duplicates, entity resolution also known as duplication detection or record linkage is used as a part of the data cleaning process to identify records that potentially refer to the same real-world entity. We present the Stringer system that provides an evaluation framework for understanding what barriers remain towards the goal of truly scalable and general purpose duplication detection algorithms. In this paper, we use Stringer to evaluate the quality of the clusters (groups of potential duplicates) obtained from several unconstrained clustering algorithms used in concert with approximate join techniques. Our work is motivated by the recent significant advancements that have made approximate join algorithms highly scalable. Our extensive evaluation reveals that some clustering algorithms that have never been considered for duplicate detection, perform extremely well in terms of both accuracy and scalability.

References

N. Ailon, M. Charikar, and A. Newman. Aggregating Inconsistent Information: Ranking and Clustering. In ACM Symp. on Theory of Computing (STOC), pages 684--693, 2005. Google ScholarDigital Library
P. Andritsos. Scalable Clustering of Categorical Data And Applications. PhD thesis, University of Toronto, Toronto, Canada, September 2004. Google ScholarDigital Library
A. Arasu, V. Ganti, and R. Kaushik. Efficient Exact Set-Similarity Joins. In Proc. of the Int'l Conf. on Very Large Data Bases (VLDB), pages 918--929, 2006. Google ScholarDigital Library
J. A. Aslam, E. Pelekhov, and D. Rus. The Star Clustering Algorithm For Static And Dynamic Information Organization. Journal of Graph Algorithms and Applications, 8(1):95--129, 2004.Google ScholarCross Ref
N. Bansal, A. Blum, and S. Chawla. Correlation Clustering. Machine Learning, 56(1--3):89--113, 2004. Google ScholarDigital Library
N. Bansal, F. Chiang, N. Koudas, and F. W. Tompa. Seeking Stable Clusters In The Blogosphere. In Proc. of the Int'l Conf. on Very Large Data Bases (VLDB), pages 806--817, Vienna, Austria, 2007. Google ScholarDigital Library
R. J. Bayardo, Y. Ma, and R. Srikant. Scaling Up All Pairs Similarity Search. In Int'l World Wide Web Conference (WWW), pages 131--140, Banff, Canada, 2007. Google ScholarDigital Library
I. Bhattacharya and L. Getoor. A Latent Dirichlet Model for Unsupervised Entity Resolution. In Proc. of the SIAM International Conference on Data Mining (SDM), pages 47--58, Bethesda, MD, USA, 2006.Google ScholarCross Ref
I. Bhattacharya and L. Getoor. Collective Entity Resolution in Relational Data. IEEE Data Engineering Bulletin, 29(2):4--12, 2006.Google Scholar
U. Brandes, M. Gaertler, and D. Wagner. Experiments on Graph Clustering Algorithms. In The 11th Europ. Symp. Algorithms, pages 568--579. Springer-Verlag, 2003.Google Scholar
S. Brohee and J. van Helden. Evaluation of Clustering Algorithms for Protein-Protein Interaction Networks. BMC Bioinformatics, 7:488+, 2006.Google ScholarCross Ref
M. Charikar, V. Guruswami, and A. Wirth. Clustering with Qualitative Information. J. Comput. Syst. Sci., 71(3):360--383, 2005. Google ScholarDigital Library
S. Chaudhuri, V. Ganti, and R. Motwani. Robust Identification of Fuzzy Duplicates. In IEEE Proc. of the Int'l Conf. on Data Eng., pages 865--876, Washington, DC, USA, 2005. Google ScholarDigital Library
D. Cheng, R. Kannan, S. Vempala, and G. Wang. On a Recursive Spectral Algorithm for Clustering from Pairwise Similarities. Technical Report MIT-LCS-TR-906, MIT LCS, 2003.Google Scholar
F. Chierichetti, A. Panconesi, P. Raghavan, M. Sozio, A. Tiberi, and E. Upfal. Finding Near Neighbors Through Cluster Pruning. In Proc. of the ACM Symp. on Principles of Database Systems (PODS), pages 103--112, Beijing, China, 2007. Google ScholarDigital Library
W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A Comparison of String Distance Metrics for Name-Matching Tasks. In Proc. of IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), pages 73--78, Acapulco, Mexico, 2003.Google Scholar
T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. McGraw Hill and MIT Press, 1990. Google ScholarDigital Library
W. H. Day and H. Edelsbrunner. Efficient Algorithms for Agglomerative Hierarchical Clustering Methods. Journal of Classification, 1(1):7--24, 1984.Google ScholarCross Ref
E. D. Demaine, D. Emanuel, A. Fiat, and N. Immorlica. Correlation Clustering In General Weighted Graphs. Theor. Comput. Sci., 361(2):172--187, 2006. Google ScholarDigital Library
E. A. Dinic. Algorithm for Solution of a Problem Of Maximum Flow in Networks with Power Estimation. Soviet Math. Dokl, 11:1277--1280, 1970.Google Scholar
J. Edmonds and R. M. Karp. Theoretical Improvements in Algorithmic Efficiency for Network Flow Problems. J. ACM, 19(2):248--264, 1972. Google ScholarDigital Library
P. F. Felzenszwalb and D. P. Huttenlocher. Efficient Graph-Based Image Segmentation. Int. J. Comput. Vision, 59(2):167--181, 2004. Google ScholarDigital Library
M. Filippone, F. Camastra, F. Masulli, and S. Rovetta. A Survey of Kernel and Spectral Methods for Clustering. Pattern Recognition, 41(1):176--190, 2008. Google ScholarDigital Library
G. W. Flake, R. E. Tarjan, and K. Tsioutsiouliklis. Graph Clustering and Minimum Cut Trees. Internet Mathematics, 1(4):385--408, 2004.Google ScholarCross Ref
L. Ford and D. Fulkerson. Maximal Flow Through a Network. Canadian J. Math, 8:399--404, 1956.Google ScholarCross Ref
D. Gibson, R. Kumar, and A. Tomkins. Discovering Large Dense Subgraphs in Massive Graphs. In Proc. of the Int'l Conf. on Very Large Data Bases (VLDB), pages 721--732, 2005. Google ScholarDigital Library
A. V. Goldberg and R. E. Tarjan. A New Approach to the Maximum-Flow Problem. Journal of the ACM, 35(4):921--940, 1988. Google ScholarDigital Library
M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On clustering validation techniques. journal, 17(2--3):107--145, 2001. Google ScholarDigital Library
O. Hassanzadeh. Benchmarking Declarative Approximate Selection Predicates. Master's thesis, University of Toronto, February 2007.Google Scholar
O. Hassanzadeh and R. J. Miller. Creating Probabilistic Databases from Duplicated Data. Technical Report CSRG-568, University of Toronto, To appear in The VLDB Journal, Accepted on 26 June 2009. Google ScholarDigital Library
O. Hassanzadeh, M. Sadoghi, and R. J. Miller. Accuracy of Approximate String Joins Using Grams. In Proc. of the International Workshop on Quality in Databases (QDB), pages 11--18, Vienna, Austria, 2007.Google Scholar
T. H. Haveliwala, A. Gionis, and P. Indyk. Scalable Techniques for Clustering the Web. In Proc. of the Int'l Workshop on the Web and Databases (WebDB), pages 129--134, Dallas, Texas, USA, 2000.Google Scholar
M. A. Hernández and S. J. Stolfo. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998. Google ScholarDigital Library
L. J. Heyer, S. Kruglyak, and S. Yooseph. Exploring Expression Data: Identification and Analysis of Coexpressed Genes. Genome Res., 9(11):1106--1115, 1999.Google ScholarCross Ref
A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. Google ScholarDigital Library
A. K. Jain, M. N. Murty, and P. J. Flynn. Data Clustering: A Review. ACM Computing Surveys, 31(3):264--323, 1999. Google ScholarDigital Library
R. Kannan, S. Vempala, and A. Vetta. On clusterings: Good, bad and spectral. Journal of the ACM, 51(3):497--515, 2004. Google ScholarDigital Library
Y. J. Kim and J. M. Patel. A Framework for Protein Structure Classification and Identification of Novel Protein Structures. BMC Bioinformatics, 7:456+, 2006.Google ScholarCross Ref
A. D. King. Graph Clustering with Restricted Neighbourhood Search. Master's thesis, University of Toronto, 2004.Google Scholar
J. Kogan. Introduction to Clustering Large and High-Dimensional Data. Cambridge Univ. Press, 2007. Google ScholarDigital Library
C. Li, B. Wang, and X. Yang. VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams. In Proc. of the Int'l Conf. on Very Large Data Bases (VLDB), pages 303--314, Vienna, Austria, 2007. Google ScholarDigital Library
A. M. D. Pelleg. X-Means: Extending K-Means with Efficient Estimation of the Number of Clusters. In Proc. of the Int'l Conf. on Machine Learning, pages 727--734, San Francisco, CA, USA, 2000. Google ScholarDigital Library
S. Sarawagi and A. Kirpal. Efficient Set Joins On Similarity Predicates. In ACM SIGMOD Int'l Conf. on the Mgmt. of Data, pages 743--754, Paris, France, 2004. Google ScholarDigital Library
N. Slonim. The Information Bottleneck: Theory And Applications. PhD thesis, The Hebrew University, 2003.Google Scholar
C. Swamy. Correlation Clustering: Maximizing Agreements Via Semidefinite Programming. In Proc. of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 526--527, New Orleans, Louisiana, USA, 2004. Google ScholarDigital Library
C. Umans. Hardness of Approximating Sigma₂^p Minimization Problems. In Symp. on Foundations of Computer Science (FOCS), pages 465--474, 1999. Google ScholarDigital Library
S. van Dongen. Graph Clustering By Flow Simulation. PhD thesis, University of Utrecht, 2000.Google Scholar
J. A. Whitney. Graph Clustering With Overlap. Master's thesis, University of Toronto, 2006.Google Scholar
D. T. Wijaya and S. Bressan. Ricochet: A Family of Unconstrained Algorithms for Graph Clustering. In Proc. of the Int'l Conf. on Database Systems for Advanced Applications (DASFAA), pages 153--167, Brisbane, Australia, 2009. Google ScholarDigital Library
R. Xu and I. Wunsch. Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3):645--678, 2005. Google ScholarDigital Library
J. Zupan. Clustering of Large Data Sets. Research Studies Press, 1982.Google Scholar

Index Terms

Framework for evaluating clustering algorithms in duplicate detection
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information retrieval
    1. Document representation
    2. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Transforming Pairwise Duplicates to Entity Clusters for High-quality Duplicate Detection
ON THE HORIZON, CHALLENGE PAPER, REGULAR PAPERS, and EXPERIENCE PAPER

Duplicate detection algorithms produce clusters of database records, each cluster representing a single real-world entity. As most of these algorithms use pairwise comparisons, the resulting (transitive) clusters can be inconsistent: Not all records ...
Read More
Duplicate Record Detection: A Survey

Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription ...
Read More
Evaluating indeterministic duplicate detection results
SUM'12: Proceedings of the 6th international conference on Scalable Uncertainty Management

Duplicate detection is an important process for cleaning or integrating data. Since real-life data is often polluted, detecting duplicates usually comes along with uncertainty. To handle duplicate uncertainty in an appropriate way, indeterministic ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 2, Issue 1
August 2009
1293 pages
ISSN:2150-8097
Editors:
Serge Abiteboul,
Tova Milo,
Jignesh Patel,
Philippe Rigaux
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2009
Published in pvldb Volume 2, Issue 1
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 67
  Total Citations
  View Citations
- 854
  Total Downloads
- Downloads (Last 12 months)45
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Framework for evaluating clustering algorithms in duplicate detection

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Transforming Pairwise Duplicates to Entity Clusters for High-quality Duplicate Detection

Duplicate Record Detection: A Survey

Evaluating indeterministic duplicate detection results

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Framework for evaluating clustering algorithms in duplicate detection

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Transforming Pairwise Duplicates to Entity Clusters for High-quality Duplicate Detection

Duplicate Record Detection: A Survey

Evaluating indeterministic duplicate detection results

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media