Article

Free Access

Efficient clustering of high-dimensional data sets with application to reference matching

Authors:
Andrew McCallum

WhizBang! Labs - Research, 4616 Henry Street, Pittsburgh, PA and School of Computer Science, Carnegie Mellon University, Pittsburgh, PA

WhizBang! Labs - Research, 4616 Henry Street, Pittsburgh, PA and School of Computer Science, Carnegie Mellon University, Pittsburgh, PA
View Profile

,
Kamal Nigam

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA
View Profile

,
Lyle H. Ungar

Computer and Info. Science, University of Pennsylvania, Philadelphia, PA

Computer and Info. Science, University of Pennsylvania, Philadelphia, PA
View Profile

KDD '00: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2000Pages 169–178https://doi.org/10.1145/347090.347123

Published:01 August 2000Publication History

KDD '00: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 169–178

References

1.H. Akaike. On entropy maximization principle. Applications of Statistics, pages 27-41, 1977.Google Scholar
2.M. R. Anderberg. Cluster Analysis for Application. Academic Press, 1973.Google Scholar
3.P. S. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In Proc. 4th International Conf. on Knowledge Discovery and Data Mining (KDD-98). AAAI Press, August 1998.Google Scholar
4.I. P. Felligi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183-1210, 1969.Google ScholarCross Ref
5.J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Tras. Math. Software, 3(3):209-226, 1977. Google ScholarDigital Library
6.C. L. Giles, K. D. Bollacker, and S. Lawrence. CiteSeer: An automatic citation indexing system. In Digital Libraries 98 - Third ACM Conference on Digital Libraries, 1998. Google ScholarDigital Library
7.M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In Proceedings of the 1995 ACM SIGMOD, May 1995. Google ScholarDigital Library
8.H. Hirsh. Integrating mulitple sources of information in text classification using whril. In Snowbird Learning Conference, April 2000.Google Scholar
9.J. Hylton. Identifying and merging related bibliographic records. MIT LCS Masters Thesis, 1996.Google Scholar
10.B. Kilss and W. Alvey, editors. Record Linkage Techniques-1985, 1985. Statistics of Income Division, Internal Revenue Service Publication 1299-2-96. Available from http://www.fcsm.gov/.Google Scholar
11.A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of internet portals with machine learning. Information Retrieval, 2000. To appear. Google ScholarDigital Library
12.A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996.Google Scholar
13.A. Monge and C. Elkan. The field-matching problem: algorithm and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, August 1996.Google Scholar
14.A. Monge and C. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In The proceedings of the SIGMOD 1997 workshop on data mining and knowledge discovery, May 1997.Google Scholar
15.A. Moore. Very fast EM-based mixture model clustering using multiresolution kd-trees. In Advances in Neural Information Processing Systems 11, 1999. Google ScholarDigital Library
16.H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James. Automatic linkage of vital records. Science, 130:954-959, 1959.Google ScholarCross Ref
17.S. Omohundro. Five balltree construction algorithms. Technical report 89-063, International Computer Science Institute, Berkeley, California, 1989.Google Scholar
18.K. Rose. Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proceedings of the IEEE, 86(11):2210-2239, 1998.Google ScholarCross Ref
19.G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513-523, 1988. Google ScholarDigital Library
20.M. Sankaran, S. Suresh, M. Wong, and D. Nesamoney. Method for incremental aggregation of dynamically increasing database data sets. U.S. Patent 5,794,246, 1998.Google Scholar
21.D. Sanko and J. B. Kruskal. Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, 1983.Google Scholar
22.J. W. Tukey and J. O. Pedersen. Method and apparatus for information access employing overlapping clusters. U.S. Patent 5,787,422, 1998.Google Scholar
23.T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pages 103-114, 1996. Google ScholarDigital Library

Index Terms

Efficient clustering of high-dimensional data sets with application to reference matching
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Subspace clustering for high dimensional data: a review
Special issue on learning from imbalanced datasets

Subspace clustering is an extension of traditional clustering that seeks to find clusters in different subspaces within a dataset. Often in high dimensional data, many dimensions are irrelevant and can mask existing clusters in noisy data. Feature ...
Read More
Iterative random projections for high-dimensional data clustering

In this text we propose a method which efficiently performs clustering of high-dimensional data. The method builds on random projection and the K-means algorithm. The idea is to apply K-means several times, increasing the dimensionality of the data ...
Read More
Enhanced algorithm for high-dimensional data classification

Graphical abstractIllustration of the decision hyperplanes generated by TSSVM, MCVSVM, and LMLP on an artificial dataset. Display Omitted HighlightsIn the case of the singularity of the within-class scatter matrix, the drawbacks of both MCVSVM and LMLP ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '00: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
August 2000
537 pages
ISBN:1581132336
DOI:10.1145/347090
Chairmen:
Raghu Ramakrishnan
Univ. of Wisconsin
,
Sal Stolfo
Columbia Univ., New York, NY
,
Roberto Bayardo
IBM Almaden Research Center, San Jose, CA
,
Ismail Parsa
Epsilon
Copyright © 2000 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 August 2000
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 662
  Total Citations
  View Citations
- 5,062
  Total Downloads
- Downloads (Last 12 months)519
- Downloads (Last 6 weeks)87
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient clustering of high-dimensional data sets with application to reference matching

KDD '00: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining

References

Cited By

Index Terms

Recommendations

Subspace clustering for high dimensional data: a review

Iterative random projections for high-dimensional data clustering

Enhanced algorithm for high-dimensional data classification

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Efficient clustering of high-dimensional data sets with application to reference matching

KDD '00: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining

References

Cited By

Index Terms

Recommendations

Subspace clustering for high dimensional data: a review

Iterative random projections for high-dimensional data clustering

Enhanced algorithm for high-dimensional data classification

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media