skip to main content
10.1145/347090.347123acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article
Free Access

Efficient clustering of high-dimensional data sets with application to reference matching

Authors Info & Claims
Published:01 August 2000Publication History
First page image

References

  1. 1.H. Akaike. On entropy maximization principle. Applications of Statistics, pages 27-41, 1977.Google ScholarGoogle Scholar
  2. 2.M. R. Anderberg. Cluster Analysis for Application. Academic Press, 1973.Google ScholarGoogle Scholar
  3. 3.P. S. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In Proc. 4th International Conf. on Knowledge Discovery and Data Mining (KDD-98). AAAI Press, August 1998.Google ScholarGoogle Scholar
  4. 4.I. P. Felligi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183-1210, 1969.Google ScholarGoogle ScholarCross RefCross Ref
  5. 5.J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Tras. Math. Software, 3(3):209-226, 1977. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. 6.C. L. Giles, K. D. Bollacker, and S. Lawrence. CiteSeer: An automatic citation indexing system. In Digital Libraries 98 - Third ACM Conference on Digital Libraries, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. 7.M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In Proceedings of the 1995 ACM SIGMOD, May 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. 8.H. Hirsh. Integrating mulitple sources of information in text classification using whril. In Snowbird Learning Conference, April 2000.Google ScholarGoogle Scholar
  9. 9.J. Hylton. Identifying and merging related bibliographic records. MIT LCS Masters Thesis, 1996.Google ScholarGoogle Scholar
  10. 10.B. Kilss and W. Alvey, editors. Record Linkage Techniques-1985, 1985. Statistics of Income Division, Internal Revenue Service Publication 1299-2-96. Available from http://www.fcsm.gov/.Google ScholarGoogle Scholar
  11. 11.A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of internet portals with machine learning. Information Retrieval, 2000. To appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. 12.A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996.Google ScholarGoogle Scholar
  13. 13.A. Monge and C. Elkan. The field-matching problem: algorithm and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, August 1996.Google ScholarGoogle Scholar
  14. 14.A. Monge and C. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In The proceedings of the SIGMOD 1997 workshop on data mining and knowledge discovery, May 1997.Google ScholarGoogle Scholar
  15. 15.A. Moore. Very fast EM-based mixture model clustering using multiresolution kd-trees. In Advances in Neural Information Processing Systems 11, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. 16.H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James. Automatic linkage of vital records. Science, 130:954-959, 1959.Google ScholarGoogle ScholarCross RefCross Ref
  17. 17.S. Omohundro. Five balltree construction algorithms. Technical report 89-063, International Computer Science Institute, Berkeley, California, 1989.Google ScholarGoogle Scholar
  18. 18.K. Rose. Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proceedings of the IEEE, 86(11):2210-2239, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  19. 19.G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513-523, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. 20.M. Sankaran, S. Suresh, M. Wong, and D. Nesamoney. Method for incremental aggregation of dynamically increasing database data sets. U.S. Patent 5,794,246, 1998.Google ScholarGoogle Scholar
  21. 21.D. Sanko and J. B. Kruskal. Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, 1983.Google ScholarGoogle Scholar
  22. 22.J. W. Tukey and J. O. Pedersen. Method and apparatus for information access employing overlapping clusters. U.S. Patent 5,787,422, 1998.Google ScholarGoogle Scholar
  23. 23.T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pages 103-114, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient clustering of high-dimensional data sets with application to reference matching

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          KDD '00: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
          August 2000
          537 pages
          ISBN:1581132336
          DOI:10.1145/347090

          Copyright © 2000 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 August 2000

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          Overall Acceptance Rate1,133of8,635submissions,13%

          Upcoming Conference

          KDD '24

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader