- 1.H. Akaike. On entropy maximization principle. Applications of Statistics, pages 27-41, 1977.Google Scholar
- 2.M. R. Anderberg. Cluster Analysis for Application. Academic Press, 1973.Google Scholar
- 3.P. S. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In Proc. 4th International Conf. on Knowledge Discovery and Data Mining (KDD-98). AAAI Press, August 1998.Google Scholar
- 4.I. P. Felligi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183-1210, 1969.Google ScholarCross Ref
- 5.J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Tras. Math. Software, 3(3):209-226, 1977. Google ScholarDigital Library
- 6.C. L. Giles, K. D. Bollacker, and S. Lawrence. CiteSeer: An automatic citation indexing system. In Digital Libraries 98 - Third ACM Conference on Digital Libraries, 1998. Google ScholarDigital Library
- 7.M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In Proceedings of the 1995 ACM SIGMOD, May 1995. Google ScholarDigital Library
- 8.H. Hirsh. Integrating mulitple sources of information in text classification using whril. In Snowbird Learning Conference, April 2000.Google Scholar
- 9.J. Hylton. Identifying and merging related bibliographic records. MIT LCS Masters Thesis, 1996.Google Scholar
- 10.B. Kilss and W. Alvey, editors. Record Linkage Techniques-1985, 1985. Statistics of Income Division, Internal Revenue Service Publication 1299-2-96. Available from http://www.fcsm.gov/.Google Scholar
- 11.A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of internet portals with machine learning. Information Retrieval, 2000. To appear. Google ScholarDigital Library
- 12.A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996.Google Scholar
- 13.A. Monge and C. Elkan. The field-matching problem: algorithm and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, August 1996.Google Scholar
- 14.A. Monge and C. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In The proceedings of the SIGMOD 1997 workshop on data mining and knowledge discovery, May 1997.Google Scholar
- 15.A. Moore. Very fast EM-based mixture model clustering using multiresolution kd-trees. In Advances in Neural Information Processing Systems 11, 1999. Google ScholarDigital Library
- 16.H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James. Automatic linkage of vital records. Science, 130:954-959, 1959.Google ScholarCross Ref
- 17.S. Omohundro. Five balltree construction algorithms. Technical report 89-063, International Computer Science Institute, Berkeley, California, 1989.Google Scholar
- 18.K. Rose. Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proceedings of the IEEE, 86(11):2210-2239, 1998.Google ScholarCross Ref
- 19.G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513-523, 1988. Google ScholarDigital Library
- 20.M. Sankaran, S. Suresh, M. Wong, and D. Nesamoney. Method for incremental aggregation of dynamically increasing database data sets. U.S. Patent 5,794,246, 1998.Google Scholar
- 21.D. Sanko and J. B. Kruskal. Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, 1983.Google Scholar
- 22.J. W. Tukey and J. O. Pedersen. Method and apparatus for information access employing overlapping clusters. U.S. Patent 5,787,422, 1998.Google Scholar
- 23.T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pages 103-114, 1996. Google ScholarDigital Library
Index Terms
- Efficient clustering of high-dimensional data sets with application to reference matching
Recommendations
Subspace clustering for high dimensional data: a review
Special issue on learning from imbalanced datasetsSubspace clustering is an extension of traditional clustering that seeks to find clusters in different subspaces within a dataset. Often in high dimensional data, many dimensions are irrelevant and can mask existing clusters in noisy data. Feature ...
Iterative random projections for high-dimensional data clustering
In this text we propose a method which efficiently performs clustering of high-dimensional data. The method builds on random projection and the K-means algorithm. The idea is to apply K-means several times, increasing the dimensionality of the data ...
Enhanced algorithm for high-dimensional data classification
Graphical abstractIllustration of the decision hyperplanes generated by TSSVM, MCVSVM, and LMLP on an artificial dataset. Display Omitted HighlightsIn the case of the singularity of the within-class scatter matrix, the drawbacks of both MCVSVM and LMLP ...
Comments