ABSTRACT
Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, hierarchical clustering solutions provide a view of the data at different levels of granularity, making them ideal for people to visualize and interactively explore large document collections.In this paper we evaluate different partitional and agglomerative approaches for hierarchical clustering. Our experimental evaluation showed that partitional algorithms always lead to better clustering solutions than agglomerative algorithms, which suggests that partitional clustering algorithms are well-suited for clustering large document datasets due to not only their relatively low computational requirements, but also comparable or even better clustering performance. We present a new class of clustering algorithms called constrained agglomerative algorithms that combine the features of both partitional and agglomerative algorithms. Our experimental results showed that they consistently lead to better hierarchical solutions than agglomerative or partitional algorithms alone.
- C. C. Aggarwal, S. C. Gates, and P. S. Yu. On the merits of building categorization systems by supervised clustering. In Proc. of the Fifth ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining, pages 352--356, 1999.]] Google ScholarDigital Library
- D. Boley. Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4), 1998.]] Google ScholarDigital Library
- D. Boley, M. Gini, R. Gross, E. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. Document categorization and query generation on the world wide web using WebACE. AI Review), 11:365--391, 1999.]] Google ScholarDigital Library
- D. Boley, M. Gini, R. Gross, E. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. Partitioning-based clustering for web document categorization. Decision Support Systems (accepted for publication), 1999.]] Google ScholarDigital Library
- P. Cheeseman and J. Stutz. Baysian classification (autoclass): Theory and results. In U. Fayyad, G. Piatetsky-Shapiro, P. Smith, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 153--180. AAAI/MIT Press, 1996.]] Google ScholarDigital Library
- D. Cutting, J. Pedersen, D. Karger, and J. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the ACM SIGIR, pages pages 318--329, Copenhagen, 1992.]] Google ScholarDigital Library
- I. Dhillon and D. Modha. Concept decomposition for large sparse text data using clustering. Technical Report Research Report RJ 10147, IBM Almadan Research Center, 1999.]]Google Scholar
- C. Ding, X. He, H. Zha, M. Gu, and H. Simon. Spectral min-max cut for graph partitioning and data clustering. Technical Report TR-2001-XX, Lawrence Berkeley National Laboratory, University of California, Berkeley, CA, 2001.]]Google Scholar
- R. Duda, P. Hart, and D. Stork. Pattern Classification. John Wiley & Sons, 2001.]] Google ScholarDigital Library
- S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large databases. In Proc. of 1998 ACM-SIGMOD Int. Conf. on Management of Data, 1998.]] Google ScholarDigital Library
- S. Guha, R. Rastogi, and K. Shim. ROCK: a robust clustering algorithm for categorical attributes. In Proc. of the 15th Int'l Conf. on Data Eng., 1999.]] Google ScholarDigital Library
- E. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. WebACE: A web agent for document categorization and exploartion. In Proc. of the 2nd International Conference on Autonomous Agents, May 1998.]] Google ScholarDigital Library
- E. Han, G. Karypis, V. Kumar, and B. Mobasher. Hypergraph based clustering in high dimensional data sets: A summary of results. Bulletin of the Technical Committee on Data Engineering, 21(1), 1998.]]Google Scholar
- A. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.]] Google ScholarDigital Library
- G. Karypis and E. Han. Concept indexing: A fast dimensionality reduction algorithm with applications to document retrieval & categorization. Technical Report TR-00-016, Department of Computer Science, University of Minnesota, Minneapolis, 2000. Available on the WWW at URL http://www.cs.umn.edu/~karypis.]]Google ScholarCross Ref
- G. Karypis, E. Han, and V. Kumar. Chameleon: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer, 32(8):68--75, 1999.]] Google ScholarDigital Library
- B. King. Step-wise clustering procedures. Journal of the American Statistical Association, 69:86--101, 1967.]]Google ScholarCross Ref
- B. Larsen and C. Aone. Fast and effective text mining using linear-time document clustering. In Proc. of the Fifth ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining, pages 16--22, 1999.]] Google ScholarDigital Library
- D. D. Lewis. Reuters-21578 text categorization test collection distribution 1.0. http://www.research.att.com/~lewis, 1999.]]Google Scholar
- J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proc. 5th Symp. Math. Statist, Prob., pages 281--297, 1967.]]Google Scholar
- J. Moore, E. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V. Kumar, and B. Mobasher. Web page categorization and feature selection using association rule and principal component clustering. In 7th Workshop on Information Technologies and Systems, Dec. 1997.]]Google Scholar
- R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. In Proc. of the 20th VLDB Conference, pages 144--155, Santiago, Chile, 1994.]] Google ScholarDigital Library
- M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.]]Google ScholarCross Ref
- G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989.]] Google ScholarDigital Library
- P. H. Sneath and R. R. Sokal. Numerical Taxonomy. Freeman, London, UK, 1973.]]Google Scholar
- M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In KDD Workshop on Text Mining, 2000.]]Google Scholar
- A. Strehl and J. Ghosh. Scalable approach to balanced, high-dimensional clustering of market-baskets. In Proceedings of HiPC, 2000.]] Google ScholarDigital Library
- S. Theodoridis and K. Koutroumbas. Pattern Recognition. Academic Press, 1999.]] Google ScholarDigital Library
- TREC. Text REtrieval conference. http://trec.nist.gov, 1999.]]Google Scholar
- Yahoo! Yahoo! http://www.yahoo.com.]]Google Scholar
- K. Zahn. Graph-tehoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, (C-20):68--86, 1971.]]Google ScholarDigital Library
- Y. Zhao and G. Karypis. Criterion functions for document clustering: Experiments and analysis. Technical Report TR #01--40, Department of Computer Science, University of Minnesota, Minneapolis, MN, 2001. Available on the WWW at http://cs.umn.edu/~karypis/publications.]]Google Scholar
Index Terms
- Evaluation of hierarchical clustering algorithms for document datasets
Recommendations
Hierarchical Clustering Algorithms for Document Datasets
Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, clustering ...
Eliminating Error Accumulation in Hierarchical Clustering Algorithms
EIDWT '13: Proceedings of the 2013 Fourth International Conference on Emerging Intelligent Data and Web TechnologiesHierarchical agglomerative clustering treats given data as a singleton cluster at the outset and then successively merge (or agglomerate) pairs of clusters until all clusters have been merged into a single cluster that contains all data. However, if two ...
Semi-supervised Hierarchical Clustering
ICDM '11: Proceedings of the 2011 IEEE 11th International Conference on Data MiningSemi-supervised clustering (i.e., clustering with knowledge-based constraints) has emerged as an important variant of the traditional clustering paradigms. However, most existing semi-supervised clustering algorithms are designed for partitional ...
Comments