Article

Evaluation of hierarchical clustering algorithms for document datasets

Authors:
Ying Zhao

University of Minnesota, Minneapolis, MN

University of Minnesota, Minneapolis, MN
View Profile

,
George Karypis

University of Minnesota, Minneapolis, MN

University of Minnesota, Minneapolis, MN
View Profile

CIKM '02: Proceedings of the eleventh international conference on Information and knowledge managementNovember 2002Pages 515–524https://doi.org/10.1145/584792.584877

Published:04 November 2002Publication History

CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management

Pages 515–524

ABSTRACT

Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, hierarchical clustering solutions provide a view of the data at different levels of granularity, making them ideal for people to visualize and interactively explore large document collections.In this paper we evaluate different partitional and agglomerative approaches for hierarchical clustering. Our experimental evaluation showed that partitional algorithms always lead to better clustering solutions than agglomerative algorithms, which suggests that partitional clustering algorithms are well-suited for clustering large document datasets due to not only their relatively low computational requirements, but also comparable or even better clustering performance. We present a new class of clustering algorithms called constrained agglomerative algorithms that combine the features of both partitional and agglomerative algorithms. Our experimental results showed that they consistently lead to better hierarchical solutions than agglomerative or partitional algorithms alone.

References

C. C. Aggarwal, S. C. Gates, and P. S. Yu. On the merits of building categorization systems by supervised clustering. In Proc. of the Fifth ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining, pages 352--356, 1999.]] Google ScholarDigital Library
D. Boley. Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4), 1998.]] Google ScholarDigital Library
D. Boley, M. Gini, R. Gross, E. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. Document categorization and query generation on the world wide web using WebACE. AI Review), 11:365--391, 1999.]] Google ScholarDigital Library
D. Boley, M. Gini, R. Gross, E. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. Partitioning-based clustering for web document categorization. Decision Support Systems (accepted for publication), 1999.]] Google ScholarDigital Library
P. Cheeseman and J. Stutz. Baysian classification (autoclass): Theory and results. In U. Fayyad, G. Piatetsky-Shapiro, P. Smith, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 153--180. AAAI/MIT Press, 1996.]] Google ScholarDigital Library
D. Cutting, J. Pedersen, D. Karger, and J. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the ACM SIGIR, pages pages 318--329, Copenhagen, 1992.]] Google ScholarDigital Library
I. Dhillon and D. Modha. Concept decomposition for large sparse text data using clustering. Technical Report Research Report RJ 10147, IBM Almadan Research Center, 1999.]]Google Scholar
C. Ding, X. He, H. Zha, M. Gu, and H. Simon. Spectral min-max cut for graph partitioning and data clustering. Technical Report TR-2001-XX, Lawrence Berkeley National Laboratory, University of California, Berkeley, CA, 2001.]]Google Scholar
R. Duda, P. Hart, and D. Stork. Pattern Classification. John Wiley & Sons, 2001.]] Google ScholarDigital Library
S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large databases. In Proc. of 1998 ACM-SIGMOD Int. Conf. on Management of Data, 1998.]] Google ScholarDigital Library
S. Guha, R. Rastogi, and K. Shim. ROCK: a robust clustering algorithm for categorical attributes. In Proc. of the 15th Int'l Conf. on Data Eng., 1999.]] Google ScholarDigital Library
E. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. WebACE: A web agent for document categorization and exploartion. In Proc. of the 2nd International Conference on Autonomous Agents, May 1998.]] Google ScholarDigital Library
E. Han, G. Karypis, V. Kumar, and B. Mobasher. Hypergraph based clustering in high dimensional data sets: A summary of results. Bulletin of the Technical Committee on Data Engineering, 21(1), 1998.]]Google Scholar
A. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.]] Google ScholarDigital Library
G. Karypis and E. Han. Concept indexing: A fast dimensionality reduction algorithm with applications to document retrieval & categorization. Technical Report TR-00-016, Department of Computer Science, University of Minnesota, Minneapolis, 2000. Available on the WWW at URL http://www.cs.umn.edu/~karypis.]]Google ScholarCross Ref
G. Karypis, E. Han, and V. Kumar. Chameleon: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer, 32(8):68--75, 1999.]] Google ScholarDigital Library
B. King. Step-wise clustering procedures. Journal of the American Statistical Association, 69:86--101, 1967.]]Google ScholarCross Ref
B. Larsen and C. Aone. Fast and effective text mining using linear-time document clustering. In Proc. of the Fifth ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining, pages 16--22, 1999.]] Google ScholarDigital Library
D. D. Lewis. Reuters-21578 text categorization test collection distribution 1.0. http://www.research.att.com/~lewis, 1999.]]Google Scholar
J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proc. 5th Symp. Math. Statist, Prob., pages 281--297, 1967.]]Google Scholar
J. Moore, E. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V. Kumar, and B. Mobasher. Web page categorization and feature selection using association rule and principal component clustering. In 7th Workshop on Information Technologies and Systems, Dec. 1997.]]Google Scholar
R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. In Proc. of the 20th VLDB Conference, pages 144--155, Santiago, Chile, 1994.]] Google ScholarDigital Library
M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.]]Google ScholarCross Ref
G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989.]] Google ScholarDigital Library
P. H. Sneath and R. R. Sokal. Numerical Taxonomy. Freeman, London, UK, 1973.]]Google Scholar
M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In KDD Workshop on Text Mining, 2000.]]Google Scholar
A. Strehl and J. Ghosh. Scalable approach to balanced, high-dimensional clustering of market-baskets. In Proceedings of HiPC, 2000.]] Google ScholarDigital Library
S. Theodoridis and K. Koutroumbas. Pattern Recognition. Academic Press, 1999.]] Google ScholarDigital Library
TREC. Text REtrieval conference. http://trec.nist.gov, 1999.]]Google Scholar
Yahoo! Yahoo! http://www.yahoo.com.]]Google Scholar
K. Zahn. Graph-tehoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, (C-20):68--86, 1971.]]Google ScholarDigital Library
Y. Zhao and G. Karypis. Criterion functions for document clustering: Experiments and analysis. Technical Report TR #01--40, Department of Computer Science, University of Minnesota, Minneapolis, MN, 2001. Available on the WWW at http://cs.umn.edu/~karypis/publications.]]Google Scholar

Index Terms

Evaluation of hierarchical clustering algorithms for document datasets
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Hierarchical Clustering Algorithms for Document Datasets

Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, clustering ...
Read More
Eliminating Error Accumulation in Hierarchical Clustering Algorithms
EIDWT '13: Proceedings of the 2013 Fourth International Conference on Emerging Intelligent Data and Web Technologies

Hierarchical agglomerative clustering treats given data as a singleton cluster at the outset and then successively merge (or agglomerate) pairs of clusters until all clusters have been merged into a single cluster that contains all data. However, if two ...
Read More
Semi-supervised Hierarchical Clustering
ICDM '11: Proceedings of the 2011 IEEE 11th International Conference on Data Mining

Semi-supervised clustering (i.e., clustering with knowledge-based constraints) has emerged as an important variant of the traditional clustering paradigms. However, most existing semi-supervised clustering algorithms are designed for partitional ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management
November 2002
704 pages
ISBN:1581134924
DOI:10.1145/584792
General Chair:
Charles Nicholas
University of Maryland Baltimore County
,
Program Chairs:
David Grossman
Illinois Institute of Technology
,
Konstantinos Kalpakis
University of Maryland Baltimore County
,
Sajda Qureshi
Erasmus University, Rotterdam
,
Han van Dissel
Erasmus University, Rotterdam
,
Len Seligman
The MITRE Corporation
Copyright © 2002 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 November 2002
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
agglomerative clustering
hierarchical clustering
partitional clustering
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 244
  Total Citations
  View Citations
- 4,874
  Total Downloads
- Downloads (Last 12 months)198
- Downloads (Last 6 weeks)37
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Evaluation of hierarchical clustering algorithms for document datasets

CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Hierarchical Clustering Algorithms for Document Datasets

Eliminating Error Accumulation in Hierarchical Clustering Algorithms

Semi-supervised Hierarchical Clustering