skip to main content
10.1145/584792.584877acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Evaluation of hierarchical clustering algorithms for document datasets

Published:04 November 2002Publication History

ABSTRACT

Fast and high-quality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, hierarchical clustering solutions provide a view of the data at different levels of granularity, making them ideal for people to visualize and interactively explore large document collections.In this paper we evaluate different partitional and agglomerative approaches for hierarchical clustering. Our experimental evaluation showed that partitional algorithms always lead to better clustering solutions than agglomerative algorithms, which suggests that partitional clustering algorithms are well-suited for clustering large document datasets due to not only their relatively low computational requirements, but also comparable or even better clustering performance. We present a new class of clustering algorithms called constrained agglomerative algorithms that combine the features of both partitional and agglomerative algorithms. Our experimental results showed that they consistently lead to better hierarchical solutions than agglomerative or partitional algorithms alone.

References

  1. C. C. Aggarwal, S. C. Gates, and P. S. Yu. On the merits of building categorization systems by supervised clustering. In Proc. of the Fifth ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining, pages 352--356, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D. Boley. Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4), 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Boley, M. Gini, R. Gross, E. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. Document categorization and query generation on the world wide web using WebACE. AI Review), 11:365--391, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Boley, M. Gini, R. Gross, E. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. Partitioning-based clustering for web document categorization. Decision Support Systems (accepted for publication), 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. Cheeseman and J. Stutz. Baysian classification (autoclass): Theory and results. In U. Fayyad, G. Piatetsky-Shapiro, P. Smith, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 153--180. AAAI/MIT Press, 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Cutting, J. Pedersen, D. Karger, and J. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the ACM SIGIR, pages pages 318--329, Copenhagen, 1992.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. I. Dhillon and D. Modha. Concept decomposition for large sparse text data using clustering. Technical Report Research Report RJ 10147, IBM Almadan Research Center, 1999.]]Google ScholarGoogle Scholar
  8. C. Ding, X. He, H. Zha, M. Gu, and H. Simon. Spectral min-max cut for graph partitioning and data clustering. Technical Report TR-2001-XX, Lawrence Berkeley National Laboratory, University of California, Berkeley, CA, 2001.]]Google ScholarGoogle Scholar
  9. R. Duda, P. Hart, and D. Stork. Pattern Classification. John Wiley & Sons, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large databases. In Proc. of 1998 ACM-SIGMOD Int. Conf. on Management of Data, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Guha, R. Rastogi, and K. Shim. ROCK: a robust clustering algorithm for categorical attributes. In Proc. of the 15th Int'l Conf. on Data Eng., 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. E. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. WebACE: A web agent for document categorization and exploartion. In Proc. of the 2nd International Conference on Autonomous Agents, May 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. E. Han, G. Karypis, V. Kumar, and B. Mobasher. Hypergraph based clustering in high dimensional data sets: A summary of results. Bulletin of the Technical Committee on Data Engineering, 21(1), 1998.]]Google ScholarGoogle Scholar
  14. A. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. G. Karypis and E. Han. Concept indexing: A fast dimensionality reduction algorithm with applications to document retrieval & categorization. Technical Report TR-00-016, Department of Computer Science, University of Minnesota, Minneapolis, 2000. Available on the WWW at URL http://www.cs.umn.edu/~karypis.]]Google ScholarGoogle ScholarCross RefCross Ref
  16. G. Karypis, E. Han, and V. Kumar. Chameleon: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer, 32(8):68--75, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. B. King. Step-wise clustering procedures. Journal of the American Statistical Association, 69:86--101, 1967.]]Google ScholarGoogle ScholarCross RefCross Ref
  18. B. Larsen and C. Aone. Fast and effective text mining using linear-time document clustering. In Proc. of the Fifth ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining, pages 16--22, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. D. Lewis. Reuters-21578 text categorization test collection distribution 1.0. http://www.research.att.com/~lewis, 1999.]]Google ScholarGoogle Scholar
  20. J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proc. 5th Symp. Math. Statist, Prob., pages 281--297, 1967.]]Google ScholarGoogle Scholar
  21. J. Moore, E. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V. Kumar, and B. Mobasher. Web page categorization and feature selection using association rule and principal component clustering. In 7th Workshop on Information Technologies and Systems, Dec. 1997.]]Google ScholarGoogle Scholar
  22. R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. In Proc. of the 20th VLDB Conference, pages 144--155, Santiago, Chile, 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.]]Google ScholarGoogle ScholarCross RefCross Ref
  24. G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. P. H. Sneath and R. R. Sokal. Numerical Taxonomy. Freeman, London, UK, 1973.]]Google ScholarGoogle Scholar
  26. M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In KDD Workshop on Text Mining, 2000.]]Google ScholarGoogle Scholar
  27. A. Strehl and J. Ghosh. Scalable approach to balanced, high-dimensional clustering of market-baskets. In Proceedings of HiPC, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Theodoridis and K. Koutroumbas. Pattern Recognition. Academic Press, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. TREC. Text REtrieval conference. http://trec.nist.gov, 1999.]]Google ScholarGoogle Scholar
  30. Yahoo! Yahoo! http://www.yahoo.com.]]Google ScholarGoogle Scholar
  31. K. Zahn. Graph-tehoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, (C-20):68--86, 1971.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Y. Zhao and G. Karypis. Criterion functions for document clustering: Experiments and analysis. Technical Report TR #01--40, Department of Computer Science, University of Minnesota, Minneapolis, MN, 2001. Available on the WWW at http://cs.umn.edu/~karypis/publications.]]Google ScholarGoogle Scholar

Index Terms

  1. Evaluation of hierarchical clustering algorithms for document datasets

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management
        November 2002
        704 pages
        ISBN:1581134924
        DOI:10.1145/584792

        Copyright © 2002 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 4 November 2002

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate1,861of8,427submissions,22%

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader