Skip to main content
Erschienen in: Discover Computing 6/2007

01.12.2007

A new unsupervised method for document clustering by using WordNet lexical and conceptual relations

verfasst von: Diego Reforgiato Recupero

Erschienen in: Discover Computing | Ausgabe 6/2007

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Text document clustering provides an effective and intuitive navigation mechanism to organize a large amount of retrieval results by grouping documents in a small number of meaningful classes. Many well-known methods of text clustering make use of a long list of words as vector space which is often unsatisfactory for a couple of reasons: first, it keeps the dimensionality of the data very high, and second, it ignores important relationships between terms like synonyms or antonyms. Our unsupervised method solves both problems by using ANNIE and WordNet lexical categories and WordNet ontology in order to create a well structured document vector space whose low dimensionality allows common clustering algorithms to perform well. For the clustering step we have chosen the bisecting k-means and the Multipole tree, a modified version of the Antipole tree data structure for, respectively, their accuracy and speed.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
The pronouns recognized by ANNIE as valid are replaced with the entity they refer to; such words have been removed from \({\mathcal{D}}.\)
 
Literatur
Zurück zum Zitat Allan, J. (2002). Introduction to topic detection and tracking. In Topic detection and tracking: Event-based information organization (pp. 1–16). Kluwer Academic Publishers. Allan, J. (2002). Introduction to topic detection and tracking. In Topic detection and tracking: Event-based information organization (pp. 1–16). Kluwer Academic Publishers.
Zurück zum Zitat Barbara, D., Li, Y., & Couto, J. (2002). Coolcat: An entropy-based algorithm for categorical clustering. In Proceedings of the 11th international conference on Information and knowledge management (pp. 582–589). Barbara, D., Li, Y., & Couto, J. (2002). Coolcat: An entropy-based algorithm for categorical clustering. In Proceedings of the 11th international conference on Information and knowledge management (pp. 582–589).
Zurück zum Zitat Beil, F., Ester, M., & Xu, X. (2002). Frequent term-based text clustering. KDD 02. pp. 436–442. Beil, F., Ester, M., & Xu, X. (2002). Frequent term-based text clustering. KDD 02. pp. 436–442.
Zurück zum Zitat Boley, D. (1998) Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4), 325–344.CrossRef Boley, D. (1998) Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4), 325–344.CrossRef
Zurück zum Zitat Bolshakova, N., & Azuaje, F. (2003). Improving expression data mining through cluster validation. Information Technology Applications in Biomedicine, 19–22. Bolshakova, N., & Azuaje, F. (2003). Improving expression data mining through cluster validation. Information Technology Applications in Biomedicine, 19–22.
Zurück zum Zitat Canas, A. J., Valerio, A., Lalinde-Pulido, J., Carvalho, M., & Arguedas, M. (2003). Using wordnet for word sense disambiguation to support concept map construction. SPIRE, 2857, 350–359. Canas, A. J., Valerio, A., Lalinde-Pulido, J., Carvalho, M., & Arguedas, M. (2003). Using wordnet for word sense disambiguation to support concept map construction. SPIRE, 2857, 350–359.
Zurück zum Zitat Cantone, D., Ferro, A., Pulvirenti, A., Reforgiato, D., & Shasha, D. (2005). Antipole tree indexing to support range search and k-nearest-neighbor search in metric spaces. IEEE Transactions on Knowledge and Data Engineering (TKDE), 17(4), 535–550.CrossRef Cantone, D., Ferro, A., Pulvirenti, A., Reforgiato, D., & Shasha, D. (2005). Antipole tree indexing to support range search and k-nearest-neighbor search in metric spaces. IEEE Transactions on Knowledge and Data Engineering (TKDE), 17(4), 535–550.CrossRef
Zurück zum Zitat Chua, S., & Kulathuramaiyer, N. (2004). Semantic feature selection using wordnet. In Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence (pp. 166–172). Chua, S., & Kulathuramaiyer, N. (2004). Semantic feature selection using wordnet. In Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence (pp. 166–172).
Zurück zum Zitat Cunningham, H., Maynard, D., Bontcheva, K., & Tablan, V. (2002). Gate: A framework and graphical development environment for robust nlp tools and applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02), July 2002. Cunningham, H., Maynard, D., Bontcheva, K., & Tablan, V. (2002). Gate: A framework and graphical development environment for robust nlp tools and applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02), July 2002.
Zurück zum Zitat Cutting, D. R., Karger, D. R., Pedersen, J. O., & Tukey, J. W. (1992). Scatter/gather: A cluster-based approach to browsing large document collection. In Proc. ACM SIGIR 92 (pp. 318–329). Cutting, D. R., Karger, D. R., Pedersen, J. O., & Tukey, J. W. (1992). Scatter/gather: A cluster-based approach to browsing large document collection. In Proc. ACM SIGIR 92 (pp. 318–329).
Zurück zum Zitat Dave, D. M. P. K., & Lawrence, S. (2003) Mining the peanut gallery: Opinion extraction and semantic classification of product reviews. WWW 03 ACM (pp. 519–528). Dave, D. M. P. K., & Lawrence, S. (2003) Mining the peanut gallery: Opinion extraction and semantic classification of product reviews. WWW 03 ACM (pp. 519–528).
Zurück zum Zitat de Buenaga Rodriguez, M., Gomez Hidalgo, J. M., & Diaz Agudo, B. (2000). Using wordnet to complement training information in text categorization. In N. Nicolov & R. Mitkov (Eds.), Recent advances in natural language processing II: Selected papers from RANLP’97, current issues in linguistic theory (CILT) (pp. 353–364). Amsterdam/Philadelphia: John Benjamins. de Buenaga Rodriguez, M., Gomez Hidalgo, J. M., & Diaz Agudo, B. (2000). Using wordnet to complement training information in text categorization. In N. Nicolov & R. Mitkov (Eds.), Recent advances in natural language processing II: Selected papers from RANLP’97, current issues in linguistic theory (CILT) (pp. 353–364). Amsterdam/Philadelphia: John Benjamins.
Zurück zum Zitat Dhillon, I. S. (2001). Co-clustering documents and words using bipartite spectral graph partitioning. In Proc. of the 11th international conference on knowledge discovery and data mining (pp. 269–274). Dhillon, I. S. (2001). Co-clustering documents and words using bipartite spectral graph partitioning. In Proc. of the 11th international conference on knowledge discovery and data mining (pp. 269–274).
Zurück zum Zitat Friedman, J. H. (1994). An overview of predictive learning and function approximation. In V. Cherkassky, J. H. Friedmanm, & H. Wechsler (Eds.), From statistic to neural networks, Proc. NATO/ASI Workshop (pp. 1–61). Friedman, J. H. (1994). An overview of predictive learning and function approximation. In V. Cherkassky, J. H. Friedmanm, & H. Wechsler (Eds.), From statistic to neural networks, Proc. NATO/ASI Workshop (pp. 1–61).
Zurück zum Zitat Gonzalez, T. F. (1985). Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38, 293–306.MATHCrossRefMathSciNet Gonzalez, T. F. (1985). Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38, 293–306.MATHCrossRefMathSciNet
Zurück zum Zitat Green, S. J. (1997). Building hypertext links in newspaper articles using semantic similarity. NLDB 97 (pp. 178–190). Green, S. J. (1997). Building hypertext links in newspaper articles using semantic similarity. NLDB 97 (pp. 178–190).
Zurück zum Zitat Green, S. J. (1999). Building hypertext links by computing semantic similarity. TKDE, 11(5), 50–57. Green, S. J. (1999). Building hypertext links by computing semantic similarity. TKDE, 11(5), 50–57.
Zurück zum Zitat Hotho, A., Staab, S., & Stumme, G. (2003). Wordnet improves text document clustering. ACM SIGIR Workshop on Semantic Web. Hotho, A., Staab, S., & Stumme, G. (2003). Wordnet improves text document clustering. ACM SIGIR Workshop on Semantic Web.
Zurück zum Zitat Jing, L., Zhou, L., Ng, M. K., & Huang, J. Z. (2006). Ontology-based distance measure for text clustering. SIAM conference on data mining. Jing, L., Zhou, L., Ng, M. K., & Huang, J. Z. (2006). Ontology-based distance measure for text clustering. SIAM conference on data mining.
Zurück zum Zitat Larsen, B., & Aone, C. (1999). Fast and effective text mining using linear-time document clustering. In Proc. of the 5th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 16–22). Larsen, B., & Aone, C. (1999). Fast and effective text mining using linear-time document clustering. In Proc. of the 5th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 16–22).
Zurück zum Zitat Urena Lopez, L. A., Gomez de Buenaga Rodriguez, M., & Gomez Hidalgo, J. M. (2001). Integrating linguistic resources in tc through wsd. Computers and the Humanities, 35(2), 215–230.CrossRef Urena Lopez, L. A., Gomez de Buenaga Rodriguez, M., & Gomez Hidalgo, J. M. (2001). Integrating linguistic resources in tc through wsd. Computers and the Humanities, 35(2), 215–230.CrossRef
Zurück zum Zitat Miller, G. (1995). Wordnet: A lexical database for English. CACM, 38(11), 39–41. Miller, G. (1995). Wordnet: A lexical database for English. CACM, 38(11), 39–41.
Zurück zum Zitat Moldovan, D. I., & Mihalcea, R. (2000). Using wordnet and lexical operators to improve internet searches. IEEE Internet Computing, 4(1), 34–43.CrossRef Moldovan, D. I., & Mihalcea, R. (2000). Using wordnet and lexical operators to improve internet searches. IEEE Internet Computing, 4(1), 34–43.CrossRef
Zurück zum Zitat Nickerson, A., Japkowicz, N., & Milios, E. (2001). Using unsupervised learning to guide re-sampling in imbalanced data sets. In Proc. of the 8th international workshop on AI and statistics (pp. 261–265). Nickerson, A., Japkowicz, N., & Milios, E. (2001). Using unsupervised learning to guide re-sampling in imbalanced data sets. In Proc. of the 8th international workshop on AI and statistics (pp. 261–265).
Zurück zum Zitat Parson, L., Haque, E., & Liu, H. (2004). Subspace clustering for high dimensional data: A review. ACM SIGKDD Explorations Newsletter, 6(1), 90–105.CrossRef Parson, L., Haque, E., & Liu, H. (2004). Subspace clustering for high dimensional data: A review. ACM SIGKDD Explorations Newsletter, 6(1), 90–105.CrossRef
Zurück zum Zitat Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137. Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.
Zurück zum Zitat Reforgiato, D. (2007). Hierarchical clustering data structure comparisons. Technical Report. Reforgiato, D. (2007). Hierarchical clustering data structure comparisons. Technical Report.
Zurück zum Zitat Van Rijsbergen, C. J. (1979). Information retrieval, 2nd ed. Dept. of Computer Science, University of Glasgow. Van Rijsbergen, C. J. (1979). Information retrieval, 2nd ed. Dept. of Computer Science, University of Glasgow.
Zurück zum Zitat Sedding, J., & Kazakov, D. (2004). Wordnet-based text document clustering. 3rd Workshop on Robust Methods in Analysis of Natural Language Data, 104–113. Sedding, J., & Kazakov, D. (2004). Wordnet-based text document clustering. 3rd Workshop on Robust Methods in Analysis of Natural Language Data, 104–113.
Zurück zum Zitat Smyth, P. (1996). Clustering using monte carlo cross-validation. Knowledge Discovery and Data Mining, 126–133. Smyth, P. (1996). Clustering using monte carlo cross-validation. Knowledge Discovery and Data Mining, 126–133.
Zurück zum Zitat Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. In Proc. TextMining Workshop, KDD 2000. Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. In Proc. TextMining Workshop, KDD 2000.
Zurück zum Zitat Voorhees, E. M. (1994). Query expansion using lexical-semantic relations. In Proc. of ACM-SIGIR (pp. 61–69). Voorhees, E. M. (1994). Query expansion using lexical-semantic relations. In Proc. of ACM-SIGIR (pp. 61–69).
Zurück zum Zitat Zamir, O., & Etzioni, O. (1998). Web document clustering: A feasibility demonstration. In Proc. ACM SIGIR 98 (pp. 46–54). Zamir, O., & Etzioni, O. (1998). Web document clustering: A feasibility demonstration. In Proc. ACM SIGIR 98 (pp. 46–54).
Zurück zum Zitat Zamir, O., Etzioni, O., Madani, O., & Karp R. M. (1997). Fast and intuitive clustering of web documents. KDD 97, 287–290. Zamir, O., Etzioni, O., Madani, O., & Karp R. M. (1997). Fast and intuitive clustering of web documents. KDD 97, 287–290.
Zurück zum Zitat Zervas, G., & Ruger, S. M. (1999). The curse of dimensionality and document clustering. In Proc. of the IEE Searching for Information: AI and IR Approaches (pp. 19/1–19/3). Zervas, G., & Ruger, S. M. (1999). The curse of dimensionality and document clustering. In Proc. of the IEE Searching for Information: AI and IR Approaches (pp. 19/1–19/3).
Metadaten
Titel
A new unsupervised method for document clustering by using WordNet lexical and conceptual relations
verfasst von
Diego Reforgiato Recupero
Publikationsdatum
01.12.2007
Verlag
Springer Netherlands
Erschienen in
Discover Computing / Ausgabe 6/2007
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI
https://doi.org/10.1007/s10791-007-9035-7

Premium Partner