nach oben

Discover Computing

Erschienen in:

01.12.2007

A new unsupervised method for document clustering by using WordNet lexical and conceptual relations

verfasst von: Diego Reforgiato Recupero

Erschienen in: Discover Computing | Ausgabe 6/2007

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Text document clustering provides an effective and intuitive navigation mechanism to organize a large amount of retrieval results by grouping documents in a small number of meaningful classes. Many well-known methods of text clustering make use of a long list of words as vector space which is often unsatisfactory for a couple of reasons: first, it keeps the dimensionality of the data very high, and second, it ignores important relationships between terms like synonyms or antonyms. Our unsupervised method solves both problems by using ANNIE and WordNet lexical categories and WordNet ontology in order to create a well structured document vector space whose low dimensionality allows common clustering algorithms to perform well. For the clustering step we have chosen the bisecting k-means and the Multipole tree, a modified version of the Antipole tree data structure for, respectively, their accuracy and speed.

Vorheriger Artikel Regularizing query-based retrieval scores

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

The pronouns recognized by ANNIE as valid are replaced with the entity they refer to; such words have been removed from \({\mathcal{D}}.\)

http://www.daviddlewis.com/resources/testcollections/reuters21578/

ftp://ftp.cs.cornell.edu/pub/smart

http://www.cnn.com, http://www.nytimes.com, http://www.usatoday.com

Allan, J. (2002). Introduction to topic detection and tracking. In Topic detection and tracking: Event-based information organization (pp. 1–16). Kluwer Academic Publishers.

ANNIE. Annie—a robust cross-domain ie system. http://www.gate.ac.uk/ie/annie.html

Barbara, D., Li, Y., & Couto, J. (2002). Coolcat: An entropy-based algorithm for categorical clustering. In Proceedings of the 11th international conference on Information and knowledge management (pp. 582–589).

Beil, F., Ester, M., & Xu, X. (2002). Frequent term-based text clustering. KDD 02. pp. 436–442.

Boley, D. (1998) Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4), 325–344.CrossRef

Bolshakova, N., & Azuaje, F. (2003). Improving expression data mining through cluster validation. Information Technology Applications in Biomedicine, 19–22.

Borgelt, C. (2000) Apriori—association rule induction/frequent item set mining. http://www.fuzzy.cs.uni-magdeburg.de/borgelt/apriori.html

Canas, A. J., Valerio, A., Lalinde-Pulido, J., Carvalho, M., & Arguedas, M. (2003). Using wordnet for word sense disambiguation to support concept map construction. SPIRE, 2857, 350–359.

Cantone, D., Ferro, A., Pulvirenti, A., Reforgiato, D., & Shasha, D. (2005). Antipole tree indexing to support range search and k-nearest-neighbor search in metric spaces. IEEE Transactions on Knowledge and Data Engineering (TKDE), 17(4), 535–550.CrossRef

Chua, S., & Kulathuramaiyer, N. (2004). Semantic feature selection using wordnet. In Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence (pp. 166–172).

Crowe, M. (2000) Wordnet.net library. http://www.opensvn.csie.org/WordNetDotNet/

Cunningham, H., Maynard, D., Bontcheva, K., & Tablan, V. (2002). Gate: A framework and graphical development environment for robust nlp tools and applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL’02), July 2002.

Cutting, D. R., Karger, D. R., Pedersen, J. O., & Tukey, J. W. (1992). Scatter/gather: A cluster-based approach to browsing large document collection. In Proc. ACM SIGIR 92 (pp. 318–329).

Dave, D. M. P. K., & Lawrence, S. (2003) Mining the peanut gallery: Opinion extraction and semantic classification of product reviews. WWW 03 ACM (pp. 519–528).

de Buenaga Rodriguez, M., Gomez Hidalgo, J. M., & Diaz Agudo, B. (2000). Using wordnet to complement training information in text categorization. In N. Nicolov & R. Mitkov (Eds.), Recent advances in natural language processing II: Selected papers from RANLP’97, current issues in linguistic theory (CILT) (pp. 353–364). Amsterdam/Philadelphia: John Benjamins.

Dhillon, I. S. (2001). Co-clustering documents and words using bipartite spectral graph partitioning. In Proc. of the 11th international conference on knowledge discovery and data mining (pp. 269–274).

Fodor, I. K. (2002). A survey of dimension reduction techniques. LLNL technical report, UCRL ID-148494 URL: http://www.llnl.gov/CASC/sapphire/pubs.html

Friedman, J. H. (1994). An overview of predictive learning and function approximation. In V. Cherkassky, J. H. Friedmanm, & H. Wechsler (Eds.), From statistic to neural networks, Proc. NATO/ASI Workshop (pp. 1–61).

Gonzalez, T. F. (1985). Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38, 293–306.MATHCrossRefMathSciNet

Green, S. J. (1997). Building hypertext links in newspaper articles using semantic similarity. NLDB 97 (pp. 178–190).

Green, S. J. (1999). Building hypertext links by computing semantic similarity. TKDE, 11(5), 50–57.

Hotho, A., Staab, S., & Stumme, G. (2003). Wordnet improves text document clustering. ACM SIGIR Workshop on Semantic Web.

Jing, L., Zhou, L., Ng, M. K., & Huang, J. Z. (2006). Ontology-based distance measure for text clustering. SIAM conference on data mining.

Larsen, B., & Aone, C. (1999). Fast and effective text mining using linear-time document clustering. In Proc. of the 5th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 16–22).

Urena Lopez, L. A., Gomez de Buenaga Rodriguez, M., & Gomez Hidalgo, J. M. (2001). Integrating linguistic resources in tc through wsd. Computers and the Humanities, 35(2), 215–230.CrossRef

Miller, G. (1995). Wordnet: A lexical database for English. CACM, 38(11), 39–41.

Moldovan, D. I., & Mihalcea, R. (2000). Using wordnet and lexical operators to improve internet searches. IEEE Internet Computing, 4(1), 34–43.CrossRef

Nickerson, A., Japkowicz, N., & Milios, E. (2001). Using unsupervised learning to guide re-sampling in imbalanced data sets. In Proc. of the 8th international workshop on AI and statistics (pp. 261–265).

Parson, L., Haque, E., & Liu, H. (2004). Subspace clustering for high dimensional data: A review. ACM SIGKDD Explorations Newsletter, 6(1), 90–105.CrossRef

Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.

Reforgiato, D. (2007). Hierarchical clustering data structure comparisons. Technical Report.

Van Rijsbergen, C. J. (1979). Information retrieval, 2nd ed. Dept. of Computer Science, University of Glasgow.

Sedding, J., & Kazakov, D. (2004). Wordnet-based text document clustering. 3rd Workshop on Robust Methods in Analysis of Natural Language Data, 104–113.

Smyth, P. (1996). Clustering using monte carlo cross-validation. Knowledge Discovery and Data Mining, 126–133.

Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. In Proc. TextMining Workshop, KDD 2000.

Voorhees, E. M. (1994). Query expansion using lexical-semantic relations. In Proc. of ACM-SIGIR (pp. 61–69).

Zamir, O., & Etzioni, O. (1998). Web document clustering: A feasibility demonstration. In Proc. ACM SIGIR 98 (pp. 46–54).

Zamir, O., Etzioni, O., Madani, O., & Karp R. M. (1997). Fast and intuitive clustering of web documents. KDD 97, 287–290.

Zervas, G., & Ruger, S. M. (1999). The curse of dimensionality and document clustering. In Proc. of the IEE Searching for Information: AI and IR Approaches (pp. 19/1–19/3).

Titel: A new unsupervised method for document clustering by using WordNet lexical and conceptual relations
verfasst von: Diego Reforgiato Recupero
Publikationsdatum: 01.12.2007
Verlag: Springer Netherlands
Erschienen in: Discover Computing / Ausgabe 6/2007
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI: https://doi.org/10.1007/s10791-007-9035-7

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner