nach oben

Discover Computing

Erschienen in:

01.06.2009

Using the Web as corpus for self-training text categorization

verfasst von: Rafael Guzmán-Cabrera, Manuel Montes-y-Gómez, Paolo Rosso, Luis Villaseñor-Pineda

Erschienen in: Discover Computing | Ausgabe 3/2009

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Most current methods for automatic text categorization are based on supervised learning techniques and, therefore, they face the problem of requiring a great number of training instances to construct an accurate classifier. In order to tackle this problem, this paper proposes a new semi-supervised method for text categorization, which considers the automatic extraction of unlabeled examples from the Web and the application of an enriched self-training approach for the construction of the classifier. This method, even though language independent, is more pertinent for scenarios where large sets of labeled resources do not exist. That, for instance, could be the case of several application domains in different non-English languages such as Spanish. The experimental evaluation of the method was carried out in three different tasks and in two different languages. The achieved results demonstrate the applicability and usefulness of the proposed method.

Vorheriger Artikel A user-centric approach to identifying best deployment strategies for language tools: the impact of content and access language on Web user behaviour and attitudes

Nächster Artikel Classifying Amharic webnews

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

The Nora project (www.noraproject.org) and the Monk project (www.monkproject.org) are two research efforts related to these kind of tasks.

Given that each unlabeled example is downloaded from the Web using a set of automatically defined class queries, each of them has a default category or web-based label.

http://www.google.com/apis.

http://ccc.inaoep.mx/~mmontesg/resources/Desastres.sgm.

http://www.reforma.com.

http://www.daviddlewis.com/resources/testcollections/reuters21578.

http://ccc.inaoep.mx/~mmontesg/resources/Poetas.sgm.

Aas, K., & Eikvil, L. (1999). Text categorization: A survey. Tech. Rep. 941. Norwegian Computing Center.

Argamon, S., & Levitan, S. (2005). Measuring the usefulness of function words for authorship attribution. In Proceedings of ACH/ALLC Conference 2005.

Bekkerman, R., & Allan, J. (2004). Using bigrams in text categorization. Tech. Rep. IR-408. Center of Intelligent Information Retrieval, UMass Amherst.

Chaski, C. (2005). Who’s at the keyboard: Authorship attribution in digital evidence investigations. International Journal of Digital Evidence, 4(1), 1–13.

Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Editorial: Special issue on learning from imbalanced data sets. SIGKDD Explorations, 6(1), 1–6.CrossRef

Coyotl-Morales, R. M., Villaseñor-Pineda, L., Montes-Y-Gómez, M., & Rosso, P. (2006). Authorship attribution using word sequences. In J. F. Martínez-Trinidad, J. A. Carrasco-Ochoa, & J. Kittler (Eds.), CIARP (Vol. 4225, pp. 844–853). Springer, Lecture Notes in Computer Science.

Diederich, J., Kindermann, J., Leopold, E., & Paass, G. (2003). Authorship attribution with support vector machines. Applied Intelligence, 19(1/2), 109–123.MATHCrossRef

Hartley, H. O., & Rao, J. N. K. (1968). Classification and estimation in analysis of variance problems. Review of the International Statistical Institute, 36(2), 141–147.MATHCrossRefMathSciNet

Holmes, D. I. (1994). Authorship attribution. Computers and the Humanities, 28, 87–106.CrossRef

Hoste, V. (2005). Optimization issues in machine learning of coreference resolution. Ph.D. thesis, Faculteit Letteren en Wijsbegeerte, Universiteit Antwerpen, Belgium.

Joachims, T. (1999). Transductive inference for text classification using support vector machines. In Proceedings of the 16th International Conference on Machine Learning (pp. 200–209). San Francisco, CA: Morgan Kaufmann.

Kaster, A., Siersdorfer, S., & Weikum, G. (2005). Combining text and linguistic document representations for authorship attribution. In SIGIR Workshop: Stylistic Analysis of Text for Information Access (STYLE) (pp. 27–35).

Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the special issue of the Web as corpus. Computational Linguistics, 29(2), 333–347.CrossRefMathSciNet

Malyutov, M. B. (2006). Authorship attribution of texts: A review. In R. Ahlswede, L. Bäumer, N. Cai, H. K. Aydinian, V. Blinovsky, C. Deppe, & H. Mashurian (Eds.), GTIT-C (Vol. 4123, pp. 362–380). Springer, Lecture Notes in Computer Science.

Moschitti, A., & Basili, R. (2004). Complex linguistic features for text classification: A comprehensive study. In S. McDonald & J. Tait (Eds.), Proceedings of the 26th European Conference on Information Retrieval (ECIR 2004) (Vol. 2997, pp. 181–196). Sunderland, UK: Springer, Lecture Notes in Computer Science.

Nigam, K., Mccallum, A. K., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3), 103–134.MATHCrossRef

Peng, F., Schuurmans, D., Wang, S. (2004). Augmenting naive Bayes classifiers with statistical language models. Information Retrieval, 7(3–4), 317–345.CrossRef

Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.CrossRef

Seeger, M. (2000). Learning with labeled and unlabeled data. Tech. Rep. Edinburgh, UK: University of Edinburgh.

Smucker, M., Allan, J., & Carterette, B. (2007). A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the ACM Sixteenth Conference on Information and Knowledge Management (pp. 623–632).

Solorio, T. (2002). Using unlabeled data to improve classifier accuracy. Master’s thesis, Computer Science Department, INAOE, Mexico.

Stamatatos, E., Fakotakis, N., & Kokkinakis, G. (2001). Computer-based authorship attribution without lexical measures. Computers and the Humanities, 35, 193–214.CrossRef

Witten, I. H., & Frank, E. (1999). Data mining: Practical machine learning tools and techniques with Java implementations. Morgan Kaufmann.

Yu, B. (2006). An evaluation of text classification methods for literary study. Ph.D. thesis, Champaign, IL, USA.

Zelikovitz, S., & Hirsh, H. (2002). Integrating background knowledge into nearest-neighbor text classification. In S. Craw & A. D. Preece (Eds.), ECCBR (Vol. 2416, pp. 1–5). Springer, Lecture Notes in Computer Science.

Zelikovitz, S., & Kogan, M. (2006). Using web searches on important words to create background sets for LSI classification. In G. Sutcliffe & R. Goebel (Eds.), FLAIRS Conference (pp. 598–603). AAAI Press.

Zhao, Y., & Zobel, J. (2005). Effective and scalable authorship attribution using function words. In G. G. Lee, A. Yamada, H. Meng, & S. H. Myaeng (Eds.), AIRS (Vol. 3689, pp. 174–189). Springer, Lecture Notes in Computer Science.

Zhu, X. (2005). Semi-supervised learning literature survey. Tech. Rep. Computer Sciences, University of Wisconsin-Madison.

Titel: Using the Web as corpus for self-training text categorization
verfasst von: Rafael Guzmán-Cabrera
Manuel Montes-y-Gómez
Paolo Rosso
Luis Villaseñor-Pineda
Publikationsdatum: 01.06.2009
Verlag: Springer Netherlands
Erschienen in: Discover Computing / Ausgabe 3/2009
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI: https://doi.org/10.1007/s10791-008-9083-7

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 3/2009

Introduction to the special issue on non-english web retrieval

Towards enhancing retrieval effectiveness of search engines for diacritisized Arabic documents

Non-english web search: an evaluation of indexing and searching the Greek web

On knowledge-poor methods for person name matching and lemmatization for highly inflectional languages

A user-centric approach to identifying best deployment strategies for language tools: the impact of content and access language on Web user behaviour and attitudes

Classifying Amharic webnews

Premium Partner