Skip to main content
Erschienen in: Discover Computing 3/2009

01.06.2009

Using the Web as corpus for self-training text categorization

verfasst von: Rafael Guzmán-Cabrera, Manuel Montes-y-Gómez, Paolo Rosso, Luis Villaseñor-Pineda

Erschienen in: Discover Computing | Ausgabe 3/2009

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Most current methods for automatic text categorization are based on supervised learning techniques and, therefore, they face the problem of requiring a great number of training instances to construct an accurate classifier. In order to tackle this problem, this paper proposes a new semi-supervised method for text categorization, which considers the automatic extraction of unlabeled examples from the Web and the application of an enriched self-training approach for the construction of the classifier. This method, even though language independent, is more pertinent for scenarios where large sets of labeled resources do not exist. That, for instance, could be the case of several application domains in different non-English languages such as Spanish. The experimental evaluation of the method was carried out in three different tasks and in two different languages. The achieved results demonstrate the applicability and usefulness of the proposed method.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
The Nora project (www.​noraproject.​org) and the Monk project (www.​monkproject.​org) are two research efforts related to these kind of tasks.
 
2
Given that each unlabeled example is downloaded from the Web using a set of automatically defined class queries, each of them has a default category or web-based label.
 
Literatur
Zurück zum Zitat Aas, K., & Eikvil, L. (1999). Text categorization: A survey. Tech. Rep. 941. Norwegian Computing Center. Aas, K., & Eikvil, L. (1999). Text categorization: A survey. Tech. Rep. 941. Norwegian Computing Center.
Zurück zum Zitat Argamon, S., & Levitan, S. (2005). Measuring the usefulness of function words for authorship attribution. In Proceedings of ACH/ALLC Conference 2005. Argamon, S., & Levitan, S. (2005). Measuring the usefulness of function words for authorship attribution. In Proceedings of ACH/ALLC Conference 2005.
Zurück zum Zitat Bekkerman, R., & Allan, J. (2004). Using bigrams in text categorization. Tech. Rep. IR-408. Center of Intelligent Information Retrieval, UMass Amherst. Bekkerman, R., & Allan, J. (2004). Using bigrams in text categorization. Tech. Rep. IR-408. Center of Intelligent Information Retrieval, UMass Amherst.
Zurück zum Zitat Chaski, C. (2005). Who’s at the keyboard: Authorship attribution in digital evidence investigations. International Journal of Digital Evidence, 4(1), 1–13. Chaski, C. (2005). Who’s at the keyboard: Authorship attribution in digital evidence investigations. International Journal of Digital Evidence, 4(1), 1–13.
Zurück zum Zitat Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Editorial: Special issue on learning from imbalanced data sets. SIGKDD Explorations, 6(1), 1–6.CrossRef Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Editorial: Special issue on learning from imbalanced data sets. SIGKDD Explorations, 6(1), 1–6.CrossRef
Zurück zum Zitat Coyotl-Morales, R. M., Villaseñor-Pineda, L., Montes-Y-Gómez, M., & Rosso, P. (2006). Authorship attribution using word sequences. In J. F. Martínez-Trinidad, J. A. Carrasco-Ochoa, & J. Kittler (Eds.), CIARP (Vol. 4225, pp. 844–853). Springer, Lecture Notes in Computer Science. Coyotl-Morales, R. M., Villaseñor-Pineda, L., Montes-Y-Gómez, M., & Rosso, P. (2006). Authorship attribution using word sequences. In J. F. Martínez-Trinidad, J. A. Carrasco-Ochoa, & J. Kittler (Eds.), CIARP (Vol. 4225, pp. 844–853). Springer, Lecture Notes in Computer Science.
Zurück zum Zitat Diederich, J., Kindermann, J., Leopold, E., & Paass, G. (2003). Authorship attribution with support vector machines. Applied Intelligence, 19(1/2), 109–123.MATHCrossRef Diederich, J., Kindermann, J., Leopold, E., & Paass, G. (2003). Authorship attribution with support vector machines. Applied Intelligence, 19(1/2), 109–123.MATHCrossRef
Zurück zum Zitat Hartley, H. O., & Rao, J. N. K. (1968). Classification and estimation in analysis of variance problems. Review of the International Statistical Institute, 36(2), 141–147.MATHCrossRefMathSciNet Hartley, H. O., & Rao, J. N. K. (1968). Classification and estimation in analysis of variance problems. Review of the International Statistical Institute, 36(2), 141–147.MATHCrossRefMathSciNet
Zurück zum Zitat Holmes, D. I. (1994). Authorship attribution. Computers and the Humanities, 28, 87–106.CrossRef Holmes, D. I. (1994). Authorship attribution. Computers and the Humanities, 28, 87–106.CrossRef
Zurück zum Zitat Hoste, V. (2005). Optimization issues in machine learning of coreference resolution. Ph.D. thesis, Faculteit Letteren en Wijsbegeerte, Universiteit Antwerpen, Belgium. Hoste, V. (2005). Optimization issues in machine learning of coreference resolution. Ph.D. thesis, Faculteit Letteren en Wijsbegeerte, Universiteit Antwerpen, Belgium.
Zurück zum Zitat Joachims, T. (1999). Transductive inference for text classification using support vector machines. In Proceedings of the 16th International Conference on Machine Learning (pp. 200–209). San Francisco, CA: Morgan Kaufmann. Joachims, T. (1999). Transductive inference for text classification using support vector machines. In Proceedings of the 16th International Conference on Machine Learning (pp. 200–209). San Francisco, CA: Morgan Kaufmann.
Zurück zum Zitat Kaster, A., Siersdorfer, S., & Weikum, G. (2005). Combining text and linguistic document representations for authorship attribution. In SIGIR Workshop: Stylistic Analysis of Text for Information Access (STYLE) (pp. 27–35). Kaster, A., Siersdorfer, S., & Weikum, G. (2005). Combining text and linguistic document representations for authorship attribution. In SIGIR Workshop: Stylistic Analysis of Text for Information Access (STYLE) (pp. 27–35).
Zurück zum Zitat Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the special issue of the Web as corpus. Computational Linguistics, 29(2), 333–347.CrossRefMathSciNet Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the special issue of the Web as corpus. Computational Linguistics, 29(2), 333–347.CrossRefMathSciNet
Zurück zum Zitat Malyutov, M. B. (2006). Authorship attribution of texts: A review. In R. Ahlswede, L. Bäumer, N. Cai, H. K. Aydinian, V. Blinovsky, C. Deppe, & H. Mashurian (Eds.), GTIT-C (Vol. 4123, pp. 362–380). Springer, Lecture Notes in Computer Science. Malyutov, M. B. (2006). Authorship attribution of texts: A review. In R. Ahlswede, L. Bäumer, N. Cai, H. K. Aydinian, V. Blinovsky, C. Deppe, & H. Mashurian (Eds.), GTIT-C (Vol. 4123, pp. 362–380). Springer, Lecture Notes in Computer Science.
Zurück zum Zitat Moschitti, A., & Basili, R. (2004). Complex linguistic features for text classification: A comprehensive study. In S. McDonald & J. Tait (Eds.), Proceedings of the 26th European Conference on Information Retrieval (ECIR 2004) (Vol. 2997, pp. 181–196). Sunderland, UK: Springer, Lecture Notes in Computer Science. Moschitti, A., & Basili, R. (2004). Complex linguistic features for text classification: A comprehensive study. In S. McDonald & J. Tait (Eds.), Proceedings of the 26th European Conference on Information Retrieval (ECIR 2004) (Vol. 2997, pp. 181–196). Sunderland, UK: Springer, Lecture Notes in Computer Science.
Zurück zum Zitat Nigam, K., Mccallum, A. K., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3), 103–134.MATHCrossRef Nigam, K., Mccallum, A. K., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3), 103–134.MATHCrossRef
Zurück zum Zitat Peng, F., Schuurmans, D., Wang, S. (2004). Augmenting naive Bayes classifiers with statistical language models. Information Retrieval, 7(3–4), 317–345.CrossRef Peng, F., Schuurmans, D., Wang, S. (2004). Augmenting naive Bayes classifiers with statistical language models. Information Retrieval, 7(3–4), 317–345.CrossRef
Zurück zum Zitat Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.CrossRef Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.CrossRef
Zurück zum Zitat Seeger, M. (2000). Learning with labeled and unlabeled data. Tech. Rep. Edinburgh, UK: University of Edinburgh. Seeger, M. (2000). Learning with labeled and unlabeled data. Tech. Rep. Edinburgh, UK: University of Edinburgh.
Zurück zum Zitat Smucker, M., Allan, J., & Carterette, B. (2007). A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the ACM Sixteenth Conference on Information and Knowledge Management (pp. 623–632). Smucker, M., Allan, J., & Carterette, B. (2007). A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the ACM Sixteenth Conference on Information and Knowledge Management (pp. 623–632).
Zurück zum Zitat Solorio, T. (2002). Using unlabeled data to improve classifier accuracy. Master’s thesis, Computer Science Department, INAOE, Mexico. Solorio, T. (2002). Using unlabeled data to improve classifier accuracy. Master’s thesis, Computer Science Department, INAOE, Mexico.
Zurück zum Zitat Stamatatos, E., Fakotakis, N., & Kokkinakis, G. (2001). Computer-based authorship attribution without lexical measures. Computers and the Humanities, 35, 193–214.CrossRef Stamatatos, E., Fakotakis, N., & Kokkinakis, G. (2001). Computer-based authorship attribution without lexical measures. Computers and the Humanities, 35, 193–214.CrossRef
Zurück zum Zitat Witten, I. H., & Frank, E. (1999). Data mining: Practical machine learning tools and techniques with Java implementations. Morgan Kaufmann. Witten, I. H., & Frank, E. (1999). Data mining: Practical machine learning tools and techniques with Java implementations. Morgan Kaufmann.
Zurück zum Zitat Yu, B. (2006). An evaluation of text classification methods for literary study. Ph.D. thesis, Champaign, IL, USA. Yu, B. (2006). An evaluation of text classification methods for literary study. Ph.D. thesis, Champaign, IL, USA.
Zurück zum Zitat Zelikovitz, S., & Hirsh, H. (2002). Integrating background knowledge into nearest-neighbor text classification. In S. Craw & A. D. Preece (Eds.), ECCBR (Vol. 2416, pp. 1–5). Springer, Lecture Notes in Computer Science. Zelikovitz, S., & Hirsh, H. (2002). Integrating background knowledge into nearest-neighbor text classification. In S. Craw & A. D. Preece (Eds.), ECCBR (Vol. 2416, pp. 1–5). Springer, Lecture Notes in Computer Science.
Zurück zum Zitat Zelikovitz, S., & Kogan, M. (2006). Using web searches on important words to create background sets for LSI classification. In G. Sutcliffe & R. Goebel (Eds.), FLAIRS Conference (pp. 598–603). AAAI Press. Zelikovitz, S., & Kogan, M. (2006). Using web searches on important words to create background sets for LSI classification. In G. Sutcliffe & R. Goebel (Eds.), FLAIRS Conference (pp. 598–603). AAAI Press.
Zurück zum Zitat Zhao, Y., & Zobel, J. (2005). Effective and scalable authorship attribution using function words. In G. G. Lee, A. Yamada, H. Meng, & S. H. Myaeng (Eds.), AIRS (Vol. 3689, pp. 174–189). Springer, Lecture Notes in Computer Science. Zhao, Y., & Zobel, J. (2005). Effective and scalable authorship attribution using function words. In G. G. Lee, A. Yamada, H. Meng, & S. H. Myaeng (Eds.), AIRS (Vol. 3689, pp. 174–189). Springer, Lecture Notes in Computer Science.
Zurück zum Zitat Zhu, X. (2005). Semi-supervised learning literature survey. Tech. Rep. Computer Sciences, University of Wisconsin-Madison. Zhu, X. (2005). Semi-supervised learning literature survey. Tech. Rep. Computer Sciences, University of Wisconsin-Madison.
Metadaten
Titel
Using the Web as corpus for self-training text categorization
verfasst von
Rafael Guzmán-Cabrera
Manuel Montes-y-Gómez
Paolo Rosso
Luis Villaseñor-Pineda
Publikationsdatum
01.06.2009
Verlag
Springer Netherlands
Erschienen in
Discover Computing / Ausgabe 3/2009
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI
https://doi.org/10.1007/s10791-008-9083-7

Weitere Artikel der Ausgabe 3/2009

Discover Computing 3/2009 Zur Ausgabe

Premium Partner