Skip to main content
Erschienen in:
Buchtitelbild

2019 | OriginalPaper | Buchkapitel

Categorizing Emails Using Machine Learning with Textual Features

verfasst von : Haoran Zhang, Jagadish Rangrej, Saad Rais, Michael Hillmer, Frank Rudzicz, Kamil Malikov

Erschienen in: Advances in Artificial Intelligence

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

We developed an application that automates the process of assigning emails received in a generic request inbox to one of fourteen predefined topic categories. To build this application, we compared the performance of several classifiers in predicting the topic category, using an email dataset extracted from this inbox, which consisted of 8,841 emails over three years. The algorithms ranged from linear classifiers operating on n-gram features to deep learning techniques such as CNNs and LSTMs. For our objective, we found that the best-performing model was a logistic regression classifier using n-grams with TF-IDF weights, achieving 90.9% accuracy. The traditional models performed better than the deep learning models for this dataset, likely in part due to the small dataset size, and also because this particular classification task may not require the ordered sequence representation of tokens that deep learning models provide. Eventually, a bagged voting model was selected which combines the predictive power of the top eight models, with accuracy of 92.7%, surpassing the performance of any of the individual models.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
2.
Zurück zum Zitat Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2001)CrossRef Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2001)CrossRef
3.
Zurück zum Zitat Provost, J.: Naïve-Bayes vs. rule-learning in classification of email. University of Texas at Austin, Artificial Intelligence Lab, CiteSeer (Ingebrigsten), pp. 1–4 (1999) Provost, J.: Naïve-Bayes vs. rule-learning in classification of email. University of Texas at Austin, Artificial Intelligence Lab, CiteSeer (Ingebrigsten), pp. 1–4 (1999)
4.
Zurück zum Zitat Zhou, C., Sun, C., Liu, Z., Lau, F.C.M.: A C-LSTM Neural Network for Text Classification. ArXiv e-prints, November 2015 Zhou, C., Sun, C., Liu, Z., Lau, F.C.M.: A C-LSTM Neural Network for Text Classification. ArXiv e-prints, November 2015
5.
Zurück zum Zitat Zhang, X., Zhao, J., LeCun, Y.: Character-level Convolutional Networks for Text Classification, pp. 1–9 (2015) Zhang, X., Zhao, J., LeCun, Y.: Character-level Convolutional Networks for Text Classification, pp. 1–9 (2015)
6.
Zurück zum Zitat Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: AAAI-29, pp. 2267–2273 (2015) Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: AAAI-29, pp. 2267–2273 (2015)
7.
Zurück zum Zitat Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of Tricks for Efficient Text Classification (2016) Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of Tricks for Efficient Text Classification (2016)
8.
Zurück zum Zitat Johnson, R., Zhang, T.: Effective Use of Word Order for Text Categorization with Convolutional Neural Networks (2011, 2014) Johnson, R., Zhang, T.: Effective Use of Word Order for Text Categorization with Convolutional Neural Networks (2011, 2014)
9.
Zurück zum Zitat Kim, T., Yang, J.: Abstractive Text Classification Using Sequence-to-convolution Neural Networks (2018) Kim, T., Yang, J.: Abstractive Text Classification Using Sequence-to-convolution Neural Networks (2018)
10.
Zurück zum Zitat Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python (2009) Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python (2009)
11.
Zurück zum Zitat Pedregosa, F., Varoquaux, G., Gramfort, A.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2012)MathSciNetMATH Pedregosa, F., Varoquaux, G., Gramfort, A.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2012)MathSciNetMATH
12.
Zurück zum Zitat Chen, T., Guestrin, C.: XGBoost: A Scalable Tree Boosting System (2016) Chen, T., Guestrin, C.: XGBoost: A Scalable Tree Boosting System (2016)
13.
Zurück zum Zitat He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)CrossRef He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)CrossRef
14.
Zurück zum Zitat Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval Introduction, vol. 35 (2008) Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval Introduction, vol. 35 (2008)
15.
Zurück zum Zitat Lewis, D.D.: Feature selection and feature extraction for text categorization. Speech and natural language. In: Proceedings of a Workshop Held at Harriman, New York, 23–26 February 1992 (1992) Lewis, D.D.: Feature selection and feature extraction for text categorization. Speech and natural language. In: Proceedings of a Workshop Held at Harriman, New York, 23–26 February 1992 (1992)
16.
Zurück zum Zitat Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on EMNLP, pp. 1532–1543 (2014) Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on EMNLP, pp. 1532–1543 (2014)
17.
Zurück zum Zitat Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)CrossRef Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)CrossRef
18.
Zurück zum Zitat Conneau, A., Schwenk, H., Le Cun, Y., Barrault, L.: Very deep convolutional networks for text classification. In: Proceedings of the 15th Conference of the EACL, vol. 1, pp. 1107–1116 (2017) Conneau, A., Schwenk, H., Le Cun, Y., Barrault, L.: Very deep convolutional networks for text classification. In: Proceedings of the 15th Conference of the EACL, vol. 1, pp. 1107–1116 (2017)
19.
Zurück zum Zitat Jurafsky, D., Martin, J.: Speech & Language Processing, 2 edn., London (2014) Jurafsky, D., Martin, J.: Speech & Language Processing, 2 edn., London (2014)
20.
Zurück zum Zitat Ramos, J.: Using TF-IDF to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning, pp. 1–4 (2003) Ramos, J.: Using TF-IDF to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning, pp. 1–4 (2003)
22.
Zurück zum Zitat Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the ACL, pp. 142–150 (2011) Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the ACL, pp. 142–150 (2011)
23.
Zurück zum Zitat Luong, M.T., Manning, C.D.: Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models (2016) Luong, M.T., Manning, C.D.: Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models (2016)
24.
Zurück zum Zitat Bahdanau, D., Bosc, T.: Learning to Compute Word Embeddings on the Fly (2018) Bahdanau, D., Bosc, T.: Learning to Compute Word Embeddings on the Fly (2018)
25.
Zurück zum Zitat Gordan, M., Kochen, M.: Recall-precision trade-off : a derivation. J. Am. Soc. Inf. Sci. 40 145 (1989, 1998) Gordan, M., Kochen, M.: Recall-precision trade-off : a derivation. J. Am. Soc. Inf. Sci. 40 145 (1989, 1998)
26.
Zurück zum Zitat Fisher, D.: Knowledge acquisition via incremental clustering. Mach. Learn. 2(1980), 139–182 (1987) Fisher, D.: Knowledge acquisition via incremental clustering. Mach. Learn. 2(1980), 139–182 (1987)
27.
Zurück zum Zitat Choi, J.D., Tetreault, J., Stent, A.: It Depends: Dependency Parser Comparison Using a Web-based Evaluation Tool, pp. 387–396 (2015) Choi, J.D., Tetreault, J., Stent, A.: It Depends: Dependency Parser Comparison Using a Web-based Evaluation Tool, pp. 387–396 (2015)
28.
Zurück zum Zitat Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. CoRR abs/1607.04606 (2016) Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. CoRR abs/1607.04606 (2016)
Metadaten
Titel
Categorizing Emails Using Machine Learning with Textual Features
verfasst von
Haoran Zhang
Jagadish Rangrej
Saad Rais
Michael Hillmer
Frank Rudzicz
Kamil Malikov
Copyright-Jahr
2019
DOI
https://doi.org/10.1007/978-3-030-18305-9_1

Premium Partner