Skip to main content
Erschienen in: Knowledge and Information Systems 3/2017

13.05.2016 | Regular Paper

An effective and interpretable method for document classification

verfasst von: Ngo Van Linh, Nguyen Kim Anh, Khoat Than, Chien Nguyen Dang

Erschienen in: Knowledge and Information Systems | Ausgabe 3/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

As the number of documents has been rapidly increasing in recent time, automatic text categorization is becoming a more important and fundamental task in information retrieval and text mining. Accuracy and interpretability are two important aspects of a text classifier. While the accuracy of a classifier measures the ability to correctly classify unseen data, interpretability is the ability of the classifier to be understood by humans and provide reasons why each data instance is assigned to a label. This paper proposes an interpretable classification method by exploiting the Dirichlet process mixture model of von Mises–Fisher distributions for directional data. By using the labeled information of the training data explicitly and determining automatically the number of topics for each class, the learned topics are coherent, relevant and discriminative. They help interpret as well as distinguish classes. Our experimental results showed the advantages of our approach in terms of separability, interpretability and effectiveness in classification task of datasets with high dimension and complex distribution. Our method is highly competitive with state-of-the-art approaches.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
1.
Zurück zum Zitat Linh NV, Anh NK, Than K, Tat NN (2015) Effective and interpretable document classification using distinctly labeled Dirichlet process mixture models of von Mises-Fisher distributions. In: Database systems for advanced applications. Springer, Switzerland, pp 139–153 Linh NV, Anh NK, Than K, Tat NN (2015) Effective and interpretable document classification using distinctly labeled Dirichlet process mixture models of von Mises-Fisher distributions. In: Database systems for advanced applications. Springer, Switzerland, pp 139–153
2.
Zurück zum Zitat Delgado MF, Cernadas E, Barro S, Amorim DG (2014) Do we need hundreds of classifiers to solve real world classification problems?”. J Mach Learn Res 15(1):3133–3181MathSciNetMATH Delgado MF, Cernadas E, Barro S, Amorim DG (2014) Do we need hundreds of classifiers to solve real world classification problems?”. J Mach Learn Res 15(1):3133–3181MathSciNetMATH
3.
Zurück zum Zitat Van de Merckt T, Decaestecker C. (1995) About breaking the trade off between accuracy and comprehensibility in concept learning. In: IJCAI’95 workshop on machine learning and comprehensibility Van de Merckt T, Decaestecker C. (1995) About breaking the trade off between accuracy and comprehensibility in concept learning. In: IJCAI’95 workshop on machine learning and comprehensibility
4.
Zurück zum Zitat Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42(1–2):177–196CrossRefMATH Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42(1–2):177–196CrossRefMATH
5.
Zurück zum Zitat Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022MATH Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022MATH
6.
Zurück zum Zitat Ramage D, Manning CD, Dumais S (2011) Partially labeled topic models for interpretable text mining. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2011, pp 457–465 Ramage D, Manning CD, Dumais S (2011) Partially labeled topic models for interpretable text mining. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2011, pp 457–465
7.
Zurück zum Zitat Ahmed A, Xing EP (2010) Staying informed: supervised and semi-supervised multi-view topical analysis of ideological perspective. In: Proceedings of the 2010 conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 1140–1150 Ahmed A, Xing EP (2010) Staying informed: supervised and semi-supervised multi-view topical analysis of ideological perspective. In: Proceedings of the 2010 conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 1140–1150
8.
Zurück zum Zitat Anh NK, Tam NT, Linh NV (2013) Document clustering using dirichlet process mixture model of von Mises–Fisher distributions. In: 4th International symposium on information and communication technology, pp 131–138 Anh NK, Tam NT, Linh NV (2013) Document clustering using dirichlet process mixture model of von Mises–Fisher distributions. In: 4th International symposium on information and communication technology, pp 131–138
9.
10.
Zurück zum Zitat Manning CD, Raghavan P, Schütze H et al (2008) Introduction to information retrieval, vol 1. Cambridge University Press, CambridgeCrossRefMATH Manning CD, Raghavan P, Schütze H et al (2008) Introduction to information retrieval, vol 1. Cambridge University Press, CambridgeCrossRefMATH
11.
Zurück zum Zitat Anh NK, Van Linh N, Ky LH, Tam NT (2013) Document classification using semi-supervived mixture model of von Mises–Fisher distributions on document manifold. In: Proceedings of the fourth symposium on information and communication technology. ACM, pp 94–100 Anh NK, Van Linh N, Ky LH, Tam NT (2013) Document classification using semi-supervived mixture model of von Mises–Fisher distributions on document manifold. In: Proceedings of the fourth symposium on information and communication technology. ACM, pp 94–100
12.
Zurück zum Zitat Anh NK, Tam NT, Linh NV (2013) Document clustering using mixture model of von mises–fisher distributions on document manifold. In: International conference on soft computing and pattern recognition, pp 140–145 Anh NK, Tam NT, Linh NV (2013) Document clustering using mixture model of von mises–fisher distributions on document manifold. In: International conference on soft computing and pattern recognition, pp 140–145
13.
Zurück zum Zitat Gopal S, Yang Y (2014) Von Mises–Fisher clustering models. In: Proceedings of The 31st international conference on machine learning, pp 154–162 Gopal S, Yang Y (2014) Von Mises–Fisher clustering models. In: Proceedings of The 31st international conference on machine learning, pp 154–162
14.
Zurück zum Zitat Reisinger J, Waters A, Silverthorn B, Mooney RJ (2010) Spherical topic models. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp 903–910 Reisinger J, Waters A, Silverthorn B, Mooney RJ (2010) Spherical topic models. In: Proceedings of the 27th international conference on machine learning (ICML-10), pp 903–910
15.
Zurück zum Zitat Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8:374–384CrossRef Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8:374–384CrossRef
16.
Zurück zum Zitat Zhu J, Ahmed A, Xing EP (2012) Medlda: maximum margin supervised topic models. J Mach Learn Res 13(1):2237–2278MathSciNetMATH Zhu J, Ahmed A, Xing EP (2012) Medlda: maximum margin supervised topic models. J Mach Learn Res 13(1):2237–2278MathSciNetMATH
17.
Zurück zum Zitat Blei DM, McAuliffe JD (2007) Supervised topic models. In: Advances in neural information processing systems 20, proceedings of the twenty-first annual conference on neural information processing systems, Vancouver, BC, Canada, December 3–6, 2007, pp 121–128 Blei DM, McAuliffe JD (2007) Supervised topic models. In: Advances in neural information processing systems 20, proceedings of the twenty-first annual conference on neural information processing systems, Vancouver, BC, Canada, December 3–6, 2007, pp 121–128
18.
Zurück zum Zitat Wang C, Blei DM, Li F (2009) Simultaneous image classification and annotation. In: IEEE computer society conference on computer vision and pattern recognition (CVPR 2009), 20–25 June 2009. Florida, USA, Miami, pp 1903–1910 Wang C, Blei DM, Li F (2009) Simultaneous image classification and annotation. In: IEEE computer society conference on computer vision and pattern recognition (CVPR 2009), 20–25 June 2009. Florida, USA, Miami, pp 1903–1910
19.
Zurück zum Zitat Anh NK, Linh NV, Toi NK, Tam NT (2013) Multi-labeled document classification using semi-supervived mixture model of watson distributions on document manifold. In: International conference on soft computing and pattern recognition, pp 123–128 Anh NK, Linh NV, Toi NK, Tam NT (2013) Multi-labeled document classification using semi-supervived mixture model of watson distributions on document manifold. In: International conference on soft computing and pattern recognition, pp 123–128
20.
Zurück zum Zitat Than K, Ho TB, Nguyen DK (2014) An effective framework for supervised dimension reduction. Neurocomputing 139:397–407CrossRef Than K, Ho TB, Nguyen DK (2014) An effective framework for supervised dimension reduction. Neurocomputing 139:397–407CrossRef
21.
Zurück zum Zitat Lacoste-Julien S, Sha F, Jordan MI (2008) Disclda: discriminative learning for dimensionality reduction and classification. In: Koller D, Schuurmans D, Bengio Y, Bottou L (eds) Advances in neural information processing systems 21. Curran Associates, Inc., pp 897–904 Lacoste-Julien S, Sha F, Jordan MI (2008) Disclda: discriminative learning for dimensionality reduction and classification. In: Koller D, Schuurmans D, Bengio Y, Bottou L (eds) Advances in neural information processing systems 21. Curran Associates, Inc., pp 897–904
22.
Zurück zum Zitat Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 conference on empirical methods in natural language processing: volume 1–volume 1. Association for Computational Linguistics, , pp 248–256 Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 conference on empirical methods in natural language processing: volume 1–volume 1. Association for Computational Linguistics, , pp 248–256
23.
Zurück zum Zitat Banerjee A, Dhillon IS, Ghosh J, Sra S (2005) Clustering on the unit hypersphere using von Mises–Fisher distributions. J Mach Learn Res 6:1345–1382MathSciNetMATH Banerjee A, Dhillon IS, Ghosh J, Sra S (2005) Clustering on the unit hypersphere using von Mises–Fisher distributions. J Mach Learn Res 6:1345–1382MathSciNetMATH
24.
Zurück zum Zitat Xie P, Xing EP (2013) Integrating document clustering and topic modeling. In: Proceedings of the twenty-ninth conference on uncertainty in artificial intelligence, Bellevue, WA, USA, August 11–15 (2013) Xie P, Xing EP (2013) Integrating document clustering and topic modeling. In: Proceedings of the twenty-ninth conference on uncertainty in artificial intelligence, Bellevue, WA, USA, August 11–15 (2013)
25.
Zurück zum Zitat Li X, OuYang J, Lu Y, Zhou X, Tian T (2015) Group topic model: organizing topics into groups. Inf Retr J 18(1):1–25CrossRef Li X, OuYang J, Lu Y, Zhou X, Tian T (2015) Group topic model: organizing topics into groups. Inf Retr J 18(1):1–25CrossRef
27.
29.
Zurück zum Zitat Neal RM (2000) Markov chain sampling methods for Dirichlet process mixture models. J Comput Graph Stat 9(2):249–265MathSciNet Neal RM (2000) Markov chain sampling methods for Dirichlet process mixture models. J Comput Graph Stat 9(2):249–265MathSciNet
31.
Zurück zum Zitat Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119 Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
32.
Zurück zum Zitat Jiang Q, Zhu J, Sun M, Xing EP (2012) Monte carlo methods for maximum margin supervised topic models. In: Advances in neural information processing systems, pp 1592–1600 Jiang Q, Zhu J, Sun M, Xing EP (2012) Monte carlo methods for maximum margin supervised topic models. In: Advances in neural information processing systems, pp 1592–1600
33.
Zurück zum Zitat Van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(11) Van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(11)
34.
Zurück zum Zitat Röder M, Both A, Hinneburg A (2015) Exploring the space of topic coherence measures. In: Proceedings of the eighth ACM international conference on web search and data mining, WSDM 2015, Shanghai, China, February 2–6, 2015, pp 399–408 Röder M, Both A, Hinneburg A (2015) Exploring the space of topic coherence measures. In: Proceedings of the eighth ACM international conference on web search and data mining, WSDM 2015, Shanghai, China, February 2–6, 2015, pp 399–408
35.
Zurück zum Zitat Niyogi P (2013) Manifold regularization and semi-supervised learning: some theoretical analyses. J Mach Learn Res 14(1):1229–1250MathSciNetMATH Niyogi P (2013) Manifold regularization and semi-supervised learning: some theoretical analyses. J Mach Learn Res 14(1):1229–1250MathSciNetMATH
36.
Zurück zum Zitat Ng AY, Jordan MI (2001) On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In: Advances in Neural information processing systems, pp 841–848 Ng AY, Jordan MI (2001) On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In: Advances in Neural information processing systems, pp 841–848
Metadaten
Titel
An effective and interpretable method for document classification
verfasst von
Ngo Van Linh
Nguyen Kim Anh
Khoat Than
Chien Nguyen Dang
Publikationsdatum
13.05.2016
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 3/2017
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-016-0956-6

Weitere Artikel der Ausgabe 3/2017

Knowledge and Information Systems 3/2017 Zur Ausgabe

Premium Partner