Skip to main content
Top

2018 | OriginalPaper | Chapter

Document Categorization Using Graph Structuring

Authors : Sandipan Sarma, Punyajoy Saha, Jaya Sil

Published in: Advanced Computational and Communication Paradigms

Publisher: Springer Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This paper proposes a document classification model using feature learning (Coates, Demystifying unsupervised feature learning, 2012) [5] approach based on semantics of the documents. In the learning phase, basic vocabulary (BV) for each document class consisting of nouns has been created by proposing a novel approach. The classification phase searches unique words in the BVs and if found, the corresponding sentence becomes a basic sentence (BS). A tree with unique words of the BS is inserted in the respective forest. Associated words of the children are used to continue the tree formation process until no new node is generated in the tree. Finally, we assign the test document to a class which has a clearly dominant percentage of sentences in the respective forest. The proposed algorithm is compared with various feature-based classification models and satisfactory performance has been observed.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
2.
go back to reference Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003) Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
3.
go back to reference Bollegala, D., Matsuo, Y., Ishizuka, M.: Measuring Semantic Similarity Between Words Using Web Search Engines, Semantic Web (2007) Bollegala, D., Matsuo, Y., Ishizuka, M.: Measuring Semantic Similarity Between Words Using Web Search Engines, Semantic Web (2007)
4.
go back to reference Bolshakov, I.A., Gelbukh, A.: Two methods of evaluation of semantic similarity. In: Kedad, Z., Lammari, N., Mtais, E., Meziane, F., Rezgui, Y. (eds.) Natural Language Processing and Information Systems. NLDB 2007. Lecture Notes in Computer Science, vol. 4592 (2007) Bolshakov, I.A., Gelbukh, A.: Two methods of evaluation of semantic similarity. In: Kedad, Z., Lammari, N., Mtais, E., Meziane, F., Rezgui, Y. (eds.) Natural Language Processing and Information Systems. NLDB 2007. Lecture Notes in Computer Science, vol. 4592 (2007)
5.
go back to reference Coates, A.: Demystifying Unsupervised Feature Learning (2012) Coates, A.: Demystifying Unsupervised Feature Learning (2012)
6.
go back to reference Cunningham, D.G.: Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of 23rd International Conference on Machine learning (ICML’06), pp. 377–384. ACM Press (2006) Cunningham, D.G.: Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of 23rd International Conference on Machine learning (ICML’06), pp. 377–384. ACM Press (2006)
7.
go back to reference Dao, N.T., Simpson, T.: Measuring Similarity Between Sentences Dao, N.T., Simpson, T.: Measuring Similarity Between Sentences
8.
go back to reference Fawcett, T.: ROC Graphs: Notes and Practical Considerations for Data Mining Researchers. Intelligent Enterprise Technologies Laboratory, HP Laboratories, Palo Alto (2003) Fawcett, T.: ROC Graphs: Notes and Practical Considerations for Data Mining Researchers. Intelligent Enterprise Technologies Laboratory, HP Laboratories, Palo Alto (2003)
9.
go back to reference Gomaa, W.H., Fahmy, A.A.: A survey of text similarity approaches. Int. J. Comput. Appl. 68(13) (2013) Gomaa, W.H., Fahmy, A.A.: A survey of text similarity approaches. Int. J. Comput. Appl. 68(13) (2013)
10.
go back to reference Gretarsson, B., et al.: Topicnets: visual analysis of large text corpora with topic modeling. ACM Trans. Intell. Syst. Technol. (TIST) 3(2), 23 (2012) Gretarsson, B., et al.: Topicnets: visual analysis of large text corpora with topic modeling. ACM Trans. Intell. Syst. Technol. (TIST) 3(2), 23 (2012)
11.
go back to reference Lafferty, J.D., Blei, D.M.: Correlated topic models. In: Advances in Neural Information Processing Systems (2006) Lafferty, J.D., Blei, D.M.: Correlated topic models. In: Advances in Neural Information Processing Systems (2006)
12.
go back to reference Li, C.H., Park, S.C.: Artificial Neural Network for Document Classification Using Latent Semantics Indexing. IEEE, Joenju, South Korea (2007) Li, C.H., Park, S.C.: Artificial Neural Network for Document Classification Using Latent Semantics Indexing. IEEE, Joenju, South Korea (2007)
13.
go back to reference Ramaswamy, S.: Multiclass Text Classification: A Decision Tree Based SVM Approach. University of California, Berkley Ramaswamy, S.: Multiclass Text Classification: A Decision Tree Based SVM Approach. University of California, Berkley
14.
go back to reference Trstenjaka, B., Mikac, S.: KNN with TF-IDF based framework for text categorization. Science Direct 69, 1356–1364 (2014) Trstenjaka, B., Mikac, S.: KNN with TF-IDF based framework for text categorization. Science Direct 69, 1356–1364 (2014)
15.
go back to reference Tsuda, K., Saigo, H.: Graph classification. In: Managing and Mining Graph Data, pp. 337–363 (2010) Tsuda, K., Saigo, H.: Graph classification. In: Managing and Mining Graph Data, pp. 337–363 (2010)
Metadata
Title
Document Categorization Using Graph Structuring
Authors
Sandipan Sarma
Punyajoy Saha
Jaya Sil
Copyright Year
2018
Publisher
Springer Singapore
DOI
https://doi.org/10.1007/978-981-10-8237-5_47