Skip to main content
Top

2021 | OriginalPaper | Chapter

Scaled Document Clustering and Word Cloud-Based Summarization on Hindi Corpus

Authors : Prafulla B. Bafna, Jatinderkumar R. Saini

Published in: Progress in Advanced Computing and Intelligent Engineering

Publisher: Springer Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Managing a large number of textual documents is a critical and significant task and supports many applications ranging from information retrieval to clustering search engine results. The multilinguistic facility provided by websites makes Hindi a major language in the digital domain of information technology today. This work focuses on document management through document clustering for a big corpus and summarization of clusters. The objective is to overcome the scalability problem while managing the documents and summarizing the Hindi corpus by extracting tokens. The work is better in terms of scalability and supports the consistent quality of cluster for incremental dataset. Most of the past and contemporary research works have targeted English corpus document management. Hindi corpus has been mostly exploited by the researchers for exploring stemming, single-document summarization, and classifier design. Implementing unsupervised learning on the Hindi corpus for summarization of multiple documents through Word Cloud is still an untouched area. Technically speaking, the current work is an application of TF-IDF, cosine-based document similarity measures, and cluster dendrograms, in addition to various other Natural Language Processing (NLP) activities. Entropy and precision are used to evaluate the experiments carried on different live and available/tested datasets and results prove the robustness of the proposed approach for Hindi Corpus.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Heimerl, F., Lohmann, S., Lange, S., Ertl, T.: Word cloud explorer: Text analytics based on word clouds. In: 2014 47th Hawaii International Conference on System Sciences (pp. 1833–1842). IEEE. (2014, January) Heimerl, F., Lohmann, S., Lange, S., Ertl, T.: Word cloud explorer: Text analytics based on word clouds. In: 2014 47th Hawaii International Conference on System Sciences (pp. 1833–1842). IEEE. (2014, January)
2.
go back to reference Altuncu, M.T., Yaliraki, S.N., & Barahona, M. (2018). Content-driven, unsupervised clustering of news articles through multiscale graph partitioning. arXiv preprint arXiv:1808.01175 Altuncu, M.T., Yaliraki, S.N., & Barahona, M. (2018). Content-driven, unsupervised clustering of news articles through multiscale graph partitioning. arXiv preprint arXiv:​1808.​01175
4.
go back to reference Audichya M.A., Saini J.R.: Computational linguistic prosody rule-based unified technique for automatic metadata generation for Hindi poetry. In: Proceedings of ICAIT-2019, in press, IEEE, USA Audichya M.A., Saini J.R.: Computational linguistic prosody rule-based unified technique for automatic metadata generation for Hindi poetry. In: Proceedings of ICAIT-2019, in press, IEEE, USA
5.
go back to reference Burch, M., Lohmann, S., Beck, F., Rodriguez, N., Di Silvestro, L., Weiskopf, D.: RadCloud: Visualizing multiple texts with merged word clouds. In: 2014 18th International Conference on Information Visualisation (pp. 108–113). IEEE (2014, July) Burch, M., Lohmann, S., Beck, F., Rodriguez, N., Di Silvestro, L., Weiskopf, D.: RadCloud: Visualizing multiple texts with merged word clouds. In: 2014 18th International Conference on Information Visualisation (pp. 108–113). IEEE (2014, July)
6.
go back to reference Sitaula, C.: Semantic text clustering using enhanced vector space model using nepali language. GESJ: Comput. Sci. 41–46 (2012) Sitaula, C.: Semantic text clustering using enhanced vector space model using nepali language. GESJ: Comput. Sci. 41–46 (2012)
7.
go back to reference Cui, W., Wu, Y., Liu, S., Wei, F., Zhou, M. X., Qu, H.: Context preserving dynamic word cloud visualization. In: 2010 IEEE Pacific Visualization Symposium (PacificVis)(pp. 121–128). IEEE.] (2010, March) Cui, W., Wu, Y., Liu, S., Wei, F., Zhou, M. X., Qu, H.: Context preserving dynamic word cloud visualization. In: 2010 IEEE Pacific Visualization Symposium (PacificVis)(pp. 121–128). IEEE.] (2010, March)
8.
go back to reference Garg, A., Saini, J.: A systematic and exhaustive review of automatic abstractive text summarization for Hindi language. (2019) Garg, A., Saini, J.: A systematic and exhaustive review of automatic abstractive text summarization for Hindi language. (2019)
9.
go back to reference Hanyurwimfura, D., Bo, L., Njagi, D., Dukuzumuremyi, J.P.: A centroid and relationship based clustering for organizing. Int. J. Multimedia Ubiquit. Eng. 9(3), 219–234 (2014) Hanyurwimfura, D., Bo, L., Njagi, D., Dukuzumuremyi, J.P.: A centroid and relationship based clustering for organizing. Int. J. Multimedia Ubiquit. Eng. 9(3), 219–234 (2014)
10.
go back to reference Akhtar, Md S., Ekbal, A., Bhattacharyya, P.: Aspect based sentiment analysis in Hindi: resource creation and evaluation. In: Proceedings of the 10th International Conference on Language Resource and Evaluation (LREC 2016); 23–28; Portoroz, Slovenia; 2016 Akhtar, Md S., Ekbal, A., Bhattacharyya, P.: Aspect based sentiment analysis in Hindi: resource creation and evaluation. In: Proceedings of the 10th International Conference on Language Resource and Evaluation (LREC 2016); 23–28; Portoroz, Slovenia; 2016
11.
go back to reference Mishra, U., Prakash, C.: MAULIK: an effective stemmer for the Hindi language. Int. J. Comput. Sci. Eng. 4(5), 711 (2012) Mishra, U., Prakash, C.: MAULIK: an effective stemmer for the Hindi language. Int. J. Comput. Sci. Eng. 4(5), 711 (2012)
12.
go back to reference Zubair Asghar, M., Khan, A., Ahmad, S., Masud Kundi, F.: A review of feature extraction in sentiment analysis. J. Basic Appl. Sci. Res. 4(3), 181–186 (2014) Zubair Asghar, M., Khan, A., Ahmad, S., Masud Kundi, F.: A review of feature extraction in sentiment analysis. J. Basic Appl. Sci. Res. 4(3), 181–186 (2014)
13.
go back to reference Ramanathan, A., Rao, D.D.: A lightweight stemmer for Hindi. In: The Proceedings of EACL. 2003 Ramanathan, A., Rao, D.D.: A lightweight stemmer for Hindi. In: The Proceedings of EACL. 2003
14.
go back to reference Rasmussen, E.M.: Clustering algorithms. Inform. Retrieval: Data Struct. Algorithms 419, 442 (1992) Rasmussen, E.M.: Clustering algorithms. Inform. Retrieval: Data Struct. Algorithms 419, 442 (1992)
15.
go back to reference Saini J.R., Desai A.A.: Identification of Hindi words used in pornographic unsolicited bulk emails. The IUP J. Syst. Manage. 9(2), 53–60 (2011). ISSN: 0972-6896 Saini J.R., Desai A.A.: Identification of Hindi words used in pornographic unsolicited bulk emails. The IUP J. Syst. Manage. 9(2), 53–60 (2011). ISSN: 0972-6896
16.
go back to reference Sindhuja, B., Trivedi, V.: Usage of cosine similarity and term frequency count for textual document clustering. Int. J. Innovative Res. Comput. Sci. Technol. (IJIRCST) 2(5), 9–12 (2014) Sindhuja, B., Trivedi, V.: Usage of cosine similarity and term frequency count for textual document clustering. Int. J. Innovative Res. Comput. Sci. Technol. (IJIRCST) 2(5), 9–12 (2014)
17.
go back to reference Thangarasu, M., Manavalan, R.: A literature review: stemming algorithms for Indian languages. (2013). arXiv preprint arXiv:1308.5423 Thangarasu, M., Manavalan, R.: A literature review: stemming algorithms for Indian languages. (2013). arXiv preprint arXiv:​1308.​5423
Metadata
Title
Scaled Document Clustering and Word Cloud-Based Summarization on Hindi Corpus
Authors
Prafulla B. Bafna
Jatinderkumar R. Saini
Copyright Year
2021
Publisher
Springer Singapore
DOI
https://doi.org/10.1007/978-981-15-6353-9_36