Top

Published in:

2021 | OriginalPaper | Chapter

Scaled Document Clustering and Word Cloud-Based Summarization on Hindi Corpus

Authors : Prafulla B. Bafna, Jatinderkumar R. Saini

Published in: Progress in Advanced Computing and Intelligent Engineering

Publisher: Springer Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Managing a large number of textual documents is a critical and significant task and supports many applications ranging from information retrieval to clustering search engine results. The multilinguistic facility provided by websites makes Hindi a major language in the digital domain of information technology today. This work focuses on document management through document clustering for a big corpus and summarization of clusters. The objective is to overcome the scalability problem while managing the documents and summarizing the Hindi corpus by extracting tokens. The work is better in terms of scalability and supports the consistent quality of cluster for incremental dataset. Most of the past and contemporary research works have targeted English corpus document management. Hindi corpus has been mostly exploited by the researchers for exploring stemming, single-document summarization, and classifier design. Implementing unsupervised learning on the Hindi corpus for summarization of multiple documents through Word Cloud is still an untouched area. Technically speaking, the current work is an application of TF-IDF, cosine-based document similarity measures, and cluster dendrograms, in addition to various other Natural Language Processing (NLP) activities. Entropy and precision are used to evaluate the experiments carried on different live and available/tested datasets and results prove the robustness of the proposed approach for Hindi Corpus.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter An Unsupervised Technique to Generate Summaries from Opinionated Review Documents

next chapter Rough Set Classifications and Performance Analysis in Medical Health Care

Heimerl, F., Lohmann, S., Lange, S., Ertl, T.: Word cloud explorer: Text analytics based on word clouds. In: 2014 47th Hawaii International Conference on System Sciences (pp. 1833–1842). IEEE. (2014, January)

Altuncu, M.T., Yaliraki, S.N., & Barahona, M. (2018). Content-driven, unsupervised clustering of news articles through multiscale graph partitioning. arXiv preprint arXiv:1808.01175

Walkley, A., Nagpal, J.: (2015) available at https://www.thinkwithgoogle.com/intl/en-apac/trends-and-insights/hindi-matters-digital-age/on29/7/2019

Audichya M.A., Saini J.R.: Computational linguistic prosody rule-based unified technique for automatic metadata generation for Hindi poetry. In: Proceedings of ICAIT-2019, in press, IEEE, USA

Burch, M., Lohmann, S., Beck, F., Rodriguez, N., Di Silvestro, L., Weiskopf, D.: RadCloud: Visualizing multiple texts with merged word clouds. In: 2014 18th International Conference on Information Visualisation (pp. 108–113). IEEE (2014, July)

Sitaula, C.: Semantic text clustering using enhanced vector space model using nepali language. GESJ: Comput. Sci. 41–46 (2012)

Cui, W., Wu, Y., Liu, S., Wei, F., Zhou, M. X., Qu, H.: Context preserving dynamic word cloud visualization. In: 2010 IEEE Pacific Visualization Symposium (PacificVis)(pp. 121–128). IEEE.] (2010, March)

Garg, A., Saini, J.: A systematic and exhaustive review of automatic abstractive text summarization for Hindi language. (2019)

Hanyurwimfura, D., Bo, L., Njagi, D., Dukuzumuremyi, J.P.: A centroid and relationship based clustering for organizing. Int. J. Multimedia Ubiquit. Eng. 9(3), 219–234 (2014)

10.

Akhtar, Md S., Ekbal, A., Bhattacharyya, P.: Aspect based sentiment analysis in Hindi: resource creation and evaluation. In: Proceedings of the 10th International Conference on Language Resource and Evaluation (LREC 2016); 23–28; Portoroz, Slovenia; 2016

11.

Mishra, U., Prakash, C.: MAULIK: an effective stemmer for the Hindi language. Int. J. Comput. Sci. Eng. 4(5), 711 (2012)

12.

Zubair Asghar, M., Khan, A., Ahmad, S., Masud Kundi, F.: A review of feature extraction in sentiment analysis. J. Basic Appl. Sci. Res. 4(3), 181–186 (2014)

13.

Ramanathan, A., Rao, D.D.: A lightweight stemmer for Hindi. In: The Proceedings of EACL. 2003

14.

Rasmussen, E.M.: Clustering algorithms. Inform. Retrieval: Data Struct. Algorithms 419, 442 (1992)

15.

Saini J.R., Desai A.A.: Identification of Hindi words used in pornographic unsolicited bulk emails. The IUP J. Syst. Manage. 9(2), 53–60 (2011). ISSN: 0972-6896

16.

Sindhuja, B., Trivedi, V.: Usage of cosine similarity and term frequency count for textual document clustering. Int. J. Innovative Res. Comput. Sci. Technol. (IJIRCST) 2(5), 9–12 (2014)

17.

Thangarasu, M., Manavalan, R.: A literature review: stemming algorithms for Indian languages. (2013). arXiv preprint arXiv:1308.5423

Title: Scaled Document Clustering and Word Cloud-Based Summarization on Hindi Corpus
Authors: Prafulla B. Bafna
Jatinderkumar R. Saini
Publisher: Springer Singapore
Book: Progress in Advanced Computing and Intelligent Engineering
Print ISBN: 978-981-15-6352-2

Electronic ISBN: 978-981-15-6353-9

Copyright Year: 2021
DOI: https://doi.org/10.1007/978-981-15-6353-9_36

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"