Skip to main content
Top

2018 | OriginalPaper | Chapter

Social Choice Theory Based Domain Specific Hindi Stop Words List Construction and Its Application in Text Mining

Authors : Ruby Rani, D. K. Lobiyal

Published in: Intelligent Human Computer Interaction

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In this paper, we have given an attempt to create domain specific Hindi stop words list using statistical and knowledge based techniques from prepared textual corpora of different domains. In order to remove the biased raking nature of each technique, Borda’s rule of vote ranking method has been employed for unbiased stop words list construction. We also propose a novel approach called netting ranked performance evaluation (NRPE) to evaluate prepared stop words lists, in which stop words removal is done in leading and trailing fashion based on ascending and descending order of terms. Further, using combined band net (CBN) performance, we demonstrate the ability of each technique in identifying of candidate stop words followed by selection of features for text mining models. The experimental results show that a technique selects good features for classification/clustering needs not necessarily finds the good stop words. Results also show that the final Borda’s lists gives normalized performance over individual technique. This approach guarantees candidate stop word removal, least information dissipation and text mining model performance enhancement.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Ricardo, B.-Y.: Modern Information Retrieval. Pearson Education, India (1999) Ricardo, B.-Y.: Modern Information Retrieval. Pearson Education, India (1999)
2.
go back to reference Yang, Y.: Noise reduction in a statistical approach to text categorization. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 256–263 (1995) Yang, Y.: Noise reduction in a statistical approach to text categorization. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 256–263 (1995)
3.
go back to reference Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28(1), 11–21 (1972)CrossRef Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28(1), 11–21 (1972)CrossRef
4.
go back to reference Sinka, M.P., Corne, D.: Evolving better stoplists for document clustering and web intelligence. In: HIS, pp. 1015–1023 (2003) Sinka, M.P., Corne, D.: Evolving better stoplists for document clustering and web intelligence. In: HIS, pp. 1015–1023 (2003)
6.
go back to reference White, B.J., Fortier, J., Clapper, D., Grabolosa, P.: The impact of domain-specific stop-word lists on ecommerce website search performance. J. Strateg. E-Commerce 5(1/2), 83 (2007) White, B.J., Fortier, J., Clapper, D., Grabolosa, P.: The impact of domain-specific stop-word lists on ecommerce website search performance. J. Strateg. E-Commerce 5(1/2), 83 (2007)
7.
go back to reference Zou, F., Wang, F.L., Deng, X., Han, S., Wang, L.S.: Automatic construction of chinese stop word list. In: Proceedings of the 5th WSEAS International Conference on Applied Computer Science, pp. 1010–1015 (2006) Zou, F., Wang, F.L., Deng, X., Han, S., Wang, L.S.: Automatic construction of chinese stop word list. In: Proceedings of the 5th WSEAS International Conference on Applied Computer Science, pp. 1010–1015 (2006)
8.
go back to reference Yao, Z., Ze-wen, C.: Research on the construction and filter method of stop-word list in text preprocessing. In: International Conference on Intelligent Computation Technology and Automation (ICICTA), 2011, vol. 1, pp. 217–221 (2011) Yao, Z., Ze-wen, C.: Research on the construction and filter method of stop-word list in text preprocessing. In: International Conference on Intelligent Computation Technology and Automation (ICICTA), 2011, vol. 1, pp. 217–221 (2011)
9.
go back to reference Hao, L., Hao, L.: Automatic identification of stop words in chinese text classification. In: International Conference on Computer Science and Software Engineering, 2008, vol. 1, pp. 718–722 (2008) Hao, L., Hao, L.: Automatic identification of stop words in chinese text classification. In: International Conference on Computer Science and Software Engineering, 2008, vol. 1, pp. 718–722 (2008)
10.
go back to reference Alhadidi, B., Alwedyan, M.: Hybrid stop-word removal technique for Arabic language. Egypt. Comput. Sci. J. 30(1), 35–38 (2008) Alhadidi, B., Alwedyan, M.: Hybrid stop-word removal technique for Arabic language. Egypt. Comput. Sci. J. 30(1), 35–38 (2008)
11.
go back to reference Alajmi, A., Saad, E.M., Darwish, R.R.: Toward an ARABIC stop-words list generation. Int. J. Comput. Appl. 46(8), 8–13 (2012) Alajmi, A., Saad, E.M., Darwish, R.R.: Toward an ARABIC stop-words list generation. Int. J. Comput. Appl. 46(8), 8–13 (2012)
12.
go back to reference Jha, V., Manjunath, N., Shenoy, P.D., Venugopal, K.R.: HSRA: Hindi stopword removal algorithm. In: International Conference on Microelectronics, Computing and Communications (MicroCom), 2016, pp. 1–5 (2016) Jha, V., Manjunath, N., Shenoy, P.D., Venugopal, K.R.: HSRA: Hindi stopword removal algorithm. In: International Conference on Microelectronics, Computing and Communications (MicroCom), 2016, pp. 1–5 (2016)
14.
go back to reference Luhn, H.P.: A statistical approach to mechanized encoding and searching of literary information. IBM J. Res. Dev. 1(4), 309–317 (1957)MathSciNetCrossRef Luhn, H.P.: A statistical approach to mechanized encoding and searching of literary information. IBM J. Res. Dev. 1(4), 309–317 (1957)MathSciNetCrossRef
15.
go back to reference Shenoy, P.D., Srinivasa, K.G., Venugopal, K.R., Patnaik, L.M.: Dynamic association rule mining using genetic algorithms. Intell. Data Anal. 9(5), 439–453 (2005)CrossRef Shenoy, P.D., Srinivasa, K.G., Venugopal, K.R., Patnaik, L.M.: Dynamic association rule mining using genetic algorithms. Intell. Data Anal. 9(5), 439–453 (2005)CrossRef
16.
go back to reference Pandey, A.K., Siddiqui, T.J.: Evaluating effect of stemming and stop-word removal on Hindi text retrieval. In: Tiwary, U.S., Siddiqui, T.J., Radhakrishna, M., Tiwari, M.D. (eds.) Proceedings of the First International Conference on Intelligent Human Computer Interaction, pp. 316–326. Springer, New Delhi (2009). https://doi.org/10.1007/978-81-8489-203-1_31CrossRef Pandey, A.K., Siddiqui, T.J.: Evaluating effect of stemming and stop-word removal on Hindi text retrieval. In: Tiwary, U.S., Siddiqui, T.J., Radhakrishna, M., Tiwari, M.D. (eds.) Proceedings of the First International Conference on Intelligent Human Computer Interaction, pp. 316–326. Springer, New Delhi (2009). https://​doi.​org/​10.​1007/​978-81-8489-203-1_​31CrossRef
18.
go back to reference Kucera, H., Francis, W.N.: Frequency analysis of English usage: Lexicon and grammar. Houghton Mifflin, Boston (1982) Kucera, H., Francis, W.N.: Frequency analysis of English usage: Lexicon and grammar. Houghton Mifflin, Boston (1982)
19.
go back to reference Van Rijsbergen, C.J.: A non-classical logic for information retrieval. Comput. J. 29(6), 481–485 (1986)CrossRef Van Rijsbergen, C.J.: A non-classical logic for information retrieval. Comput. J. 29(6), 481–485 (1986)CrossRef
20.
go back to reference Lo, R.T.-W., He, B., Ounis, I.: Automatically building a stopword list for an information retrieval system. J. Digit. Inf. Manage 5, 17–24 (2005). Special Issue on the 5th Dutch-Belgian Information Retrieval Workshop (DIR) Lo, R.T.-W., He, B., Ounis, I.: Automatically building a stopword list for an information retrieval system. J. Digit. Inf. Manage 5, 17–24 (2005). Special Issue on the 5th Dutch-Belgian Information Retrieval Workshop (DIR)
21.
go back to reference Makrehchi, M., Kamel, M.S.: Extracting domain-specific stopwords for text classifiers. Intell. Data Anal. 21(1), 39–62 (2017)CrossRef Makrehchi, M., Kamel, M.S.: Extracting domain-specific stopwords for text classifiers. Intell. Data Anal. 21(1), 39–62 (2017)CrossRef
23.
go back to reference Singh, S., Siddiqui, T.J.: Evaluating effect of context window size, stemming and stop word removal on Hindi word sense disambiguation. In: International Conference on Information Retrieval & Knowledge Management (CAMP), 2012, pp. 1–5 (2012) Singh, S., Siddiqui, T.J.: Evaluating effect of context window size, stemming and stop word removal on Hindi word sense disambiguation. In: International Conference on Information Retrieval & Knowledge Management (CAMP), 2012, pp. 1–5 (2012)
24.
go back to reference Rani, R., Lobiyal, D.K.: Automatic construction of generic stop words list for Hindi text. Procedia Comput. Sci. Elsevier J. 132, 1–7 (2018)CrossRef Rani, R., Lobiyal, D.K.: Automatic construction of generic stop words list for Hindi text. Procedia Comput. Sci. Elsevier J. 132, 1–7 (2018)CrossRef
25.
go back to reference Ranks, “Hindi stopwords”. Accessed 17 Dec 2017 Ranks, “Hindi stopwords”. Accessed 17 Dec 2017
26.
go back to reference Taranjeet, “Hindi stopwords”, 17 April 2017 Taranjeet, “Hindi stopwords”, 17 April 2017
27.
go back to reference GitHub, “Hindi stopword list”, 29 December 2011 GitHub, “Hindi stopword list”, 29 December 2011
28.
go back to reference Kantor, P.B., Lee, J.J.: The maximum entropy principle in information retrieval. In: Proceedings of the 9th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 269–274 (1986) Kantor, P.B., Lee, J.J.: The maximum entropy principle in information retrieval. In: Proceedings of the 9th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 269–274 (1986)
29.
go back to reference Myerson, R.B.: Fundamentals of social choice theory. Quart. J. Polit. Sci. 8(3), 305–337 (2013)CrossRef Myerson, R.B.: Fundamentals of social choice theory. Quart. J. Polit. Sci. 8(3), 305–337 (2013)CrossRef
Metadata
Title
Social Choice Theory Based Domain Specific Hindi Stop Words List Construction and Its Application in Text Mining
Authors
Ruby Rani
D. K. Lobiyal
Copyright Year
2018
DOI
https://doi.org/10.1007/978-3-030-04021-5_12