Skip to main content
Top
Published in: International Journal of Data Science and Analytics 4/2018

15-03-2018 | Applications

An effective approach for semantic-based clustering and topic-based ranking of web documents

Author: Rajendra Kumar Roul

Published in: International Journal of Data Science and Analytics | Issue 4/2018

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In this large, dynamic and expandable web, extracting desired information of any user query is a significant problem for the search engine. Clustering and Ranking are two important resources which can shed light in this direction. To achieve this potential clustering-ranking mechanism, this study proposes a combined approach of semantic-based clustering and topic-based ranking of web documents. The proposed clustering approach combines the latent semantic indexing (LSI) with min-cut algorithm. To make the clustering technique more effective, a new feature selection method called clustering-based feature selection has been developed that focuses on finding the feature set which gathers the crux of documents in the corpus without deteriorating the outcome of the construction process. While LSI completely overcomes the constraint of synonymy, the min-cut algorithm helps to generate efficient clusters at each stage of the clustering process. For deciding the number of clusters to be formed, silhouette coefficient is used, which is a parameter incorporating both cohesion and separation of clusters. To rank the documents in each semantic cluster, the proposed approach transforms the text into topics using latent Dirichlet allocation and then runs the inverted indexing technique on those topics. 20-Newsgroups and DMOZ datasets are used for experimental work, and the results obtained from the experiment show that the performance of the clustering approach is better than the traditional clustering approaches and the ranking approach is promising.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Spink, A., Wolfram, D., Jansen, M.B., Saracevic, T.: Searching the web: the public and their queries. J. Am. Soc. Inf. Sci. Technol. 52(3), 226–234 (2001)CrossRef Spink, A., Wolfram, D., Jansen, M.B., Saracevic, T.: Searching the web: the public and their queries. J. Am. Soc. Inf. Sci. Technol. 52(3), 226–234 (2001)CrossRef
3.
go back to reference Anick, P.G., Vaithyanathan, S.: Exploiting clustering and phrases for context-based information retrieval. In: ACM SIGIR Forum, vol. 31, no. SI, pp. 314–323. ACM (1997) Anick, P.G., Vaithyanathan, S.: Exploiting clustering and phrases for context-based information retrieval. In: ACM SIGIR Forum, vol. 31, no. SI, pp. 314–323. ACM (1997)
4.
go back to reference Kaufman, L., Rousseeuw, P.J.: Finding groups in data: an introduction to cluster analysis, vol. 344. Wiley, New York (2009)MATH Kaufman, L., Rousseeuw, P.J.: Finding groups in data: an introduction to cluster analysis, vol. 344. Wiley, New York (2009)MATH
5.
go back to reference Yin, J., Wang, J.: A dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 233–242. ACM (2014) Yin, J., Wang, J.: A dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 233–242. ACM (2014)
6.
go back to reference Liang, S., Yilmaz, E., Kanoulas, E.: Dynamic clustering of streaming short documents. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 995–1004. ACM (2016) Liang, S., Yilmaz, E., Kanoulas, E.: Dynamic clustering of streaming short documents. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 995–1004. ACM (2016)
7.
go back to reference Andritsos, P., Tsaparas, P., Miller, R.J., Sevcik, K.C.: Limbo: scalable clustering of categorical data. In: EDBT, pp. 123–146. Springer (2004) Andritsos, P., Tsaparas, P., Miller, R.J., Sevcik, K.C.: Limbo: scalable clustering of categorical data. In: EDBT, pp. 123–146. Springer (2004)
8.
go back to reference Gibson, D., Kleinberg, J., Raghavan, P.: Clustering categorical data: an approach based on dynamical systems. VLDB 8, 222–236 (2000)CrossRef Gibson, D., Kleinberg, J., Raghavan, P.: Clustering categorical data: an approach based on dynamical systems. VLDB 8, 222–236 (2000)CrossRef
9.
go back to reference Guha, S., Rastogi, R., Shim, K.: Rock: a robust clustering algorithm for categorical attributes. In: Proceedings., 15th International Conference on Data Engineering, 1999, pp. 512–521. IEEE (1999) Guha, S., Rastogi, R., Shim, K.: Rock: a robust clustering algorithm for categorical attributes. In: Proceedings., 15th International Conference on Data Engineering, 1999, pp. 512–521. IEEE (1999)
10.
go back to reference Ahmad, A., Dey, L.: A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl. Eng. 63(2), 503–527 (2007)CrossRef Ahmad, A., Dey, L.: A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl. Eng. 63(2), 503–527 (2007)CrossRef
11.
go back to reference Borhani-Fard, Z., Minaei, B., Alinejad-Rokny, H.: Applying clustering approach in blog recommendation. J. Emerg. Technol. Web Intell. 5(3), 296–301 (2013) Borhani-Fard, Z., Minaei, B., Alinejad-Rokny, H.: Applying clustering approach in blog recommendation. J. Emerg. Technol. Web Intell. 5(3), 296–301 (2013)
13.
go back to reference Parvin, H., MirnabiBaboli, M., Alinejad-Rokny, H.: Proposing a classifier ensemble framework based on classifier selection and decision tree. Eng. Appl. Artif. Intell. 37, 34–42 (2015)CrossRef Parvin, H., MirnabiBaboli, M., Alinejad-Rokny, H.: Proposing a classifier ensemble framework based on classifier selection and decision tree. Eng. Appl. Artif. Intell. 37, 34–42 (2015)CrossRef
14.
go back to reference Roul, R.K., Aggrawal, A.: Feature space of deep learning and its importance: comparison of clustering techniques on the extended space of ML-ELM. In: Proceedings of the 9th Annual Meeting of the Forum for Information Retrieval Evaluation, pp. 25–28. ACM (2017). https://doi.org/10.1145/3158354.3158359. ISBN: 978-1-4503-6382-2 Roul, R.K., Aggrawal, A.: Feature space of deep learning and its importance: comparison of clustering techniques on the extended space of ML-ELM. In: Proceedings of the 9th Annual Meeting of the Forum for Information Retrieval Evaluation, pp. 25–28. ACM (2017). https://​doi.​org/​10.​1145/​3158354.​3158359. ISBN: 978-1-4503-6382-2
15.
go back to reference Parvin, H., Minaei-Bidgoli, B., Alinejad-Rokny, H.: A new imbalanced learning and dictions tree method for breast cancer diagnosis. J. Bionanosci. 7(6), 673–678 (2013)CrossRef Parvin, H., Minaei-Bidgoli, B., Alinejad-Rokny, H.: A new imbalanced learning and dictions tree method for breast cancer diagnosis. J. Bionanosci. 7(6), 673–678 (2013)CrossRef
16.
go back to reference Alinejad-Rokny, H., Farzaneh, M.K., Orimi, A.G., Pedram, M., Kiasari, H.A.: Proposing a new structure for web mining and personalizing web pages. J. Emerg. Technol. Web Intell. 5(3), 287–295 (2013) Alinejad-Rokny, H., Farzaneh, M.K., Orimi, A.G., Pedram, M., Kiasari, H.A.: Proposing a new structure for web mining and personalizing web pages. J. Emerg. Technol. Web Intell. 5(3), 287–295 (2013)
17.
go back to reference Esmaeili, L., Minaei-Bidgoli, B., Alinejad-Rokny, H., Nasiri, M.: Hybrid recommender system for joining virtual communities. Res. J. Appl. Sci. Eng. Technol. 4(5), 500–509 (2012) Esmaeili, L., Minaei-Bidgoli, B., Alinejad-Rokny, H., Nasiri, M.: Hybrid recommender system for joining virtual communities. Res. J. Appl. Sci. Eng. Technol. 4(5), 500–509 (2012)
18.
go back to reference Roul, R.K., Varshneya, S., Kalra, A., Sahay, S.K.: A novel modified apriori approach for web document clustering. In: Jain, L.C., Behera, H.S., Mandal, J.K., Mohapatra, D.P. (eds.) Computational Intelligence in Data Mining - Volume 3, pp. 159–171. Springer (2015) Roul, R.K., Varshneya, S., Kalra, A., Sahay, S.K.: A novel modified apriori approach for web document clustering. In: Jain, L.C., Behera, H.S., Mandal, J.K., Mohapatra, D.P. (eds.) Computational Intelligence in Data Mining - Volume 3, pp. 159–171. Springer (2015)
19.
go back to reference Hartigan, J.A., Wong, M.A.: Algorithm as 136: a k-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979)MATH Hartigan, J.A., Wong, M.A.: Algorithm as 136: a k-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979)MATH
20.
go back to reference Bezdek, J.C., Ehrlich, R., Full, W.: Fcm: the fuzzy c-means clustering algorithm. Comput. Geosci. 10(2), 191–203 (1984)CrossRef Bezdek, J.C., Ehrlich, R., Full, W.: Fcm: the fuzzy c-means clustering algorithm. Comput. Geosci. 10(2), 191–203 (1984)CrossRef
21.
go back to reference Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recognit. Lett. 31(8), 651–666 (2010)CrossRef Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recognit. Lett. 31(8), 651–666 (2010)CrossRef
22.
go back to reference Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96(34), 226–231 (1996) Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96(34), 226–231 (1996)
24.
go back to reference Salton, G., Wong, A., Yang, C.-S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)CrossRefMATH Salton, G., Wong, A., Yang, C.-S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)CrossRefMATH
25.
go back to reference Song, W., Park, S.C.: Genetic algorithm for text clustering based on latent semantic indexing. Comput. Math. Appl. 57(11), 1901–1907 (2009)CrossRefMATH Song, W., Park, S.C.: Genetic algorithm for text clustering based on latent semantic indexing. Comput. Math. Appl. 57(11), 1901–1907 (2009)CrossRefMATH
26.
go back to reference Li, P., Wang, B., Jin, W.: Improving web document clustering through employing user-related tag expansion techniques. J. Comput. Sci. Technol. 27(3), 554–566 (2012)CrossRef Li, P., Wang, B., Jin, W.: Improving web document clustering through employing user-related tag expansion techniques. J. Comput. Sci. Technol. 27(3), 554–566 (2012)CrossRef
27.
go back to reference Wei, C.-P., Yang, C.C., Lin, C.-M.: A latent semantic indexing-based approach to multilingual document clustering. Decis. Support Syst. 45(3), 606–620 (2008)CrossRef Wei, C.-P., Yang, C.C., Lin, C.-M.: A latent semantic indexing-based approach to multilingual document clustering. Decis. Support Syst. 45(3), 606–620 (2008)CrossRef
28.
go back to reference Qimin, C., Qiao, G., Yongliang, W., Xianghua, W.: Text clustering using vsm with feature clusters. Neural Comput. Appl. 26(4), 995–1003 (2015)CrossRef Qimin, C., Qiao, G., Yongliang, W., Xianghua, W.: Text clustering using vsm with feature clusters. Neural Comput. Appl. 26(4), 995–1003 (2015)CrossRef
29.
go back to reference Huang, F., Zhang, S., He, M., Wu, X.: Clustering web documents using hierarchical representation with multi-granularity. World Wide Web 17(1), 105–126 (2014)CrossRef Huang, F., Zhang, S., He, M., Wu, X.: Clustering web documents using hierarchical representation with multi-granularity. World Wide Web 17(1), 105–126 (2014)CrossRef
30.
go back to reference Farahat, A.K., Kamel, M.S.: Statistical semantics for enhancing document clustering. Knowl. Inf. Syst. 28(2), 365–393 (2011)CrossRef Farahat, A.K., Kamel, M.S.: Statistical semantics for enhancing document clustering. Knowl. Inf. Syst. 28(2), 365–393 (2011)CrossRef
31.
go back to reference Chen, H., Carin, L., Dunson, D.B.: Topic modeling with nonparametric markov tree. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 377–384 (2011) Chen, H., Carin, L., Dunson, D.B.: Topic modeling with nonparametric markov tree. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 377–384 (2011)
32.
go back to reference Blei, D.M., McAuliffe, J.D.: Supervised topic models. In: NIPS, vol. 7, pp. 121–128 (2007) Blei, D.M., McAuliffe, J.D.: Supervised topic models. In: NIPS, vol. 7, pp. 121–128 (2007)
34.
go back to reference Cai, J., Lee, W.S., Teh, Y.W.: Improving word sense disambiguation using topic features. In: EMNLP-CoNLL, pp. 1015–1023 (2007) Cai, J., Lee, W.S., Teh, Y.W.: Improving word sense disambiguation using topic features. In: EMNLP-CoNLL, pp. 1015–1023 (2007)
35.
go back to reference Mimno, D., Blei, D.: Bayesian checking for topic models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pp. 227–237 (2011) Mimno, D., Blei, D.: Bayesian checking for topic models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pp. 227–237 (2011)
36.
go back to reference Henderson, K., Eliassi-Rad, T.: Applying latent dirichlet allocation to group discovery in large graphs. In: Proceedings of the 2009 ACM Symposium on Applied Computing, pp. 1456–1461. ACM (2009) Henderson, K., Eliassi-Rad, T.: Applying latent dirichlet allocation to group discovery in large graphs. In: Proceedings of the 2009 ACM Symposium on Applied Computing, pp. 1456–1461. ACM (2009)
37.
go back to reference Chemudugunta, C., Holloway, A., Smyth, P., Steyvers, M.: Modeling documents by combining semantic concepts with unsupervised statistical learning. In: The Semantic Web-ISWC, 2008, pp. 229–244. Springer (2008) Chemudugunta, C., Holloway, A., Smyth, P., Steyvers, M.: Modeling documents by combining semantic concepts with unsupervised statistical learning. In: The Semantic Web-ISWC, 2008, pp. 229–244. Springer (2008)
38.
go back to reference Andrzejewski, D., Zhu, X., Craven, M.: Incorporating domain knowledge into topic modeling via dirichlet forest priors. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 25–32. ACM (2009) Andrzejewski, D., Zhu, X., Craven, M.: Incorporating domain knowledge into topic modeling via dirichlet forest priors. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 25–32. ACM (2009)
39.
go back to reference Ramage, D., Manning, C.D., Dumais, S.: Partially labeled topic models for interpretable text mining. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 457–465. ACM (2011) Ramage, D., Manning, C.D., Dumais, S.: Partially labeled topic models for interpretable text mining. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 457–465. ACM (2011)
40.
go back to reference Khodaei, A., Shahabi, C., Li, C.: Skif-p: a point-based indexing and ranking of web documents for spatial-keyword search. Geoinformatica 16(3), 563–596 (2012)CrossRef Khodaei, A., Shahabi, C., Li, C.: Skif-p: a point-based indexing and ranking of web documents for spatial-keyword search. Geoinformatica 16(3), 563–596 (2012)CrossRef
41.
go back to reference Chahal, P., Singh, M., Kumar, S.: An efficient web page ranking for semantic web. J. Inst. Eng. (India) Ser. B 95(1), 15–21 (2014)CrossRef Chahal, P., Singh, M., Kumar, S.: An efficient web page ranking for semantic web. J. Inst. Eng. (India) Ser. B 95(1), 15–21 (2014)CrossRef
42.
go back to reference Metzler, D., Croft, W.B.: A markov random field model for term dependencies. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 472–479. ACM (2005) Metzler, D., Croft, W.B.: A markov random field model for term dependencies. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 472–479. ACM (2005)
43.
go back to reference Lv, Y., Zhai, C.: Positional language models for information retrieval. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 299–306. ACM (2009) Lv, Y., Zhai, C.: Positional language models for information retrieval. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 299–306. ACM (2009)
44.
go back to reference Tao, T., Zhai, C.: An exploration of proximity measures in information retrieval. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 295–302. ACM (2007) Tao, T., Zhai, C.: An exploration of proximity measures in information retrieval. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 295–302. ACM (2007)
45.
go back to reference Zhao, J., Yun, Y.: A proximity language model for information retrieval. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 291–298. ACM (2009) Zhao, J., Yun, Y.: A proximity language model for information retrieval. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 291–298. ACM (2009)
46.
go back to reference Vuurens, J.B., de Vries, A.P.: Distance matters! cumulative proximity expansions for ranking documents. Inf. Retr. 17(4), 380–406 (2014)CrossRef Vuurens, J.B., de Vries, A.P.: Distance matters! cumulative proximity expansions for ranking documents. Inf. Retr. 17(4), 380–406 (2014)CrossRef
47.
go back to reference Papadimitriou, C.H., Tamaki, H., Raghavan, P., Vempala, S.: Latent semantic indexing: a probabilistic analysis. In: Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 159–168. ACM (1998) Papadimitriou, C.H., Tamaki, H., Raghavan, P., Vempala, S.: Latent semantic indexing: a probabilistic analysis. In: Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 159–168. ACM (1998)
48.
go back to reference Karger, D.R.: Global min-cuts in rnc, and other ramifications of a simple min-cut algorithm. ACM 93, 21–30 (1993)MathSciNetMATH Karger, D.R.: Global min-cuts in rnc, and other ramifications of a simple min-cut algorithm. ACM 93, 21–30 (1993)MathSciNetMATH
49.
go back to reference Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)CrossRefMATH Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)CrossRefMATH
50.
go back to reference Knuth, D.E.: The Art of Computer Programming, Volume 4A: Combinatorial Algorithms, Part 1. Delhi, Pearson Education India (2011)MATH Knuth, D.E.: The Art of Computer Programming, Volume 4A: Combinatorial Algorithms, Part 1. Delhi, Pearson Education India (2011)MATH
51.
go back to reference Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28(1), 11–21 (1972)CrossRef Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28(1), 11–21 (1972)CrossRef
Metadata
Title
An effective approach for semantic-based clustering and topic-based ranking of web documents
Author
Rajendra Kumar Roul
Publication date
15-03-2018
Publisher
Springer International Publishing
Published in
International Journal of Data Science and Analytics / Issue 4/2018
Print ISSN: 2364-415X
Electronic ISSN: 2364-4168
DOI
https://doi.org/10.1007/s41060-018-0112-3

Other articles of this Issue 4/2018

International Journal of Data Science and Analytics 4/2018 Go to the issue

Premium Partner