Top

International Journal of Data Science and Analytics

Published in:

15-03-2018 | Applications

An effective approach for semantic-based clustering and topic-based ranking of web documents

Author: Rajendra Kumar Roul

Published in: International Journal of Data Science and Analytics | Issue 4/2018

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

In this large, dynamic and expandable web, extracting desired information of any user query is a significant problem for the search engine. Clustering and Ranking are two important resources which can shed light in this direction. To achieve this potential clustering-ranking mechanism, this study proposes a combined approach of semantic-based clustering and topic-based ranking of web documents. The proposed clustering approach combines the latent semantic indexing (LSI) with min-cut algorithm. To make the clustering technique more effective, a new feature selection method called clustering-based feature selection has been developed that focuses on finding the feature set which gathers the crux of documents in the corpus without deteriorating the outcome of the construction process. While LSI completely overcomes the constraint of synonymy, the min-cut algorithm helps to generate efficient clusters at each stage of the clustering process. For deciding the number of clusters to be formed, silhouette coefficient is used, which is a parameter incorporating both cohesion and separation of clusters. To rank the documents in each semantic cluster, the proposed approach transforms the text into topics using latent Dirichlet allocation and then runs the inverted indexing technique on those topics. 20-Newsgroups and DMOZ datasets are used for experimental work, and the results obtained from the experiment show that the performance of the clustering approach is better than the traditional clustering approaches and the ranking approach is promising.

previous article A case study for intelligent event recommendation

next article Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

http://lsa.colorado.edu/papers/dp1.LSAintro.pdf.

http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm.

https://radimrehurek.com/gensim/tutorial.html.

http://ag.arizona.edu/classes/rnr555/lecnotes/10.html.

http://www.azimuthproject.org/azimuth/show/Eckart-Young+low+rank+approximation+theorem#idea_2.

http://jmlr.org/papers/volume3/blei03a/blei03a.pdf.

http://tartarus.org/martin/PorterStemmer/.

https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html.

Decided by threshold.

http://qwone.com/~jason/20Newsgroups/.

http://www.dmoz.org.

Determined by running the code many times as the LDA is a stochastic topic model and considered that value of t for which the best results is obtained.

Spink, A., Wolfram, D., Jansen, M.B., Saracevic, T.: Searching the web: the public and their queries. J. Am. Soc. Inf. Sci. Technol. 52(3), 226–234 (2001)CrossRef

Croft, W.B.: A model of cluster searching based on classification. Inf. Syst. 5(3), 189–195 (1980)MathSciNetCrossRef

Anick, P.G., Vaithyanathan, S.: Exploiting clustering and phrases for context-based information retrieval. In: ACM SIGIR Forum, vol. 31, no. SI, pp. 314–323. ACM (1997)

Kaufman, L., Rousseeuw, P.J.: Finding groups in data: an introduction to cluster analysis, vol. 344. Wiley, New York (2009)MATH

Yin, J., Wang, J.: A dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 233–242. ACM (2014)

Liang, S., Yilmaz, E., Kanoulas, E.: Dynamic clustering of streaming short documents. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 995–1004. ACM (2016)

Andritsos, P., Tsaparas, P., Miller, R.J., Sevcik, K.C.: Limbo: scalable clustering of categorical data. In: EDBT, pp. 123–146. Springer (2004)

Gibson, D., Kleinberg, J., Raghavan, P.: Clustering categorical data: an approach based on dynamical systems. VLDB 8, 222–236 (2000)CrossRef

Guha, S., Rastogi, R., Shim, K.: Rock: a robust clustering algorithm for categorical attributes. In: Proceedings., 15th International Conference on Data Engineering, 1999, pp. 512–521. IEEE (1999)

10.

Ahmad, A., Dey, L.: A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl. Eng. 63(2), 503–527 (2007)CrossRef

11.

Borhani-Fard, Z., Minaei, B., Alinejad-Rokny, H.: Applying clustering approach in blog recommendation. J. Emerg. Technol. Web Intell. 5(3), 296–301 (2013)

12.

Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33(1), 1–39 (2010)MathSciNetCrossRef

13.

Parvin, H., MirnabiBaboli, M., Alinejad-Rokny, H.: Proposing a classifier ensemble framework based on classifier selection and decision tree. Eng. Appl. Artif. Intell. 37, 34–42 (2015)CrossRef

14.

Roul, R.K., Aggrawal, A.: Feature space of deep learning and its importance: comparison of clustering techniques on the extended space of ML-ELM. In: Proceedings of the 9th Annual Meeting of the Forum for Information Retrieval Evaluation, pp. 25–28. ACM (2017). https://doi.org/10.1145/3158354.3158359. ISBN: 978-1-4503-6382-2

15.

Parvin, H., Minaei-Bidgoli, B., Alinejad-Rokny, H.: A new imbalanced learning and dictions tree method for breast cancer diagnosis. J. Bionanosci. 7(6), 673–678 (2013)CrossRef

16.

Alinejad-Rokny, H., Farzaneh, M.K., Orimi, A.G., Pedram, M., Kiasari, H.A.: Proposing a new structure for web mining and personalizing web pages. J. Emerg. Technol. Web Intell. 5(3), 287–295 (2013)

17.

Esmaeili, L., Minaei-Bidgoli, B., Alinejad-Rokny, H., Nasiri, M.: Hybrid recommender system for joining virtual communities. Res. J. Appl. Sci. Eng. Technol. 4(5), 500–509 (2012)

18.

Roul, R.K., Varshneya, S., Kalra, A., Sahay, S.K.: A novel modified apriori approach for web document clustering. In: Jain, L.C., Behera, H.S., Mandal, J.K., Mohapatra, D.P. (eds.) Computational Intelligence in Data Mining - Volume 3, pp. 159–171. Springer (2015)

19.

Hartigan, J.A., Wong, M.A.: Algorithm as 136: a k-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979)MATH

20.

Bezdek, J.C., Ehrlich, R., Full, W.: Fcm: the fuzzy c-means clustering algorithm. Comput. Geosci. 10(2), 191–203 (1984)CrossRef

21.

Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recognit. Lett. 31(8), 651–666 (2010)CrossRef

22.

Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96(34), 226–231 (1996)

23.

Defays, D.: An efficient algorithm for a complete link method. Comput. J. 20(4), 364–366 (1977)MathSciNetCrossRefMATH

24.

Salton, G., Wong, A., Yang, C.-S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)CrossRefMATH

25.

Song, W., Park, S.C.: Genetic algorithm for text clustering based on latent semantic indexing. Comput. Math. Appl. 57(11), 1901–1907 (2009)CrossRefMATH

26.

Li, P., Wang, B., Jin, W.: Improving web document clustering through employing user-related tag expansion techniques. J. Comput. Sci. Technol. 27(3), 554–566 (2012)CrossRef

27.

Wei, C.-P., Yang, C.C., Lin, C.-M.: A latent semantic indexing-based approach to multilingual document clustering. Decis. Support Syst. 45(3), 606–620 (2008)CrossRef

28.

Qimin, C., Qiao, G., Yongliang, W., Xianghua, W.: Text clustering using vsm with feature clusters. Neural Comput. Appl. 26(4), 995–1003 (2015)CrossRef

29.

Huang, F., Zhang, S., He, M., Wu, X.: Clustering web documents using hierarchical representation with multi-granularity. World Wide Web 17(1), 105–126 (2014)CrossRef

30.

Farahat, A.K., Kamel, M.S.: Statistical semantics for enhancing document clustering. Knowl. Inf. Syst. 28(2), 365–393 (2011)CrossRef

31.

Chen, H., Carin, L., Dunson, D.B.: Topic modeling with nonparametric markov tree. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 377–384 (2011)

32.

Blei, D.M., McAuliffe, J.D.: Supervised topic models. In: NIPS, vol. 7, pp. 121–128 (2007)

33.

Blei, D.M., Lafferty, J.D., et al.: A correlated topic model of science. Ann. Appl. Stat. 1(1), 17–35 (2007)MathSciNetCrossRefMATH

34.

Cai, J., Lee, W.S., Teh, Y.W.: Improving word sense disambiguation using topic features. In: EMNLP-CoNLL, pp. 1015–1023 (2007)

35.

Mimno, D., Blei, D.: Bayesian checking for topic models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pp. 227–237 (2011)

36.

Henderson, K., Eliassi-Rad, T.: Applying latent dirichlet allocation to group discovery in large graphs. In: Proceedings of the 2009 ACM Symposium on Applied Computing, pp. 1456–1461. ACM (2009)

37.

Chemudugunta, C., Holloway, A., Smyth, P., Steyvers, M.: Modeling documents by combining semantic concepts with unsupervised statistical learning. In: The Semantic Web-ISWC, 2008, pp. 229–244. Springer (2008)

38.

Andrzejewski, D., Zhu, X., Craven, M.: Incorporating domain knowledge into topic modeling via dirichlet forest priors. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 25–32. ACM (2009)

39.

Ramage, D., Manning, C.D., Dumais, S.: Partially labeled topic models for interpretable text mining. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 457–465. ACM (2011)

40.

Khodaei, A., Shahabi, C., Li, C.: Skif-p: a point-based indexing and ranking of web documents for spatial-keyword search. Geoinformatica 16(3), 563–596 (2012)CrossRef

41.

Chahal, P., Singh, M., Kumar, S.: An efficient web page ranking for semantic web. J. Inst. Eng. (India) Ser. B 95(1), 15–21 (2014)CrossRef

42.

Metzler, D., Croft, W.B.: A markov random field model for term dependencies. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 472–479. ACM (2005)

43.

Lv, Y., Zhai, C.: Positional language models for information retrieval. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 299–306. ACM (2009)

44.

Tao, T., Zhai, C.: An exploration of proximity measures in information retrieval. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 295–302. ACM (2007)

45.

Zhao, J., Yun, Y.: A proximity language model for information retrieval. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 291–298. ACM (2009)

46.

Vuurens, J.B., de Vries, A.P.: Distance matters! cumulative proximity expansions for ranking documents. Inf. Retr. 17(4), 380–406 (2014)CrossRef

47.

Papadimitriou, C.H., Tamaki, H., Raghavan, P., Vempala, S.: Latent semantic indexing: a probabilistic analysis. In: Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 159–168. ACM (1998)

48.

Karger, D.R.: Global min-cuts in rnc, and other ramifications of a simple min-cut algorithm. ACM 93, 21–30 (1993)MathSciNetMATH

49.

Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)CrossRefMATH

50.

Knuth, D.E.: The Art of Computer Programming, Volume 4A: Combinatorial Algorithms, Part 1. Delhi, Pearson Education India (2011)MATH

51.

Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28(1), 11–21 (1972)CrossRef

Title: An effective approach for semantic-based clustering and topic-based ranking of web documents
Author: Rajendra Kumar Roul
Publication date: 15-03-2018
Publisher: Springer International Publishing
Published in: International Journal of Data Science and Analytics / Issue 4/2018
Print ISSN: 2364-415X
Electronic ISSN: 2364-4168
DOI: https://doi.org/10.1007/s41060-018-0112-3

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 4/2018

Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization

Personalized market response analysis for a wide variety of products from sparse transaction data

Correction to: Streaming active learning strategies for real-life credit detection: assessment and visualization

Large-scale asynchronous distributed learning based on parameter exchanges

A case study for intelligent event recommendation

Sports analytics and the big-data era

Premium Partner