Skip to main content
Top

2016 | OriginalPaper | Chapter

Retracted: Clustering of Wikipedia Texts Based on Keywords

Authors : Jalalaldin Gharibi Karyak, Fardin Yazdanpanah Sisakht, Sadrollah Abbasi

Published in: Computational Science and Its Applications – ICCSA 2016

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The paper presents application of spectral clustering algorithms used for grouping Wikipedia search results. The main contribution of the paper is a representation method for Wikipedia articles that has been based on combination of words and links and it has been used to categorize search result in this repository. We evaluate proposed approach with Primary Component Analysis and show, on a test data, how usage of cosine transformation to create combined representations influence a data variability. On a sample test datasets we also show how combined representation improves the data separation that increases overall results of data categorization. We gave the review of the main spectral clustering methods and we compare them using external validation criteria with standard clustering quality measures. Discussion on descriptiveness of evaluation measures and performed experiments on test datasets allows us to select the one spectral clustering algorithm that has been implemented in our system. We gave a brief description of the system architecture that groups on-line Wikipedia articles retrieved with specified keywords. Using the system we show how clustering increases information retrieval effectiveness for Wikipedia data repository.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Manning, C., Raghavan, P., Schütze, H.: Corporation, E.: Introduction to information retrieval, vol. 1. Cambridge University Press, Cambridge (2008)CrossRefMATH Manning, C., Raghavan, P., Schütze, H.: Corporation, E.: Introduction to information retrieval, vol. 1. Cambridge University Press, Cambridge (2008)CrossRefMATH
2.
go back to reference Yang, P., Zhu, Q., Huang, B.: Spectral clustering with density sensitive similarity function. Knowl.-Based Syst. 24, 621–628 (2011)CrossRef Yang, P., Zhu, Q., Huang, B.: Spectral clustering with density sensitive similarity function. Knowl.-Based Syst. 24, 621–628 (2011)CrossRef
3.
go back to reference Cvetkovic, D., Doob, M., Sachs, H.: Spectra of Graphs-Theory and Applications, III revised and enlarged edn. Johan Ambrosius Barth Verlag, Heidelberg-Leipzig (1995)MATH Cvetkovic, D., Doob, M., Sachs, H.: Spectra of Graphs-Theory and Applications, III revised and enlarged edn. Johan Ambrosius Barth Verlag, Heidelberg-Leipzig (1995)MATH
5.
go back to reference Vazirani, V.: Algorytmy aproksymacyjne. WNT Warszawa, Warszawa (2005) Vazirani, V.: Algorytmy aproksymacyjne. WNT Warszawa, Warszawa (2005)
6.
go back to reference Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: analysis and an algorithm. Adv. Neural Inf. Process. Syst. 2, 849–856 (2002) Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: analysis and an algorithm. Adv. Neural Inf. Process. Syst. 2, 849–856 (2002)
8.
go back to reference Verma, D., Meila, M.: A comparison of spectral clustering algorithms. Technical report, University of Washington UW-CSE-03-05-01 (2003) Verma, D., Meila, M.: A comparison of spectral clustering algorithms. Technical report, University of Washington UW-CSE-03-05-01 (2003)
9.
go back to reference Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Commun. ACM 18, 613–620 (1975)CrossRefMATH Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Commun. ACM 18, 613–620 (1975)CrossRefMATH
10.
go back to reference Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S.: Constrained k-means clustering with background knowledge. In: Proceedings of the Eighteenth International Conference on Machine Learning, vol. 577, p. 584. Citeseer (2001) Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S.: Constrained k-means clustering with background knowledge. In: Proceedings of the Eighteenth International Conference on Machine Learning, vol. 577, p. 584. Citeseer (2001)
11.
go back to reference Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22, 888–905 (2000)CrossRef Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22, 888–905 (2000)CrossRef
12.
go back to reference Hartigan, J., Wong, M.: Algorithm as 136: A k-means clustering algorithm. J. Roy. Stat. Soc.: Ser. C (Appl. Stat.) 28, 100–108 (1979)MATH Hartigan, J., Wong, M.: Algorithm as 136: A k-means clustering algorithm. J. Roy. Stat. Soc.: Ser. C (Appl. Stat.) 28, 100–108 (1979)MATH
13.
go back to reference Damashek, M.: Gauging similarity with n-grams: language-independent categorization of text. Science 267, 843 (1995)CrossRef Damashek, M.: Gauging similarity with n-grams: language-independent categorization of text. Science 267, 843 (1995)CrossRef
14.
go back to reference Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24, 513–523 (1988)CrossRef Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24, 513–523 (1988)CrossRef
15.
go back to reference Wong, S.K.M., Ziarko, W., Wong, P.N.: Generalized vector spaces model in information retrieval. In: Proceedings of SIGIR 1985, pp. 18–25. ACM Press, New York (1985) Wong, S.K.M., Ziarko, W., Wong, P.N.: Generalized vector spaces model in information retrieval. In: Proceedings of SIGIR 1985, pp. 18–25. ACM Press, New York (1985)
16.
go back to reference Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD workshop on text mining, vol. 400, pp. 525–526. Citeseer (2000) Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD workshop on text mining, vol. 400, pp. 525–526. Citeseer (2000)
17.
go back to reference Korenius, T., Laurikkala, J., Juhola, M.: On principal component analysis, cosine and Euclidean measures in information retrieval. Inf. Sci. 177, 4893–4905 (2007)MathSciNetCrossRefMATH Korenius, T., Laurikkala, J., Juhola, M.: On principal component analysis, cosine and Euclidean measures in information retrieval. Inf. Sci. 177, 4893–4905 (2007)MathSciNetCrossRefMATH
18.
go back to reference Jiang, Y., Lin, H., Wang, X., Lu, D.: A technique for improving the performance of naive bayes text classification. In: Gong, Z., Luo, X., Chen, J., Lei, J., Wang, F.L. (eds.) WISM 2011, Part II. LNCS, vol. 6988, pp. 196–203. Springer, Heidelberg (2011)CrossRef Jiang, Y., Lin, H., Wang, X., Lu, D.: A technique for improving the performance of naive bayes text classification. In: Gong, Z., Luo, X., Chen, J., Lei, J., Wang, F.L. (eds.) WISM 2011, Part II. LNCS, vol. 6988, pp. 196–203. Springer, Heidelberg (2011)CrossRef
19.
go back to reference Szymański, J.: Wikipedia articles representation with matrix’u. In: Hota, C., Srimani, P.K. (eds.) ICDCIT 2013. LNCS, vol. 7753, pp. 500–510. Springer, Heidelberg (2013)CrossRef Szymański, J.: Wikipedia articles representation with matrix’u. In: Hota, C., Srimani, P.K. (eds.) ICDCIT 2013. LNCS, vol. 7753, pp. 500–510. Springer, Heidelberg (2013)CrossRef
20.
go back to reference Grossi, R., Vitter, J.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: Proceedings of the Thirty-Second Annual ACM Symposium on Theory of Computing, PP. 397–406 (2000) Grossi, R., Vitter, J.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: Proceedings of the Thirty-Second Annual ACM Symposium on Theory of Computing, PP. 397–406 (2000)
21.
go back to reference Bennett, C., Li, M., Ma, B.: Chain letters and evolutionary histories. Sci. Am. 288, 76–81 (2003)CrossRef Bennett, C., Li, M., Ma, B.: Chain letters and evolutionary histories. Sci. Am. 288, 76–81 (2003)CrossRef
22.
go back to reference Eldridge, S., Ashby, D., Bennett, C., Wakelin, M., Feder, G.: Internal and external validity of cluster randomised trials: systematic review of recent trials. BMJ 336, 876 (2008)CrossRef Eldridge, S., Ashby, D., Bennett, C., Wakelin, M., Feder, G.: Internal and external validity of cluster randomised trials: systematic review of recent trials. BMJ 336, 876 (2008)CrossRef
23.
go back to reference Yeung, K., Haynor, D., Ruzzo, W.: Validating clustering for gene expression data. Bioinformatics 17, 309 (2001)CrossRef Yeung, K., Haynor, D., Ruzzo, W.: Validating clustering for gene expression data. Bioinformatics 17, 309 (2001)CrossRef
24.
go back to reference Provost, F., Fawcett, T., Kohavi, R.: The case against accuracy estimation for comparing induction algorithms. In: Proceedings of the Fifteenth International Conference on Machine Learning, vol. 445. Citeseer (1998) Provost, F., Fawcett, T., Kohavi, R.: The case against accuracy estimation for comparing induction algorithms. In: Proceedings of the Fifteenth International Conference on Machine Learning, vol. 445. Citeseer (1998)
25.
go back to reference Zepeda-Mendoza, M.L., Resendis-Antonio, O.: Hierarchical agglomerative clustering. In: Dubitzky, W., Wolkenhaue, O., Cho, K.-H., Yokota, H. (eds.) Encyclopedia of Systems Biology, pp. 886–887. Springer, New York (2013)CrossRef Zepeda-Mendoza, M.L., Resendis-Antonio, O.: Hierarchical agglomerative clustering. In: Dubitzky, W., Wolkenhaue, O., Cho, K.-H., Yokota, H. (eds.) Encyclopedia of Systems Biology, pp. 886–887. Springer, New York (2013)CrossRef
26.
go back to reference Krebs, C.J.: Ecological Methodology, vol. 2. Benjamin/Cummings, Menlo Park (1999) Krebs, C.J.: Ecological Methodology, vol. 2. Benjamin/Cummings, Menlo Park (1999)
27.
go back to reference Wang, C., Duo, C.: An improved density-based DBSCAN clustering algorithm. J. Guangxi Norm. Univ. Nat. Sci. Edit. 25, 104 (2007) Wang, C., Duo, C.: An improved density-based DBSCAN clustering algorithm. J. Guangxi Norm. Univ. Nat. Sci. Edit. 25, 104 (2007)
29.
go back to reference Kriegel, H., Pfeifle, M.: Density-based clustering of uncertain data. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, p. 677. ACM (2005) Kriegel, H., Pfeifle, M.: Density-based clustering of uncertain data. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, p. 677. ACM (2005)
30.
go back to reference Szymański, J.: Towards automatic classification of wikipedia content. In: Fyfe, C., Tino, P., Charles, D., Garcia-Osorio, C., Yin, H. (eds.) IDEAL 2010. LNCS, vol. 6283, pp. 102–109. Springer, Heidelberg (2010)CrossRef Szymański, J.: Towards automatic classification of wikipedia content. In: Fyfe, C., Tino, P., Charles, D., Garcia-Osorio, C., Yin, H. (eds.) IDEAL 2010. LNCS, vol. 6283, pp. 102–109. Springer, Heidelberg (2010)CrossRef
31.
go back to reference Duch, W.: Neurocognitive informatics manifesto. In: Series of Information and Management Sciences (2009) Duch, W.: Neurocognitive informatics manifesto. In: Series of Information and Management Sciences (2009)
32.
go back to reference Collins, A., Loftus, E.: A spreading-activation theory of semantic processing. Psychol. Rev. 82, 407 (1975)CrossRef Collins, A., Loftus, E.: A spreading-activation theory of semantic processing. Psychol. Rev. 82, 407 (1975)CrossRef
33.
go back to reference Duch, W., Matykiewicz, P., Pestian, J.: Neurolinguistic approach to natural language processing with applications to medical text analysis. Neural Netw. 21(10), 1500–1510 (2008)CrossRef Duch, W., Matykiewicz, P., Pestian, J.: Neurolinguistic approach to natural language processing with applications to medical text analysis. Neural Netw. 21(10), 1500–1510 (2008)CrossRef
34.
go back to reference Miller, G.A., Beckitch, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to WordNet: An On-line Lexical Database. Cognitive Science Laboratory, Princeton University Press, Princeton (1993) Miller, G.A., Beckitch, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to WordNet: An On-line Lexical Database. Cognitive Science Laboratory, Princeton University Press, Princeton (1993)
35.
go back to reference Szymański, J., Mizgier, A., Szopi ński, M., P., L.: Ujednoznacznianie słów przy uzyciu słownika WordNet. Wydawnictwo Naukowe PG TI 2008 18 89–195 536 (2008) Szymański, J., Mizgier, A., Szopi ński, M., P., L.: Ujednoznacznianie słów przy uzyciu słownika WordNet. Wydawnictwo Naukowe PG TI 2008 18 89–195 536 (2008)
36.
go back to reference Szymański, J., Duch, W.: Annotating words using wordNet semantic glosses. In: Huang, T., Zeng, Z., Li, C., Leung, C.S. (eds.) ICONIP 2012, Part IV. LNCS, vol. 7666, pp. 180–187. Springer, Heidelberg (2012)CrossRef Szymański, J., Duch, W.: Annotating words using wordNet semantic glosses. In: Huang, T., Zeng, Z., Li, C., Leung, C.S. (eds.) ICONIP 2012, Part IV. LNCS, vol. 7666, pp. 180–187. Springer, Heidelberg (2012)CrossRef
Metadata
Title
Retracted: Clustering of Wikipedia Texts Based on Keywords
Authors
Jalalaldin Gharibi Karyak
Fardin Yazdanpanah Sisakht
Sadrollah Abbasi
Copyright Year
2016
DOI
https://doi.org/10.1007/978-3-319-42092-9_39

Premium Partner