Skip to main content

2017 | OriginalPaper | Buchkapitel

Probabilistic Topic Modelling for Controlled Snowball Sampling in Citation Network Collection

verfasst von : Hennadii Dobrovolskyi, Nataliya Keberle, Olga Todoriko

Erschienen in: Knowledge Engineering and Semantic Web

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The paper presents a probabilistic topic model (PTM) application to citation network collection. Snowball sampling method is moderated with the selection of the most relevant papers by means of the PTM. The PTM used in the paper is modified to treat collections of short texts. It is constructed from the titles of seed papers collection united with the papers obtained through unrestricted snowball sampling. The objective of the research is to propose and to experimentally verify the approach of application of PTM of short text documents for improvement of a citation network collection. The preliminary analysis has shown that the method is robust: seed paper collection variations do not affect the most influencing papers subset in the collected citation network.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Ahad, A., Fayaz, M., Shah, A.S.: Navigation through citation network based on content similarity using cosine similarity algorithm. Int. J. Database Theory Appl. 9(5), 9–20 (2016)CrossRef Ahad, A., Fayaz, M., Shah, A.S.: Navigation through citation network based on content similarity using cosine similarity algorithm. Int. J. Database Theory Appl. 9(5), 9–20 (2016)CrossRef
2.
Zurück zum Zitat Aletras, N., Stevenson, M.: Evaluating topic coherence using distributional semantics. IWCS 13, 13–22 (2013) Aletras, N., Stevenson, M.: Evaluating topic coherence using distributional semantics. IWCS 13, 13–22 (2013)
4.
Zurück zum Zitat Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)MATH Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)MATH
5.
Zurück zum Zitat Ermolayev, V., Batsakis, S., Keberle, N., Tatarintseva, O., Antoniou, G.: Ontologies of time: Review and trends. Int. J. Comput. Sci. Appl. 11(3), 57–115 (2014) Ermolayev, V., Batsakis, S., Keberle, N., Tatarintseva, O., Antoniou, G.: Ontologies of time: Review and trends. Int. J. Comput. Sci. Appl. 11(3), 57–115 (2014)
6.
Zurück zum Zitat Fouz-González, J.: Trends and directions in computer-assisted pronunciation training. In: Mompean, J.A., Fouz-González, J. (eds.) Investigating English Pronunciation, pp. 314–342. Palgrave Macmillan UK, London (2015). doi:10.1057/9781137509437_14 CrossRef Fouz-González, J.: Trends and directions in computer-assisted pronunciation training. In: Mompean, J.A., Fouz-González, J. (eds.) Investigating English Pronunciation, pp. 314–342. Palgrave Macmillan UK, London (2015). doi:10.​1057/​9781137509437_​14 CrossRef
7.
Zurück zum Zitat Garfield, E.: From computational linguistics to algorithmic historiography. In: Symposium in Honor of Casimir Borkowski at the University of Pittsburgh School of Information Sciences (2001) Garfield, E.: From computational linguistics to algorithmic historiography. In: Symposium in Honor of Casimir Borkowski at the University of Pittsburgh School of Information Sciences (2001)
8.
Zurück zum Zitat Garfield, E., Merton, R.K.: Citation Indexing: Its Theory and Application in Science, Technology, and Humanities, vol. 8. Wiley, New York (1979) Garfield, E., Merton, R.K.: Citation Indexing: Its Theory and Application in Science, Technology, and Humanities, vol. 8. Wiley, New York (1979)
10.
Zurück zum Zitat Harris, J.K., Beatty, K.E., Lecy, J.D., Cyr, J.M., Shapiro, R.M.: Mapping the multidisciplinary field of public health services and systems research. Am. J. Prev. Med. 41(1), 105–111 (2011)CrossRef Harris, J.K., Beatty, K.E., Lecy, J.D., Cyr, J.M., Shapiro, R.M.: Mapping the multidisciplinary field of public health services and systems research. Am. J. Prev. Med. 41(1), 105–111 (2011)CrossRef
11.
Zurück zum Zitat Hoyer, P.O.: Non-negative sparse coding. In: Proceedings of the 2002 12th IEEE Workshop on Neural Networks for Signal Processing, pp. 557–565. IEEE (2002) Hoyer, P.O.: Non-negative sparse coding. In: Proceedings of the 2002 12th IEEE Workshop on Neural Networks for Signal Processing, pp. 557–565. IEEE (2002)
12.
Zurück zum Zitat Islam, A., Inkpen, D.: Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data (TKDD) 2(2), 10 (2008) Islam, A., Inkpen, D.: Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data (TKDD) 2(2), 10 (2008)
13.
Zurück zum Zitat Jijkoun, V., de Rijke, M.: Recognizing textual entailment: is word similarity enough? In: Quiñonero-Candela, J., Dagan, I., Magnini, B., d’Alché-Buc, F. (eds.) MLCW 2005. LNCS, vol. 3944, pp. 449–460. Springer, Heidelberg (2006). doi:10.1007/11736790_25 CrossRef Jijkoun, V., de Rijke, M.: Recognizing textual entailment: is word similarity enough? In: Quiñonero-Candela, J., Dagan, I., Magnini, B., d’Alché-Buc, F. (eds.) MLCW 2005. LNCS, vol. 3944, pp. 449–460. Springer, Heidelberg (2006). doi:10.​1007/​11736790_​25 CrossRef
15.
Zurück zum Zitat Kajikawa, Y., Ohno, J., Takeda, Y., Matsushima, K., Komiyama, H.: Creating an academic landscape of sustainability science: an analysis of the citation network. Sustain. Sci. 2(2), 221 (2007)CrossRef Kajikawa, Y., Ohno, J., Takeda, Y., Matsushima, K., Komiyama, H.: Creating an academic landscape of sustainability science: an analysis of the citation network. Sustain. Sci. 2(2), 221 (2007)CrossRef
16.
Zurück zum Zitat Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pp. 1188–1196 (2014) Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pp. 1188–1196 (2014)
17.
Zurück zum Zitat Lecy, J.D., Beatty, K.E.: Representative literature reviews using constrained snowball sampling and citation network analysis (2012) Lecy, J.D., Beatty, K.E.: Representative literature reviews using constrained snowball sampling and citation network analysis (2012)
18.
Zurück zum Zitat Lee, A., et al.: Language-independent methods for computer-assisted pronunciation training. Ph.D. thesis, Massachusetts Institute of Technology (2016) Lee, A., et al.: Language-independent methods for computer-assisted pronunciation training. Ph.D. thesis, Massachusetts Institute of Technology (2016)
19.
20.
Zurück zum Zitat Liu, J.S., Lu, L.Y., Lu, W.M., Lin, B.J.: Data envelopment analysis 1978–2010: a citation-based literature survey. Omega 41(1), 3–15 (2013)CrossRef Liu, J.S., Lu, L.Y., Lu, W.M., Lin, B.J.: Data envelopment analysis 1978–2010: a citation-based literature survey. Omega 41(1), 3–15 (2013)CrossRef
21.
Zurück zum Zitat López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)CrossRef López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)CrossRef
22.
Zurück zum Zitat Lu, Z., Li, H.: A deep architecture for matching short texts. In: Advances in Neural Information Processing Systems, pp. 1367–1375 (2013) Lu, Z., Li, H.: A deep architecture for matching short texts. In: Advances in Neural Information Processing Systems, pp. 1367–1375 (2013)
23.
Zurück zum Zitat MacKay, D.J.: Information Theory. Inference and Learning Algorithms. Cambridge University Press, Cambridge (2003)MATH MacKay, D.J.: Information Theory. Inference and Learning Algorithms. Cambridge University Press, Cambridge (2003)MATH
24.
Zurück zum Zitat Meho, L.I.: The rise and rise of citation analysis. Phys. World 20(1), 32 (2007)CrossRef Meho, L.I.: The rise and rise of citation analysis. Phys. World 20(1), 32 (2007)CrossRef
25.
Zurück zum Zitat Mihalcea, R., Corley, C., Strapparava, C., et al.: Corpus-based and knowledge-based measures of text semantic similarity. AAAI 6, 775–780 (2006) Mihalcea, R., Corley, C., Strapparava, C., et al.: Corpus-based and knowledge-based measures of text semantic similarity. AAAI 6, 775–780 (2006)
26.
Zurück zum Zitat Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013) Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:​1301.​3781 (2013)
27.
Zurück zum Zitat Moya-Anegón, F., Vargas-Quesada, B., Herrero-Solana, V., Chinchilla-Rodríguez, Z., Corera-Álvarez, E., Munoz-Fernández, F.: A new technique for building maps of large scientific domains based on the cocitation of classes and categories. Scientometrics 61(1), 129–145 (2004)CrossRef Moya-Anegón, F., Vargas-Quesada, B., Herrero-Solana, V., Chinchilla-Rodríguez, Z., Corera-Álvarez, E., Munoz-Fernández, F.: A new technique for building maps of large scientific domains based on the cocitation of classes and categories. Scientometrics 61(1), 129–145 (2004)CrossRef
29.
Zurück zum Zitat Newman, M.E.: Coauthorship networks and patterns of scientific collaboration. Proc. Natl. Acad. Sci. 101(suppl 1), 5200–5205 (2004)CrossRef Newman, M.E.: Coauthorship networks and patterns of scientific collaboration. Proc. Natl. Acad. Sci. 101(suppl 1), 5200–5205 (2004)CrossRef
30.
31.
Zurück zum Zitat Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. EMNLP 14, 1532–1543 (2014) Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. EMNLP 14, 1532–1543 (2014)
32.
Zurück zum Zitat Petticrew, M., Gilbody, S.: Planning and conducting systematic reviews. In: Health Psychology in Practice, pp. 150–179 (2004) Petticrew, M., Gilbody, S.: Planning and conducting systematic reviews. In: Health Psychology in Practice, pp. 150–179 (2004)
33.
Zurück zum Zitat Popova, S., Khodyrev, I., Egorov, A., Logvin, S., Gulyaev, S., Karpova, M., Mouromtsev, D.: Sci-search: academic search and analysis system based on keyphrases. In: Klinov, P., Mouromtsev, D. (eds.) KESW 2013. CCIS, vol. 394, pp. 281–288. Springer, Heidelberg (2013). doi:10.1007/978-3-642-41360-5_24 CrossRef Popova, S., Khodyrev, I., Egorov, A., Logvin, S., Gulyaev, S., Karpova, M., Mouromtsev, D.: Sci-search: academic search and analysis system based on keyphrases. In: Klinov, P., Mouromtsev, D. (eds.) KESW 2013. CCIS, vol. 394, pp. 281–288. Springer, Heidelberg (2013). doi:10.​1007/​978-3-642-41360-5_​24 CrossRef
34.
Zurück zum Zitat Price, D.: Citation measures of hard science, soft science, technology, and nonscience. In: Nelson, C.E., Pollack, D.K. (eds.) Communication Among Scientists and Engineers. Heath Lexington Books Massachusetts (1970) Price, D.: Citation measures of hard science, soft science, technology, and nonscience. In: Nelson, C.E., Pollack, D.K. (eds.) Communication Among Scientists and Engineers. Heath Lexington Books Massachusetts (1970)
35.
Zurück zum Zitat Ramage, D., Rafferty, A.N., Manning, C.D.: Random walks for text semantic similarity. In: Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing, pp. 23–31. Association for Computational Linguistics (2009) Ramage, D., Rafferty, A.N., Manning, C.D.: Random walks for text semantic similarity. In: Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing, pp. 23–31. Association for Computational Linguistics (2009)
36.
Zurück zum Zitat Small, H.: Visualizing science by citation mapping. J. Associat. Inf. Sci. Technol. 50(9), 799 (1999) Small, H.: Visualizing science by citation mapping. J. Associat. Inf. Sci. Technol. 50(9), 799 (1999)
37.
Zurück zum Zitat Socher, R., Huang, E.H., Pennin, J., Manning, C.D., Ng, A.Y.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Advances in Neural Information Processing Systems, pp. 801–809 (2011) Socher, R., Huang, E.H., Pennin, J., Manning, C.D., Ng, A.Y.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Advances in Neural Information Processing Systems, pp. 801–809 (2011)
38.
Zurück zum Zitat de Solla Price, D.J.: Networks of scientific papers. Science 149(3683), 510–515 (1965)CrossRef de Solla Price, D.J.: Networks of scientific papers. Science 149(3683), 510–515 (1965)CrossRef
39.
Zurück zum Zitat Vorontsov, K., Potapenko, A.: Tutorial on probabilistic topic modeling: additive regularization for stochastic matrix factorization. In: Ignatov, D.I., Khachay, M.Y., Panchenko, A., Konstantinova, N., Yavorskiy, R.E. (eds.) AIST 2014. CCIS, vol. 436, pp. 29–46. Springer, Cham (2014). doi:10.1007/978-3-319-12580-0_3 Vorontsov, K., Potapenko, A.: Tutorial on probabilistic topic modeling: additive regularization for stochastic matrix factorization. In: Ignatov, D.I., Khachay, M.Y., Panchenko, A., Konstantinova, N., Yavorskiy, R.E. (eds.) AIST 2014. CCIS, vol. 436, pp. 29–46. Springer, Cham (2014). doi:10.​1007/​978-3-319-12580-0_​3
40.
Zurück zum Zitat Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456. ACM (2013) Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456. ACM (2013)
41.
Zurück zum Zitat Yan, X., Guo, J., Liu, S., Cheng, X., Wang, Y.: Learning topics in short texts by non-negative matrix factorization on term correlation matrix. In: Proceedings of the 2013 SIAM International Conference on Data Mining, pp. 749–757. SIAM (2013) Yan, X., Guo, J., Liu, S., Cheng, X., Wang, Y.: Learning topics in short texts by non-negative matrix factorization on term correlation matrix. In: Proceedings of the 2013 SIAM International Conference on Data Mining, pp. 749–757. SIAM (2013)
43.
Zurück zum Zitat Yang, Z., Yang, D., Dyer, C., He, X., Smola, A.J., Hovy, E.H.: Hierarchical attention networks for document classification. In: HLT-NAACL, pp. 1480–1489 (2016) Yang, Z., Yang, D., Dyer, C., He, X., Smola, A.J., Hovy, E.H.: Hierarchical attention networks for document classification. In: HLT-NAACL, pp. 1480–1489 (2016)
44.
Zurück zum Zitat Zuo, Y., Zhao, J., Xu, K.: Word network topic model: a simple but general solution for short and imbalanced texts. Knowl. Inf. Syst. 48(2), 379–398 (2016)CrossRef Zuo, Y., Zhao, J., Xu, K.: Word network topic model: a simple but general solution for short and imbalanced texts. Knowl. Inf. Syst. 48(2), 379–398 (2016)CrossRef
Metadaten
Titel
Probabilistic Topic Modelling for Controlled Snowball Sampling in Citation Network Collection
verfasst von
Hennadii Dobrovolskyi
Nataliya Keberle
Olga Todoriko
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-69548-8_7

Neuer Inhalt