Skip to main content
Top

2019 | OriginalPaper | Chapter

On Convergence of Controlled Snowball Sampling for Scientific Abstracts Collection

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This paper presents evidences concerned to convergence of controlled snowball sampling iterations applied to collecting seminal papers in a selected domain of research. Iterations start from the seed paper selection, plain snowball sampling and probabilistic topic modelling, then greedy controlled snowball sampling and analysis of the collected citation network are performed in rotation until the list of seminal papers becomes stable. The topic model is built on the base of word-word co-occurrence probability with combination of sparse symmetric nonnegative matrix factorization and principal component approximation. Experiments show that the number of topics in the model is determined in natural way and the Kullback-Leibler (KL) divergence provides the upper bound of the cosine similarity calculated from keywords assigned by publication authors. Several citation networks are collected and analysed. The analysis shows that all networks are “small worlds” and therefore the observed saturation of the controlled snowball sampling can provide the complete set of publications in domains of interest. Experiments with KL-divergence, symmetric KL-divergence and Jensen-Shannon divergence show that KL-divergence produces less connected citation network but provides better convergence of snowball iterations. Multiple runs of the sampling confirm the hypothesis that the set of seminal publications is stable with respect to variations of the seed papers. The modified main path analysis allows to distinguish the seminal papers including new publications following main stream of research. The comparison of different ranking criterion is made. It shows that Search Path Count provides better lists of seminal papers than citation index, PageRank and indegree.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Ahad, A., Fayaz, M., Shah, A.S.: Navigation through citation network based on content similarity using cosine similarity algorithm. Int. J. Database Theory Appl. 9(5), 9–20 (2016)CrossRef Ahad, A., Fayaz, M., Shah, A.S.: Navigation through citation network based on content similarity using cosine similarity algorithm. Int. J. Database Theory Appl. 9(5), 9–20 (2016)CrossRef
2.
go back to reference Akavipat, R., Wu, L.S., Menczer, F., Maguitman, A.G.: Emerging semantic communities in peer web search. In: Proceedings of the International Workshop on Information Retrieval in Peer-to-Peer Networks, pp. 1–8. ACM (2006) Akavipat, R., Wu, L.S., Menczer, F., Maguitman, A.G.: Emerging semantic communities in peer web search. In: Proceedings of the International Workshop on Information Retrieval in Peer-to-Peer Networks, pp. 1–8. ACM (2006)
3.
go back to reference Baez, M., Mirylenka, D., Parra, C.: Understanding and supporting search for scholarly knowledge. In: Proceeding of the 7th European Computer Science Summit, pp. 1–8 (2011) Baez, M., Mirylenka, D., Parra, C.: Understanding and supporting search for scholarly knowledge. In: Proceeding of the 7th European Computer Science Summit, pp. 1–8 (2011)
5.
go back to reference Barbosa, M.W., Costa, M.M., Almeida, J.M., Almeida, V.A.: Using locality of reference to improve performance of peer-to-peer applications. In: ACM SIGSOFT Software Engineering Notes, vol. 29, pp. 216–227. ACM (2004)CrossRef Barbosa, M.W., Costa, M.M., Almeida, J.M., Almeida, V.A.: Using locality of reference to improve performance of peer-to-peer applications. In: ACM SIGSOFT Software Engineering Notes, vol. 29, pp. 216–227. ACM (2004)CrossRef
6.
go back to reference Batagelj, V.: Efficient algorithms for citation network analysis. arXiv preprint cs/0309023 (2003) Batagelj, V.: Efficient algorithms for citation network analysis. arXiv preprint cs/0309023 (2003)
7.
go back to reference Batagelj, V., Mrvar, A.: Pajek-program for large network analysis. Connections 21(2), 47–57 (1998)MATH Batagelj, V., Mrvar, A.: Pajek-program for large network analysis. Connections 21(2), 47–57 (1998)MATH
8.
go back to reference Beel, J., Gipp, B., Langer, S., Breitinger, C.: Paper recommender systems: a literature survey. Int. J. Digit. Librar. 17(4), 305–338 (2016)CrossRef Beel, J., Gipp, B., Langer, S., Breitinger, C.: Paper recommender systems: a literature survey. Int. J. Digit. Librar. 17(4), 305–338 (2016)CrossRef
9.
go back to reference Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)MATH Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)MATH
10.
go back to reference Crespo, A., Garcia-Molina, H.: Routing indices for peer-to-peer systems. In: Proceedings 22nd International Conference on Distributed Computing Systems, pp. 23–32. IEEE (2002) Crespo, A., Garcia-Molina, H.: Routing indices for peer-to-peer systems. In: Proceedings 22nd International Conference on Distributed Computing Systems, pp. 23–32. IEEE (2002)
11.
go back to reference De Bruijn, N.G.: Asymptotic Methods in Analysis, vol. 4. Courier Corporation, Chelmsford (1981)MATH De Bruijn, N.G.: Asymptotic Methods in Analysis, vol. 4. Courier Corporation, Chelmsford (1981)MATH
12.
go back to reference Dobrovolskyi, H., Keberle, N.: Collecting the seminal scientific abstracts with topic modelling, snowball sampling and citation analysis. In: Proceedings of the 14th International Conference on ICT in Education, Research and Industrial Applications. Integration, Harmonization and Knowledge Transfer. Volume I: Main Conference, vol. 2105, pp. 179–192. CEUR-WS (2018) Dobrovolskyi, H., Keberle, N.: Collecting the seminal scientific abstracts with topic modelling, snowball sampling and citation analysis. In: Proceedings of the 14th International Conference on ICT in Education, Research and Industrial Applications. Integration, Harmonization and Knowledge Transfer. Volume I: Main Conference, vol. 2105, pp. 179–192. CEUR-WS (2018)
14.
go back to reference Dong, R., Tokarchuk, L., Ma, A.: Digging friendship: paper recommendation in social network. In: Proceedings of Networking and Electronic Commerce Research Conference, NAEC 2009, pp. 21–28 (2009) Dong, R., Tokarchuk, L., Ma, A.: Digging friendship: paper recommendation in social network. In: Proceedings of Networking and Electronic Commerce Research Conference, NAEC 2009, pp. 21–28 (2009)
15.
go back to reference Doulamis, N.D., Karamolegkos, P.N., Doulamis, A., Nikolakopoulos, I.: Exploiting semantic proximities for content search over P2P networks. Comput. Commun. 32(5), 814–827 (2009)CrossRef Doulamis, N.D., Karamolegkos, P.N., Doulamis, A., Nikolakopoulos, I.: Exploiting semantic proximities for content search over P2P networks. Comput. Commun. 32(5), 814–827 (2009)CrossRef
16.
go back to reference Endres, D.M., Schindelin, J.E.: A new metric for probability distributions. IEEE Trans. Inf. Theory (2003) Endres, D.M., Schindelin, J.E.: A new metric for probability distributions. IEEE Trans. Inf. Theory (2003)
17.
go back to reference Ermolayev, V., Batsakis, S., Keberle, N., Tatarintseva, O., Antoniou, G.: Ontologies of time: review and trends. Int. J. Comput. Sci. Appl. 11(3) (2014) Ermolayev, V., Batsakis, S., Keberle, N., Tatarintseva, O., Antoniou, G.: Ontologies of time: review and trends. Int. J. Comput. Sci. Appl. 11(3) (2014)
18.
19.
go back to reference Golumbic, M.C.: Algorithmic Graph Theory and Perfect Graphs, vol. 57. Elsevier, Amsterdam (2004)MATH Golumbic, M.C.: Algorithmic Graph Theory and Perfect Graphs, vol. 57. Elsevier, Amsterdam (2004)MATH
20.
go back to reference Gori, M., Pucci, A.: Research paper recommender systems: a random-walk based approach. In: IEEE/WIC/ACM International Conference on Web Intelligence, WI 2006, pp. 778–781. IEEE (2006) Gori, M., Pucci, A.: Research paper recommender systems: a random-walk based approach. In: IEEE/WIC/ACM International Conference on Web Intelligence, WI 2006, pp. 778–781. IEEE (2006)
21.
go back to reference Hamilton, D.P., et al.: Publishing by–and for?–the numbers. Science 250(4986), 1331–1332 (1990)CrossRef Hamilton, D.P., et al.: Publishing by–and for?–the numbers. Science 250(4986), 1331–1332 (1990)CrossRef
22.
go back to reference Huang, Z., Chung, W., Ong, T.H., Chen, H.: A graph-based recommender system for digital library. In: Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 65–73. ACM (2002) Huang, Z., Chung, W., Ong, T.H., Chen, H.: A graph-based recommender system for digital library. In: Proceedings of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 65–73. ACM (2002)
23.
go back to reference Küçüktunç, O., Saule, E., Kaya, K., Çatalyürek, Ü.V.: Recommendation on academic networks using direction aware citation analysis. arXiv preprint arXiv:1205.1143 (2012) Küçüktunç, O., Saule, E., Kaya, K., Çatalyürek, Ü.V.: Recommendation on academic networks using direction aware citation analysis. arXiv preprint arXiv:​1205.​1143 (2012)
24.
go back to reference Lao, N., Cohen, W.W.: Relational retrieval using a combination of path-constrained random walks. Mach. Learn. 81(1), 53–67 (2010)MathSciNetCrossRef Lao, N., Cohen, W.W.: Relational retrieval using a combination of path-constrained random walks. Mach. Learn. 81(1), 53–67 (2010)MathSciNetCrossRef
25.
go back to reference Lecy, J.D., Beatty, K.E.: Representative literature reviews using constrained snowball sampling and citation network analysis (2012) Lecy, J.D., Beatty, K.E.: Representative literature reviews using constrained snowball sampling and citation network analysis (2012)
26.
go back to reference Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2014)CrossRef Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2014)CrossRef
29.
go back to reference Lucio-Arias, D., Leydesdorff, L.: Main-path analysis and path-dependent transitions in histcite™-based historiograms. J. Assoc. Inf. Sci. Technol. 59(12), 1948–1962 (2008)CrossRef Lucio-Arias, D., Leydesdorff, L.: Main-path analysis and path-dependent transitions in histcite™-based historiograms. J. Assoc. Inf. Sci. Technol. 59(12), 1948–1962 (2008)CrossRef
30.
go back to reference MacKay, D.J.: Information Theory, Inference and Learning Algorithms. Cambridge University Press, Cambridge (2003)MATH MacKay, D.J.: Information Theory, Inference and Learning Algorithms. Cambridge University Press, Cambridge (2003)MATH
31.
go back to reference Mendenhall, W.M., Sincich, T.L., Boudreau, N.S.: Statistics for Engineering and the Sciences, Student Solutions Manual. Chapman and Hall/CRC, Boca Raton (2016)CrossRef Mendenhall, W.M., Sincich, T.L., Boudreau, N.S.: Statistics for Engineering and the Sciences, Student Solutions Manual. Chapman and Hall/CRC, Boca Raton (2016)CrossRef
32.
go back to reference Molloy, M., Reed, B.: A critical point for random graphs with a given degree sequence. Random Struct. Algorithms 6(2–3), 161–180 (1995)MathSciNetCrossRef Molloy, M., Reed, B.: A critical point for random graphs with a given degree sequence. Random Struct. Algorithms 6(2–3), 161–180 (1995)MathSciNetCrossRef
33.
go back to reference Moya-Anegón, F., Vargas-Quesada, B., Herrero-Solana, V., Chinchilla-Rodríguez, Z., Corera-Álvarez, E., Munoz-Fernández, F.: A new technique for building maps of large scientific domains based on the cocitation of classes and categories. Scientometrics 61(1), 129–145 (2004)CrossRef Moya-Anegón, F., Vargas-Quesada, B., Herrero-Solana, V., Chinchilla-Rodríguez, Z., Corera-Álvarez, E., Munoz-Fernández, F.: A new technique for building maps of large scientific domains based on the cocitation of classes and categories. Scientometrics 61(1), 129–145 (2004)CrossRef
34.
35.
go back to reference Newman, M.E.: Coauthorship networks and patterns of scientific collaboration. Proc. Natl. Acad. Sci. 101(Suppl. 1), 5200–5205 (2004)CrossRef Newman, M.E.: Coauthorship networks and patterns of scientific collaboration. Proc. Natl. Acad. Sci. 101(Suppl. 1), 5200–5205 (2004)CrossRef
36.
go back to reference Nicolini, A.L., Lorenzetti, C.M., Maguitman, A.G., Chesñevar, C.I.: Intelligent algorithms for improving communication patterns in thematic P2P search. Inf. Proces. Manag. 53(2), 388–404 (2017)CrossRef Nicolini, A.L., Lorenzetti, C.M., Maguitman, A.G., Chesñevar, C.I.: Intelligent algorithms for improving communication patterns in thematic P2P search. Inf. Proces. Manag. 53(2), 388–404 (2017)CrossRef
37.
go back to reference Nikulin, M.S.: Hellinger distance. In: Encyclopedia of Mathematics, vol. 78 (2001) Nikulin, M.S.: Hellinger distance. In: Encyclopedia of Mathematics, vol. 78 (2001)
39.
go back to reference Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP, vol. 14, pp. 1532–1543 (2014) Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP, vol. 14, pp. 1532–1543 (2014)
40.
go back to reference Petticrew, M., Gilbody, S.: Planning and conducting systematic reviews. Health Psychol. Pract. 150–179 (2004) Petticrew, M., Gilbody, S.: Planning and conducting systematic reviews. Health Psychol. Pract. 150–179 (2004)
41.
go back to reference Pohl, S., Radlinski, F., Joachims, T.: Recommending related papers based on digital library access records. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 417–418. ACM (2007) Pohl, S., Radlinski, F., Joachims, T.: Recommending related papers based on digital library access records. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 417–418. ACM (2007)
42.
go back to reference Ráez, A.M., López, L.A.U., Steinberger, R.: Adaptive selection of base classifiers in one-against-all learning for large multi-labeled collections. In: Vicedo, J.L., Martínez-Barco, P., Muńoz, R., Saiz Noeda, M. (eds.) EsTAL 2004. LNCS (LNAI), vol. 3230, pp. 1–12. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30228-5_1CrossRef Ráez, A.M., López, L.A.U., Steinberger, R.: Adaptive selection of base classifiers in one-against-all learning for large multi-labeled collections. In: Vicedo, J.L., Martínez-Barco, P., Muńoz, R., Saiz Noeda, M. (eds.) EsTAL 2004. LNCS (LNAI), vol. 3230, pp. 1–12. Springer, Heidelberg (2004). https://​doi.​org/​10.​1007/​978-3-540-30228-5_​1CrossRef
44.
go back to reference Salganik, M.J., Heckathorn, D.D.: Sampling and estimation in hidden populations using respondent-driven sampling. Sociol. Methodol. 34(1), 193–240 (2004)CrossRef Salganik, M.J., Heckathorn, D.D.: Sampling and estimation in hidden populations using respondent-driven sampling. Sociol. Methodol. 34(1), 193–240 (2004)CrossRef
45.
go back to reference Small, H.: Co-citation in the scientific literature: a new measure of the relationship between two documents. J. Am. Soc. Inf. Sci. 24(4), 265–269 (1973)CrossRef Small, H.: Co-citation in the scientific literature: a new measure of the relationship between two documents. J. Am. Soc. Inf. Sci. 24(4), 265–269 (1973)CrossRef
46.
go back to reference de Solla Price, D.J.: Networks of scientific papers. Science 149(3683), 510–515 (1965)CrossRef de Solla Price, D.J.: Networks of scientific papers. Science 149(3683), 510–515 (1965)CrossRef
47.
go back to reference Tan, P.N., et al.: Introduction to Data Mining. Pearson Education India, Delhi (2007) Tan, P.N., et al.: Introduction to Data Mining. Pearson Education India, Delhi (2007)
48.
go back to reference Trudeau, R.J.: Introduction to Graph Theory. Courier Corporation, Chelmsford (2013) Trudeau, R.J.: Introduction to Graph Theory. Courier Corporation, Chelmsford (2013)
49.
go back to reference Valenzuela, M., Ha, V., Etzioni, O.: Identifying meaningful citations. In: AAAI Workshop: Scholarly Big Data (2015) Valenzuela, M., Ha, V., Etzioni, O.: Identifying meaningful citations. In: AAAI Workshop: Scholarly Big Data (2015)
50.
go back to reference Varela, A.R., et al.: Mapping the historical development of physical activity and health research: a structured literature review and citation network analysis. Prev. Med. 111, 466–472 (2018)CrossRef Varela, A.R., et al.: Mapping the historical development of physical activity and health research: a structured literature review and citation network analysis. Prev. Med. 111, 466–472 (2018)CrossRef
51.
go back to reference Vellino, A.: Usage-based vs. citation-based methods for recommending scholarly research articles. arXiv preprint arXiv:1303.7149 (2013) Vellino, A.: Usage-based vs. citation-based methods for recommending scholarly research articles. arXiv preprint arXiv:​1303.​7149 (2013)
52.
go back to reference Vorontsov, K., Potapenko, A.: Tutorial on probabilistic topic modeling: additive regularization for stochastic matrix factorization. In: Ignatov, D.I., Khachay, M.Y., Panchenko, A., Konstantinova, N., Yavorskiy, R.E. (eds.) AIST 2014. CCIS, vol. 436, pp. 29–46. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-12580-0_3CrossRef Vorontsov, K., Potapenko, A.: Tutorial on probabilistic topic modeling: additive regularization for stochastic matrix factorization. In: Ignatov, D.I., Khachay, M.Y., Panchenko, A., Konstantinova, N., Yavorskiy, R.E. (eds.) AIST 2014. CCIS, vol. 436, pp. 29–46. Springer, Cham (2014). https://​doi.​org/​10.​1007/​978-3-319-12580-0_​3CrossRef
53.
go back to reference Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393(6684), 440 (1998)CrossRef Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393(6684), 440 (1998)CrossRef
54.
go back to reference Woodruff, A., Gossweiler, R., Pitkow, J., Chi, E.H., Card, S.K.: Enhancing a digital book with a reading recommender. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 153–160. ACM (2000) Woodruff, A., Gossweiler, R., Pitkow, J., Chi, E.H., Card, S.K.: Enhancing a digital book with a reading recommender. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 153–160. ACM (2000)
55.
go back to reference Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456. ACM (2013) Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456. ACM (2013)
56.
go back to reference Zeinalipour-Yazti, D., Kalogeraki, V., Gunopulos, D.: Information retrieval techniques for peer-to-peer networks. Comput. Sci. Eng. 6(4), 20–26 (2004)CrossRef Zeinalipour-Yazti, D., Kalogeraki, V., Gunopulos, D.: Information retrieval techniques for peer-to-peer networks. Comput. Sci. Eng. 6(4), 20–26 (2004)CrossRef
57.
go back to reference Zeinalipour-Yazti, D., Kalogeraki, V., Gunopulos, D.: Exploiting locality for scalable information retrieval in peer-to-peer networks. Inf. Syst. 30(4), 277–298 (2005)CrossRef Zeinalipour-Yazti, D., Kalogeraki, V., Gunopulos, D.: Exploiting locality for scalable information retrieval in peer-to-peer networks. Inf. Syst. 30(4), 277–298 (2005)CrossRef
58.
go back to reference Zhou, D., et al.: Learning multiple graphs for document recommendations. In: Proceedings of the 17th International Conference on World Wide Web, pp. 141–150. ACM (2008) Zhou, D., et al.: Learning multiple graphs for document recommendations. In: Proceedings of the 17th International Conference on World Wide Web, pp. 141–150. ACM (2008)
59.
go back to reference Zuo, Y., Zhao, J., Xu, K.: Word network topic model: a simple but general solution for short and imbalanced texts. Knowl. Inf. Syst. 48(2), 379–398 (2016)CrossRef Zuo, Y., Zhao, J., Xu, K.: Word network topic model: a simple but general solution for short and imbalanced texts. Knowl. Inf. Syst. 48(2), 379–398 (2016)CrossRef
Metadata
Title
On Convergence of Controlled Snowball Sampling for Scientific Abstracts Collection
Authors
Hennadii Dobrovolskyi
Nataliya Keberle
Copyright Year
2019
DOI
https://doi.org/10.1007/978-3-030-13929-2_2

Premium Partner