Skip to main content
Erschienen in:
Buchtitelbild

2019 | OriginalPaper | Buchkapitel

Do Scaling Algorithms Preserve Word2Vec Semantics? A Case Study for Medical Entities

verfasst von : Janus Wawrzinek, José María González Pinto, Philipp Markiewka, Wolf-Tilo Balke

Erschienen in: Data Integration in the Life Sciences

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The exponential increase of scientific publications in the bio-medical field challenges access to scientific information, which primarily is encoded by semantic relationships between medical entities, such as active ingredients, diseases, or genes. Neural language models, such as Word2Vec, offer new ways of automatically learning semantically meaningful entity relationships even from large text corpora. They offer high scalability and deliver better accuracy than comparable approaches. Still, first the models have to be tuned by testing different training parameters. Arguably, the most critical parameter is the number of training dimensions for the neural network training and testing individually different numbers of dimensions is time-consuming. It usually takes hours or even days per training iteration on large corpora. In this paper we show a more efficient way to determine the optimal number of dimensions concerning quality measures such as precision/recall. We show that the quality of results gained using simpler and easier to compute scaling approaches like MDS or PCA correlates strongly with the expected quality when using the same number of Word2Vec training dimensions. This has even more impact if after initial Word2Vec training only a limited number of entities and their respective relations are of interest.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
2.
Zurück zum Zitat Wang, Z.Y., Zhang, H.Y.: Rational drug repositioning by medical genetics. Nat. Biotechnol. 31(12), 1080 (2013)CrossRef Wang, Z.Y., Zhang, H.Y.: Rational drug repositioning by medical genetics. Nat. Biotechnol. 31(12), 1080 (2013)CrossRef
3.
Zurück zum Zitat Abdelaziz, I., Fokoue, A., Hassanzadeh, O., Zhang, P., Sadoghi, M.: Large-scale structural and textual similarity-based mining of knowledge graph to predict drug–drug interactions. Web Semant.: Sci., Serv. Agents World Wide Web 44, 104–117 (2017)CrossRef Abdelaziz, I., Fokoue, A., Hassanzadeh, O., Zhang, P., Sadoghi, M.: Large-scale structural and textual similarity-based mining of knowledge graph to predict drug–drug interactions. Web Semant.: Sci., Serv. Agents World Wide Web 44, 104–117 (2017)CrossRef
4.
Zurück zum Zitat Leser, U., Hakenberg, J.: What makes a gene name? Named entity recognition in the biomedical literature. Brief. Bioinform. 6(4), 357–369 (2005)CrossRef Leser, U., Hakenberg, J.: What makes a gene name? Named entity recognition in the biomedical literature. Brief. Bioinform. 6(4), 357–369 (2005)CrossRef
5.
Zurück zum Zitat Lotfi Shahreza, M., Ghadiri, N., Mousavi, S.R., Varshosaz, J., Green, J.R.: A review of network-based approaches to drug repositioning. Brief. Bioinform. bbx017 (2017) Lotfi Shahreza, M., Ghadiri, N., Mousavi, S.R., Varshosaz, J., Green, J.R.: A review of network-based approaches to drug repositioning. Brief. Bioinform. bbx017 (2017)
6.
Zurück zum Zitat Dudley, J.T., Deshpande, T., Butte, A.J.: Exploiting drug–disease relationships for computational drug repositioning. Brief. Bioinform. 12(4), 303–311 (2011)CrossRef Dudley, J.T., Deshpande, T., Butte, A.J.: Exploiting drug–disease relationships for computational drug repositioning. Brief. Bioinform. 12(4), 303–311 (2011)CrossRef
7.
Zurück zum Zitat Willett, P., Barnard, J.M., Downs, G.M.: Chemical similarity searching. J. Chem. Inf. Comput. Sci. 38(6), 983–996 (1998)CrossRef Willett, P., Barnard, J.M., Downs, G.M.: Chemical similarity searching. J. Chem. Inf. Comput. Sci. 38(6), 983–996 (1998)CrossRef
8.
Zurück zum Zitat Ngo, D.L., et al.: Application of word embedding to drug repositioning. J. Biomed. Sci. Eng. 9(01), 7 (2016)CrossRef Ngo, D.L., et al.: Application of word embedding to drug repositioning. J. Biomed. Sci. Eng. 9(01), 7 (2016)CrossRef
9.
Zurück zum Zitat Lengerich, B.J., Maas, A.L., Potts, C.: Retrofitting Distributional Embeddings to Knowledge Graphs with Functional Relations. arXiv preprint arXiv:1708.00112 (2017) Lengerich, B.J., Maas, A.L., Potts, C.: Retrofitting Distributional Embeddings to Knowledge Graphs with Functional Relations. arXiv preprint arXiv:​1708.​00112 (2017)
10.
Zurück zum Zitat Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Long Papers, vol. 1, pp. 238–247 (2014) Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Long Papers, vol. 1, pp. 238–247 (2014)
11.
Zurück zum Zitat Mikolov, T., Yih, W.T., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751 (2013) Mikolov, T., Yih, W.T., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751 (2013)
12.
Zurück zum Zitat Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: NIPS (2013) Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: NIPS (2013)
13.
Zurück zum Zitat Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Advances in Neural Information Processing, pp. 2177–2185 (2014) Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Advances in Neural Information Processing, pp. 2177–2185 (2014)
14.
Zurück zum Zitat Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
15.
Zurück zum Zitat Bengio, Y., Courville, A., Vincent, P., Collobert, R., Weston, J., et al.: Natural language processing (almost) from scratch. IEEE Trans. Pattern Anal. Mach. Intell. 35, 384–394 (2014) Bengio, Y., Courville, A., Vincent, P., Collobert, R., Weston, J., et al.: Natural language processing (almost) from scratch. IEEE Trans. Pattern Anal. Mach. Intell. 35, 384–394 (2014)
16.
Zurück zum Zitat Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification, vol. 2, pp. 427–431 (2016). Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Valencia, Spain, 3–7 April 2017 Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification, vol. 2, pp. 427–431 (2016). Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Valencia, Spain, 3–7 April 2017
18.
Zurück zum Zitat Weinberg, S.L.: An introduction to multidimensional scaling. Meas. Eval. Couns. Dev. 24, 12–36 (1991) Weinberg, S.L.: An introduction to multidimensional scaling. Meas. Eval. Couns. Dev. 24, 12–36 (1991)
19.
Zurück zum Zitat Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATH Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATH
20.
Zurück zum Zitat Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)CrossRef Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)CrossRef
21.
Zurück zum Zitat Hamilton, W.L., Leskovec, J., Jurafsky, D.: Diachronic word embeddings reveal statistical laws of semantic change. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016, pp. 1489–1501 (2016) Hamilton, W.L., Leskovec, J., Jurafsky, D.: Diachronic word embeddings reveal statistical laws of semantic change. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016, pp. 1489–1501 (2016)
22.
Zurück zum Zitat Altman, D.G., Bland, J.M.: Measurement in medicine: the analysis of method comparison studies. Statistician 32, 307–317 (1983)CrossRef Altman, D.G., Bland, J.M.: Measurement in medicine: the analysis of method comparison studies. Statistician 32, 307–317 (1983)CrossRef
23.
Zurück zum Zitat Schönemann, P.H.: A generalized solution of the orthogonal procrustes problem. Psychometrika 31, 1–10 (1966)MathSciNetCrossRef Schönemann, P.H.: A generalized solution of the orthogonal procrustes problem. Psychometrika 31, 1–10 (1966)MathSciNetCrossRef
24.
Zurück zum Zitat Jessop, D.M., Adams, S.E., Willighagen, E.L., Hawizy, L., Murray-Rust, P.: OSCAR4: a flexible architecture for chemical text-mining. J. Cheminformatics 3(1), 41 (2011)CrossRef Jessop, D.M., Adams, S.E., Willighagen, E.L., Hawizy, L., Murray-Rust, P.: OSCAR4: a flexible architecture for chemical text-mining. J. Cheminformatics 3(1), 41 (2011)CrossRef
25.
Zurück zum Zitat Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2014)CrossRef Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2014)CrossRef
26.
Zurück zum Zitat Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Advances in Neural Information Processing Systems, pp. 2177–2185 (2014) Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Advances in Neural Information Processing Systems, pp. 2177–2185 (2014)
27.
Zurück zum Zitat Gittens, A., Achlioptas, D., Mahoney, M.W.: Skip-gram - zipf + uniform = vector additivity. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Long Papers, vol. 1, pp. 69–76 (2017) Gittens, A., Achlioptas, D., Mahoney, M.W.: Skip-gram - zipf + uniform = vector additivity. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Long Papers, vol. 1, pp. 69–76 (2017)
28.
Zurück zum Zitat Li, Y., Xu, L., Tian, F., Jiang, L., Zhong, X., Chen, E.: Word embedding revisited: a new representation learning and explicit matrix factorization perspective. In: IJCAI International Joint Conference on Artificial Intelligence, pp. 3650–3656 (2015) Li, Y., Xu, L., Tian, F., Jiang, L., Zhong, X., Chen, E.: Word embedding revisited: a new representation learning and explicit matrix factorization perspective. In: IJCAI International Joint Conference on Artificial Intelligence, pp. 3650–3656 (2015)
29.
Zurück zum Zitat Canese, K.: PubMed relevance sort. NLM Tech. Bull 394, e2 (2013) Canese, K.: PubMed relevance sort. NLM Tech. Bull 394, e2 (2013)
Metadaten
Titel
Do Scaling Algorithms Preserve Word2Vec Semantics? A Case Study for Medical Entities
verfasst von
Janus Wawrzinek
José María González Pinto
Philipp Markiewka
Wolf-Tilo Balke
Copyright-Jahr
2019
DOI
https://doi.org/10.1007/978-3-030-06016-9_1