nach oben

Erschienen in:

2019 | OriginalPaper | Buchkapitel

Do Scaling Algorithms Preserve Word2Vec Semantics? A Case Study for Medical Entities

verfasst von : Janus Wawrzinek, José María González Pinto, Philipp Markiewka, Wolf-Tilo Balke

Erschienen in: Data Integration in the Life Sciences

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

The exponential increase of scientific publications in the bio-medical field challenges access to scientific information, which primarily is encoded by semantic relationships between medical entities, such as active ingredients, diseases, or genes. Neural language models, such as Word2Vec, offer new ways of automatically learning semantically meaningful entity relationships even from large text corpora. They offer high scalability and deliver better accuracy than comparable approaches. Still, first the models have to be tuned by testing different training parameters. Arguably, the most critical parameter is the number of training dimensions for the neural network training and testing individually different numbers of dimensions is time-consuming. It usually takes hours or even days per training iteration on large corpora. In this paper we show a more efficient way to determine the optimal number of dimensions concerning quality measures such as precision/recall. We show that the quality of results gained using simpler and easier to compute scaling approaches like MDS or PCA correlates strongly with the expected quality when using the same number of Word2Vec training dimensions. This has even more impact if after initial Word2Vec training only a limited number of entities and their respective relations are of interest.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nächstes Kapitel Combining Semantic and Lexical Measures to Evaluate Medical Terms Similarity

https://www.pubpharm.de/vufind/.

https://www.whocc.no/atc_ddd_index/.

https://www.ncbi.nlm.nih.gov/pubmed/.

The complete list can be downloaded under: http://www.ifis.cs.tu-bs.de/webfm_send/2295.

https://www.drugbank.ca/.

https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/.

https://lucene.apache.org/.

https://radimrehurek.com/gensim/models/word2vec.html.

https://github.com/RaRe-Technologies/gensim-data/issues/28.

https://code.google.com/archive/p/word2vec/.

Wawrzinek, J., Balke, W.-T.: Semantic facettation in pharmaceutical collections using deep learning for active substance contextualization. In: Choemprayong, S., Crestani, F., Cunningham, S.J. (eds.) ICADL 2017. LNCS, vol. 10647, pp. 41–53. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-70232-2_4CrossRef

Wang, Z.Y., Zhang, H.Y.: Rational drug repositioning by medical genetics. Nat. Biotechnol. 31(12), 1080 (2013)CrossRef

Abdelaziz, I., Fokoue, A., Hassanzadeh, O., Zhang, P., Sadoghi, M.: Large-scale structural and textual similarity-based mining of knowledge graph to predict drug–drug interactions. Web Semant.: Sci., Serv. Agents World Wide Web 44, 104–117 (2017)CrossRef

Leser, U., Hakenberg, J.: What makes a gene name? Named entity recognition in the biomedical literature. Brief. Bioinform. 6(4), 357–369 (2005)CrossRef

Lotfi Shahreza, M., Ghadiri, N., Mousavi, S.R., Varshosaz, J., Green, J.R.: A review of network-based approaches to drug repositioning. Brief. Bioinform. bbx017 (2017)

Dudley, J.T., Deshpande, T., Butte, A.J.: Exploiting drug–disease relationships for computational drug repositioning. Brief. Bioinform. 12(4), 303–311 (2011)CrossRef

Willett, P., Barnard, J.M., Downs, G.M.: Chemical similarity searching. J. Chem. Inf. Comput. Sci. 38(6), 983–996 (1998)CrossRef

Ngo, D.L., et al.: Application of word embedding to drug repositioning. J. Biomed. Sci. Eng. 9(01), 7 (2016)CrossRef

Lengerich, B.J., Maas, A.L., Potts, C.: Retrofitting Distributional Embeddings to Knowledge Graphs with Functional Relations. arXiv preprint arXiv:1708.00112 (2017)

10.

Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Long Papers, vol. 1, pp. 238–247 (2014)

11.

Mikolov, T., Yih, W.T., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751 (2013)

12.

Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: NIPS (2013)

13.

Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Advances in Neural Information Processing, pp. 2177–2185 (2014)

14.

Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)

15.

Bengio, Y., Courville, A., Vincent, P., Collobert, R., Weston, J., et al.: Natural language processing (almost) from scratch. IEEE Trans. Pattern Anal. Mach. Intell. 35, 384–394 (2014)

16.

Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification, vol. 2, pp. 427–431 (2016). Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Valencia, Spain, 3–7 April 2017

17.

Borg, I., Groenen, P.J.: Modern Multidimensional Scaling: Theory and Applications. Springer, New york (2005). https://doi.org/10.1007/0-387-28981-XCrossRefMATH

18.

Weinberg, S.L.: An introduction to multidimensional scaling. Meas. Eval. Couns. Dev. 24, 12–36 (1991)

19.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATH

20.

Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)CrossRef

21.

Hamilton, W.L., Leskovec, J., Jurafsky, D.: Diachronic word embeddings reveal statistical laws of semantic change. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016, pp. 1489–1501 (2016)

22.

Altman, D.G., Bland, J.M.: Measurement in medicine: the analysis of method comparison studies. Statistician 32, 307–317 (1983)CrossRef

23.

Schönemann, P.H.: A generalized solution of the orthogonal procrustes problem. Psychometrika 31, 1–10 (1966)MathSciNetCrossRef

24.

Jessop, D.M., Adams, S.E., Willighagen, E.L., Hawizy, L., Murray-Rust, P.: OSCAR4: a flexible architecture for chemical text-mining. J. Cheminformatics 3(1), 41 (2011)CrossRef

25.

Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2014)CrossRef

26.

Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Advances in Neural Information Processing Systems, pp. 2177–2185 (2014)

27.

Gittens, A., Achlioptas, D., Mahoney, M.W.: Skip-gram - zipf + uniform = vector additivity. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Long Papers, vol. 1, pp. 69–76 (2017)

28.

Li, Y., Xu, L., Tian, F., Jiang, L., Zhong, X., Chen, E.: Word embedding revisited: a new representation learning and explicit matrix factorization perspective. In: IJCAI International Joint Conference on Artificial Intelligence, pp. 3650–3656 (2015)

29.

Canese, K.: PubMed relevance sort. NLM Tech. Bull 394, e2 (2013)

Titel: Do Scaling Algorithms Preserve Word2Vec Semantics? A Case Study for Medical Entities
verfasst von: Janus Wawrzinek
José María González Pinto
Philipp Markiewka
Wolf-Tilo Balke
Verlag: Springer International Publishing
Buch: Data Integration in the Life Sciences
Print ISBN: 978-3-030-06015-2

Electronic ISBN: 978-3-030-06016-9

Copyright-Jahr: 2019
DOI: https://doi.org/10.1007/978-3-030-06016-9_1

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"