Abstract
Being able to correctly model semantic relatedness between texts, and consequently the concepts represented by these texts, has become an important part of many intelligent information retrieval and knowledge processing systems. The need for such systems is especially evident within the biomedical domain, where the sheer amount of scientific publishing contributes to an information overflow. In this paper we present a novel method to approximate semantic relatedness in domain-focused settings. The approach is an extension to a well-known ESA (Explicit Semantic Analysis) method. Our extension successfully leverages the semantics of a domain-specific document corpus. We present the evaluation of the proposed method on a set of reference datasets, that are a de facto reference standard for the task of approximating biomedical semantic relatedness. The proposed method is evaluated in comparison with other state-of-the-art methods, as well as the baselines established with the original ESA method. The results of the experiments suggest that the proposed method combines the semantics of a general and domain-specific corpora to provide significant improvements over the original method.
Similar content being viewed by others
Notes
We have evaluated the algorithm with the values of k between 1 and 15 and the method seems to work well within this range. In the evaluation presented here we only discuss results for k = 1 and k = 10 for illustrational purposes.
See https://seriousstats.wordpress.com/2012/02/05/comparing-correlations/ for discussion and code.
References
Agirre, E., & Rigau, G. (1996). Word sense disambiguation using conceptual density. In Proceedings of the 16th conference on computational linguistics-volume 1, association for computational linguistics (pp. 16–22).
Asooja, N.A.K., Bordea, G., & Buitelaar, P. (2015). Non-orthogonal explicit semantic analysis. Lexical and Computational Semantics (* SEM 2015).
Barzilay, R., & Elhadad, M. (1997). Using lexical chains for text summarization. In Proceedings of the ACL workshop on intelligent scalable text summarization: July 1997; Madrid, Spain, Association for Computational Linguistics (pp. 10–17).
Dumais, S.T. (2004). Latent semantic analysis. Annual Review of Information Science and Technology, 38(1), 188–230.
Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI, (Vol. 7 pp. 1606–1611).
Guo, X., Liu, R., Shriver, C.D., Hu, H., & Liebman, M.N. (2006). Assessing semantic similarity measures for the characterization of human regulatory pathways. Bioinformatics, 22(8), 967–973.
Haralambous, Y., & Klyuev, V. (2013). Thematically reinforced explicit semantic analysis. International Journal of Computational Linguistics and Applications, 4(1), 79.
Kusner, M.J., Sun, Y., Kolkin, N.I., & Weinberger, K.Q. (2015). From word embeddings to document distances. In Proceedings of the 32nd international conference on machine learning (ICML 2015) (pp. 957–966).
Liu, Y., McInnes, B.T., Pedersen, T., Melton-Meaux, G., & Pakhomov, S. (2012). Semantic relatedness study using second order co-occurrence vectors computed from biomedical corpora, umls and wordnet. In Proceedings of the 2nd ACM SIGHIT international health informatics symposium, ACM (pp. 363–372).
Martinez-Gil, J. (2016). Accurate semantic similarity measurement of biomedical nomenclature by means of fuzzy logic. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 24(02), 291–305.
Mathur, S., & Dinakarpandian, D. (2012). Finding disease similarity based on implicit semantic similarity. Journal of Biomedical Informatics, 45(2), 363–371.
Mikolov, T., Chen, K., Corrado, G.S., & Dean, J. (2013). Efficient estimation of word representations in vector space. In International Conference on Learning Representations. arXiv:1301.3781.
Muneeb, T., Sahu, S.K., & Anand, A. (2015). Evaluating distributed word representations for capturing semantics of biomedical concepts. In ACL-IJCNLP, (Vol. 2015 p. 158).
Pakhomov, S., McInnes, B., Adam, T., Liu, Y., Pedersen, T., & Melton, G.B. (2010). Semantic similarity and relatedness between clinical terms: an experimental study. In AMIA Annual symposium proceedings, american medical informatics association, (Vol. 2010 p. 572).
Pakhomov, S.V., Pedersen, T., McInnes, B., Melton, G.B., Ruggieri, A., & Chute, C.G. (2011). Towards a framework for developing semantic relatedness reference standards. Journal of Biomedical Informatics, 44(2), 251–265.
Pedersen, T., Pakhomov, S.V.S., Patwardhan, S., & Chute, C.G. (2007). Measures of semantic similarity and relatedness in the biomedical domain. Journal of Biomedical Informatics, 40(3), 288–299.
Pennington, J., Socher, R., & Manning, C.D. (2014). Glove: Global vectors for word representation. In EMNLP, (Vol. 14 pp. 1532–43).
Pesaranghader, A., Rezaei, A., & Pesaranghader, A. (2014). Adapting gloss vector semantic relatedness measure for semantic similarity estimation: an evaluation in the biomedical domain. In Semantic technology (pp. 129–145). New York: Springer.
Pesquita, C., Faria, D., Falcao, A.O., Lord, P., & Couto, F.M. (2009). Semantic similarity in biomedical ontologies. PLoS Computational Biology, 5(7), e1000,443.
Polajnar, T., Aggarwal, N., Asooja, K., & Buitelaar, P. (2013). Improving esa with document similarity. In Advances in information retrieval (pp. 582–593). New York: Springer.
Potthast, M., Stein, B., & Anderka, M. (2008). A wikipedia-based multilingual retrieval model. In European conference on information retrieval, (pp. 522–530). Springer.
Rada, R., Mili, H., Bicknell, E., & Blettner, M. (1989). Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man, and Cybernetics, 19(1), 17–30.
Rybiński, M., & Aldana-Montes, J.F. (2016). TESA: a distributional measure for calculating semantic relatedness. BMC Journal of Biomedical Semantics – accepted for publication.
Rybiński, M., del Mar Roldán-Garcıa, M., Garcıa-Nieto, J., & Aldana-Montes, J.F. (2016). Dismatch results for OAEI. In OM. http://disi.unitn.it/~pavel/om2016/papers/oaei16_paper5.pdf.
Sahay, S., & Ram, A. (2011). Socio-semantic health information access. In AAAI spring symposium: AI and health communication, AAAI.
Sajadi, A., Milios, E.E., Kešelj, V., & Janssen, J.C. (2015). Domain-specific semantic relatedness from wikipedia structure: a case study in biomedical text. In International conference on intelligent text processing and computational linguistics, (pp. 347–360). Springer.
Sánchez, D., & Batet, M. (2011). Semantic similarity estimation in the biomedical domain: an ontology-based information-theoretic perspective. Journal of Biomedical Informatics, 44(5), 749–759.
Scholl, P., Böhnstedt, D, García, R.D., Rensing, C., & Steinmetz, R. (2010). Extended explicit semantic analysis for calculating semantic relatedness of web resources. In Sustaining TEL: from innovation to learning and practice (pp. 324–339). New York: Springer.
Strube, M., & Ponzetto, S.P. (2006). Wikirelate! computing semantic relatedness using wikipedia. In AAAI, (Vol. 6 pp. 1419–1424).
Virginia, G., & Nguyen, H.S. (2015). A semantic text retrieval for Indonesian using tolerance rough sets models. In Transactions on rough sets XIX, (pp. 138–224). Springer.
Zhang, R., Pakhomov, S., McInnes, B.T., & Melton, G.B. (2011). Evaluating measures of redundancy in clinical texts. In AMIA annual symposium proceedings, american medical informatics association, (Vol. 2011 p. 1612).
Zhang, Z., Gentile, A.L., & Ciravegna, F. (2012). Recent advances in methods of lexical semantic relatedness–a survey. Natural Language Engineering, 1(1), 1–69.
Zou, G.Y. (2007). Toward using confidence intervals to compare correlations. Psychological Methods, 12(4), 399.
Acknowledgments
We would like to thank the anonymous referees for their invaluable contributions towards improving the manuscript.
The work presented in this paper was partially supported by grants TIN2014-58304-R (Ministerio de Ciencia e Innovación), P11-TIC-7529 and P12-TIC-1519 (Plan Andaluz de Investigación, Desarrollo e Innovación).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Rybiński, M., Aldana Montes, J.F. DomESA: a novel approach for extending domain-oriented lexical relatedness calculations with domain-specific semantics. J Intell Inf Syst 49, 315–331 (2017). https://doi.org/10.1007/s10844-017-0442-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-017-0442-y