Abstract
Domain specific information retrieval process has been a prominent and ongoing research in the field of natural language processing. Many researchers have incorporated different techniques to overcome the technical and domain specificity and provide a mature model for various domains of interest. The main bottleneck in these studies is the heavy coupling of domain experts, that makes the entire process to be time consuming and cumbersome. In this study, we have developed three novel models which are compared against a golden standard generated via the on line repositories provided, specifically for the legal domain. The three different models incorporated vector space representations of the legal domain, where document vector generation was done in two different mechanisms and as an ensemble of the above two. This study contains the research being carried out in the process of representing legal case documents into different vector spaces, whilst incorporating semantic word measures and natural language processing techniques. The ensemble model built in this study, shows a significantly higher accuracy level, which indeed proves the need for incorporation of domain specific semantic similarity measures into the information retrieval process. This study also shows, the impact of varying distribution of the word similarity measures, against varying document vector dimensions, which can lead to improvements in the process of legal information retrieval.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Salton, G., Lesk, M.: Iv information analysis and dictionary construction (1971)
Müller, H.-M., Kenny, E.E., Sternberg, P.W.: Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol 2(11), e309 (2004)
Huang, J., Gutierrez, F., Strachan, H.J., et al.: OmniSearch: a semantic search system based on the Ontology for MIcroRNA Target (OMIT) for microRNA-target gene interaction data. J. Biomed. Semant. 7(25) (2016)
Huang, J., Eilbeck, K., Smith, B.: The development of non-coding RNA ontology. Int. J. Data Min. Bioinform. 15(3), 214–232 (2016)
Oliver, D.E., Shahar, Y., Shortliffe, E.H., Musen, M.A.: Representation of change in controlled medical terminologies. Artif. Intell. Med. 15(1), 53–76 (1999)
Woods, W.A.: Progress in natural language understanding: an application to lunar geology. In: Proceedings of the June 4-8, 1973, National Computer Conference and Exposition, pp. 441–450. ACM (1973)
Müller, M.: Information Retrieval for Music and Motion, vol. 2. Springer (2007)
Manning, C.D., Raghavan, P., Schütze, H., et al.: Introduction to Information Retrieval, vol. 1, no. 1. Cambridge University Press, Cambridge (2008)
Hughes, J.: Rules for mediation in findlaw for legal professionals (1999)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)
Sivic, J., Zisserman, A., et al.: Video google: a text retrieval approach to object matching in videos. In: ICCV, vol. 2, no. 1470, pp. 1470–1477 (2003)
Greene, D., Cunningham, P.: Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 377–384. ACM (2006)
Ramos, J., et al.: Using tf-idf to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning (2003)
Perina, A., Jojic, N., Bicego, M., Truski, A.: Documents as multiple overlapping windows into grids of counts. In: Advances in Neural Information Processing Systems, pp. 10–18 (2013)
Rocchio, J.J.: Relevance feedback in information retrieval (1971)
Salton, G., Singhal, A., Mitra, M., Buckley, C.: Automatic text structuring and summarization. Inf. Process. Manag. 33(2), 193–207 (1997)
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)
Jayawardana, V., Lakmal, D., de Silva, N., Perera, A.S., Sugathadasa, K., Ayesha, B., Perera, M.: Semi-supervised instance population of an ontology using word vector embeddings. arXiv preprint  arXiv:1709.02911 (2017)
Lapata, M., Barzilay, R.: Automatic evaluation of text coherence: models and representations. In: IJCAI, vol. 5, pp. 1085–1090 (2005)
Terra, E., Clarke, C.L.: Frequency estimates for statistical word similarity measures. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 165–172. Association for Computational Linguistics (2003)
Sugathadasa, K., Ayesha, B., de Silva, N.,Perera, A.S. , Jayawardana, V., Lakmal, D., Perera, M.: Synergistic union of word2vec and lexicon for domain specific semantic similarity (2017)
Mihalcea, R., Corley, C., Strapparava, C., et al.: Corpus-based and knowledge-based measures of text semantic similarity. In: AAAI, vol. 6, pp. 775–780 (2006)
Salton, G., McGill, M.J.: Introduction to modern information retrieval (1986)
Hamming, R.W.: Error detecting and error correcting codes. Bell Labs Tech. J. 29(2), 147–160 (1950)
Norouzi, M., Fleet, D.J., Salakhutdinov, R.R.: Hamming distance metric learning. In: Advances in Neural Information Processing Systems, pp. 1061–1069 (2012)
Cha, S.-H.: Comprehensive survey on distance/similarity measures between probability density functions. City 1(2), 1 (2007)
Cha, S.-H., Yoon, S., Tappert, C.C.: Enhancing binary feature vector similarity measures (2005)
Jayawardana, V., Lakmal, D., de Silva, N., Perera, A.S., Sugathadasa, K., Ayesha, B.: Deriving a representative vector for ontology classes with instance word vector embeddings. arXiv preprint arXiv:1706.02909 (2017)
Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: Proceedings of the 32Nd Annual Meeting on Association for Computational Linguistics, ser. ACL 1994, pp. 133–138. Association for Computational Linguistics, Stroudsburg, PA, USA (1994). https://doi.org/10.3115/981732.981751
Qian, G., Sural, S., Gu, Y., Pramanik, S.: Similarity between euclidean and cosine angle distance for nearest neighbor queries. In: Proceedings of the 2004 ACM Symposium on Applied Computing, pp. 1232–1237. ACM (2004)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Evans, D.A., Cimino, J.J., Hersh, W.R., Huff, S.M., Bell, D.S.: Toward a medical-concept representation language. J. Am. Med. Inform. Assoc. 1(3), 207–217 (1994)
Tang, B., Cao, H., Wu, Y., Jiang, M., Xu, H.: Recognizing clinical entities in hospital discharge summaries using structural support vector machines with word representation features. BMC Med. Inform. Decis. Mak. 13(1), S1 (2013)
Schweighofer, E., Winiwarter, W.: Legal expert system kontermautomatic representation of document structure and contents. In: International Conference on Database and Expert Systems Applications, pp. 486–497. Springer (1993)
Nay, J.J.: Gov2vec: learning distributed representations of institutions and their legal text. arXiv preprint arXiv:1609.06616 (2016)
Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)
Mihalcea, R., Tarau, P.: Textrank: bringing order into texts. Association for Computational Linguistics (2004)
Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Technical Report Stanford InfoLab (1999)
Radev, D.R.: A common theory of information fusion from multiple text sources step one: cross-document structure. In: Proceedings of the 1st SIGdial Workshop on Discourse and Dialogue, vol. 10. Association for Computational Linguistics, pp. 74–83 (2000)
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pp. 1188–1196 (2014)
Zhang, Y., Jin, R., Zhou, Z.-H.: Understanding bag-of-words model: a statistical framework. Int. J. Mach. Learn. Cybern. 1(1–4), 43–52 (2010)
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J.R., Bethard, S., McClosky, D.: The stanford corenlp natural language processing toolkit. In: ACL (System Demonstrations), pp. 55–60 (2014)
Wang, X., Wang, L., Qiao, Y.: A comparative study of encoding, pooling and normalization methods for action recognition. In: Asian Conference on Computer Vision, pp. 572–585. Springer (2012)
Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864. ACM (2016)
Wimalasuriya, D.C., Dou, D.: Ontology-based information extraction: an introduction and a survey of current approaches. J. Inf. Sci. (2010)
Kohavi, R., et al.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Ijcai, vol. 14, no. 2, pp. 1137–1145. Stanford, CA (1995)
Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Sci. Am. 284(5), 28–37 (2001)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Sugathadasa, K. et al. (2019). Legal Document Retrieval Using Document Vector Embeddings and Deep Learning. In: Arai, K., Kapoor, S., Bhatia, R. (eds) Intelligent Computing. SAI 2018. Advances in Intelligent Systems and Computing, vol 857. Springer, Cham. https://doi.org/10.1007/978-3-030-01177-2_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-01177-2_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01176-5
Online ISBN: 978-3-030-01177-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)