Legal Document Retrieval Using Document Vector Embeddings and Deep Learning

Sugathadasa, Keet; Ayesha, Buddhi; de Silva, Nisansa; Perera, Amal Shehan; Jayawardana, Vindula; Lakmal, Dimuthu; Perera, Madhavi

doi:10.1007/978-3-030-01177-2_12

Keet Sugathadasa¹⁷,
Buddhi Ayesha¹⁷,
Nisansa de Silva¹⁷,
Amal Shehan Perera¹⁷,
Vindula Jayawardana¹⁷,
Dimuthu Lakmal¹⁷ &
…
Madhavi Perera¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 857))

Included in the following conference series:

Science and Information Conference

2066 Accesses
22 Citations

Abstract

Domain specific information retrieval process has been a prominent and ongoing research in the field of natural language processing. Many researchers have incorporated different techniques to overcome the technical and domain specificity and provide a mature model for various domains of interest. The main bottleneck in these studies is the heavy coupling of domain experts, that makes the entire process to be time consuming and cumbersome. In this study, we have developed three novel models which are compared against a golden standard generated via the on line repositories provided, specifically for the legal domain. The three different models incorporated vector space representations of the legal domain, where document vector generation was done in two different mechanisms and as an ensemble of the above two. This study contains the research being carried out in the process of representing legal case documents into different vector spaces, whilst incorporating semantic word measures and natural language processing techniques. The ensemble model built in this study, shows a significantly higher accuracy level, which indeed proves the need for incorporation of domain specific semantic similarity measures into the information retrieval process. This study also shows, the impact of varying distribution of the word similarity measures, against varying document vector dimensions, which can lead to improvements in the process of legal information retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Salton, G., Lesk, M.: Iv information analysis and dictionary construction (1971)
Google Scholar
Müller, H.-M., Kenny, E.E., Sternberg, P.W.: Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol 2(11), e309 (2004)
Article Google Scholar
Huang, J., Gutierrez, F., Strachan, H.J., et al.: OmniSearch: a semantic search system based on the Ontology for MIcroRNA Target (OMIT) for microRNA-target gene interaction data. J. Biomed. Semant. 7(25) (2016)
Google Scholar
Huang, J., Eilbeck, K., Smith, B.: The development of non-coding RNA ontology. Int. J. Data Min. Bioinform. 15(3), 214–232 (2016)
Article Google Scholar
Oliver, D.E., Shahar, Y., Shortliffe, E.H., Musen, M.A.: Representation of change in controlled medical terminologies. Artif. Intell. Med. 15(1), 53–76 (1999)
Article Google Scholar
Woods, W.A.: Progress in natural language understanding: an application to lunar geology. In: Proceedings of the June 4-8, 1973, National Computer Conference and Exposition, pp. 441–450. ACM (1973)
Google Scholar
Müller, M.: Information Retrieval for Music and Motion, vol. 2. Springer (2007)
Google Scholar
Manning, C.D., Raghavan, P., Schütze, H., et al.: Introduction to Information Retrieval, vol. 1, no. 1. Cambridge University Press, Cambridge (2008)
Google Scholar
Hughes, J.: Rules for mediation in findlaw for legal professionals (1999)
Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)
Article Google Scholar
Sivic, J., Zisserman, A., et al.: Video google: a text retrieval approach to object matching in videos. In: ICCV, vol. 2, no. 1470, pp. 1470–1477 (2003)
Google Scholar
Greene, D., Cunningham, P.: Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 377–384. ACM (2006)
Google Scholar
Ramos, J., et al.: Using tf-idf to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning (2003)
Google Scholar
Perina, A., Jojic, N., Bicego, M., Truski, A.: Documents as multiple overlapping windows into grids of counts. In: Advances in Neural Information Processing Systems, pp. 10–18 (2013)
Google Scholar
Rocchio, J.J.: Relevance feedback in information retrieval (1971)
Google Scholar
Salton, G., Singhal, A., Mitra, M., Buckley, C.: Automatic text structuring and summarization. Inf. Process. Manag. 33(2), 193–207 (1997)
Article Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)
Google Scholar
Jayawardana, V., Lakmal, D., de Silva, N., Perera, A.S., Sugathadasa, K., Ayesha, B., Perera, M.: Semi-supervised instance population of an ontology using word vector embeddings. arXiv preprint arXiv:1709.02911 (2017)
Lapata, M., Barzilay, R.: Automatic evaluation of text coherence: models and representations. In: IJCAI, vol. 5, pp. 1085–1090 (2005)
Google Scholar
Terra, E., Clarke, C.L.: Frequency estimates for statistical word similarity measures. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 165–172. Association for Computational Linguistics (2003)
Google Scholar
Sugathadasa, K., Ayesha, B., de Silva, N.,Perera, A.S. , Jayawardana, V., Lakmal, D., Perera, M.: Synergistic union of word2vec and lexicon for domain specific semantic similarity (2017)
Google Scholar
Mihalcea, R., Corley, C., Strapparava, C., et al.: Corpus-based and knowledge-based measures of text semantic similarity. In: AAAI, vol. 6, pp. 775–780 (2006)
Google Scholar
Salton, G., McGill, M.J.: Introduction to modern information retrieval (1986)
Google Scholar
Hamming, R.W.: Error detecting and error correcting codes. Bell Labs Tech. J. 29(2), 147–160 (1950)
Article MathSciNet Google Scholar
Norouzi, M., Fleet, D.J., Salakhutdinov, R.R.: Hamming distance metric learning. In: Advances in Neural Information Processing Systems, pp. 1061–1069 (2012)
Google Scholar
Cha, S.-H.: Comprehensive survey on distance/similarity measures between probability density functions. City 1(2), 1 (2007)
MathSciNet Google Scholar
Cha, S.-H., Yoon, S., Tappert, C.C.: Enhancing binary feature vector similarity measures (2005)
Google Scholar
Jayawardana, V., Lakmal, D., de Silva, N., Perera, A.S., Sugathadasa, K., Ayesha, B.: Deriving a representative vector for ontology classes with instance word vector embeddings. arXiv preprint arXiv:1706.02909 (2017)
Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: Proceedings of the 32Nd Annual Meeting on Association for Computational Linguistics, ser. ACL 1994, pp. 133–138. Association for Computational Linguistics, Stroudsburg, PA, USA (1994). https://doi.org/10.3115/981732.981751
Qian, G., Sural, S., Gu, Y., Pramanik, S.: Similarity between euclidean and cosine angle distance for nearest neighbor queries. In: Proceedings of the 2004 ACM Symposium on Applied Computing, pp. 1232–1237. ACM (2004)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Evans, D.A., Cimino, J.J., Hersh, W.R., Huff, S.M., Bell, D.S.: Toward a medical-concept representation language. J. Am. Med. Inform. Assoc. 1(3), 207–217 (1994)
Article Google Scholar
Tang, B., Cao, H., Wu, Y., Jiang, M., Xu, H.: Recognizing clinical entities in hospital discharge summaries using structural support vector machines with word representation features. BMC Med. Inform. Decis. Mak. 13(1), S1 (2013)
Article Google Scholar
Schweighofer, E., Winiwarter, W.: Legal expert system kontermautomatic representation of document structure and contents. In: International Conference on Database and Expert Systems Applications, pp. 486–497. Springer (1993)
Google Scholar
Nay, J.J.: Gov2vec: learning distributed representations of institutions and their legal text. arXiv preprint arXiv:1609.06616 (2016)
Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)
MATH Google Scholar
Mihalcea, R., Tarau, P.: Textrank: bringing order into texts. Association for Computational Linguistics (2004)
Google Scholar
Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Technical Report Stanford InfoLab (1999)
Google Scholar
Radev, D.R.: A common theory of information fusion from multiple text sources step one: cross-document structure. In: Proceedings of the 1st SIGdial Workshop on Discourse and Dialogue, vol. 10. Association for Computational Linguistics, pp. 74–83 (2000)
Google Scholar
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pp. 1188–1196 (2014)
Google Scholar
Zhang, Y., Jin, R., Zhou, Z.-H.: Understanding bag-of-words model: a statistical framework. Int. J. Mach. Learn. Cybern. 1(1–4), 43–52 (2010)
Article Google Scholar
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J.R., Bethard, S., McClosky, D.: The stanford corenlp natural language processing toolkit. In: ACL (System Demonstrations), pp. 55–60 (2014)
Google Scholar
Wang, X., Wang, L., Qiao, Y.: A comparative study of encoding, pooling and normalization methods for action recognition. In: Asian Conference on Computer Vision, pp. 572–585. Springer (2012)
Google Scholar
Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864. ACM (2016)
Google Scholar
Wimalasuriya, D.C., Dou, D.: Ontology-based information extraction: an introduction and a survey of current approaches. J. Inf. Sci. (2010)
Google Scholar
Kohavi, R., et al.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Ijcai, vol. 14, no. 2, pp. 1137–1145. Stanford, CA (1995)
Google Scholar
Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Sci. Am. 284(5), 28–37 (2001)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Moratuwa, Moratuwa, Sri Lanka
Keet Sugathadasa, Buddhi Ayesha, Nisansa de Silva, Amal Shehan Perera, Vindula Jayawardana & Dimuthu Lakmal
University of London International Programmes, University of London, London, UK
Madhavi Perera

Authors

Keet Sugathadasa
View author publications
You can also search for this author in PubMed Google Scholar
Buddhi Ayesha
View author publications
You can also search for this author in PubMed Google Scholar
Nisansa de Silva
View author publications
You can also search for this author in PubMed Google Scholar
Amal Shehan Perera
View author publications
You can also search for this author in PubMed Google Scholar
Vindula Jayawardana
View author publications
You can also search for this author in PubMed Google Scholar
Dimuthu Lakmal
View author publications
You can also search for this author in PubMed Google Scholar
Madhavi Perera
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Keet Sugathadasa .

Editor information

Editors and Affiliations

Faculty of Science and Engineering, Department of Information Science, Saga University, Honjo, Saga, Japan
Kohei Arai
The Science and Information (SAI) Organization, Bradford, West Yorkshire, UK
Supriya Kapoor
The Science and Information (SAI) Organization, Bradford, West Yorkshire, UK
Rahul Bhatia

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sugathadasa, K. et al. (2019). Legal Document Retrieval Using Document Vector Embeddings and Deep Learning. In: Arai, K., Kapoor, S., Bhatia, R. (eds) Intelligent Computing. SAI 2018. Advances in Intelligent Systems and Computing, vol 857. Springer, Cham. https://doi.org/10.1007/978-3-030-01177-2_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-01177-2_12
Published: 02 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01176-5
Online ISBN: 978-3-030-01177-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics