Skip to main content

Legal Document Retrieval Using Document Vector Embeddings and Deep Learning

  • Conference paper
  • First Online:
Book cover Intelligent Computing (SAI 2018)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 857))

Included in the following conference series:

Abstract

Domain specific information retrieval process has been a prominent and ongoing research in the field of natural language processing. Many researchers have incorporated different techniques to overcome the technical and domain specificity and provide a mature model for various domains of interest. The main bottleneck in these studies is the heavy coupling of domain experts, that makes the entire process to be time consuming and cumbersome. In this study, we have developed three novel models which are compared against a golden standard generated via the on line repositories provided, specifically for the legal domain. The three different models incorporated vector space representations of the legal domain, where document vector generation was done in two different mechanisms and as an ensemble of the above two. This study contains the research being carried out in the process of representing legal case documents into different vector spaces, whilst incorporating semantic word measures and natural language processing techniques. The ensemble model built in this study, shows a significantly higher accuracy level, which indeed proves the need for incorporation of domain specific semantic similarity measures into the information retrieval process. This study also shows, the impact of varying distribution of the word similarity measures, against varying document vector dimensions, which can lead to improvements in the process of legal information retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.westlaw.com/.

  2. 2.

    https://www.lexisnexis.com/.

  3. 3.

    https://www.westlaw.com/.

  4. 4.

    https://www.lexisnexis.com/.

References

  1. Salton, G., Lesk, M.: Iv information analysis and dictionary construction (1971)

    Google Scholar 

  2. Müller, H.-M., Kenny, E.E., Sternberg, P.W.: Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol 2(11), e309 (2004)

    Article  Google Scholar 

  3. Huang, J., Gutierrez, F., Strachan, H.J., et al.: OmniSearch: a semantic search system based on the Ontology for MIcroRNA Target (OMIT) for microRNA-target gene interaction data. J. Biomed. Semant. 7(25) (2016)

    Google Scholar 

  4. Huang, J., Eilbeck, K., Smith, B.: The development of non-coding RNA ontology. Int. J. Data Min. Bioinform. 15(3), 214–232 (2016)

    Article  Google Scholar 

  5. Oliver, D.E., Shahar, Y., Shortliffe, E.H., Musen, M.A.: Representation of change in controlled medical terminologies. Artif. Intell. Med. 15(1), 53–76 (1999)

    Article  Google Scholar 

  6. Woods, W.A.: Progress in natural language understanding: an application to lunar geology. In: Proceedings of the June 4-8, 1973, National Computer Conference and Exposition, pp. 441–450. ACM (1973)

    Google Scholar 

  7. Müller, M.: Information Retrieval for Music and Motion, vol. 2. Springer (2007)

    Google Scholar 

  8. Manning, C.D., Raghavan, P., Schütze, H., et al.: Introduction to Information Retrieval, vol. 1, no. 1. Cambridge University Press, Cambridge (2008)

    Google Scholar 

  9. Hughes, J.: Rules for mediation in findlaw for legal professionals (1999)

    Google Scholar 

  10. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)

    Article  Google Scholar 

  11. Sivic, J., Zisserman, A., et al.: Video google: a text retrieval approach to object matching in videos. In: ICCV, vol. 2, no. 1470, pp. 1470–1477 (2003)

    Google Scholar 

  12. Greene, D., Cunningham, P.: Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 377–384. ACM (2006)

    Google Scholar 

  13. Ramos, J., et al.: Using tf-idf to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning (2003)

    Google Scholar 

  14. Perina, A., Jojic, N., Bicego, M., Truski, A.: Documents as multiple overlapping windows into grids of counts. In: Advances in Neural Information Processing Systems, pp. 10–18 (2013)

    Google Scholar 

  15. Rocchio, J.J.: Relevance feedback in information retrieval (1971)

    Google Scholar 

  16. Salton, G., Singhal, A., Mitra, M., Buckley, C.: Automatic text structuring and summarization. Inf. Process. Manag. 33(2), 193–207 (1997)

    Article  Google Scholar 

  17. Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)

    Google Scholar 

  18. Jayawardana, V., Lakmal, D., de Silva, N., Perera, A.S., Sugathadasa, K., Ayesha, B., Perera, M.: Semi-supervised instance population of an ontology using word vector embeddings. arXiv preprint  arXiv:1709.02911 (2017)

  19. Lapata, M., Barzilay, R.: Automatic evaluation of text coherence: models and representations. In: IJCAI, vol. 5, pp. 1085–1090 (2005)

    Google Scholar 

  20. Terra, E., Clarke, C.L.: Frequency estimates for statistical word similarity measures. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 165–172. Association for Computational Linguistics (2003)

    Google Scholar 

  21. Sugathadasa, K., Ayesha, B., de Silva, N.,Perera, A.S. , Jayawardana, V., Lakmal, D., Perera, M.: Synergistic union of word2vec and lexicon for domain specific semantic similarity (2017)

    Google Scholar 

  22. Mihalcea, R., Corley, C., Strapparava, C., et al.: Corpus-based and knowledge-based measures of text semantic similarity. In: AAAI, vol. 6, pp. 775–780 (2006)

    Google Scholar 

  23. Salton, G., McGill, M.J.: Introduction to modern information retrieval (1986)

    Google Scholar 

  24. Hamming, R.W.: Error detecting and error correcting codes. Bell Labs Tech. J. 29(2), 147–160 (1950)

    Article  MathSciNet  Google Scholar 

  25. Norouzi, M., Fleet, D.J., Salakhutdinov, R.R.: Hamming distance metric learning. In: Advances in Neural Information Processing Systems, pp. 1061–1069 (2012)

    Google Scholar 

  26. Cha, S.-H.: Comprehensive survey on distance/similarity measures between probability density functions. City 1(2), 1 (2007)

    MathSciNet  Google Scholar 

  27. Cha, S.-H., Yoon, S., Tappert, C.C.: Enhancing binary feature vector similarity measures (2005)

    Google Scholar 

  28. Jayawardana, V., Lakmal, D., de Silva, N., Perera, A.S., Sugathadasa, K., Ayesha, B.: Deriving a representative vector for ontology classes with instance word vector embeddings. arXiv preprint arXiv:1706.02909 (2017)

  29. Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: Proceedings of the 32Nd Annual Meeting on Association for Computational Linguistics, ser. ACL 1994, pp. 133–138. Association for Computational Linguistics, Stroudsburg, PA, USA (1994). https://doi.org/10.3115/981732.981751

  30. Qian, G., Sural, S., Gu, Y., Pramanik, S.: Similarity between euclidean and cosine angle distance for nearest neighbor queries. In: Proceedings of the 2004 ACM Symposium on Applied Computing, pp. 1232–1237. ACM (2004)

    Google Scholar 

  31. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  32. Evans, D.A., Cimino, J.J., Hersh, W.R., Huff, S.M., Bell, D.S.: Toward a medical-concept representation language. J. Am. Med. Inform. Assoc. 1(3), 207–217 (1994)

    Article  Google Scholar 

  33. Tang, B., Cao, H., Wu, Y., Jiang, M., Xu, H.: Recognizing clinical entities in hospital discharge summaries using structural support vector machines with word representation features. BMC Med. Inform. Decis. Mak. 13(1), S1 (2013)

    Article  Google Scholar 

  34. Schweighofer, E., Winiwarter, W.: Legal expert system kontermautomatic representation of document structure and contents. In: International Conference on Database and Expert Systems Applications, pp. 486–497. Springer (1993)

    Google Scholar 

  35. Nay, J.J.: Gov2vec: learning distributed representations of institutions and their legal text. arXiv preprint arXiv:1609.06616 (2016)

  36. Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)

    MATH  Google Scholar 

  37. Mihalcea, R., Tarau, P.: Textrank: bringing order into texts. Association for Computational Linguistics (2004)

    Google Scholar 

  38. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Technical Report Stanford InfoLab (1999)

    Google Scholar 

  39. Radev, D.R.: A common theory of information fusion from multiple text sources step one: cross-document structure. In: Proceedings of the 1st SIGdial Workshop on Discourse and Dialogue, vol. 10. Association for Computational Linguistics, pp. 74–83 (2000)

    Google Scholar 

  40. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pp. 1188–1196 (2014)

    Google Scholar 

  41. Zhang, Y., Jin, R., Zhou, Z.-H.: Understanding bag-of-words model: a statistical framework. Int. J. Mach. Learn. Cybern. 1(1–4), 43–52 (2010)

    Article  Google Scholar 

  42. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J.R., Bethard, S., McClosky, D.: The stanford corenlp natural language processing toolkit. In: ACL (System Demonstrations), pp. 55–60 (2014)

    Google Scholar 

  43. Wang, X., Wang, L., Qiao, Y.: A comparative study of encoding, pooling and normalization methods for action recognition. In: Asian Conference on Computer Vision, pp. 572–585. Springer (2012)

    Google Scholar 

  44. Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 855–864. ACM (2016)

    Google Scholar 

  45. Wimalasuriya, D.C., Dou, D.: Ontology-based information extraction: an introduction and a survey of current approaches. J. Inf. Sci. (2010)

    Google Scholar 

  46. Kohavi, R., et al.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Ijcai, vol. 14, no. 2, pp. 1137–1145. Stanford, CA (1995)

    Google Scholar 

  47. Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Sci. Am. 284(5), 28–37 (2001)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Keet Sugathadasa .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sugathadasa, K. et al. (2019). Legal Document Retrieval Using Document Vector Embeddings and Deep Learning. In: Arai, K., Kapoor, S., Bhatia, R. (eds) Intelligent Computing. SAI 2018. Advances in Intelligent Systems and Computing, vol 857. Springer, Cham. https://doi.org/10.1007/978-3-030-01177-2_12

Download citation

Publish with us

Policies and ethics