Skip to main content
Top

Hint

Swipe to navigate through the articles of this issue

27-08-2023

An Approach for Analyzing Unstructured Text Data Using Topic Modeling Techniques for Efficient Information Extraction

Authors: Ashwini Zadgaonkar, Avinash J. Agrawal

Published in: New Generation Computing

Log in

Abstract

Topic modeling techniques are popularly used for document clustering, large-scale text analysis, information extraction from unstructured text documents, feature selection from large corpus, and various recommendation systems. This work suggested a framework using topic modeling techniques for legal information extraction from the Indian judicial system’s unstructured legal judgments. The suggested approach aims to eliminate time-consuming manual judgment analysis in favor of automated judgment analysis that can quickly examine large number of judgments in reduced time span. In this work, we have experimented with different topic modeling methodologies for information extraction. The proposed framework is built on the Latent Dirichlet Allocation, to categorize legal judgments into extracted topic groups. Indian Supreme Court judgements are considered for the experimental setting. The three main elements of the framework are pre-processing, applying the topic model, and model evaluation using a coherence score metric. The framework was successfully applied to a corpus size of 100, 500, and 1000 legal judgments in batches. The proposed framework is used to measure legal judgment similarity to demonstrate its quantitative evaluation. In the future scope, various legal tasks that can benefit from the proposed framework for performance improvement are suggested.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Widyassari, A.P., Rustad, S., Shidik, G.F., Noersasongko, E., Syukur, A., Affandy, A.: Review of automatic text summarization techniques & methods. J. King Saud Univ. Comput. Inform. Sci. 34(4):1029–1046 (2022) Widyassari, A.P., Rustad, S., Shidik, G.F., Noersasongko, E., Syukur, A., Affandy, A.: Review of automatic text summarization techniques & methods. J. King Saud Univ. Comput. Inform. Sci. 34(4):1029–1046 (2022)
2.
go back to reference Chiche, A., Yitagesu, B.: Part of speech tagging: a systematic review of deep learning and machine learning approaches. J. Big Data 9(1), 1–25 (2022) CrossRef Chiche, A., Yitagesu, B.: Part of speech tagging: a systematic review of deep learning and machine learning approaches. J. Big Data 9(1), 1–25 (2022) CrossRef
3.
go back to reference Birjali, M., Kasri, M., Beni-Hssane, A.: A comprehensive survey on sentiment analysis: approaches, challenges and trends. Knowl.-Based Syst. 226, 107134 (2021) CrossRef Birjali, M., Kasri, M., Beni-Hssane, A.: A comprehensive survey on sentiment analysis: approaches, challenges and trends. Knowl.-Based Syst. 226, 107134 (2021) CrossRef
4.
go back to reference Zebari, R., et al.: A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction. J. Appl. Sci. Technol. Trends 1(2), 56–70 (2020) MathSciNetCrossRef Zebari, R., et al.: A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction. J. Appl. Sci. Technol. Trends 1(2), 56–70 (2020) MathSciNetCrossRef
5.
go back to reference Xiang, W., Wang, B.: A survey of event extraction from text. IEEE Access 7, 173111–173137 (2019) CrossRef Xiang, W., Wang, B.: A survey of event extraction from text. IEEE Access 7, 173111–173137 (2019) CrossRef
6.
go back to reference Sansone, C., Sperlí, G.: Legal Information retrieval systems: state-of-the-art and open issues. Inf. Syst. 106, 101967 (2022) CrossRef Sansone, C., Sperlí, G.: Legal Information retrieval systems: state-of-the-art and open issues. Inf. Syst. 106, 101967 (2022) CrossRef
7.
go back to reference Waltl, B., Georg, B., Florian, M.: Rule-based information extraction: advantages, limitations, and Perspectives, In: Jusletter IT 22 (2018) Waltl, B., Georg, B., Florian, M.: Rule-based information extraction: advantages, limitations, and Perspectives, In: Jusletter IT 22 (2018)
9.
go back to reference Heng, J., Grishman, R.: Knowledge base population: successful approaches and challenges. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp. 1148–1158 (2011) Heng, J., Grishman, R.: Knowledge base population: successful approaches and challenges. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp. 1148–1158 (2011)
12.
go back to reference Mandal, A., Kripabandhu, G., Arindam, P., Saptarshi, G.: Automatic catchphrase identification from legal court case documents. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (2017) Mandal, A., Kripabandhu, G., Arindam, P., Saptarshi, G.: Automatic catchphrase identification from legal court case documents. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (2017)
13.
go back to reference Bhattacharya, P., Shounak P., Kripabandhu G., Saptarshi, G., Adam, Z.W.: Identification of rhetorical roles of sentences in indian legal judgments (2019). ArXiv abs/1911.05405. Bhattacharya, P., Shounak P., Kripabandhu G., Saptarshi, G., Adam, Z.W.: Identification of rhetorical roles of sentences in indian legal judgments (2019). ArXiv abs/1911.05405.
19.
20.
go back to reference Vedant, P., Vidit, M., Parth, M., Namita, M., Prasenjit, M.: LawSum: a weakly-supervised approach for Indian Legal Document Summarization (2021). arXiv:​2110.​01188 Vedant, P., Vidit, M., Parth, M., Namita, M., Prasenjit, M.: LawSum: a weakly-supervised approach for Indian Legal Document Summarization (2021). arXiv:​2110.​01188
24.
go back to reference Ashwini Z., Avinash A.: An overview of information extraction techniques for legal document analysis and processing. Int. J. Electr. Comput. Eng. (IJECE). 11(6) (2021) Ashwini Z., Avinash A.: An overview of information extraction techniques for legal document analysis and processing. Int. J. Electr. Comput. Eng. (IJECE). 11(6) (2021)
26.
go back to reference Silveira, R., Fernandes, C.G., Monteiro Neto, J.A., Furtado, V., Pimentel Filho, J.E.: Topic modelling of legal documents via LEGAL-BERT1. Relations in the Legal Domain Workshop, in conjunction with ICAIL 2021. São Paulo, Brazil (2021) Silveira, R., Fernandes, C.G., Monteiro Neto, J.A., Furtado, V., Pimentel Filho, J.E.: Topic modelling of legal documents via LEGAL-BERT1. Relations in the Legal Domain Workshop, in conjunction with ICAIL 2021. São Paulo, Brazil (2021)
27.
go back to reference Novotná, T., Harašta, J., Kól, J.: Topic modelling of the Czech Supreme Court Decisions. Proceedings of Automated Semantic Analysis of Information in Legal Text (2020) Novotná, T., Harašta, J., Kól, J.: Topic modelling of the Czech Supreme Court Decisions. Proceedings of Automated Semantic Analysis of Information in Legal Text (2020)
29.
go back to reference Jacobi C., Van A., W., Welbers K.: Quantitative analysis of large amounts of journalistic texts usingtopicmodeling. Dig. Journalism 4(1), 89–106 (2016) Jacobi C., Van A., W., Welbers K.: Quantitative analysis of large amounts of journalistic texts usingtopicmodeling. Dig. Journalism 4(1), 89–106 (2016)
30.
go back to reference Nikolenko, S., Koltcov S., Koltsova.: Topic modelling for qualitative studies. J. Inform. Sci. 43(1), 88–102 (2017) Nikolenko, S., Koltcov S., Koltsova.: Topic modelling for qualitative studies. J. Inform. Sci. 43(1), 88–102 (2017)
32.
go back to reference He L., Liu, Z., H.: Exploring differential topic models for comparative summarization of scientificpapers. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics, pp. 1028–1038 (2016) He L., Liu, Z., H.: Exploring differential topic models for comparative summarization of scientificpapers. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics, pp. 1028–1038 (2016)
33.
go back to reference Kumar, V., Raghuveer, K.: Legal document summarization using latent dirichlet allocation. Int. J. Comput. Sci. Telecommun. 3(7), 114–117 (2012) Kumar, V., Raghuveer, K.: Legal document summarization using latent dirichlet allocation. Int. J. Comput. Sci. Telecommun. 3(7), 114–117 (2012)
34.
go back to reference Lu, Q., Conrad, J.G., Al-Kofahi, K., Keenan, W.: Legal document clustering with built-in topic segmentation. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 383–392 (2001) Lu, Q., Conrad, J.G., Al-Kofahi, K., Keenan, W.: Legal document clustering with built-in topic segmentation. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 383–392 (2001)
36.
go back to reference Sangaraju, V., Bolla, B., Nayak, D., Kh, J.: Topic modelling on consumer financial protection bureau data: an approach using BERT based embeddings. International Conference for Convergence in Technology (2022) Sangaraju, V., Bolla, B., Nayak, D., Kh, J.: Topic modelling on consumer financial protection bureau data: an approach using BERT based embeddings. International Conference for Convergence in Technology (2022)
42.
go back to reference Sarika, J., Deepak, J., Kapil, G.: Investigating the similarity of court decisions. In: 2022 Advances in Computational Intelligence, its Concepts & Applications (ACI 2022), vol. 3283. pp. 316–326. CEUR-WS ISSN: 1613–0073 (2022) Sarika, J., Deepak, J., Kapil, G.: Investigating the similarity of court decisions. In: 2022 Advances in Computational Intelligence, its Concepts & Applications (ACI 2022), vol. 3283. pp. 316–326. CEUR-WS ISSN: 1613–0073 (2022)
43.
go back to reference Pariskhit, K., Shubham, K., Suraj, S., Pooja, H., Nandana, M., Sarika, J., An Indian court decision annotated corpus and knowledge graph. In: Joint Proceedings of ISWC2022 Workshops: The International Workshop on Artificial Intelligence Technologies for Legal Documents (AI4LEGAL) and the International Workshop on Knowledge Graph Summarization (KGSum) (2022), CEUR Workshop Proceedings vol. 3257. pp. 79–90 (2022) Pariskhit, K., Shubham, K., Suraj, S., Pooja, H., Nandana, M., Sarika, J., An Indian court decision annotated corpus and knowledge graph. In: Joint Proceedings of ISWC2022 Workshops: The International Workshop on Artificial Intelligence Technologies for Legal Documents (AI4LEGAL) and the International Workshop on Knowledge Graph Summarization (KGSum) (2022), CEUR Workshop Proceedings vol. 3257. pp. 79–90 (2022)
44.
go back to reference Jain, S., Harde, P., Mihindukulasooriya, N. NyOn: a multilingual modular legal ontology for representing court judgements. In: Semantic intelligence: select proceedings of ISIC 2022 (pp. 175–183). Singapore: Springer Nature Singapore (2023) Jain, S., Harde, P., Mihindukulasooriya, N. NyOn: a multilingual modular legal ontology for representing court judgements. In: Semantic intelligence: select proceedings of ISIC 2022 (pp. 175–183). Singapore: Springer Nature Singapore (2023)
Metadata
Title
An Approach for Analyzing Unstructured Text Data Using Topic Modeling Techniques for Efficient Information Extraction
Authors
Ashwini Zadgaonkar
Avinash J. Agrawal
Publication date
27-08-2023
Publisher
Springer Japan
Published in
New Generation Computing
Print ISSN: 0288-3635
Electronic ISSN: 1882-7055
DOI
https://doi.org/10.1007/s00354-023-00230-5

Premium Partner