Skip to main content
Top

2023 | OriginalPaper | Chapter

AI/ML Based Sensitive Data Discovery and Classification of Unstructured Data Sources

Authors : Shravani Ponde, Akshay Kulkarni, Rashmi Agarwal

Published in: Intelligent Systems and Machine Learning

Publisher: Springer Nature Switzerland

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The amount of data produced every day is enormous. According to Forbes, 2.5 quintillion data is created daily (Marr, 2018). The volume of unstructured data is also multiplying daily, forcing organizations to spend significant time, effort, and money to manage and govern the data assets. This volume of unstructured data also leads to data privacy challenges in handling, auditing, and regulatory encounters thrown by governing bodies like Governments, Auditors, Data Protection/Legislative/Federal laws, regulatory acts like The General Data Protection Regulation (GDPR), The Basel Committee on Banking Supervision (BCBS), Health Insurance Portability and Accountability Act (HIPPA), The California Consumer Privacy Act (CCPA) etc.
Organizations must set up a robust data protection framework and governance to identify, classify, protect and monitor the sensitive data residing in the unstructured data sources. Data discovery and classification of the data assets is scanning the organization’s data sources both structured and unstructured, that could potentially contain sensitive or regulated data.
Most organizations are using various data discovery and classification tools in scanning the structured and unstructured sources. The organizations cannot accomplish the overall privacy and protection needs due to the gaps observed in scanning and discovering sensitive data elements from unstructured sources. Hence, they are adapting to manual methodologies to fill these gaps.
The main objective of this study is to build a solution which systematically scans an unstructured data source and detects the sensitive data elements, auto classify as per the data classification categories, and visualizes the results on a dashboard. This solution uses Machine Learning (ML) and Natural Language Processing (NLP) techniques to detect the sensitive data elements contained in the unstructured data sources. It can be used as a first step before performing data encryption, tokenization, anonymization, and masking as part of the overall data protection journey.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
go back to reference Cha, S.-C., Yeh, K.-H.: A Data-Driven Security Risk Assessment Scheme for Personal Data Protection. IEEE, pp. 50510 – 50517 (2018) Cha, S.-C., Yeh, K.-H.: A Data-Driven Security Risk Assessment Scheme for Personal Data Protection. IEEE, pp. 50510 – 50517 (2018)
go back to reference Mehmood, A., Natgunanathan, I., Xiang, Y., Hua, G., Guo, S.: Protection of Big Data Privacy. IEEE, pp. 1821–1834 (2016) Mehmood, A., Natgunanathan, I., Xiang, Y., Hua, G., Guo, S.: Protection of Big Data Privacy. IEEE, pp. 1821–1834 (2016)
go back to reference Truong, N.B., Sun, K., Lee, G.M., Guo, Y.: GDPR-Compliant Personal Data Management: A Blockchain-Based Solution. IEEE, pp. 1746–1761 (2019) Truong, N.B., Sun, K., Lee, G.M., Guo, Y.: GDPR-Compliant Personal Data Management: A Blockchain-Based Solution. IEEE, pp. 1746–1761 (2019)
go back to reference Xu, L., Jiang, C., Wang, J., Yuan, J., Ren, Y.: Information Security in Big Data: Privacy and Data Mining. IEEE, pp. 1149–1176 (2014) Xu, L., Jiang, C., Wang, J., Yuan, J., Ren, Y.: Information Security in Big Data: Privacy and Data Mining. IEEE, pp. 1149–1176 (2014)
go back to reference Yaqoob, I., Salah, K., Jayaraman, R., & Al-Hammadi, Y.: Blockchain for healthcare data management: opportunities, challenges, and future recommendations. Springer Link, pp. 11475–11490 (2022) Yaqoob, I., Salah, K., Jayaraman, R., & Al-Hammadi, Y.: Blockchain for healthcare data management: opportunities, challenges, and future recommendations. Springer Link, pp. 11475–11490 (2022)
go back to reference Zhang, X., et al.: MRMondrian: Scalable Multidimensional Anonymisation for Big Data Privacy Preservation. IEEE, pp. 125–139 (2017) Zhang, X., et al.: MRMondrian: Scalable Multidimensional Anonymisation for Big Data Privacy Preservation. IEEE, pp. 125–139 (2017)
Metadata
Title
AI/ML Based Sensitive Data Discovery and Classification of Unstructured Data Sources
Authors
Shravani Ponde
Akshay Kulkarni
Rashmi Agarwal
Copyright Year
2023
DOI
https://doi.org/10.1007/978-3-031-35081-8_31

Premium Partner