A Scalable Architecture for Automated Data Classification and Sensitive Information Discovery Using Artificial Intelligence

Muppidi Sudheer Kumar

doi:10.63282/3050-922X.IJERET-V4I2P117

Authors

Muppidi Sudheer Kumar Data Governance Lead, Kemper, Tallahassee, FL, USA. Author

DOI:

https://doi.org/10.63282/3050-922X.IJERET-V4I2P117

Keywords:

Artificial Intelligence (AI), Automated Data Classification, Sensitive Information Discovery, Data Governance, Data Security, Machine Learning, Natural Language Processing (Nlp), Sensitive Data Detection, Data Privacy, Scalable Architecture, Intelligent Data Management

Abstract

The continuous expansion of enterprise data across cloud computing platforms, distributed storage systems, and digital communication networks has significantly increased the complexity of managing and securing sensitive information. Traditional rule-based and manual data classification techniques are often inadequate for handling large-scale heterogeneous datasets due to limited scalability, low contextual awareness, and high operational overhead. With the increasing complexity of enterprise data governance, privacy protection, and compliance with cybersecurity regulations, this paper presents an AI-powered, scalable solution for automated data classification and sensitive information discovery. The proposed solution combines machine learning, deep learning, Natural Language Processing (NLP) and transformer-based models to automatically classify enterprise structured, semi-structured and unstructured data. The architecture features several functional components, such as data ingestion, data preprocessing, classification by AI, discovery of sensitive data, compliance management, and secure data storage. By using advanced NLP and Named Entity Recognition (NER) techniques, entities that need to be kept confidential are accurately identified, including personally identifiable information (PII), healthcare records, financial data, and organizational secrets. Cloud-native distributed processing and scalable monitoring frameworks further amplify processing efficiency, flexibility and real-time data governance features. The evaluation results from experiments show that the proposed architecture using AI outperforms the traditional rule-based architecture for classification accuracy, sensitive data detection performance, scalability, and operational efficiency. The framework also features automated governance and auditing to help ensure that all regulations are met, including GDPR, HIPAA, and CCPA. In conclusion, the proposed architecture offers a secure and intelligent way to manage enterprise data in today's digital landscape.

References

[1] Teh, P. S., Zhang, N., Teoh, A. B. J., & Chen, K. (2016). A survey on touch dynamics authentication in mobile devices. Computers & Security, 59, 210-235.

[2] Ahmed, H., Traore, I., Saad, S., & Mamun, M. (2021). Automated detection of unstructured context-dependent sensitive information using deep learning. Internet of Things, 16, 100444.

[3] Timmer, R. C., Liebowitz, D., Nepal, S., & Kanhere, S. S. (2021, December). Can pre-trained transformers be used in detecting complex sensitive sentences?-a monsanto case study. In 2021 Third IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA) (pp. 90-97). IEEE.

[4] Shen, Y., Ding, S. X., Xie, X., & Luo, H. (2014). A review on basic data-driven approaches for industrial process monitoring. IEEE Transactions on Industrial Electronics, 61(11), 6418–6428. https://doi.org/10.1109/TIE.2014.2301773

[5] Ponde, S., Kulkarni, A., & Agarwal, R. (2022, December). Ai/ml based sensitive data discovery and classification of unstructured data sources. In International Conference on Intelligent Systems and Machine Learning (pp. 367-377). Cham: Springer Nature Switzerland.

[6] González, G., & Evans, C. L. (2019). Biomedical Image Processing with Containers and Deep Learning: An Automated Analysis Pipeline: Data architecture, artificial intelligence, automated processing, containerization, and clusters orchestration ease the transition from data acquisition to insights in medium‐to‐large datasets. BioEssays, 41(6), 1900004.

[7] Patil, R., & Gurtoo, A. (2021). Data categorisation and classification: A systematic review. Centre for Society and Policy, Indian Institute of Science, Bangalore.

[8] Liu, Y., Ni, Z., Karlsson, M., & Gong, S. (2021). Methodology for digital transformation with internet of things and cloud computing: A practical guideline for innovation in small-and medium-sized enterprises. Sensors, 21(16), 5355.

[9] Zimmermann, A., Schmidt, R., Sandkuhl, K., Jugel, D., Bogner, J., & Möhring, M. (2018, October). Evolution of enterprise architecture for digital transformation. In 2018 IEEE 22nd International Enterprise Distributed Object Computing Workshop (EDOCW) (pp. 87-96). IEEE.

[10] Mitra, A., & Munir, K. (2019). Influence of Big Data in managing cyber assets. Built Environment Project and Asset Management, 9(4), 503-514.

[11] Pulkkinen, M., Naumenko, A., & Luostarinen, K. (2007). Managing information security in a business network of machinery maintenance services business–Enterprise architecture as a coordination tool. Journal of Systems and Software, 80(10), 1607-1620.

[12] Sarker, I. H. (2022). AI-based modeling: techniques, applications and research issues towards automation, intelligent and smart systems. SN computer science, 3(2), 158.

[13] Srinivas, J., Das, A. K., & Kumar, N. (2019). Government regulations in cyber security: Framework, standards and recommendations. Future generation computer systems, 92, 178-188.

[14] Inmon, W. H., & Nesavich, A. (2007). Tapping into unstructured data: integrating unstructured data and textual analytics into business intelligence. Pearson Education.

[15] King, N. J., & Raja, V. (2012). Protecting the privacy and security of sensitive customer data in the cloud. Computer Law & Security Review, 28(3), 308-319.

[16] Xu, L., Jiang, C., Wang, J., Yuan, J., & Ren, Y. (2014). Information security in big data: privacy and data mining. IEEE Access, 2, 1149-1176.

[17] Janiesch, C., Zschech, P., & Heinrich, K. (2021). Machine learning and deep learning: C. Janiesch et al. Electronic markets, 31(3), 685-695.

[18] Jehangir, B., Radhakrishnan, S., & Agarwal, R. (2023). A survey on named entity recognition—datasets, tools, and methodologies. Natural Language Processing Journal, 3, 100017.

[19] Li, D. C., Liu, C. W., & Hu, S. C. (2011). A fuzzy-based data transformation for feature extraction to increase classification performance with small medical data sets. Artificial intelligence in medicine, 52(1), 45-52.

[20] Ammirato, P., Poirson, P., Park, E., Košecká, J., & Berg, A. C. (2017, May). A dataset for developing and benchmarking active vision. In 2017 IEEE international conference on robotics and automation (ICRA) (pp. 1378-1385). IEEE.

[21] Petrolini, M., Cagnoni, S., & Mordonini, M. (2022). Automatic detection of sensitive data using transformer-based classifiers. Future Internet, 14(8), 228.

[22] Seetala, S. R. (2020). Secure data architecture models for protecting sensitive information in distributed enterprise environments. International Journal of Science, Engineering and Technology, 8(3).

[23] Koo, J., Kang, G., & Kim, Y. G. (2020). Security and privacy in big data life cycle: A survey and open challenges. Sustainability, 12(24), 10571.

A Scalable Architecture for Automated Data Classification and Sensitive Information Discovery Using Artificial Intelligence

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

Make a Submission

Callpaper

Menu

Information

Keywords

Latest publications