Causal Inference and Graph-Based AI Models for Root Cause Analysis in Telecom and Networking Systems
DOI:
https://doi.org/10.63282/3050-922X.IJERET-V3I1P109Keywords:
Causal Inference, Graph Neural Networks, Root Cause Analysis, Knowledge Graphs, Telecom Fault Diagnosis, Bayesian Networks, Network Automation, AIOps, Network Management, ObservabilityAbstract
The growing complexity of telecom and networking systems due to 5G, cloud-native and virtualization brings the need to use intelligent means of Root Cause Analysis (RCA) on faults, anomalies and performance degradation. Classical methods of RCA tend to be non-scalable and non-adaptive to the dimensional and dynamic characteristics of today's systems. The given paper proposes an integrative approach that combines the elements of causal inference and graph-based AI models to improve RCA accuracy and efficiency in telecom networks. We discuss the use of probability graphical models (Bayesian networks, Markov networks), causal discovery methods (PC, GES, and LiNGAM), and knowledge graphs in representing network relationships and inferring causality from observational data. We present evidence from case studies at Ericsson, Nokia, and AT&T Labs that the latter not only outperform comparison studies in correlational heuristics but also demonstrate compatibility with explainable diagnostics and active prevention. The method takes into account originating data through network telemetry ingestion, the creation of causality graphs, intervention analysis, and inference. Findings indicate that the RCA precision has improved by more than 30 percent in large-scale environment simulation conditions. We discuss the application of domain knowledge, data- and model-based hybrid practice, and the system design trade-offs. This paper identifies the route to autonomous network healing systems, which is based on causal AI
References
[1] Granger, C. W. (1969). Investigating causal relations by econometric models and cross-spectral methods. Econometrica: journal of the Econometric Society, 424-438.
[2] Koller, D., & Friedman, N. (2009). Probabilistic graphical models: principles and techniques. MIT Press.
[3] Hamilton, W., Ying, Z., & Leskovec, J. (2017). Inductive representation learning on large graphs. Advances in Neural Information Processing Systems, 30.
[4] Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., & Yu, P. S. (2020). A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 32(1), 4-24.
[5] Zhang, K., Kalander, M., Zhou, M., Zhang, X., & Ye, J. (2020, December). An influence-based approach for root cause alarm discovery in telecom networks. In International Conference on Service-Oriented Computing (pp. 124-136). Cham: Springer International Publishing.
[6] Chigurupati, A., & Lassar, N. (2017, January). Root cause analysis using artificial intelligence. In the 2017 Annual reliability and maintainability symposium (RAMS) (pp. 1-5). IEEE.
[7] Solé, M., Muntés-Mulero, V., Rana, A. I., & Estrada, G. (2017). Survey on Models and Techniques for Root-Cause Analysis. arXiv preprint arXiv:1701.08546.
[8] Carletti, M., Masiero, C., Beghi, A., & Susto, G. A. (2019, October). Explainable machine learning in industry 4.0: Evaluating feature importance in anomaly detection to enable root cause analysis. In 2019, the IEEE International Conference on Systems, Man and Cybernetics (SMC) (pp. 21-26). IEEE.
[9] Mfula, H., & Nurminen, J. K. (2017, July). Adaptive root cause analysis for self-healing in 5G networks. In the 2017 International Conference on High Performance Computing & Simulation (HPCS) (pp. 136-143). IEEE.
[10] Yan, H., Breslau, L., Ge, Z., Massey, D., Pei, D., & Yates, J. (2010, November). G-rca: a generic root cause analysis platform for service quality management in large IP networks. In Proceedings of the 6th International Conference (pp. 1-12).
[11] Zhu, F., Yuan, M., Xie, X., Wang, T., Zhao, S., Rao, W., & Zeng, J. (2019). A data-driven sequential localization framework for big telco data. IEEE Transactions on Knowledge and Data Engineering, 33(8), 3007-3019.
[12] Pearl, J. (2012). The causal foundations of structural equation modeling. Handbook of structural equation modeling, 68-91.
[13] Marcot, B. G., & Penman, T. D. (2019). Advances in Bayesian network modelling: Integration of modelling technologies. Environmental modelling & software, 111, 386-393.
[14] Nistal‐Nuño, B. (2018). Tutorial on the probabilistic methods, Bayesian networks and influence diagrams applied to medicine. Journal of Evidence‐Based Medicine, 11(2), 112-124.
[15] Pham, J. C., Kim, G. R., Natterman, J. P., Cover, R. M., Goeschel, C. A., Wu, A. W., & Pronovost, P. J. (2010). ReCASTing the RCA: an improved model for performing root cause analyses. American Journal of Medical Quality, 25(3), 186-191.
[16] Mohammad-Taheri, S., Ness, R., Zucker, J., & Vitek, O. (2021). Do-calculus enables causal reasoning with latent variable models. arXiv preprint arXiv:2102.06626.
[17] Pernstål, J., Feldt, R., Gorschek, T., & Florén, D. (2019). FLEX-RCA: a lean-based method for root cause analysis in software process improvement. Software Quality Journal, 27(1), 389-428.
[18] Zhang, S., Tong, H., Xu, J., & Maciejewski, R. (2019). Graph convolutional networks: a comprehensive review. Computational Social Networks, 6(1), 1-23.
[19] Gupta, P., & Varkey, P. (2009). Developing a tool for assessing competency in root cause analysis. The Joint Commission Journal on Quality and Patient Safety, 35(1), 36-42.
[20] Cañas, J., Quesada, J., Antolí, A., & Fajardo, I. (2003). Cognitive flexibility and adaptability to environmental changes in dynamic complex problem-solving tasks. Ergonomics, 46(5), 482-501.
[21] Aragani, Venu Madhav and Maroju, Praveen Kumar and Mudunuri, Lakshmi Narasimha Raju, “Efficient Distributed Training through Gradient Compression with Sparsification and Quantization Techniques” (September 29, 2021). Available at SSRN: https://ssrn.com/abstract=5022841 or http://dx.doi.org/10.2139/ssrn.5022841