Monitoring Isn’t Observability: Lessons from Running Enterprise Microservices

Sumith Thalary

doi:10.63282/3050-922X.IJERET-V4I2P115

Authors

Sumith Thalary Sr Cloud DevOps Engineer, Rexel USA, Dallas TX. Author

DOI:

https://doi.org/10.63282/3050-922X.IJERET-V4I2P115

Keywords:

Observability Vs Monitoring, Enterprise Microservices Observability, Logs Metrics Traces Explained, Alert Fatigue In Devops Teams,, SRE Observability Best Practices, Aiops For Enterprise Monitoring, AI-Driven Observability Platform, ML-Based Anomaly Detection In Microservices, AI-Powered Root Cause Analysis, Machine Learning Predictive Alerting

Abstract

At a very high rate, microservices architectures have been adopted where they have fundamentally changed the way in which modern enterprise applications are developed, distributed, and managed. Contrary to the standard monolithic systems, with microservices they spread the distributed complexity whereby hundreds or thousands of services can interact dynamically in cloud infrastructure. Under such conditions, making sure that the systems are stable, active, and consist of reliable components becomes more difficult. The first wave of organizations was heavily over-reliant on traditional monitoring tools that monitored the use of the CPU, memory, service uptime, and error rates. Although these metrics are very helpful to understand the health of a system, in many cases, they cannot give much information on the diagnosis of failures in a highly distributed system. The limitation has given rise to observability as a more general paradigm when exploring the system behavior. Monitoring is mainly about a set of metrics and notifications of the presence of deviation in the work of a system. However, observability is the quality that allows the engineers to gain insights into the internal state of complex systems through the analysis of telemetry data such as logs, metrics, and distributed traces. Observability systems enable the real-time investigation of unknown failures modes and emergent behavior by the team, knowledge required to manage microservices ecosystems where the dependency between services constantly changes. A lot of enterprises have confused monitoring and observability as the interchangeable notions and provoked the lack of operational blind area and ineffective incident resolution. Monitoring systems tend to respond to the following question: Is something wrong? Observability systems deal with the more fundamental question: Why is it bad? Lack of observability tools causes engineering teams to find root causes in the distributed environment, causing extended outages, poor user experience and incur high operational costs. The research paper is an investigation of the differences of monitoring and observability in practice within a microservice-based enterprise setting. Basing his reasoning on the practical operation applications and DevOps operations, the paper emphasizes the architectural, operational, and analytical constraints of the traditional monitoring solutions. The study investigates the lifecycle of the telemetry data, considers distributed tracing approaches, and assesses the effectiveness of observability platforms to help reduce system debugging, reliability engineering, and incident response processes. The paper suggests a structured observability system, combining telemetry pipelines, service instrumentation, data correlation, and others. Its framework highlights the three pillars of observability also metrics, logs, and traces, but also includes some current best practices including service dependency mapping, anomaly detection, and automated root cause analysis. The paper illustrates that organizations that implement observability-based operations will experience both the mean time to detection (MTTD) and mean time to recovery (MTTR) faster when compared to other organizations that exclusively make use of monitoring systems. Findings of the enterprise microservices implementations indicate that observability can greatly enhance the system diagnostics, system operational transparency and collaboration across the team development teams on cross team basis with operations teams. Also reliant is observability that aids proactive reliability engineering as it allows the prediction of system behavior in response to diverse workloads on it. The results support this contention that current distributed architectures are operationally complex, and can hardly be tackled with monitoring. Businesses have to shift to holistic observability approaches offering in-depth understanding of service engagements and system behaviors. Observability practices allow organizations to manage incidents better, increase their system resilience, and sustain high service reliability in cloud-native environments becoming more difficult to manage.

References

[1] Barham, P., Isaacs, R., Mortier, R., & Narayanan, D. (2003). Magpie: Online modelling and performance-aware systems. In 9th Workshop on Hot Topics in Operating Systems (HotOS IX).

[2] Usman, M., Ferlin, S., Brunstrom, A., & Taheri, J. (2022). A survey on observability of distributed edge & container-based microservices. IEEE Access, 10, 86904-86919.

[3] Sigelman, B. H., Barroso, L. A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., ... & Shanbhag, C. (2010). Dapper, a large-scale distributed systems tracing infrastructure.

[4] Odofin, O. T., Abayomi, A. A., Uzoka, A. C., Adekunle, B. I., Agboola, O. A., & Owoade, S. (2020). Developing microservices architecture models for modularization and scalability in enterprise systems. Iconic Research and Engineering Journals, 3(9), 323-333.

[5] Beyer, B., Jones, C., Petoff, J., & Murphy, N. R. (2016). Site reliability engineering: how Google runs production systems. " O'Reilly Media, Inc.".

[6] Turnbull, J. (2014). The art of monitoring. James Turnbull.

[7] Wagner, S. (2019, October). On Observability and Monitoring of Distributed Systems–An Industry Interview Study. In Service-Oriented Computing: 17th International Conference, ICSOC 2019, Toulouse, France, October 28–31, 2019, Proceedings (p. 36). Springer Nature.

[8] Burns, B., Grant, B., Oppenheimer, D., Brewer, E., & Wilkes, J. (2016). Borg, Omega, and Kubernetes: Lessons learned from three container-management systems over a decade. Queue, 14(1), 70-93.

[9] Kim, G., Humble, J., Debois, P., Willis, J., & Forsgren, N. (2021). The DevOps handbook: How to create world-class agility, reliability, & security in technology organizations. It Revolution.

[10] Chen, Y., Alspaugh, S., & Katz, R. (2012). Interactive analytical processing in big data systems: A cross-industry study of mapreduce workloads. arXiv preprint arXiv:1208.4174.

[11] Bhattacharjee, S., & Ramesh, R. (2000). Enterprise computing environments and cost assessment. Communications of the ACM, 43(10), 74-82.

[12] Pahl, C. (2015). Containerization and the paas cloud. IEEE Cloud Computing, 2(3), 24-31.

[13] Lorido-Botran, T., Miguel-Alonso, J., & Lozano, J. A. (2014). A review of auto-scaling techniques for elastic applications in cloud environments. Journal of grid computing, 12(4), 559-592

[14] Cito, J., Schermann, G., Wittern, J. E., Leitner, P., Zumberi, S., & Gall, H. C. (2017, May). An empirical analysis of the docker container ecosystem on github. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR) (pp. 323-333). IEEE.

[15] Kalske, M., Mäkitalo, N., & Mikkonen, T. (2017, June). Challenges when moving from monolith to microservice architecture. In International Conference on Web Engineering (pp. 32-47). Cham: Springer International Publishing.

[16] Richardson, C. (2018). Microservices patterns: with examples in Java. Simon and Schuster.

[17] Bertagnoli, G., Malavisi, M., & Mancini, G. (2019, September). Large scale monitoring system for existing structures and infrastructures. In IOP Conference Series: Materials science and Engineering (Vol. 603, No. 5, p. 052042). IOP Publishing.

[18] Beyer, B., Murphy, N. R., Rensin, D. K., Kawahara, K., & Thorne, S. (2018). The site reliability workbook: practical ways to implement SRE. " O'Reilly Media, Inc.".

[19] Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., Tune, E., & Wilkes, J. (2015, April). Large-scale cluster management at Google with Borg. In Proceedings of the tenth european conference on computer systems (pp. 1-17).

[20] Shkuro, Y. (2019). Mastering Distributed Tracing: Analyzing performance in microservices and complex systems. Packt Publishing Ltd.

[21] Zhao, J. T., Jing, S. Y., & Jiang, L. Z. (2018, September). Management of API gateway based on micro-service architecture. In Journal of Physics: Conference Series (Vol. 1087, No. 3, p. 032032). IOP Publishing.

[22] "Chennareddy, R. K. (2020). Engineering Intelligence Systems Using Big Data and Cloud Architectures for Modern Data Intensive Applications. International Journal of AI, BigData, Computational and Management Studies, 1(2), 41-50.

[23] Chennareddy, R. K. (2021). Designing Data and Analytics Ecosystems for High Volume Transaction Processing Applications. International Journal of AI, BigData, Computational and Management Studies, 2(2), 95-106.

[24] Sethuraman, P., & Chennareddy, R. K. (2022). Machine Learning Assisted Design of Wireless Access Systems for Reliable and Low-Latency Financial and Smart Commerce Services. International Journal of Artificial Intelligence, Data Science, and Machine Learning, 3(4), 133-142.

[25] Sethuraman, P., & Chennareddy, R. K. (2022). Intelligent Vehicular Traffic Flow Prediction Using Learning-Based Spatio-Temporal Models for Data-Driven Wireless Transportation and Urban Analytics Systems. International Journal of Emerging Trends in Computer Science and Information Technology, 3(2), 111-121.

[26] Sethuraman, P. (2022). Latency-Aware Scheduling and Resource Control Algorithms for Emergency and Public Safety Wireless Networks. International Journal of Emerging Research in Engineering and Technology, 3(4), 133-140.

Monitoring Isn’t Observability: Lessons from Running Enterprise Microservices

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

Make a Submission

Callpaper

Menu

Information

Keywords

Latest publications