Building Observability into Full-Stack Systems: Metrics That Matter

Kiran Kumar Pappula; Sunil Anasuri; Guru Pramod Rusum

doi:10.63282/3050-922X.IJERET-V2I4P106

Authors

Kiran Kumar Pappula Independent Researcher, USA. Author
Sunil Anasuri Independent Researcher, USA. Author
Guru Pramod Rusum Independent Researcher, USA. Author

DOI:

https://doi.org/10.63282/3050-922X.IJERET-V2I4P106

Keywords:

Observability, Full-Stack Systems, Metrics, Distributed Tracing, OpenTelemetry, ELK Stack, Jaeger

Abstract

In the paper, a framework of observability in full-stack systems is defined. It links frontend performance and backend health metrics, log aggregation and traceability. The art (or science) of observability is shifting towards data-rich, event-driven observability that is an important step towards resilient, scalable systems. The full-stack paradigm requires the telemetry to be integrated at the frontend, backend, infrastructure, and application levels. We propose a unified model that quantifies the relationship between the behaviours of systems and the experiences of users with structured metrics, logs and traces. Our framework utilizes the open standards OpenTelemetry and integrates the distributed tracing tools like Jaeger, Prometheus, in order to collect metrics, and the ELK stack to aggregate the logs. The objective is to have insight into profound levels of system state and performance bottlenecks, as well as anomaly detection. The architecture is organized in the form of five strata- Instrumentation, Telemetry Collection, Analysis, Visualization, and Action. Each of the levels is correlated with technical elements and levels of observability. An analytic model is likewise formulated to measure observability coverage in terms of signal density and correlation coefficient of traces and metrics. The framework was evaluated through a case study of an e-commerce application based on microservices and a frontend interface using React.js. Mean Time to Detect (MTTD) and Mean Time To Resolve (MTTR) showed great improvements in performance. We also mention telemetry noise, data storage cost and cross-domain correlation as the challenges in this case. Our results give a viable route that all organizations seeking to implement observability in production can follow

References

[1] Cagan, M. (2017). Inspired: How to create tech products customers love. John Wiley & Sons.

[2] Murray, C. J., & Frenk, J. (2008). Health metrics and evaluation: strengthening the science. The Lancet, 371(9619), 1191-1199.

[3] Arah, O. A., Klazinga, N. S., Delnoij, D. M., Asbroek, A. T., & Custers, T. (2003). Conceptual frameworks for health systems performance: a quest for effectiveness, quality, and improvement. International journal for quality in health care, 15(5), 377-398.

[4] Niedermaier, S., Koetter, F., Freymann, A., & Wagner, S. (2019). On Observability and Monitoring of Distributed Systems – an industry interview study. In Lecture notes in computer science (pp. 36–52). https://doi.org/10.1007/978-3-030-33702-5_3

[5] Carney, T. J., & Shea, C. M. (2017). Informatics metrics and measures for a smart public health systems approach: information science perspective. Computational and Mathematical Methods in Medicine, 2017(1), 1452415.

[6] Kalman, R. E. (1960). A new approach to linear filtering and prediction problems.

[7] McGinnis, J. M., Malphrus, E., & Blumenthal, D. (Eds.). (2015). Vital signs: core metrics for health and health care progress.

[8] Sigelman, B. H., Barroso, L. A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., ... & Shanbhag, C. (2010). Dapper is a large-scale distributed systems tracing infrastructure.

[9] Niedermaier, S., Koetter, F., Freymann, A., & Wagner, S. (2019). On observability and monitoring of distributed systems–an industry interview study. In Service-Oriented Computing: 17th International Conference, ICSOC 2019, Toulouse, France, October 28–31, 2019, Proceedings 17 (pp. 36-52). Springer International Publishing.

[10] Goodson, R., & Klein, R. (2003). A definition and some results for distributed system observability. IEEE Transactions on Automatic Control, 15(2), 165-174.

[11] Nerode, A., & Kohn, W. (1991, June). Models for hybrid systems: Automata, topologies, controllability, observability. In International Hybrid Systems Workshop (pp. 317-356). Berlin, Heidelberg: Springer Berlin Heidelberg.

[12] Xu, J., Xu, J., & McDermott. (2018). Block Trace Analysis and Storage System Optimization. Apress.

[13] Costa, J. C., Devadas, S., & Monteiro, J. C. (2000, November). Observability analysis of embedded software for coverage-directed validation. In the IEEE/ACM International Conference on Computer-Aided Design. ICCAD-2000. IEEE/ACM Digest of Technical Papers (Cat. No. 00CH37140) (pp. 27-32). IEEE.

[14] Lisherness, P., & Cheng, K. T. (2009, November). An instrumented observability coverage method for system validation. In 2009 IEEE International High Level Design Validation and Test Workshop (pp. 88-93). IEEE.

[15] Liu, Y. Y., Slotine, J. J., & Barabási, A. L. (2013). Observability of complex systems. Proceedings of the National Academy of Sciences, 110(7), 2460-2465.

[16] Hasselbring, W., & Steinacker, G. (2017, April). Microservice architectures for scalability, agility and reliability in e-commerce. In 2017 IEEE International Conference on Software Architecture Workshops (ICSAW) (pp. 243-246). IEEE.

[17] Indrasiri, K., & Siriwardena, P. (2018). Microservices for the Enterprise. Apress, Berkeley, 143-148.

[18] Magalhaes, J. P., & Silva, L. M. (2012, August). Anomaly detection techniques for web-based applications: An experimental study. In 2012 IEEE 11th International Symposium on Network Computing and Applications (pp. 181-190). IEEE.

[19] Van Handel, R. (2009). Observability and nonlinear filtering. Probability theory and related fields, 145, 35-74.

[20] Jogalekar, P., & Woodside, M. (2002). Evaluating the scalability of distributed systems. IEEE Transactions on parallel and distributed systems, 11(6), 589-603.

[21] Rahul, N. (2020). Optimizing Claims Reserves and Payments with AI: Predictive Models for Financial Accuracy. International Journal of Emerging Trends in Computer Science and Information Technology, 1(3), 46-55. https://doi.org/10.63282/3050-9246.IJETCSIT-V1I3P106

[22] Enjam, G. R. (2020). Ransomware Resilience and Recovery Planning for Insurance Infrastructure. International Journal of AI, BigData, Computational and Management Studies, 1(4), 29-37. https://doi.org/10.63282/3050-9416.IJAIBDCMS-V1I4P104

Building Observability into Full-Stack Systems: Metrics That Matter

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

Make a Submission

Callpaper

Menu

Information

Keywords

Latest publications