Resilience by Design: Site Reliability Engineering for Multi-Cloud Systems

Authors

  • Hitesh Allam Software Engineer at Concor IT, USA. Author

DOI:

https://doi.org/10.63282/3050-922X.IJERET-V3I2P106

Keywords:

Site Reliability Engineering, Multi-Cloud, Resilience Engineering, Fault Tolerance, DevOps, Service Level Objectives (SLOs), Infrastructure Automation, Observability, Chaos Engineering, High Availability

Abstract

Companies are rapidly using multi-cloud architectures in the modern digital world in order to improve their enhanced performance, reduce vendor lock-in, and increase flexibility. This change brings further challenges, particularly in terms of guaranteeing consistency, resilience, and stability throughout many several cloud platforms. This article investigates how Site Reliability Engineering (SRE) may be a strategic framework for addressing these kinds of challenges by including resilience into the basic design of multi-cloud systems. Emphasizing Site Reliability Engineering's (SRE) focus on automation, observability, and proactive problem management, this article offers an overview to the concepts and approaches of SRE. Resilience is not only useful but also a necessary feature for distributed, cloud-native applications operating in different environments, the research underlines The primary goal is to show how SRE serves as both a mindset and a method for building fault-tolerant systems competent of adjusting to and recovering from disruptions with least influence on user experience. To build strong infrastructure, we describe our approach which combines error budgets, service-level goals (SLOs), chaotic engineering, and thorough monitoring. Examining a worldwide company using applications on AWS, Azure, and Google Cloud, this case study offers insights on deployment challenges, dependability strategies, and SRE-driven strategy impacts. The studies show considerable increases in system availability, recovery times & operational effectiveness. In the end, our findings confirm that intentional design and thorough engineering lead to resilience more rather than reactive solutions. This article offers a methodology for companies trying to harness the benefits of multi-cloud ecosystems while guaranteeing high service dependability, therefore orienting resilience as a fundamental design aspect rather than a side issue

References

[1] Sivakumar, Shanmugasundaram. "Performance Engineering for Hybrid Multi-Cloud Architectures." (2021).

[2] Alshammari, Mohammad M., et al. "Disaster recovery in single-cloud and multi-cloud environments: Issues and challenges." 2017 4th IEEE international conference on engineering technologies and applied sciences (ICETAS). IEEE, 2017.

[3] Neto, Jose Pergentino Araujo, Donald M. Pianto, and Célia Ghedini Ralha. "MULTS: A multi-cloud fault-tolerant architecture to manage transient servers in cloud computing." Journal of Systems Architecture 101 (2019): 101651.

[4] Yasodhara Varma, and Manivannan Kothandaraman. “Leveraging Graph ML for Real-Time Recommendation Systems in Financial Services”. Essex Journal of AI Ethics and Responsible Innovation, vol. 1, Oct. 2021, pp. 105-28

[5] Thumala, Srinivasarao. "Building Highly Resilient Architectures in the Cloud." Nanotechnology Perceptions 16.2 (2020).

[6] Gangu, Krishna, and Avneesh Kumar. "Strategic Cloud Architecture for High-Availability Systems." International Journal of Research in Humanities & Social Sciences 8.7 (2020): 40.

[7] Jani, Parth. "Modernizing Claims Adjudication Systems with NoSQL and Apache Hive in Medicaid Expansion Programs." JOURNAL OF RECENT TRENDS IN COMPUTER SCIENCE AND ENGINEERING (JRTCSE) 7.1 (2019): 105-121.

[8] Welsh, Thomas, and Elhadj Benkhelifa. "On resilience in cloud computing: A survey of techniques across the cloud domain." ACM Computing Surveys (CSUR) 53.3 (2020): 1-36.

[9] Sangaraju, Varun Varma. "AI-Augmented Test Automation: Leveraging Selenium, Cucumber, and Cypress for Scalable Testing." International Journal of Science And Engineering 7 (2021): 59-68

[10] Talakola, Swetha. “The Importance of Mobile Apps in Scan and Go Point of Sale (POS) Solutions”. American Journal of Data Science and Artificial Intelligence Innovations, vol. 1, Sept. 2021, pp. 464-8

[11] Liu, Jinwei, et al. "A low-cost multi-failure resilient replication scheme for high-data availability in cloud storage." IEEE/ACM Transactions on Networking 29.4 (2020): 1436-1451.

[12] Jani, Parth. “Embedding NLP into Member Portals to Improve Plan Selection and CHIP Re-Enrollment”. Newark Journal of Human-Centric AI and Robotics Interaction, vol. 1, Nov. 2021, pp. 175-92

[13] Veluru, Sai Prasad. "Threat Modeling in Large-Scale Distributed Systems." International Journal of Emerging Research in Engineering and Technology 1.4 (2020): 28-37.

[14] Schrama, Amon. "Managing Multi-Cloud Systems."

[15] Anusha Atluri. “The Revolutionizing Employee Experience: Leveraging Oracle HCM for Self-Service HR”. JOURNAL OF RECENT TRENDS IN COMPUTER SCIENCE AND ENGINEERING ( JRTCSE), vol. 7, no. 2, Dec. 2019, pp. 77-90

[16] Abdul Jabbar Mohammad. “Cross-Platform Timekeeping Systems for a Multi-Generational Workforce”. American Journal of Cognitive Computing and AI Systems, vol. 5, Dec. 2021, pp. 1-22

[17] Balkishan Arugula, and Pavan Perala. “Multi-Technology Integration: Challenges and Solutions in Heterogeneous IT Environments”. American Journal of Cognitive Computing and AI Systems, vol. 6, Feb. 2022, pp. 26-52

[18] Sangaraju, Varun Varma, and Senthilkumar Rajagopal. "Danio rerio: A Promising Tool for Neurodegenerative Dysfunctions." Animal Behavior in the Tropics: Vertebrates: 47.

[19] de Araújo Neto, José Pergentino, Donald M. Pianto, and Célia Ghedini Ralha. "MULTS: A Multi-cloud Fault-tolerant Architecture to Manage Transient Servers in Cloud Computing." (2019).

[20] Vasanta Kumar Tarra, and Arun Kumar Mittapelly. “Future of AI & Blockchain in Insurance CRM”. JOURNAL OF RECENT TRENDS IN COMPUTER SCIENCE AND ENGINEERING ( JRTCSE), vol. 10, no. 1, Mar. 2022, pp. 60-77

[21] Talakola, Swetha. “Automation Best Practices for Microsoft Power BI Projects”. American Journal of Autonomous Systems and Robotics Engineering, vol. 1, May 2021, pp. 426-48

[22] Mohammad, Abdul Jabbar. “AI-Augmented Time Theft Detection System”. International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 2, no. 3, Oct. 2021, pp. 30-38

[23] Datla, Lalith Sriram, and Rishi Krishna Thodupunuri. “Methodological Approach to Agile Development in Startups: Applying Software Engineering Best Practices”. International Journal of AI, BigData, Computational and Management Studies, vol. 2, no. 3, Oct. 2021, pp. 34-45

[24] Tatineni, Sumanth. "Challenges and Strategies for Optimizing Multi-Cloud Deployments in DevOps." International Journal of Science and Research (IJSR) 9.1 (2020).

[25] Kupunarapu, Sujith Kumar. "AI-Enhanced Rail Network Optimization: Dynamic Route Planning and Traffic Flow Management." International Journal of Science And Engineering 7.3 (2021): 87-95.

[26] Paidy, Pavan. “Zero Trust in Cloud Environments: Enforcing Identity and Access Control”. American Journal of Autonomous Systems and Robotics Engineering, vol. 1, Apr. 2021, pp. 474-97

[27] Ali Asghar Mehdi Syed, and Shujat Ali. “Evolution of Backup and Disaster Recovery Solutions in Cloud Computing: Trends, Challenges, and Future Directions”. JOURNAL OF RECENT TRENDS IN COMPUTER SCIENCE AND ENGINEERING ( JRTCSE), vol. 9, no. 2, Sept. 2021, pp. 56-71

[28] van Vliet, Jurg, Flavia Paganelli, and Jasper Geurtsen. Resilience and Reliability on AWS: Engineering at Cloud Scale. " O'Reilly Media, Inc.", 2013.

[29] Veluru, Sai Prasad. "Leveraging AI and ML for Automated Incident Resolution in Cloud Infrastructure." International Journal of Artificial Intelligence, Data Science, and Machine Learning 2.2 (2021): 51-61.

[30] Mohammad, Abdul Jabbar, and Waheed Mohammad A. Hadi. “Time-Bounded Knowledge Drift Tracker”. International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 2, no. 2, June 2021, pp. 62-71

[31] Arugula, Balkishan. “Implementing DevOps and CI CD Pipelines in Large-Scale Enterprises”. International Journal of Emerging Research in Engineering and Technology, vol. 2, no. 4, Dec. 2021, pp. 39-47

[32] Saha, Biswanath. "Best practices for IT disaster recovery planning in multi-cloud environments." Available at SSRN 5224693 (2019).

[33] Sangaraju, Varun Varma. "Ranking Of XML Documents by Using Adaptive Keyword Search." (2014): 1619-1621.

[34] Talakola, Swetha, and Sai Prasad Veluru. “How Microsoft Power BI Elevates Financial Reporting Accuracy and Efficiency”. Newark Journal of Human-Centric AI and Robotics Interaction, vol. 2, Feb. 2022, pp. 301-23

[35] “UM Decision Automation Using PEGA and Machine Learning for Preauthorization Claims”. The Distributed Learning and Broad Applications in Scientific Research, vol. 6, Feb. 2020, pp. 1177-05

[36] Sangeeta Anand, and Sumeet Sharma. “Automating ETL Pipelines for Real-Time Eligibility Verification in Health Insurance”. Essex Journal of AI Ethics and Responsible Innovation, vol. 1, Mar. 2021, pp. 129-50

[37] Junghanns, Philipp, Benjamin Fabian, and Tatiana Ermakova. "Engineering of secure multi-cloud storage." Computers in Industry 83 (2016): 108-120.

[38] Datla, Lalith Sriram, and Rishi Krishna Thodupunuri. “Designing for Defense: How We Embedded Security Principles into Cloud-Native Web Application Architectures”. International Journal of Emerging Research in Engineering and Technology, vol. 2, no. 4, Dec. 2021, pp. 30-38

[39] Paidy, Pavan. “Testing Modern APIs Using OWASP API Top 10”. Essex Journal of AI Ethics and Responsible Innovation, vol. 1, Nov. 2021, pp. 313-37

[40] Torkura, Kennedy A., et al. "Cloudstrike: Chaos engineering for security and resiliency in cloud infrastructure." IEEE Access 8 (2020): 123044-123060.

[41] Kupunarapu, Sujith Kumar. "AI-Enabled Remote Monitoring and Telemedicine: Redefining Patient Engagement and Care Delivery." International Journal of Science And Engineering 2.4 (2016): 41-48.

[42] Maher, Reda, and Omar A. Nasr. "DropStore: A secure backup system using multi-cloud and fog computing." IEEE Access 9 (2021): 71318-71327.

Downloads

Published

2022-06-30

Issue

Section

Articles

How to Cite

1.
Allam H. Resilience by Design: Site Reliability Engineering for Multi-Cloud Systems. IJERET [Internet]. 2022 Jun. 30 [cited 2025 Sep. 12];3(2):49-5. Available from: https://ijeret.org/index.php/ijeret/article/view/160