Designing Resilient Distributed Workflows: Stage-Aware API Failure Handling and Operational Trade-offs
DOI:
https://doi.org/10.63282/3050-922X.IJERET-V7I1P128Keywords:
Distributed Systems, API Failure Handling, Resilient Workflows, Partial and Ambiguous Failures, Idempotency and Retries, Compensation and Reconciliation, Operational Recovery, Microservices ArchitectureAbstract
Distributed applications increasingly rely on API-driven workflows that span multiple independently deployed services. While this architecture improves scalability and modularity, it also exposes systems to partial and ambiguous failures that can leave workflows in inconsistent states. The impact of a failure depends on factors such as the transaction stage, the certainty of the API outcome, system throughput, and business priorities, which makes one-size-fits-all recovery strategies ineffective. This paper analyzes API call failures in distributed workflows and evaluates context- and stage-aware mitigation strategies. Using an order management system as a case study, it examines failures during inventory reservation, payment processing, and shipping initiation. For each API, the paper considers stage-specific recovery mechanisms, including retries, compensation or rollback, deferred processing, and manual intervention. The analysis highlights trade-offs between correctness, operational cost, throughput, latency, and customer experience. The paper demonstrates that resilience in distributed workflows must be explicitly designed with stage and outcome awareness, combining automated recovery with manageable manual intervention to achieve practical and sustainable operations.
References
[1] A. S. Tanenbaum and M. van Steen, Distributed Systems: Principles and Paradigms, 2nd ed. Upper Saddle River, NJ, USA: Pearson Education, 2007.
[2] G. Coulouris, J. Dollimore, T. Kindberg, and G. Blair, Distributed Systems: Concepts and Design, 5th ed. Boston, MA, USA: Addison-Wesley, 2011.
[3] J. Gray, “Why do computers stop and what can be done about it?” in Proc. 5th Symp. Reliability in Distributed Software and Database Systems, Los Angeles, CA, USA, 1985, pp. 3–12.
[4] K. P. Birman, Reliable Distributed Systems: Technologies, Web Services, and Applications. New York, NY, USA: Springer, 2005.
[5] R. Fielding et al., “HTTP Semantics,” RFC 9110, IETF, June 2022. [Online]. Available: https://www.rfc-editor.org/rfc/rfc9110
[6] H. Garcia-Molina and K. Salem, “Sagas,” in Proc. ACM SIGMOD Int. Conf. Management of Data, San Francisco, CA, USA, 1987, pp. 249–259.
[7] P. Helland, “Idempotence is not a medical condition,” Commun. ACM, vol. 59, no. 5, pp. 56–62, May 2016.
[8] E. A. Brewer, “Towards robust distributed systems,” in Proc. 19th Annu. ACM Symp. Principles of Distributed Computing (PODC), Portland, OR, USA, 2000 (Keynote Address).
[9] J. Allspaw and J. Robbins, Web Operations: Keeping the Data On Time. Sebastopol, CA, USA: O’Reilly Media, 2010.
[10] B. Beyer, C. Jones, J. Petoff, and N. R. Murphy, Site Reliability Engineering: How Google Runs Production Systems. Sebastopol, CA, USA: O’Reilly Media, 2016.
[11] L. A. Barroso, J. Clidaras, and U. Hölzle, The Datacenter as a Computer: Designing Warehouse-Scale Machines, 2nd ed. San Rafael, CA, USA: Morgan & Claypool, 2018.