Agentic Data Engineering: LLM-Augmented Pipeline Generation, Self-Healing ETL, and Autonomous Repair

Jeevan Krishna Paruchuri

doi:10.63282/3050-922X.IJERET-V7I2P105

Authors

Jeevan Krishna Paruchuri Independent Researcher, USA. Author

DOI:

https://doi.org/10.63282/3050-922X.IJERET-V7I2P105

Keywords:

LLM Agents, React, Self-Healing Pipelines, AIOps, Data Engineering, Tool Use, Human-In-The-Loop, Pipeline Repair

Abstract

Production data engineering organizations spend a significant fraction of their on-call time on a narrow class of recurring incidents schema drift, transient cluster failures, late-arriving data, configuration regressions that are individually simple but collectively expensive. This paper presents a position and prototype design for agentic data engineering: the use of large language model (LLM) agents, organized around the ReAct (Reason and Act) framework, as a first responder to pipeline failures, augmenting rather than replacing the on-call engineer. The design is grounded in operational experience from a production banking data platform comprising 35 production pipelines, an observed incident frequency of 15-20 failures per month, and 3-4 schema changes per month exactly the conditions under which a well-scoped agent can absorb a meaningful share of triage work. We describe a tool-use architecture in which the agent has access to a fixed, audited set of read-mostly diagnostic tools (Airflow DAG status, Spark job logs, Delta Lake history, Trino query plans, schema registry diff) and a smaller set of write-capable repair tools that are gated behind explicit human approval before any change reaches production. We propose an evaluation methodology over 6 months of retrospective real incidents plus a synthetic incident benchmark, with a target diagnostic accuracy of 83% and a target mean-time-to-resolution (MTTR) reduction of 36x relative to the human-only baseline. We address the regulatory realities that govern any automated change in financial services SOX change management, GDPR data handling, immutable audit trails and argue that the chain-of-thought reasoning produced by ReAct agents is not a limitation but an audit feature, making the agent's decisions more inspectable than equivalent script-based automation. We conclude with a frank discussion of the limitations: hallucinated diagnoses, the cost of tool calls under high incident volume, the risk of automation bias on the human reviewer, and the need for narrowly scoped agents rather than generalist ones. The contribution of the paper is a practitioner-oriented design framework and a reproducible evaluation protocol, intended to bridge the gap between the LLM agent literature and the operational realities of regulated data platforms.

References

[1] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, "ReAct: Synergizing Reasoning and Acting in Language Models," in Proc. ICLR, 2023.

[2] OpenAI, "GPT-4 Technical Report," 2023. https://openai.com/research/gpt-4

[3] M. Chen et al., "Evaluating Large Language Models Trained on Code," 2021. https://arxiv.org/abs/2107.03374

[4] T. Schick et al., "Toolformer: Language Models Can Teach Themselves to Use Tools," in Proc. NeurIPS, 2023.

[5] A. Vaswani et al., "Attention Is All You Need," in Proc. NeurIPS, 2017.

[6] D. Sculley et al., "Hidden Technical Debt in Machine Learning Systems," in Proc. NeurIPS, 2015.

[7] E. Breck, S. Cai, E. Nielsen, M. Salib, and D. Sculley, "The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction," in IEEE Big Data, 2017.

[8] N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich, "Data Lifecycle Challenges in Production Machine Learning: A Survey," SIGMOD Record, 2018.

[9] M. Zaharia et al., "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing," in Proc. NSDI, 2012.

[10] M. Armbrust et al., "Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores," Proc. VLDB Endowment, 2020.

[11] Apache Airflow Documentation. https://airflow.apache.org/docs/

[12] Apache Spark Documentation. https://spark.apache.org/docs/latest/

[13] Sarbanes-Oxley Act of 2002, Public Law 107-204, 116 Stat. 745.

[14] Regulation (EU) 2016/679 (General Data Protection Regulation, GDPR).

[15] B. Beyer, C. Jones, J. Petoff, and N. R. Murphy, Eds., Site Reliability Engineering: How Google Runs Production Systems. O'Reilly, 2016.

[16] P. Lewis et al., "Retrieval-Augmented Generation for Enterprise Data Platforms," Proc. VLDB, 2025.

[17] J. Park et al., "Agentic AI Systems for Autonomous Data Pipeline Management," IEEE Transactions on Knowledge and Data Engineering, vol. 37, no. 3, pp. 891-907, 2025.

[18] W. Chen et al., "Unified Lakehouse Architectures: Open Formats, Zero-Copy Sharing, and AI-Native Governance," Proc. SIGMOD, 2026.

[19] L. Zhang et al., "Responsible AI Frameworks for Regulated Industries: A Practitioner Survey," IEEE Access, vol. 14, pp. 28301-28319, 2026.

Agentic Data Engineering: LLM-Augmented Pipeline Generation, Self-Healing ETL, and Autonomous Repair

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

Make a Submission

Callpaper

Menu

Information

Keywords

Latest publications