Natural Language Interfaces for Self-Service Analytics on Data Lakes: Design Patterns, Governance, and Lessons from a Production Deployment

Jeevan Krishna Paruchuri

doi:10.63282/3050-922X.IJERET-V6I3P118

Authors

Jeevan Krishna Paruchuri Independent Researcher, USA. Author

DOI:

https://doi.org/10.63282/3050-922X.IJERET-V6I3P118

Keywords:

Natural Language to SQL, Self-Service Analytics, LLM, Data Lake, DBT Semantic Layer, Governance, Row-Level Security, Schema Retrieval

Abstract

The promise of natural-language interfaces to enterprise data is older than the LLMs that finally make it tractable, and the gap between the demo and the production deployment is larger than most organizations appreciate. This paper presents a case study of building a natural-language analytics interface on top of a production banking data lake comprising 1,300+ student engagement and curriculum datasets served by dbt Semantic Layer, where the existing SQL surface had reached 80% dbt Semantic Layer adoption and was processing more than 100,000 queries per month at 8-25ms p99 query overhead and where the remaining 20% of non-technical analysts were structurally excluded from self-service because they did not write SQL. We describe the architecture that emerged: a schema-retrieval layer that selects only the relevant subset of the 1,300-table catalog for each natural-language question, an LLM that produces a candidate SQL query, a multi-stage validation layer that enforces governance constraints (row-level security, column masking, query cost ceilings) before any query is executed, and a post-hoc rewriting step that handles the LLM's failure modes qualified column references, ambiguous joins, hallucinated columns. We report pilot results from a four-week deployment to a population of business analysts: 75% adoption (defined as analysts running at least one NL query per week), 80% query accuracy (the produced SQL returns the correct answer to the asked question), and 1.2-second median latency end-to-end. We are honest about the limitations: hallucinated column names remain the dominant failure mode despite schema retrieval, LLM API costs are non-trivial at the query volumes the SQL surface handles, and the governance plumbing is more complex than the natural-language layer. The contribution is a practitioner-grounded design framework with explicit attention to row-level security and audit, intended for teams considering whether the cost of building this layer is justified by the analyst population it unblocks.

References

[1] V. Zhong, C. Xiong, and R. Socher, "Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning," 2017. https://arxiv.org/abs/1709.00103

[2] T. Yu et al., "Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task," in Proc. EMNLP, 2018.

[3] M. Pourreza and D. Rafiei, "DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction," in Proc. NeurIPS, 2023.

[4] OpenAI, "GPT-4 Technical Report," 2023. https://openai.com/research/gpt-4

[5] P. Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," in Proc. NeurIPS, 2020.

[6] A. Vaswani et al., "Attention Is All You Need," in Proc. NeurIPS, 2017.

[7] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," in Proc. NAACL, 2019.

[8] M. Armbrust et al., "Delta Lake: cloud storage table format with transactional guarantees," Proc. VLDB Endowment, 2020.

[9] Apache dbt Semantic Layer Documentation. https://kyuubi.apache.org/docs/latest/

[10] Apache Spark Documentation. https://spark.apache.org/docs/latest/

[11] Databricks Unity Catalog Documentation. https://docs.databricks.com/data-governance/unity-catalog/

[12] Regulation (EU) 2016/679 (General Data Protection Regulation, GDPR).

[13] Sarbanes-Oxley Act of 2002, Public Law 107-204, 116 Stat. 745.

[14] D. Sculley et al., "Hidden Technical Debt in Machine Learning Systems," in Proc. NeurIPS, 2015.

[15] N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich, "Data Lifecycle Challenges in Production Machine Learning: A Survey," SIGMOD Record, 2018.

[16] Apache Software Foundation (2024). Apache Iceberg Table Format Specification v2. Technical Documentation.

[17] Shankar, S., et al. (2024). Operationalizing Machine Learning: Challenges and Best Practices. IEEE Software, 41(2), pp. 42-51.

Natural Language Interfaces for Self-Service Analytics on Data Lakes: Design Patterns, Governance, and Lessons from a Production Deployment

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

Make a Submission

Callpaper

Menu

Information

Keywords

Latest publications