Project Case Study
English to SQL Exploration
An exploration into turning natural-language questions into reliable SQL—focused on safety, explainability, and analytics outputs that non-technical users can trust.
Context
Business users often know what they want to measure, but not how to query the data. Analysts become a bottleneck for simple questions (“show me utilisation by client”, “compare hours this month vs last month”). The goal was to reduce friction without compromising data integrity or security.
Problem
- High dependency on analysts: simple questions still required SQL skills and domain knowledge.
- Ambiguity in language: terms like “utilisation”, “active matters”, and “this quarter” need consistent definitions.
- Safety risk: free-form query generation can produce expensive, incorrect, or over-broad SQL.
- Trust gap: users need to understand what was queried and why the results are correct.
Approach
- Schema grounding: constrained the model to approved tables, columns, and business definitions.
- Intent → query plan: generated a structured plan first (metrics, filters, time window, grouping) before SQL.
- Guardrails: enforced read-only queries, row limits, safe joins, and blocked sensitive fields by policy.
- Validation loop: SQL linting + explain step (“what this query does”) and retry on errors.
- UX output: returned results as tables + simple charts, with plain-English explanation alongside.
Key Decisions
- Explainability is mandatory: every SQL response includes a “what it did” summary.
- Definitions matter: business terms are backed by a glossary and reused consistently.
- Safety first: default to smaller scoped queries and require refinement for broader pulls.
- Human override: analysts can promote approved queries into “trusted templates”.
Learnings
English-to-SQL works best when it behaves like a guided analyst—not a magic box. The more you formalise definitions, enforce guardrails, and show the logic, the more usable (and safer) the system becomes.
Next Enhancements
- Clarifying questions when intent is ambiguous (“Which date range?”, “Which client group?”).
- Cost-aware planning (estimate query weight and refuse expensive patterns by default).
- Semantic layer for business metrics so “utilisation” is always computed the same way.