Text-to-SQL and Database Agents
How agents answer questions over structured data by generating and executing SQL: schema context, few-shot prompting, self-correction, safety constraints, benchmarks (Spider, BIRD-SQL), and tooling (LangChain SQLDatabaseToolkit, LlamaIndex NLSQLTableQueryEngine, Vanna, MCP Postgres server).
A text-to-SQL agent accepts a natural-language question, generates a SQL query, executes it against a real database, and synthesizes an answer from the result. The gap between a demo and a production system is wide: schema size, ambiguity, dialect differences, multi-table joins, and query safety all require deliberate engineering.
Why it is hard
Schema understanding — large databases have hundreds of tables and thousands of columns. Feeding the entire schema into the context window is often infeasible and noisy. The model must identify which tables and columns are relevant to the question (schema linking) before writing correct SQL.
Ambiguity — natural language is imprecise. "Sales last quarter" requires knowing the calendar convention, time-zone handling, and whether "sales" means orders, invoices, or revenue rows.
Dialect differences — SQLite, PostgreSQL, MySQL, BigQuery, Snowflake, and DuckDB differ in date functions, string operators, window syntax, and quoting rules. A prompt tuned for one dialect degrades on another.
Correctness — a query can execute without error yet return a wrong answer (wrong JOIN condition, missing filter, off-by-one aggregation). Execution success does not imply semantic correctness.
Core techniques
Schema context injection — provide the model with DDL (CREATE TABLE statements), column descriptions, and 2–3 sample rows per table. This is the single highest-leverage intervention. For large schemas, use retrieval to select only the relevant subset of tables and columns — see /resources/rag-retrieval-for-agents for the retrieval patterns.
Few-shot examples — include 3–10 verified question/SQL pairs in the prompt. Examples calibrate the model to your dialect, naming conventions, and common query patterns. Dynamic few-shot (retrieving the most similar examples for each incoming question) consistently outperforms static few-shot on held-out questions.
Self-correction (run → read error → fix) — after generating a query, execute it; if it returns a database error, feed the error message back to the model with the original question and schema and ask it to produce a corrected query. One correction round captures the majority of fixable errors. Limit to 2–3 attempts to avoid infinite loops.
Query validation before execution — for higher-stakes pipelines, add a validation step: parse the SQL with a library (e.g., sqlglot) and check that all referenced table/column names exist in the schema before executing. This catches hallucinated identifiers without a live database round-trip.
Read-only constraint — connect to the database with a read-only role (no INSERT/UPDATE/DELETE privileges). This is a hard safety boundary, not a prompt instruction.
Safety
Text-to-SQL agents combine two injection attack surfaces: the natural-language input (prompt injection) and the generated SQL (SQL injection). A user can craft a question that instructs the model to emit destructive SQL, or stacked statements that escape a read-only transaction. Datadog Security Labs documented a real example: the reference Anthropic Postgres MCP server allowed statement stacking that bypassed its read-only transaction guard (COMMIT; DROP SCHEMA public CASCADE; executed successfully), patched by switching to prepared statements.
Mandatory controls:
- Read-only DB role — enforce at the database connection level, not in the prompt.
- Row limits — add
LIMIT Nto every generated query (or enforce via query rewriting) to prevent runaway full-table scans. - Allowlist / schema restriction — expose only the tables the agent needs; deny access to system catalogs and sensitive tables.
- No statement stacking — use parameterized queries or prepared statement APIs; do not pass raw model output as a multi-statement string to the database driver.
- Human approval for writes — if the use case requires write access, add an explicit human-in-the-loop confirmation step before executing any INSERT/UPDATE/DELETE.
See /resources/agentic-security-checklist (prompt injection mitigations) and /resources/code-execution-sandboxing (database sandboxing patterns).
Transparency
Return the generated SQL to the caller alongside the result. Agents that hide the SQL make it impossible to audit wrong answers or catch injection attempts. Logged SQL is also the primary signal for fine-tuning and few-shot improvement.
Benchmarks
Spider (Yale LILY Lab) — 10,181 questions over 200 databases across 138 domains, with complex cross-domain SQL. The original benchmark where models now exceed 90% execution accuracy, making it a baseline rather than a frontier test. Official site: https://yale-lily.github.io/spider
Spider 2.0 (ICLR 2025 Oral) — enterprise-grade successor using real Snowflake, BigQuery, and SQLite databases with large, complex schemas. The best models score roughly 21% success rate, highlighting the gap between toy benchmarks and production workloads. GitHub: https://github.com/xlang-ai/Spider2
BIRD-SQL — 12,751 question-SQL pairs over 95 large databases (33.4 GB total) spanning 37 professional domains including dirty/null data and external-knowledge requirements. Evaluates execution accuracy (EX): whether the generated query returns the correct result set. Human performance is ~93% EX; strong models reach ~72% EX (as of mid-2026). BIRD-Interact accepted to ICLR 2026 (Oral). Official site: https://bird-bench.github.io/
Tooling
LangChain SQLDatabaseToolkit — a set of tools (QuerySQLDatabaseTool, InfoSQLDatabaseTool, ListSQLDatabaseTool) used with create_react_agent (LangGraph). The toolkit handles schema introspection, query execution, and error feedback. Scope credentials to read-only before connecting. Docs: https://python.langchain.com/docs/modules/agents/toolkits/sql_database
LlamaIndex NLSQLTableQueryEngine — generates SQL from natural language over a SQLAlchemy- connected database. For schemas too large to fit the context window, SQLTableRetrieverQueryEngine dynamically selects relevant tables at query time. Docs: https://developers.llamaindex.ai/python/examples/index_structs/struct_indices/sqlindexdemo/
Vanna (vanna-ai/vanna) — open-source Python RAG framework for text-to-SQL. Trains on DDL, documentation strings, and verified question-SQL pairs stored in a vector store; uses retrieved examples as few-shot context at inference time. GitHub: https://github.com/vanna-ai/vanna
MCP Postgres server — the reference @modelcontextprotocol/server-postgres (now archived at modelcontextprotocol/servers-archived) provided read-only PostgreSQL access as an MCP tool. Community successors with configurable access include crystaldba/postgres-mcp. All queries should run in READ ONLY transactions; see the Datadog SQL-injection case study for why prepared statements are required even in read-only mode. Archived repo: https://github.com/modelcontextprotocol/servers-archived/tree/main/src/postgres
Practical checklist
- Give the model the schema (DDL + column descriptions + sample rows), not just table names.
- Use dynamic few-shot: retrieve the 3–5 most similar verified examples per question.
- Connect with a read-only, schema-scoped database role — enforce at the driver, not the prompt.
- Add a
LIMITclause to every generated query. - Validate SQL syntax and referenced identifiers before execution.
- Run the query; on error, feed the error back and retry (max 2–3 rounds).
- Return the SQL in the response for auditability.
- For writes, require explicit human approval.
Cross-links: /resources/rag-retrieval-for-agents · /resources/agentic-security-checklist · /resources/code-execution-sandboxing · /resources/evaluating-ai-agents
Verified sources
- Spider benchmark (Yale LILY Lab): https://yale-lily.github.io/spider
- Spider 2.0 (ICLR 2025 Oral, GitHub): https://github.com/xlang-ai/Spider2
- BIRD-SQL benchmark: https://bird-bench.github.io/
- LangChain SQLDatabaseToolkit docs: https://python.langchain.com/docs/modules/agents/toolkits/sql_database
- LangChain SQL agent API reference: https://reference.langchain.com/python/langchain-community/agent_toolkits/sql/toolkit/SQLDatabaseToolkit
- LlamaIndex text-to-SQL guide: https://developers.llamaindex.ai/python/examples/index_structs/struct_indices/sqlindexdemo/
- Vanna (vanna-ai/vanna) GitHub: https://github.com/vanna-ai/vanna
- MCP Postgres server (archived): https://github.com/modelcontextprotocol/servers-archived/tree/main/src/postgres
- Datadog — SQL injection in the Postgres MCP server: https://securitylabs.datadoghq.com/articles/mcp-vulnerability-case-study-SQL-injection-in-the-postgresql-mcp-server/