In Part 1, the naïve Text-to-SQL approach sent 8,414 tokens of schema context to generate 16 tokens of SQL - a 526:1 input-to-output ratio. This post engineers the fix: a deterministic schema pruner - context engineering at the schema layer - that selects only the tables relevant to each query, with no LLM dependency.

dummy-image.png
Schema Pruning for Text-to-SQL: A dense cluster of database tables is filtered through a deterministic pruner, selecting only the few tables relevant to each query, without the need of an LLM

TL;DR: A deterministic schema pruner, using FK graph traversal and 3-layer entity resolution, reduces context by 93% on average across 15 benchmark queries against a 35-table schema. Mean recall is 1.00 (every needed table is selected), mean precision is 0.93 (minimal extra tables). No LLM calls. Fully reproducible. The implementation is ~500 lines of Python with one external dependency (tiktoken).

The Problem: 526:1

Part 1 established the token waste problem with a concrete measurement. For the simplest possible query - “Show all active products in the Electronics category” - the naïve approach sends the full 35-table DDL as context:

Context strategy Input tokens vs. Answer
Full schema DDL (naïve approach) 8,414 526x
CREATE TABLE blocks only 5,230 327x
Only the products table 166 10x

Note on token counts: All benchmarks in this post use 5,164 tokens (CREATE TABLE blocks only from the committed schema, measured with tiktoken) as the baseline. Part 1 reported 8,414 for full DDL including comments, DROP statements, and operational commands; stripping those to CREATE TABLE blocks only is itself a 38% reduction - a point from Part 1 that the benchmarks below build on.

The query only needs one table. The naïve approach sends all 35. That is not just expensive - at enterprise scale (200-500+ tables), the schema exceeds most models’ context windows entirely and the approach becomes technically impossible.

But the gap between 5,230 tokens (every CREATE TABLE block) and 166 tokens (just the products table) suggests that if we can identify which tables a query needs, we can cut context by 95%+ per query.

The question is: how do you identify the right tables from a natural language query, without using an LLM?

The Idea: FK Graph Pruning

The schema is not a flat list of tables. It is a graph. Foreign key relationships define edges between tables, and those edges encode which tables are structurally related.

Given a query like “Show total revenue by product category”, the approach is:

  1. Resolve the query to seed tables - tables directly referenced by the natural language
  2. Traverse the FK graph from those seeds to find structurally connected tables
  3. Prune the schema to only the selected tables

dummy-image.png
FK graph pruning for GQ-013: “Show revenue by product for products manufactured in Singapore.” The resolver maps query terms to 5 seed tables out of 35, reducing tokens from 5,164 to 751 (85.5%). 30 tables are excluded.

For the revenue query:

  • “revenue” resolves to orders and order_items (seed tables)
  • “product” resolves to products (seed table)
  • All three are FK-connected (order_itemsorders, order_itemsproducts), so no additional traversal needed
  • Result: 3 tables instead of 35 - a 91% token reduction

The critical insight is that entity resolution can be deterministic. Business terminology maps predictably to database tables: “revenue” means orders, “shipping” means shipments, “quality” means quality inspections. These mappings don’t change between queries. They can be compiled once and applied without an LLM call.

Other approaches to table selection exist: [1] embedding-based retrieval - encode table descriptions as vectors, retrieve nearest matches (RAG for schema selection), [2] LLM-powered extraction - ask a model to identify relevant tables directly. Both work, but both add latency, cost, and infrastructure dependencies. Embedding-based retrieval adds ~50-200ms per query for vector lookup plus requires maintaining an embedding index; LLM extraction adds ~500-2000ms plus API cost. The deterministic pruner runs in <1ms.

Deterministic keyword resolution is faster, fully reproducible, and free. For the majority of queries where business terms map predictably to tables, it is the better trade-off.

Building the Pruner

The implementation has two core components: a 3-layer entity resolver that identifies seed tables from natural language, and a BFS traversal that expands seeds through the FK graph.

3-Layer Entity Resolution

The resolver processes the query through three layers, each catching what the previous one misses:

def resolve_tables(self, query: str) -> Set[str]:
    seeds: Set[str] = set()
    query_lower = query.lower()
    query_words = set(re.findall(r"\w+", query_lower))
    resolved_words: Set[str] = set()

    # Layer 1: Direct table name matching
    for table in self._all_tables:
        singular = _singularize(name=table)
        if table in query_lower:
            seeds.add(table)
            resolved_words.add(table)
        elif singular in query_lower:
            seeds.add(table)
            resolved_words.add(singular)

    # Layer 2: Business entity mapping
    for term, tables in ENTITY_MAP.items():
        if term in query_words:
            seeds.update(tables)
            resolved_words.add(term)

    # Layer 3: Column name matching
    for col_name, tables in self._column_index.items():
        if len(col_name) < 6:
            continue
        if col_name in resolved_words:
            continue
        if col_name in query_lower:
            seeds.update(tables)

    if not seeds:
        seeds = {"orders", "products", "customers"}

    return seeds

Layer 1 catches direct references: “Show all orders”orders. It handles both plural table names and their singular forms (“each product”products).

Layer 2 maps business terminology to tables. The ENTITY_MAP has ~30 entries mapping terms like “revenue” to {orders, order_items}, “shipping” to {shipments}, “profit” to {profitability_analysis}, and so on. This is where domain knowledge lives. A marketing analyst asking about “campaign conversion rates” doesn’t know (or care) that the data lives in campaigns and conversion_funnels. The entity map bridges that gap.

Layer 3 catches column-name references that layers 1 and 2 missed. If the query mentions unit_cost, and unit_cost is a column in raw_materials, that table gets added as a seed. A stop-list excludes generic column names (id, status, name, created_at) that would match too broadly.

One subtlety: Layer 3 skips words already resolved by layers 1-2. Without this, “revenue” would match both the entity map (Layer 2 → orders, order_items) and a column name in profitability_analysis (Layer 3), pulling in an unrelated table. The resolved_words set prevents this double-counting.

FK Graph Traversal

Once seed tables are identified, find_minimal_tables runs BFS through the FK graph, bounded by a max_depth parameter that controls how many FK hops to traverse. Higher depth finds more structurally related tables but risks pulling in too many. The benchmarks below show that depth=0 (seeds only, no traversal) achieves 100% recall on all test queries - the 3-layer resolver is strong enough that BFS expansion is rarely needed.

This is a useful result: it means the pruner’s accuracy comes from entity resolution quality, not graph traversal depth. For well-formed business queries, the right tables are identified directly from the natural language. BFS is insurance for edge cases where the query doesn’t explicitly mention a needed join table - for instance, if a query mentions “Singapore products” but not “manufacturing” or “production,” the resolver finds products directly but needs production_runs (FK-connected to products) for the Singapore filter. BFS at depth=1 would catch it.

Parsing the FK Graph

The pruner builds its FK graph by parsing DDL at initialization. It handles both inline FOREIGN KEY constraints and ALTER TABLE ADD FOREIGN KEY statements:

# Parse inline FOREIGN KEY per CREATE TABLE block
fk_in_block = re.compile(
    r"FOREIGN\s+KEY\s*\((\w+)\)\s*REFERENCES\s+(\w+)\s*\((\w+)\)",
    re.IGNORECASE,
)
for table_name, ddl_block in self._table_ddl.items():
    for match in fk_in_block.finditer(ddl_block):
        self._add_fk_edge(table_name, match.group(1), match.group(2), match.group(3))

Each FK creates a bidirectional edge. If orders.customer_id references customers.customer_id, both orders → customers and customers → orders are added. This ensures BFS can traverse in either direction.

For the 35-table schema, this produces 59 FK edges - substantial connectivity that BFS can exploit when needed.

Benchmarks

I benchmarked the pruner against 15 golden queries spanning simple single-table lookups, multi-table joins, cross-domain analytics, and complex inventory queries. Each query runs through resolve_tablesfind_minimal_tables (depth=0) → prune_schema, measuring token reduction, precision, and recall against expected tables. Precision measures whether all selected tables are needed; recall measures whether all needed tables are selected.

ID Query Tokens Reduction P R
GQ-001 How many orders were placed last month? 5,164 → 157 97.0% 1.00 1.00
GQ-002 Show total revenue by product category 5,164 → 465 91.0% 1.00 1.00
GQ-003 List all products with their current inventory levels 5,164 → 239 95.4% 1.00 1.00
GQ-006 Show me monthly sales trends for the last 6 months 5,164 → 365 92.9% 1.00 1.00
GQ-007 Which employees have the highest number of quality inspections? 5,164 → 218 95.8% 1.00 1.00
GQ-008 Show me the best products 5,164 → 100 98.1% 1.00 1.00
GQ-009 What is the average order value by customer segment? 5,164 → 471 90.9% 0.67 1.00
GQ-010 Show production run details for product P001 5,164 → 313 93.9% 1.00 1.00
GQ-011 Show shipment status and delivery partner details 5,164 → 307 94.1% 1.00 1.00
GQ-012 What is the total cost allocation per department? 5,164 → 453 91.2% 0.50 1.00
GQ-013 Show revenue by product for products manufactured in Singapore 5,164 → 751 85.5% 1.00 1.00
GQ-014 What is the conversion rate for each marketing campaign? 5,164 → 319 93.8% 1.00 1.00
GQ-015 Which warehouses have products below their safety stock reorder point? 5,164 → 658 87.3% 0.80 1.00
GQ-016 Show supplier reliability scores and their raw material lead times 5,164 → 189 96.3% 1.00 1.00
GQ-017 Show me monthly profit margins by product category 5,164 → 389 92.5% 1.00 1.00

Summary

Metric Value
Mean token reduction 93.0%
Median token reduction 93.8%
Mean precision 0.93
Mean recall 1.00
Total tokens across 15 queries 77,460 → 5,394 (72,066 saved)
Queries failing recall threshold (< 0.8) 0

Recall is 1.00 across all 15 queries - every table needed for every query was selected. This is the metric that matters most: a missing table means the LLM cannot generate correct SQL.

Precision is 0.93 - three queries selected extra tables beyond what was strictly needed (GQ-009, GQ-012, GQ-015). This is the right trade-off. Including an extra table costs a few tokens; missing a needed table breaks the query entirely.

Ablation: What Each Layer Contributes

How much does each resolver layer contribute? To answer this, I ran all 15 queries through three configurations - Layer 1 only, Layers 1+2, and all three layers:

Config Mean Precision Mean Recall Queries with R < 1.0
L1 (table names only) 0.93 0.57 12/15
L1+L2 (+ entity map) 0.99 1.00 0/15
L1+L2+L3 (+ column matching) 0.93 1.00 0/15

Layer 1 alone fails 12 of 15 queries - direct table name matching cannot resolve business terms like “revenue,” “quality,” or “profit.” Layer 2 (the ENTITY_MAP) is the critical component: it lifts recall from 0.57 to 1.00 and eliminates all failures. Layer 3 (column name matching) adds no recall on these 15 queries but costs 6 points of precision from extra table matches. Layer 3 remains valuable as insurance for queries referencing column names not covered by the entity map, but the ablation confirms that Layer 2 carries the resolver.

This is reproducible: python demos/05_schema_pruning_ablation_study.py (or --verbose for per-query detail).

A closer look at GQ-012. The query “What is the total cost allocation per department?” needs cost_allocations and departments. The pruner selected those two plus production_lines and warehouses (precision: 0.50). The cause: Layer 3’s substring matching found “location” inside “allocation”, pulling in every table with a location column. The cost is 453 tokens instead of ~280 - still a 91% reduction, but it illustrates the precision ceiling of substring matching. GQ-009 has a similar pattern: “segment” (a column in customer_lifetime_value) matched via Layer 3, adding one extra table. Both cases confirm the design choice: substring matching catches real references that exact-match would miss, at the cost of occasional false positives.

The hardest query is GQ-013: “Show revenue by product for products manufactured in Singapore”. It spans 5 tables across 3 domains (E-commerce, Manufacturing, Analytics). The pruner resolves all 5 correctly - orders, order_items, products, production_runs, production_lines - still achieving an 85.5% reduction.

End-to-End Validation

The benchmarks above validate the pruner in isolation — does it select the right tables? A separate validation script tests the full pipeline:

  • For each golden query, generate SQL with the full schema and again with the pruned schema
  • Execute both against the database and compare outcomes
  • If pruning breaks SQL generation, this is where it shows up
  • Call order is randomised per query to avoid first-call latency bias (TCP/TLS handshake penalty)

The 526:1 Problem, Solved

Recall the motivating example from Part 1: “Show all active products in the Electronics category” needed 16 tokens of SQL but received 8,414 tokens of schema context.

With pruning, that same query receives 100 tokens of context (just the products table’s CREATE TABLE block). The ratio drops from 526:1 to 6:1.

Sending all 35 tables when the LLM needs one is context pollution - irrelevant schema diluting the signal for correct SQL generation. The pruner implements context engineering at the schema layer: finding the minimal effective context for each query, rather than sending everything on every call. The 526:1-to-6:1 improvement is a direct measure of its effect.

The token reduction maps directly to cost. Schema context alone, at 10,000 queries per day:

Model Price (per 1M input) Before pruning After pruning Annual saving
gpt-4o-mini $0.15 $7.75/day $0.54/day $2,632
gpt-4o $2.50 $129/day $9.00/day $43,800

These figures represent savings on schema context alone; total per-query cost includes system prompt and response tokens, which are unaffected by pruning. The savings scale linearly with query volume. At 100K queries per day on gpt-4o, pruning saves over $430K per year - on a 35-table schema.

The task this pruner performs, identifying which tables a natural language query needs, has a name in the research literature: schema linking. It is a well-studied subproblem of text-to-SQL. Several recent systems tackle it, and the approaches fall into three categories: prompt an LLM, train a dedicated model, or match deterministically.

dummy-image.png
Schema Linking for Text-to-SQL: How Approaches Compare. The pruner occupies the bottom-left corner: zero-latency, single-schema production deployment. Published methods cluster in the upper region: cross-database generalisation with LLM or model inference cost.

Most published approaches use an LLM for schema linking. DIN-SQL prompts GPT-4 with few-shot examples to classify which tables and columns are relevant, then chains three more LLM calls for query classification, SQL generation, and self-correction. C3SQL goes further: it prompts GPT-3.5 ten times for table recall, takes a majority vote, then generates 20 candidate SQL queries and picks the most common result. Both achieve strong accuracy on Spider (DIN-SQL: 85.3%, C3SQL: 82.3%), but both pay for it in latency and API cost. DIN-SQL needs four sequential LLM calls per query. C3SQL needs roughly forty.

RESDSQL takes a different route: train a dedicated RoBERTa-Large cross-encoder to score table relevance, then feed only the top-ranked tables to a T5-3B model for SQL generation. The decoupling principle, letting a specialised component handle schema selection rather than burdening the SQL generator with the full schema, is the same principle behind this pruner. The difference is in the mechanism. RESDSQL trains a neural classifier on 7,000 labelled examples and needs a GPU at inference time. This pruner uses deterministic rules with a hand-curated entity map and runs on anything.

The most interesting comparison is DAIL-SQL, because its schema linking is also deterministic. It uses n-gram matching adapted from RAT-SQL, checking whether question words overlap with table and column names. This is conceptually close to Layer 1 of this pruner. DAIL-SQL pairs this deterministic linker with carefully selected few-shot examples and GPT-4, achieving 86.6% on Spider, the highest of the four. That result matters: the top-performing system on Spider uses deterministic schema linking. Pattern matching works for table selection, even at the top of the leaderboard.

Where this pruner goes further than DAIL-SQL’s n-gram matcher is Layer 2. The entity map resolves business terminology that no surface-level matcher can catch: “revenue” to orders and order_items, “quality” to quality_inspections, “profit” to profitability_analysis. The ablation confirms this is critical. Layer 1 alone (the equivalent of n-gram matching) achieves 0.57 recall on the benchmark queries. Layer 2 lifts it to 1.00. The cost is a hand-curated mapping, roughly 30 entries for 35 tables, that must be maintained per schema.

This pruner has not been evaluated on Spider. Spider measures cross-database generalisation: the ability to handle schemas never seen during development. This pruner is designed for single-schema production deployment, where the entity map is curated once for a known business domain and amortised across all queries. Tables may be added over time, but the domain is stable. For this application, cross-database generalisation is not a requirement. The schema is known, the entity map is maintainable, and the trade-off is zero-latency schema linking with no model dependency.

DIN-SQL: Pourreza & Rafiei, NeurIPS 2023. RESDSQL: Li et al., AAAI 2023. C3SQL: Dong et al., 2023. DAIL-SQL: Gao et al., VLDB 2024.

Limitations

This is a deterministic system. It has predictable failure modes:

Novel terminology. If a user asks about “throughput” meaning production output, the entity map won’t match it. The entity map has ~30 entries covering common business terms, but it cannot anticipate every synonym.

Implicit table references. A query like “Which managers approved the most expenses?” requires understanding that “managers” are employees with a certain role, and “expenses” map to cost allocations. Neither word appears in the entity map or column names.

Schema evolution. When new tables are added or renamed, the entity map needs manual updates. The column index rebuilds automatically from DDL, but business term mappings do not. At enterprise scale, the entity map grows sublinearly: many tables share business terminology (e.g., “revenue” maps to the same tables regardless of how many other tables exist), so a 500-table schema does not require 500 mappings - but the maintenance burden is real.

These limitations are exactly why Part 3 introduces LLM-powered entity extraction on top of deterministic pruning. The deterministic layer handles the 80-90% of queries where business terms map predictably to tables. The LLM layer handles the rest - synonyms, implicit references, and novel terminology - while the deterministic layer provides a fast, reproducible baseline that the LLM can fall back to if extraction fails.

Try It

The full implementation, tests, and benchmark script are in the companion repository.

git clone https://github.com/nirmalyaghosh/text-to-sql.git
cd text-to-sql
uv pip install -e .

# Run the benchmark
python demos/05_schema_pruning_benchmark.py

# Verbose mode (shows per-query table details)
python demos/05_schema_pruning_benchmark.py --verbose

# Single query
python demos/05_schema_pruning_benchmark.py --query GQ-013

# Run the tests
pytest tests/test_schema_pruner.py -v

No database connection needed. No API keys. No LLM calls. The pruner works entirely from DDL text.

What’s Next

The pruner demonstrated here is a standalone module - no agent framework, no LLM dependency, no infrastructure requirements. It is the foundation that the agentic system builds on.

In Part 3, I introduce a 5-agent architecture where the Schema Intelligence Agent uses this pruner as its first pass, then falls back to LLM-powered entity extraction for queries that deterministic matching cannot resolve.


Part 1: Text-to-SQL the Naïve Way - the failure modes

Part 2: Schema Pruning for Text-to-SQL (this post) - engineering the context window

Part 3: Agentic Text-to-SQL - the multi-agent architecture

Part 4: Red-Teaming the Security Agent (will be published soon) - Crescendo-style attacks

Part 5: MCP Integration (will be published soon) - dynamic schema discovery

Related: Schema Linking for Text-to-SQL (will be published soon) - extended benchmarks on Spider


ℹ️ Version History

  • 26 Feb 2026 : Post published
  • 28 Feb 2026 : Added Related Work section
  • 02 Mar 2026 : Added End-to-End Validation subsection
  • 08 Mar 2026 : Added link to Part 3