In Part 1, the naïve Text-to-SQL approach sent 8,414 tokens of schema context to generate 16 tokens of SQL - a 526:1 input-to-output ratio. This post engineers the fix: a deterministic schema pruner that selects only the tables relevant to each query, with no LLM dependency.
TL;DR: A deterministic schema pruner, using FK graph traversal and 3-layer entity resolution, reduces context by 93% on average across 15 benchmark queries against a 35-table schema. Mean recall is 1.00 (every needed table is selected), mean precision is 0.93 (minimal extra tables). No LLM calls. Fully reproducible. The implementation is ~500 lines of Python with one external dependency (
tiktoken).
The Problem: 526:1
Part 1 established the token waste problem with a concrete measurement. For the simplest possible query - “Show all active products in the Electronics category” - the naïve approach sends the full 35-table DDL as context:
| Context strategy | Input tokens | vs. Answer |
|---|---|---|
| Full schema DDL (naïve approach) | 8,414 | 526x |
| CREATE TABLE blocks only | 5,230 | 327x |
| Only the products table | 166 | 10x |
Note on token counts: Part 1 reported 8,414 / 5,230 tokens measured during drafting. The committed schema produces 8,348 / 5,164 tokens - the difference is due to minor DDL edits (schema naming) between the draft measurement and the final commit. The benchmarks in this post use the reproducible figures from the committed schema: 5,164 tokens for CREATE TABLE blocks, measured with tiktoken.get_encoding("o200k_base").
The query only needs one table. The naïve approach sends all 35. That is not just expensive - at enterprise scale (200-500+ tables), the schema exceeds most models’ context windows entirely and the approach becomes technically impossible.
But the gap between 5,230 tokens (every CREATE TABLE block) and 166 tokens (just the products table) suggests that if we can identify which tables a query needs, we can cut context by 95%+ per query.
The question is: how do you identify the right tables from a natural language query, without using an LLM?
The Idea: FK Graph Pruning
The schema is not a flat list of tables. It is a graph. Foreign key relationships define edges between tables, and those edges encode which tables are structurally related.
Given a query like “Show total revenue by product category”, the approach is:
- Resolve the query to seed tables - tables directly referenced by the natural language
- Traverse the FK graph from those seeds to find structurally connected tables
- Prune the schema to only the selected tables
For the revenue query:
- “revenue” resolves to
ordersandorder_items(seed tables) - “product” resolves to
products(seed table) - All three are FK-connected (order_items → orders, order_items → products), so no additional traversal needed
- Result: 3 tables instead of 35 - a 91% token reduction
The critical insight is that entity resolution can be deterministic. Business terminology maps predictably to database tables: “revenue” means orders, “shipping” means shipments, “quality” means quality inspections. These mappings don’t change between queries. They can be compiled once and applied without an LLM call.
Other approaches to table selection exist: [1] embedding-based retrieval - encode table descriptions as vectors, retrieve nearest matches (RAG for schema selection), [2] LLM-powered extraction - ask a model to identify relevant tables directly. Both work, but both add latency, cost, and infrastructure dependencies - a vector database or an API call on every query.
Deterministic keyword resolution is faster, fully reproducible, and free. For the majority of queries where business terms map predictably to tables, it is the better trade-off.
Building the Pruner
The implementation has two core components: a 3-layer entity resolver that identifies seed tables from natural language, and a BFS traversal that expands seeds through the FK graph.
3-Layer Entity Resolution
The resolver processes the query through three layers, each catching what the previous one misses:
def resolve_tables(self, query: str) -> Set[str]:
seeds: Set[str] = set()
query_lower = query.lower()
query_words = set(re.findall(r"\w+", query_lower))
resolved_words: Set[str] = set()
# Layer 1: Direct table name matching
for table in self._all_tables:
singular = table.rstrip("s")
if table in query_lower or singular in query_lower:
seeds.add(table)
resolved_words.add(table if table in query_lower else singular)
# Layer 2: Business entity mapping
for term, tables in ENTITY_MAP.items():
if term in query_words:
seeds.update(tables)
resolved_words.add(term)
# Layer 3: Column name matching
for col_name, tables in self._column_index.items():
if len(col_name) < 6:
continue
if col_name in resolved_words:
continue
if col_name in query_lower:
seeds.update(tables)
if not seeds:
seeds = {"orders", "products", "customers"}
return seeds
Layer 1 catches direct references: “Show all orders” → orders. It handles both plural table names and their singular forms (“each product” → products).
Layer 2 maps business terminology to tables. The ENTITY_MAP has ~30 entries:
ENTITY_MAP = {
"revenue": {"orders", "order_items"},
"sales": {"orders", "order_items"},
"shipping": {"shipments"},
"inventory": {"finished_goods_inventory"},
"production": {"production_runs"},
"quality": {"quality_inspections"},
"profit": {"profitability_analysis"},
"cost": {"cost_allocations"},
# ... ~25 more entries
}
This is where domain knowledge lives. A marketing analyst asking about “campaign conversion rates” doesn’t know (or care) that the data lives in campaigns and conversion_funnels. The entity map bridges that gap.
Layer 3 catches column-name references that layers 1 and 2 missed. If the query mentions unit_cost, and unit_cost is a column in raw_materials, that table gets added as a seed. A stop-list excludes generic column names (id, status, name, created_at) that would match too broadly.
One subtlety: Layer 3 skips words already resolved by layers 1-2. Without this, “revenue” would match both the entity map (Layer 2 → orders, order_items) and a column name in profitability_analysis (Layer 3), pulling in an unrelated table. The resolved_words set prevents this double-counting.
FK Graph Traversal
Once seed tables are identified, BFS expands through the FK graph:
def find_minimal_tables(self, seed_tables: Set[str], max_depth: int = 2) -> Set[str]:
if not seed_tables:
return set()
visited: Set[str] = set()
queue: List[Tuple[str, int]] = [
(t, 0) for t in seed_tables if t in self._all_tables
]
while queue:
table, depth = queue.pop(0)
if table in visited:
continue
visited.add(table)
if depth < max_depth:
neighbors = self._fk_graph.get(table, set())
for neighbor in neighbors:
if neighbor not in visited:
queue.append((neighbor, depth + 1))
return visited
The max_depth parameter controls how many FK hops to traverse. Higher depth finds more structurally related tables but risks pulling in too many. The benchmarks below show that depth=0 (seeds only, no traversal) achieves 100% recall on all test queries - the 3-layer resolver is strong enough that BFS expansion is rarely needed.
This is a useful result: it means the pruner’s accuracy comes from entity resolution quality, not graph traversal depth. For well-formed business queries, the right tables are identified directly from the natural language. BFS is insurance for edge cases where the query doesn’t explicitly mention a needed join table.
Parsing the FK Graph
The pruner builds its FK graph by parsing DDL at initialization. It handles both inline FOREIGN KEY constraints and ALTER TABLE ADD FOREIGN KEY statements:
# Parse inline FOREIGN KEY per CREATE TABLE block
fk_in_block = re.compile(
r"FOREIGN\s+KEY\s*\((\w+)\)\s*REFERENCES\s+(\w+)\s*\((\w+)\)",
re.IGNORECASE,
)
for table_name, ddl_block in self._table_ddl.items():
for match in fk_in_block.finditer(ddl_block):
self._add_fk_edge(table_name, match.group(1), match.group(2), match.group(3))
Each FK creates a bidirectional edge. If orders.customer_id references customers.customer_id, both orders → customers and customers → orders are added. This ensures BFS can traverse in either direction.
For the 35-table schema, this produces 59 FK edges - substantial connectivity that BFS can exploit when needed.
Benchmarks
I benchmarked the pruner against 15 golden queries spanning simple single-table lookups, multi-table joins, cross-domain analytics, and complex inventory queries. Each query runs through resolve_tables → find_minimal_tables (depth=0) → prune_schema, measuring token reduction, precision, and recall against expected tables. Precision measures whether all selected tables are needed; recall measures whether all needed tables are selected.
| ID | Query | Tokens | Reduction | P | R |
|---|---|---|---|---|---|
| GQ-001 | How many orders were placed last month? | 5,164 → 157 | 97.0% | 1.00 | 1.00 |
| GQ-002 | Show total revenue by product category | 5,164 → 465 | 91.0% | 1.00 | 1.00 |
| GQ-003 | List all products with their current inventory levels | 5,164 → 239 | 95.4% | 1.00 | 1.00 |
| GQ-006 | Show me monthly sales trends for the last 6 months | 5,164 → 365 | 92.9% | 1.00 | 1.00 |
| GQ-007 | Which employees have the highest number of quality inspections? | 5,164 → 218 | 95.8% | 1.00 | 1.00 |
| GQ-008 | Show me the best products | 5,164 → 100 | 98.1% | 1.00 | 1.00 |
| GQ-009 | What is the average order value by customer segment? | 5,164 → 471 | 90.9% | 0.67 | 1.00 |
| GQ-010 | Show production run details for product P001 | 5,164 → 313 | 93.9% | 1.00 | 1.00 |
| GQ-011 | Show shipment status and delivery partner details | 5,164 → 307 | 94.1% | 1.00 | 1.00 |
| GQ-012 | What is the total cost allocation per department? | 5,164 → 453 | 91.2% | 0.50 | 1.00 |
| GQ-013 | Show revenue by product for products manufactured in Singapore | 5,164 → 751 | 85.5% | 1.00 | 1.00 |
| GQ-014 | What is the conversion rate for each marketing campaign? | 5,164 → 319 | 93.8% | 1.00 | 1.00 |
| GQ-015 | Which warehouses have products below their safety stock reorder point? | 5,164 → 658 | 87.3% | 0.80 | 1.00 |
| GQ-016 | Show supplier reliability scores and their raw material lead times | 5,164 → 189 | 96.3% | 1.00 | 1.00 |
| GQ-017 | Show me monthly profit margins by product category | 5,164 → 389 | 92.5% | 1.00 | 1.00 |
Summary
| Metric | Value |
|---|---|
| Mean token reduction | 93.0% |
| Median token reduction | 93.8% |
| Mean precision | 0.93 |
| Mean recall | 1.00 |
| Total tokens across 15 queries | 77,460 → 5,394 (72,066 saved) |
| Queries failing recall threshold (< 0.8) | 0 |
Recall is 1.00 across all 15 queries - every table needed for every query was selected. This is the metric that matters most: a missing table means the LLM cannot generate correct SQL.
Precision is 0.93 - three queries selected extra tables beyond what was strictly needed (GQ-009, GQ-012, GQ-015). This is the right trade-off. Including an extra table costs a few tokens; missing a needed table breaks the query entirely.
The hardest query is GQ-013: “Show revenue by product for products manufactured in Singapore”. It spans 5 tables across 3 domains (E-commerce, Manufacturing, Analytics). The pruner resolves all 5 correctly - orders, order_items, products, production_runs, production_lines - still achieving an 85.5% reduction.
The 526:1 Problem, Solved
Recall the motivating example from Part 1: “Show all active products in the Electronics category” needed 16 tokens of SQL but received 8,414 tokens of schema context.
With pruning, that same query receives 100 tokens of context (just the products table’s CREATE TABLE block). The ratio drops from 526:1 to 6:1.
Sending all 35 tables when the LLM needs one is context pollution - irrelevant schema diluting the signal for correct SQL generation. The pruner implements context engineering at the schema layer: finding the minimal effective context for each query, rather than sending everything on every call. The 526:1-to-6:1 improvement is a direct measure of its effect.
The token reduction maps directly to cost. Schema context alone, at 10,000 queries per day:
| Model | Price (per 1M input) | Before pruning | After pruning | Annual saving |
|---|---|---|---|---|
gpt-4o-mini |
$0.15 | $7.75/day | $0.54/day | $2,632 |
gpt-4o |
$2.50 | $129/day | $9.00/day | $43,800 |
The savings scale linearly with query volume. At 100K queries per day on gpt-4o, pruning saves over $430K per year - on a 35-table schema.
Limitations
This is a deterministic system. It has predictable failure modes:
Novel terminology. If a user asks about “throughput” meaning production output, the entity map won’t match it. The entity map has ~30 entries covering common business terms, but it cannot anticipate every synonym.
Implicit table references. A query like “Which managers approved the most expenses?” requires understanding that “managers” are employees with a certain role, and “expenses” map to cost allocations. Neither word appears in the entity map or column names.
Schema evolution. When new tables are added or renamed, the entity map needs manual updates. The column index rebuilds automatically from DDL, but business term mappings do not.
These limitations are exactly why Part 3 introduces LLM-powered entity extraction on top of deterministic pruning. The deterministic layer handles the 80-90% of queries where business terms map predictably to tables. The LLM layer handles the rest - synonyms, implicit references, and novel terminology - while the deterministic layer provides a fast, reproducible baseline that the LLM can fall back to if extraction fails.
Try It
The full implementation, tests, and benchmark script are in the companion repository.
git clone https://github.com/nirmalyaghosh/text-to-sql.git
cd text-to-sql
pip install -e .
# Run the benchmark
python demos/05_schema_pruning_benchmark.py
# Verbose mode (shows per-query table details)
python demos/05_schema_pruning_benchmark.py --verbose
# Single query
python demos/05_schema_pruning_benchmark.py --query GQ-013
# Run the tests
pytest tests/test_schema_pruner.py -v
No database connection needed. No API keys. No LLM calls. The pruner works entirely from DDL text.
What’s Next
The pruner demonstrated here is a standalone module - no agent framework, no LLM dependency, no infrastructure requirements. It is the foundation that the agentic system builds on.
In Part 3, I introduce a 6-agent architecture where the Schema Intelligence Agent uses this pruner as its first pass, then applies LLM-powered entity extraction for queries that deterministic matching cannot resolve. The result: the speed and reproducibility of deterministic pruning for common queries, with LLM intelligence for the long tail.
Part 1: Text-to-SQL the Naïve Way - the failure modes
Part 2: Schema Pruning for Text-to-SQL (this post) - engineering the context window
Part 3: Agentic Text-to-SQL (will be published soon) - the multi-agent architecture
Part 4: Red-Teaming the Security Agent (will be published soon) - Crescendo-style attacks
Part 5: MCP Integration (will be published soon) - dynamic schema discovery
Related: Schema Linking for Text-to-SQL (will be published soon) - deterministic pruning against DIN-SQL, RESDSQL, and Spider
Version History
- 26 Feb 2026 : Post published