How to reduce TTFT in production: practical patterns, implementation strategies, and edge cases to watch for.
TL;DR: Prefix caching can deliver 70-90% TTFT reduction for cached requests with stable prompts (first request still incurs full prefill cost). Tiered context loading reduces token count by 30-70% depending on query distribution, translating to proportional TTFT improvements. Batch processing requires careful monitoring of
batch_size × prompt_tokensto avoid outOutOfMemory(OOM) errors. This post covers practical patterns, provider-specific considerations, and common production edge cases.
Context from Part 1
In Part 1, I covered how prompt size directly impacts TTFT through the prefill stage, where the model generates its KV cache. The relationship is linear: approximately 0.20-0.24ms per token, though this varies by provider and infrastructure.
This post will cover a few practical strategies to reduce these delays.
Mitigation and Optimisation
Several techniques can be used to manage the delays associated with large prompts. These include:
- Prefix Caching
- Parallel Processing (Cake)
- Prompt Engineering
Prefix Caching
Prefix caching allows the system to store and reuse the KV cache for frequently used or repeated prompt prefixes [#2].
This is particularly effective for:
- System instructions that remain constant across conversations
- Shared documents or context that appear in multiple requests
- Few-shot examples that are reused
How it works:
- The first request processes the full prompt and generates the KV cache
- The system saves the KV cache for the prefix portion
- Subsequent requests with the same prefix skip recomputation
- Only the new tokens (user query) need to be processed
Impact:
First request (3,000 token prompt):
Prefill: 600ms
Subsequent requests (2,500 token cached prefix + 500 new tokens):
Prefill: 100ms (83% reduction)
👉 Takeaway: If the system instructions or context are stable, prefix caching can significantly reduce TTFT for subsequent requests. Provider implementations vary widely (see Provider-Specific Variations below).
Parallel Processing (Cake)
New systems like “Cake” attempt to reduce TTFT by parallelizing KV cache generation [#3].
How it works:
- Simultaneously compute the cache on the GPU from the start of the prompt
- Load a saved cache from disk in reverse order
- Merge the two caches as they meet in the middle
This reduces TTFT by overlapping computation and I/O, particularly effective when:
- The prefix is cached on disk but needs to be loaded
- The suffix is new and needs to be computed
- Disk I/O and GPU computation can run in parallel
Prompt Engineering
Splitting complex prompts into parallel smaller prompts reduces perceived delay by pre-processing context before user queries arrive [#1].
This approach is particularly effective for:
- Document-heavy applications where users upload files before querying
- Long-running conversations where history can be pre-summarized
- Multi-tenant systems where context is shared across users
For implementation details, see Pattern 4: Background Context Pre-processing.
The Hidden Costs
Often overlooked considerations:
- Predictive loading logic complexity:
- Which documents will users query? Wrong predictions waste compute.
- When to invalidate pre-processed summaries?
- Too aggressive → cache misses
- Too conservative → stale data
- How to prioritize pre-processing under resource constraints?
- Background compute economics:
- Background processing increases costs in low-utilization scenarios
- 🚨️ Break-even occurs when queries significantly outnumber uploads (say, 3:1 or higher)
- Below that threshold, pre-processing costs may exceed the latency benefits users experience
- Cache invalidation complexity:
- Document updates require re-processing all dependent summaries
- In collaborative environments, updates may occur faster than summary generation
Trade-off Decision Framework
This pattern delivers maximum ROI when:
- Context change frequency is LOW relative to query frequency (documents remain stable for hours/days, not minutes)
- Query patterns are predictable (e.g., FAQ scenarios)
Skip this pattern when:
- Documents change frequently (collaborative editing, real-time feeds)
- Query patterns are unpredictable (exploratory analysis)
- Team lacks infrastructure to monitor cache hit rates
Practical Implementation Patterns
Here are a few effective patterns that can be used in production systems:
Pattern 1: Tiered Context Loading
The context is loaded progressively based on query complexity.
def get_context(user_query, complexity_threshold=0.7):
# Always load: lightweight system prompt (~200 tokens)
context = get_system_prompt()
# Conditional: add history only if needed
if needs_history(user_query):
context += get_recent_history(max_tokens=500)
# Conditional: add documents only for complex queries
if query_complexity(user_query) > complexity_threshold:
context += get_relevant_docs(max_tokens=1000)
return context
# NOTE: The pseudo-code is only for illustration
How token reduction translates to TTFT improvement:
Query complexity classification determines context size:
- Simple queries (e.g., “What is X?”) → 200 tokens (system prompt only)
- Medium queries (e.g., “How does X work?”) → 700 tokens (+ recent history)
- Complex queries (e.g., “Compare X and Y in context Z”) → 1,700 tokens (+ documents)
Token reduction analysis:
Assuming typical query distribution: 40% simple / 30% medium / 30% complex
Average tokens with tiered loading:
(0.4 × 200) + (0.3 × 700) + (0.3 × 1,700) = 800 tokens
Without tiering (always full context): 1,700 tokens
👉 Reduction: (1,700 - 800) / 1,700 = 53%
This translates to proportional TTFT improvement:
- Original TTFT: 1,700 × 0.24ms = 408ms
- Optimised TTFT: 800 × 0.24ms = 192ms
- Improvement: 216ms (53% reduction)
The 30-70% range depends on query distribution:
- FAQ-heavy applications (60% simple): ~65% reduction
- Analysis-heavy applications (60% complex): ~35% reduction
Implementation note: Profile your query distribution for 1-2 weeks to understand potential impact before starting implementation.
💼 Recent Experience: In recent projects I have worked on, query distribution variance is significant:
- Knowledge base systems: 70% simple queries
- Document analysis tools: 60% complex queries
References:
- OpenViking implements a similar 3-tier approach (L0/L1/L2) [#7]
- Context management tools commonly trigger tiering at >70% context utilization [#8]
- Architecture patterns for tiered context loading [#9]
Pattern 2: Aggressive Summarisation
Aggressively summarise conversation history.
def prepare_conversation_context(messages, max_tokens=1000):
recent_messages = messages[-5:] # Keep last 5 verbatim
older_messages = messages[:-5]
if len(older_messages) > 0:
# Summarise older context
summary = summarise(older_messages, target_tokens=200)
return summary + recent_messages
return recent_messages
# NOTE: The pseudo-code is only for illustration
Trade-off:
- Token reduction: 2,000 → 1,000 (50%)
- TTFT improvement calculation:
- Original: 2,000 × 0.20ms = 400ms (lower) to 2,000 × 0.24ms = 480ms (upper)
- Optimised: 1,000 × 0.20ms = 200ms (lower) to 1,000 × 0.24ms = 240ms (upper)
- Improvement: 200-240ms reduction
- Context loss: Minimal (recent messages preserved)
🚨️ Monitor summary quality. Aggressive summarization can lose:
- Nuanced context
- User preferences mentioned earlier
- Important constraints from prior messages
Pattern 3: Cached System Instructions
Use prefix caching for stable system prompts.
# Mark prefix for caching (implementation varies by provider)
system_prompt = """
You are a helpful assistant...
[1,000 tokens of instructions]
"""
# First call: full prefill
response = llm.complete(
prompt=system_prompt + user_query,
cache_prefix=True # Provider-specific
)
# Subsequent calls: cached prefix
# Only user_query tokens are processed
response = llm.complete(
prompt=system_prompt + user_query,
cache_prefix=True
)
# NOTE: The pseudo-code is only for illustration
Impact:
- First call: 1,100 tokens → TTFT calculation:
- Lower bound: 1,100 × 0.20ms = 220ms
- Upper bound: 1,100 × 0.24ms = 264ms
- TTFT: 220-264ms
- Cached calls: 100 tokens → TTFT calculation:
- Lower bound: 100 × 0.20ms = 20ms
- Upper bound: 100 × 0.24ms = 24ms
- TTFT: 20-24ms
- Reduction: 90% improvement (from 220-264ms to 20-24ms)
Pattern 4: Background Context Pre-processing
❌ Synchronous (naive):
User query triggers → Process 3,600 tokens → TTFT: 720-864ms
✅ Asynchronous (optimised):
Background: Pre-process 2,500 tokens (docs + history)
User query triggers → Process 1,000 tokens → TTFT: 200-240ms
Improvement: 72-77% TTFT reduction
When this pattern matters:
- Document-heavy applications where users upload files before querying
- Long-running conversations where history can be pre-summarized
- Multi-tenant systems where context is shared across users
For implementation considerations, see The Hidden Costs and Trade-off Decision Framework.
Things To Take Note Of
1. TTFT ≠ Generation Speed
TTFT measures how long until the first token appears.
TTFT is not related to tokens-per-second during generation.
TTFT measures: How long until the first token appears (prefill phase) Generation speed measures: How many tokens per second after generation starts (decode phase)
Prefill Phase (TTFT):
├─ Process all input tokens in parallel
├─ Generate KV cache for entire input
└─ Computational cost: Linear with input tokens
Decode Phase (Generation Speed):
├─ Process one token at a time
├─ Use cached KV values from prefill
└─ Computational cost: Linear with output tokens
Two kinds of scenarios:
- Fast TTFT + slow generation (small prompt, slow model)
- Slow TTFT + fast generation (large prompt, fast model)
Optimise both independently.
| Metric | Optimisation Target | Techniques |
|---|---|---|
| TTFT | Reduce input tokens | Prefix caching, prompt compression, tiered context loading |
| Generation speed | Increase throughput | Quantization, speculative decoding, batch processing |
2. Batch Size Amplifies KV Cache Costs
The KV cache scales linearly with both the number of tokens and the batch size. This multiplication can quickly overwhelm GPU memory.
Memory calculation:
From NVIDIA’s formula:
Total KV cache size (bytes) = batch_size
× sequence_length × 2
× num_layers
× hidden_size
× sizeof(FP16)
For a 7B parameter model like Llama 2 (model card):
- Layers: 32
- Hidden size: 4,096
- Precision: FP16 (2 bytes per value)
Example 1: Single request:
Prompt: 2,000 tokens
Batch size: 1
Using NVIDIA's formula:
KV cache = batch_size × sequence_length × 2 × num_layers × hidden_size × sizeof(FP16)
= 1 × 2,000 × 2 × 32 × 4,096 × 2 bytes
= 1,048,576,000 bytes
≈ 1 GB
Example 2: Batch of 8 requests:
Prompt: 2,000 tokens each
Batch size: 8
Using NVIDIA's formula:
KV cache = batch_size × sequence_length × 2 × num_layers × hidden_size × sizeof(FP16)
= 8 × 2,000 × 2 × 32 × 4,096 × 2 bytes
= 8,388,608,000 bytes
≈ 8 GB
Why this matters:
A GPU with 24GB VRAM might handle:
- 10-12 concurrent requests with 2,000-token prompts
- Only 4-5 concurrent requests with 5,000-token prompts
💡 Takeaway: In production batch inference, monitor
batch_size × prompt_tokens, not just batch size alone. Naive batch processing will hit OOM sooner or later - consider adaptive batch sizing.
3. Provider-Specific Variations
Different LLM providers implement prefill and caching differently, leading to significant TTFT variations for the same prompt.
-
Anthropic Claude: Uses explicit prompt caching that requires developers to mark content with
cache_controlparameters [#4]. Enables latency reductions up to 85% for long prompts when caching is enabled [#5]. Cache entries have a 5-minute TTL (default) or 1-hour TTL (optional, higher cost). -
OpenAI: Implements automatic prompt caching enabled by default for prompts exceeding 1,024 tokens, with 50% cost reduction for cached tokens [#6].
- Caching occurs transparently without code changes.
- Setting the
prompt_cache_keyparameter helps improve cache hit rates as a result of cache routing (i.e. requests are routed to a machine based on a hash of the initial prefix of the prompt, thus higher chance of a cache hit). - Cache lifetime can be extended using the
prompt_cache_retention parameter.
{
"model": "gpt-5.1",
"input": "Your prompt goes here...",
"prompt_cache_retention": "24h"
}
# From https://platform.openai.com/docs/guides/prompt-caching
👉 Always benchmark TTFT with your specific provider and workload patterns rather than relying on general estimates.
4. Edge Cases
Common edge cases likely to occur in multi-user production environment.
-
Power users & long conversation history:
Regular user: 5 messages, 200 tokens Power user: 50 messages, 5,000 tokens Outcome: Power user waits ~25× longer (when no optimisation inplace) -
Large document uploads:
Typical query: 300 tokens With 100-page PDF: 50,000+ tokens Outcome: TTFT increase, 60ms → 10,000ms (167× slower) -
Multi-turn debugging sessions:
Turn 1: 500 tokens Turn 5: 2,500 tokens (accumulated context) Turn 10: 5,000 tokens Turn 20: 10,000 tokens Each turn gets progressively slower without context management
How To Mitigate Edge Cases?
Mitigation Strategy 1: Progressive Degradation
def prepare_context(user_query, conversation_history):
token_budget = 3000 # Target token limit
# Essential: Always include
context = get_system_prompt() # 500 tokens
context += user_query # ~100 tokens
remaining_budget = token_budget - count_tokens(context)
# Optional: Add based on remaining budget
if remaining_budget > 1000:
# Include recent history
recent = get_recent_messages(
conversation_history,
max_tokens=min(remaining_budget, 2000)
)
context += recent
remaining_budget -= count_tokens(recent)
if remaining_budget > 500:
# Include relevant documents
docs = get_relevant_docs(
user_query,
max_tokens=remaining_budget
)
context += docs
return context
# NOTE: The pseudo-code is only for illustration
Mitigation Strategy 2: Aggressive Summarisation At Thresholds
def manage_conversation_context(messages):
total_tokens = sum(count_tokens(m) for m in messages)
# Threshold-based compression
if total_tokens < 2000:
return messages # No compression needed
elif total_tokens < 5000:
# Summarise older messages, keep recent verbatim
old_messages = messages[:-10]
recent_messages = messages[-10:]
summary = summarise(old_messages, target_tokens=500)
return [summary] + recent_messages
else: # > 5000 tokens - aggressive compression
# Keep only last 5 messages + summary of all prior
old_messages = messages[:-5]
recent_messages = messages[-5:]
summary = summarise(old_messages, target_tokens=300)
return [summary] + recent_messages
# NOTE: The pseudo-code is only for illustration
Implementation Strategy: A Phased Approach
The prioritisation for the optimisations would be based on ROI and implementation effort:
Phase 1: Foundation
1. TTFT Monitoring
Why first? One cannot optimise what one does not measure. 🚨️ Teams who skip this stand to risk waste cycles on premature optimisation.
- Percentile tracking (P50, P95, P99), per endpoint
- User segment analysis (power users vs typical)
- Minimal effort required for implementation: <1 week, assuming an engineer with 2-3 years of experience
- Additional cost: minimal (since only additional logging involved)
- Impact: (supports a significant impact)
2. Prefix Caching for System Prompts
Why second? Highest ROI with lowest implementation risk
- Implement provider-native caching
- Monitor cache hit rates
- Target stable prompts which have high hit rate (> 70%)
- Minimal effort required, since minimal code changes
- Impact: significant
Phase 2: Conditional Optimisations
3. Tiered Context Loading
Why third: Adds complexity (query classification), so only implement if Phase 1 monitoring shows high variance in prompt sizes.
- Build a query classifier (determines if query is simple/medium/complex)†
- Implement conditional context loading
- Monitor metrics:
- Classification accuracy (target: >85%)
- TTFT improvement per tier
- Response quality vs full context baseline
- Effort required: 1-2 weeks for 2 engineers
- Impact: 30-50% additional TTFT reduction
† Query Classifier Implementation: Options include fine-tuned BERT (best accuracy), GPT-4o-mini (fastest deployment), or rule-based heuristics (simplest). Input: raw query text. Output: simple/medium/complex + confidence. Example: “What is X?” → simple (system prompt only), “Compare X and Y considering Z” → complex (full context).
4. Aggressive Summarisation
- Trigger implementation if conversation history consistently ranks beyond the configured percentile threshold
- Prerequisites:
- Requires a summary quality evaluation pipeline to be in place prior to development
- Human-in-the-loop validation
- Rollback mechanisms
Why last: Introduces non-determinism and requires quality monitoring infrastructure
Advanced & Emerging Techniques
Automatic Prefix Caching in vLLM (Self-Hosted)
For self-hosted deployments, vLLM’s Automatic Prefix Caching detects and reuses shared prefixes across requests without manual intervention. It works transparently with paged attention, making it ideal for multi-user or multi-turn workloads.
Key benefits:
- No code changes needed beyond standard vLLM server launch
- Excellent cache hit rates when many requests share system prompts or RAG context
- Combines well with paged attention for memory efficiency under variable batch sizes
Usage example (OpenAI-compatible server):
# Automatic prefix caching is enabled by default in recent vLLM versions
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-70B-Instruct \
--tensor-parallel-size 8 \
--max-model-len 32768 \
--enable-prefix-caching # Optional flag (often default-on)
Subsequent requests with identical prefixes automatically reuse the KV cache. Monitor cache hits via vLLM metrics (e.g., vllm:prefix_cache_hit_rate).
Prefill-Decode Disaggregation
Emerging in advanced serving systems (e.g., DistServe, Moreh variants), this runs prefill on separate high-memory nodes while decode uses optimised throughput nodes, reducing TTFT bottlenecks in mixed workloads.
vLLM supports an experimental disaggregated prefill mode via separate prefill/decode instances and KV transfer connectors.
Example configuration (simplified NixlConnector):
# Prefill instance
python -m vllm.entrypoints.openai.api_server \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'
# Decode instance
python -m vllm.entrypoints.openai.api_server \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'
NOTE:
- This is only available in self-hosted/open-source engines like vLLM.
- Major hosted providers (such as Anthropic, OpenAI, Azure OpenAI, etc.) do not expose disaggregation controls via their APIs - prefill and decode are handled internally on shared infrastructure.
Chunked or Streaming Prefill
Some engines support generating early tokens before full prefill completes (speculative/partial/chunked prefill), overlapping phases for lower perceived TTFT. TensorRT-LLM and certain research systems implement this effectively.
However, this remains experimental and model-dependent. Major hosted providers (such as Anthropic, OpenAI, Azure OpenAI, etc.) do not support true chunked/streaming prefill via their APIs. Streaming only begins after complete prefill. Early token generation would require custom self-hosted setups.
ROI Calculation
Daily savings = V requests/day × (X-Y) ms improvement × $C per ms
Break-even days = Engineering cost / Daily savings
Where:
- V = Daily request volume
- X = Current P95 TTFT (ms)
- Y = Target P95 TTFT (ms)
- $C = Business cost per millisecond of latency
- Engineering cost = Z days × engineer daily rate
An Example,
Given:
- Current P95 TTFT: 2,400ms
- Target P95 TTFT: 1,200ms (improvement: 1,200ms)
- Daily request volume: 200 requests/day
- Business cost: $6,000/hour = $1.67/second = $0.00167/ms
- Engineering effort: 5 days @ $1,000/day = $5,000
Calculate:
- Daily savings = 200 × 1,200ms × $0.00167/ms = $400.80/day
- Break-even = $5,000 / $400.80 = 12.5 days
Conclusion:
- The optimisation pays for itself in 12.5 working days.
- After that, it delivers $400.80/day in ongoing savings (~$100K annually).
💡 Key insight: At 200 requests/day with 1,200ms improvement, even a 5-day engineering effort breaks even in under 3 weeks.
References
- How input token count impacts the latency of AI chat tools (Jan, 2026)
- Prefix caching
- Compute or Load KV Cache? Why not both? (arXiv, Oct, 2024)
- Prompt caching with Claude
- Prompt caching announcement
- OpenAI Prompt Caching Documentation
- OpenViking Tiered Context Loading
- MCP Context Loader
- DeepWiki Context Engineer Architecture