How to reduce TTFT in production: practical patterns, implementation strategies, and edge cases to watch for.
TL;DR: Prefix caching can deliver 70-90% TTFT reduction for cached requests with stable prompts (first request still incurs full prefill cost). Tiered context loading reduces token count by 30-70% depending on query distribution, translating to proportional TTFT improvements. Batch processing requires careful monitoring of
batch_size à prompt_tokensto avoid outOutOfMemory(OOM) errors. This post covers practical patterns, provider-specific considerations, and common production edge cases.
Context from Part 1
In Part 1, I covered how prompt size directly impacts TTFT through the prefill stage, where the model generates its KV cache. The relationship is linear: approximately 0.20-0.24ms per token, though this varies by provider and infrastructure.
This post will cover a few practical strategies to reduce these delays.
Mitigation and Optimisation
Several techniques can be used to manage the delays associated with large prompts. These include:
- Prefix Caching
- Parallel Processing (Cake)
- Prompt Engineering
Prefix Caching
Prefix caching allows the system to store and reuse the KV cache for frequently used or repeated prompt prefixes [#2].
This is particularly effective for:
- System instructions that remain constant across conversations
- Shared documents or context that appear in multiple requests
- Few-shot examples that are reused
How it works:
- The first request processes the full prompt and generates the KV cache
- The system saves the KV cache for the prefix portion
- Subsequent requests with the same prefix skip recomputation
- Only the new tokens (user query) need to be processed
Impact:
First request (3,000 token prompt):
Prefill: 600ms
Subsequent requests (2,500 token cached prefix + 500 new tokens):
Prefill: 100ms (83% reduction)
đ Takeaway: If the system instructions or context are stable, prefix caching can significantly reduce TTFT for subsequent requests. Provider implementations vary widely (see Provider-Specific Variations below).
Parallel Processing (Cake)
New systems like “Cake” attempt to reduce TTFT by parallelizing KV cache generation [#3].
How it works:
- Simultaneously compute the cache on the GPU from the start of the prompt
- Load a saved cache from disk in reverse order
- Merge the two caches as they meet in the middle
This reduces TTFT by overlapping computation and I/O, particularly effective when:
- The prefix is cached on disk but needs to be loaded
- The suffix is new and needs to be computed
- Disk I/O and GPU computation can run in parallel
Prompt Engineering
Splitting complex prompts into parallel smaller prompts reduces perceived delay by pre-processing context before user queries arrive [#1].
This approach is particularly effective for:
- Document-heavy applications where users upload files before querying
- Long-running conversations where history can be pre-summarized
- Multi-tenant systems where context is shared across users
For implementation details, see Pattern 4: Background Context Pre-processing.
The Hidden Costs
Often overlooked considerations:
- Predictive loading logic complexity:
- Which documents will users query? Wrong predictions waste compute.
- When to invalidate pre-processed summaries?
- Too aggressive â cache misses
- Too conservative â stale data
- How to prioritize pre-processing under resource constraints?
- Background compute economics:
- Background processing increases costs in low-utilization scenarios
- đ¨ī¸ Break-even occurs when queries significantly outnumber uploads (say, 3:1 or higher)
- Below that threshold, pre-processing costs may exceed the latency benefits users experience
- Cache invalidation complexity:
- Document updates require re-processing all dependent summaries
- In collaborative environments, updates may occur faster than summary generation
Trade-off Decision Framework
This pattern delivers maximum ROI when:
- Context change frequency is LOW relative to query frequency (documents remain stable for hours/days, not minutes)
- Query patterns are predictable (e.g., FAQ scenarios)
Skip this pattern when:
- Documents change frequently (collaborative editing, real-time feeds)
- Query patterns are unpredictable (exploratory analysis)
- Team lacks infrastructure to monitor cache hit rates
Practical Implementation Patterns
Here are a few effective patterns that can be used in production systems:
Pattern 1: Tiered Context Loading (Demand-Driven Context Injection)
Load context on demand based on query complexity, i.e. only inject the context that each query actually needs.
def get_context(user_query, complexity_threshold=0.7):
# Always load: lightweight system prompt (~200 tokens)
context = get_system_prompt()
# Conditional: add history only if needed
if needs_history(user_query):
context += get_recent_history(max_tokens=500)
# Conditional: add documents only for complex queries
if query_complexity(user_query) > complexity_threshold:
context += get_relevant_docs(max_tokens=1000)
return context
# NOTE: The pseudo-code is only for illustration
How token reduction translates to TTFT improvement:
Query complexity classification determines context size:
- Simple queries (e.g., “What is X?”) â 200 tokens (system prompt only)
- Medium queries (e.g., “How does X work?”) â 700 tokens (+ recent history)
- Complex queries (e.g., “Compare X and Y in context Z”) â 1,700 tokens (+ documents)
Token reduction analysis:
Assuming typical query distribution: 40% simple / 30% medium / 30% complex
Average tokens with tiered loading:
(0.4 Ã 200) + (0.3 Ã 700) + (0.3 Ã 1,700) = 800 tokens
Without tiering (always full context): 1,700 tokens
đ Reduction: (1,700 - 800) / 1,700 = 53%
This translates to proportional TTFT improvement:
- Original TTFT: 1,700 Ã 0.24ms = 408ms
- Optimised TTFT: 800 Ã 0.24ms = 192ms
- Improvement: 216ms (53% reduction)
The 30-70% range depends on query distribution:
- FAQ-heavy applications (60% simple): ~65% reduction
- Analysis-heavy applications (60% complex): ~35% reduction
Implementation note: Profile your query distribution for 1-2 weeks to understand potential impact before starting implementation.
đŧ Recent Experience: In recent projects I have worked on, query distribution variance is significant:
- Knowledge base systems: 70% simple queries
- Document analysis tools: 60% complex queries
References:
- OpenViking implements a similar 3-tier approach (L0/L1/L2) [#7]
- Context management tools commonly trigger tiering at >70% context utilization [#8]
- Architecture patterns for tiered context loading [#9]
Pattern 2: Aggressive Summarisation
Aggressively summarise the conversation history, i.e. compact older messages rather than truncating them â this reduces the token count while preserving the semantic meaning.
def prepare_conversation_context(messages, max_tokens=1000):
recent_messages = messages[-5:] # Keep last 5 verbatim
older_messages = messages[:-5]
if len(older_messages) > 0:
# Summarise older context
summary = summarise(older_messages, target_tokens=200)
return summary + recent_messages
return recent_messages
# NOTE: The pseudo-code is only for illustration
Trade-off:
- Token reduction: 2,000 â 1,000 (50%)
- TTFT improvement calculation:
- Original: 2,000 Ã 0.20ms = 400ms (lower) to 2,000 Ã 0.24ms = 480ms (upper)
- Optimised: 1,000 Ã 0.20ms = 200ms (lower) to 1,000 Ã 0.24ms = 240ms (upper)
- Improvement: 200-240ms reduction
- Context loss: Minimal (recent messages preserved)
đ¨ī¸ Monitor summary quality. Aggressive summarization can lose:
- Nuanced context
- User preferences mentioned earlier
- Important constraints from prior messages
Pattern 3: Cached System Instructions
Use prefix caching for stable system prompts.
# Mark prefix for caching (implementation varies by provider)
system_prompt = """
You are a helpful assistant...
[1,000 tokens of instructions]
"""
# First call: full prefill
response = llm.complete(
prompt=system_prompt + user_query,
cache_prefix=True # Provider-specific
)
# Subsequent calls: cached prefix
# Only user_query tokens are processed
response = llm.complete(
prompt=system_prompt + user_query,
cache_prefix=True
)
# NOTE: The pseudo-code is only for illustration
Impact:
- First call: 1,100 tokens â TTFT calculation:
- Lower bound: 1,100 Ã 0.20ms = 220ms
- Upper bound: 1,100 Ã 0.24ms = 264ms
- TTFT: 220-264ms
- Cached calls: 100 tokens â TTFT calculation:
- Lower bound: 100 Ã 0.20ms = 20ms
- Upper bound: 100 Ã 0.24ms = 24ms
- TTFT: 20-24ms
- Reduction: 90% improvement (from 220-264ms to 20-24ms)
Measured Results: Prefix Caching on a Shared API
The estimates above assume that prefill cost dominates TTFT. To test this, I ran a prefix caching benchmark against Azure OpenAI (gpt-4o-mini, Australia East) from Singapore, measuring cache-miss TTFT vs cache-hit TTFT across five prefix sizes (N=3 miss measurements, N=20 hit measurements per size). Each prefix size starts with a throwaway warmup request, so the first “real” measurement isn’t skewed by connection setup overhead.
| Prefix Size | Cache Miss (mean) | Cache Hit (P50) | Cache Hit (P95) | Measured Reduction | Valid Hits | Calculated Reduction |
|---|---|---|---|---|---|---|
| ~1,500 tokens | 1.015s | 1.150s | 2.821s | -13.3% | 18/20 | 99.0% |
| ~3,000 tokens | 1.404s | 0.949s | 1.603s | 32.4% | 20/20 | 99.5% |
| ~5,000 tokens | 1.732s | 1.057s | 1.618s | 39.0% | 20/20 | 99.7% |
| ~10,000 tokens | 1.379s | 1.201s | 1.988s | 12.9% | 15/20 | 99.9% |
| ~20,000 tokens | 1.486s | 1.411s | 1.953s | 5.0% | 10/20 | 99.9% |
The “Calculated Reduction” column assumes 100% of prefix tokens are cached and only the user query tokens (~15) incur prefill cost, i.e. the arithmetic estimate from the pattern above. The “Measured Reduction” column is what the benchmark actually observed. The “Valid Hits” column shows how many of the 20 cache-hit measurements succeeded without errors.
Why the gap? On a shared API endpoint, network round-trip latency (Singapore to Australia East) and infrastructure overhead account for a large, roughly fixed portion of TTFT. Prefix caching eliminates prefill compute on the server, but this saving is small relative to the total latency budget. The sweet spot is in the 3,000-5,000 token range, where measured reduction peaks at 32-39%.
What happens at larger prefix sizes? At 10,000 and 20,000 tokens, the benchmark hit API rate limits (HTTP 429), with 5 and 10 failed measurements respectively. The higher token volume per request increases the likelihood of hitting per-minute token quotas on shared endpoints. This degrades both data quality (fewer valid measurements) and the measured reduction (the miss baseline becomes unreliable with partial caching from prior requests bleeding through).
This is itself a production-relevant finding: prefix caching at scale is not just a compute problem, it is a rate limiting problem. Larger prefixes mean more tokens per request, which means fewer requests before hitting quotas. The engineering response is adaptive rate limiting, i.e. a self-adjusting mechanism that decreases request frequency when rate-limited and gradually increases it when succeeding. I implemented this pattern previously in a newsletter analysis agent using a decorator-based approach with configurable backoff. Porting this to the benchmark is the next step.
The takeaway: prefix caching delivers meaningful TTFT reduction on shared API endpoints, peaking at 32-39% for 3,000-5,000 token prefixes. The benefit is bounded by two factors: network and infrastructure latency (which dominates at small prefix sizes), and API rate limits (which constrain measurement quality at large prefix sizes). For self-hosted deployments where both constraints are absent, the measured reduction would approach the calculated estimate.
The benchmark script is available in the accompanying ttft-benchmark repository.
Pattern 4: Background Context Pre-processing
â Synchronous (naive):
User query triggers â Process 3,600 tokens â TTFT: 720-864ms
â
Asynchronous (optimised):
Background: Pre-process 2,500 tokens (docs + history)
User query triggers â Process 1,000 tokens â TTFT: 200-240ms
Improvement: 72-77% TTFT reduction
When this pattern matters:
- Document-heavy applications where users upload files before querying
- Long-running conversations where history can be pre-summarized
- Multi-tenant systems where context is shared across users
For implementation considerations, see The Hidden Costs and Trade-off Decision Framework.
Things To Take Note Of
1. TTFT â Generation Speed
TTFT measures how long until the first token appears.
TTFT is not related to tokens-per-second during generation.
TTFT measures: How long until the first token appears (prefill phase) Generation speed measures: How many tokens per second after generation starts (decode phase)
Prefill Phase (TTFT):
ââ Process all input tokens in parallel
ââ Generate KV cache for entire input
ââ Computational cost: Linear with input tokens
Decode Phase (Generation Speed):
ââ Process one token at a time
ââ Use cached KV values from prefill
ââ Computational cost: Linear with output tokens
Two kinds of scenarios:
- Fast TTFT + slow generation (small prompt, slow model)
- Slow TTFT + fast generation (large prompt, fast model)
Optimise both independently.
| Metric | Optimisation Target | Techniques |
|---|---|---|
| TTFT | Reduce input tokens | Prefix caching, prompt compression, tiered context loading |
| Generation speed | Increase throughput | Quantization, speculative decoding, batch processing |
2. Batch Size Amplifies KV Cache Costs
The KV cache scales linearly with both the number of tokens and the batch size. This multiplication can quickly overwhelm GPU memory.
Memory calculation:
From NVIDIA’s formula:
Total KV cache size (bytes) = batch_size
à sequence_length à 2
à num_layers
à hidden_size
à sizeof(FP16)
For a 7B parameter model like Llama 2 (model card):
- Layers: 32
- Hidden size: 4,096
- Precision: FP16 (2 bytes per value)
Example 1: Single request:
Prompt: 2,000 tokens
Batch size: 1
Using NVIDIA's formula:
KV cache = batch_size à sequence_length à 2 à num_layers à hidden_size à sizeof(FP16)
= 1 Ã 2,000 Ã 2 Ã 32 Ã 4,096 Ã 2 bytes
= 1,048,576,000 bytes
â 1 GB
Example 2: Batch of 8 requests:
Prompt: 2,000 tokens each
Batch size: 8
Using NVIDIA's formula:
KV cache = batch_size à sequence_length à 2 à num_layers à hidden_size à sizeof(FP16)
= 8 Ã 2,000 Ã 2 Ã 32 Ã 4,096 Ã 2 bytes
= 8,388,608,000 bytes
â 8 GB
Why this matters:
A GPU with 24GB VRAM might handle:
- 10-12 concurrent requests with 2,000-token prompts
- Only 4-5 concurrent requests with 5,000-token prompts
đĄ Takeaway: In production batch inference, monitor
batch_size à prompt_tokens, not just batch size alone. Naive batch processing will hit OOM sooner or later - consider adaptive batch sizing.
3. Provider-Specific Variations
Different LLM providers implement prefill and caching differently, leading to significant TTFT variations for the same prompt.
-
Anthropic Claude: Uses explicit prompt caching that requires developers to mark content with
cache_controlparameters [#4]. Enables latency reductions up to 85% for long prompts when caching is enabled [#5]. Cache entries have a 5-minute TTL (default) or 1-hour TTL (optional, higher cost). -
OpenAI: Implements automatic prompt caching enabled by default for prompts exceeding 1,024 tokens, with 50% cost reduction for cached tokens [#6].
- Caching occurs transparently without code changes.
- Setting the
prompt_cache_keyparameter helps improve cache hit rates as a result of cache routing (i.e. requests are routed to a machine based on a hash of the initial prefix of the prompt, thus higher chance of a cache hit). - Cache lifetime can be extended using the
prompt_cache_retention parameter.
{
"model": "gpt-5.1",
"input": "Your prompt goes here...",
"prompt_cache_retention": "24h"
}
# From https://platform.openai.com/docs/guides/prompt-caching
đ Always benchmark TTFT with your specific provider and workload patterns rather than relying on general estimates.
4. Edge Cases
Common edge cases likely to occur in multi-user production environment.
-
Power users & long conversation history:
Regular user: 5 messages, 200 tokens Power user: 50 messages, 5,000 tokens Outcome: Power user waits ~25Ã longer (when no optimisation inplace)This is context pollution through accumulation, i.e. each turn adds more tokens to the prompt â without active context management, the most engaged users end up with the worst experience.
-
Large document uploads:
Typical query: 300 tokens With 100-page PDF: 50,000+ tokens Outcome: TTFT increase, 60ms â 10,000ms (167Ã slower) -
Multi-turn debugging sessions:
Turn 1: 500 tokens Turn 5: 2,500 tokens (accumulated context) Turn 10: 5,000 tokens Turn 20: 10,000 tokens Each turn gets progressively slower without context management
How To Mitigate Edge Cases?
Mitigation Strategy 1: Progressive Degradation
def prepare_context(user_query, conversation_history):
token_budget = 3000 # Target token limit
# Essential: Always include
context = get_system_prompt() # 500 tokens
context += user_query # ~100 tokens
remaining_budget = token_budget - count_tokens(context)
# Optional: Add based on remaining budget
if remaining_budget > 1000:
# Include recent history
recent = get_recent_messages(
conversation_history,
max_tokens=min(remaining_budget, 2000)
)
context += recent
remaining_budget -= count_tokens(recent)
if remaining_budget > 500:
# Include relevant documents
docs = get_relevant_docs(
user_query,
max_tokens=remaining_budget
)
context += docs
return context
# NOTE: The pseudo-code is only for illustration
Mitigation Strategy 2: Aggressive Summarisation At Thresholds
def manage_conversation_context(messages):
total_tokens = sum(count_tokens(m) for m in messages)
# Threshold-based compression
if total_tokens < 2000:
return messages # No compression needed
elif total_tokens < 5000:
# Summarise older messages, keep recent verbatim
old_messages = messages[:-10]
recent_messages = messages[-10:]
summary = summarise(old_messages, target_tokens=500)
return [summary] + recent_messages
else: # > 5000 tokens - aggressive compression
# Keep only last 5 messages + summary of all prior
old_messages = messages[:-5]
recent_messages = messages[-5:]
summary = summarise(old_messages, target_tokens=300)
return [summary] + recent_messages
# NOTE: The pseudo-code is only for illustration
Implementation Strategy: A Phased Approach
The prioritisation for the optimisations would be based on ROI and implementation effort:
Phase 1: Foundation
1. TTFT Monitoring
Why first? One cannot optimise what one does not measure. đ¨ī¸ Teams who skip this stand to risk waste cycles on premature optimisation.
- Percentile tracking (P50, P95, P99), per endpoint
- User segment analysis (power users vs typical)
- Minimal effort required for implementation: <1 week, assuming an engineer with 2-3 years of experience
- Additional cost: minimal (since only additional logging involved)
- Impact: (supports a significant impact)
2. Prefix Caching for System Prompts
Why second? Highest ROI with lowest implementation risk
- Implement provider-native caching
- Monitor cache hit rates
- Target stable prompts which have high hit rate (> 70%)
- Minimal effort required, since minimal code changes
- Impact: significant
Phase 2: Conditional Optimisations
3. Tiered Context Loading
Why third: Adds complexity (query classification), so only implement if Phase 1 monitoring shows high variance in prompt sizes.
- Build a query classifier (determines if query is simple/medium/complex)â
- Implement conditional context loading
- Monitor metrics:
- Classification accuracy (target: >85%)
- TTFT improvement per tier
- Response quality vs full context baseline
- Effort required: 1-2 weeks for 2 engineers
- Impact: 30-50% additional TTFT reduction
â Query Classifier Implementation: Options include fine-tuned BERT (best accuracy), GPT-4o-mini (fastest deployment), or rule-based heuristics (simplest). Input: raw query text. Output: simple/medium/complex + confidence. Example: “What is X?” â simple (system prompt only), “Compare X and Y considering Z” â complex (full context).
4. Aggressive Summarisation
- Trigger implementation if conversation history consistently ranks beyond the configured percentile threshold
- Prerequisites:
- Requires a summary quality evaluation pipeline to be in place prior to development
- Human-in-the-loop validation
- Rollback mechanisms
Why last: Introduces non-determinism and requires quality monitoring infrastructure
Advanced & Emerging Techniques
Automatic Prefix Caching in vLLM (Self-Hosted)
For self-hosted deployments, vLLM’s Automatic Prefix Caching detects and reuses shared prefixes across requests without manual intervention. It works transparently with paged attention, making it ideal for multi-user or multi-turn workloads.
Key benefits:
- No code changes needed beyond standard vLLM server launch
- Excellent cache hit rates when many requests share system prompts or RAG context
- Combines well with paged attention for memory efficiency under variable batch sizes
Usage example (OpenAI-compatible server):
# Automatic prefix caching is enabled by default in recent vLLM versions
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-70B-Instruct \
--tensor-parallel-size 8 \
--max-model-len 32768 \
--enable-prefix-caching # Optional flag (often default-on)
Subsequent requests with identical prefixes automatically reuse the KV cache. Monitor cache hits via vLLM metrics (e.g., vllm:prefix_cache_hit_rate).
Prefill-Decode Disaggregation
Emerging in advanced serving systems (e.g., DistServe, Moreh variants), this runs prefill on separate high-memory nodes while decode uses optimised throughput nodes, reducing TTFT bottlenecks in mixed workloads.
vLLM supports an experimental disaggregated prefill mode via separate prefill/decode instances and KV transfer connectors.
Example configuration (simplified NixlConnector):
# Prefill instance
python -m vllm.entrypoints.openai.api_server \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_producer"}'
# Decode instance
python -m vllm.entrypoints.openai.api_server \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_consumer"}'
NOTE:
- This is only available in self-hosted/open-source engines like vLLM.
- Major hosted providers (such as Anthropic, OpenAI, Azure OpenAI, etc.) do not expose disaggregation controls via their APIs - prefill and decode are handled internally on shared infrastructure.
Chunked or Streaming Prefill
Some engines support generating early tokens before full prefill completes (speculative/partial/chunked prefill), overlapping phases for lower perceived TTFT. TensorRT-LLM and certain research systems implement this effectively.
However, this remains experimental and model-dependent. Major hosted providers (such as Anthropic, OpenAI, Azure OpenAI, etc.) do not support true chunked/streaming prefill via their APIs. Streaming only begins after complete prefill. Early token generation would require custom self-hosted setups.
Related Work
The patterns in this post operate at the application level, i.e. controlling what tokens enter the prompt. They are complementary to system-level optimisations that reduce the per-token cost of prefill.
PagedAttention [#10] introduced memory-efficient KV cache management by borrowing virtual memory concepts from operating systems. Instead of allocating contiguous GPU memory per sequence, PagedAttention stores KV cache in non-contiguous blocks, reducing memory waste from 60-80% to near-zero. This is the foundation underlying vLLM’s serving engine. The prefix caching patterns in this post (Patterns 3 and 4) build on top of PagedAttention’s memory management, i.e. efficient prefix reuse requires efficient KV cache storage first.
FlashAttention [#11] reduces prefill compute through IO-aware exact attention. By tiling attention computation to exploit GPU SRAM hierarchy, FlashAttention achieves 2-4x wall-clock speedup on prefill without approximation. This is complementary to application-level token reduction: FlashAttention reduces the per-token cost; the patterns here reduce the number of tokens. Both compound.
SGLang and RadixAttention [#12] implement tree-based prefix caching with LRU eviction for multi-turn and branching workloads. Rather than caching a single linear prefix, RadixAttention maintains a radix tree of all previously seen prefixes, enabling cache hits even when requests share partial prefixes. This is the system-level implementation of the prefix reuse that Pattern 3 exploits at the application level. SGLang’s automatic prefix sharing across requests in multi-turn conversations is particularly relevant to the edge cases discussed in Section 4.
Orca [#13] introduced continuous batching (iteration-level scheduling) for LLM serving. Rather than waiting for an entire batch to complete before starting the next, Orca schedules at the iteration level, allowing new requests to join mid-batch. The batch size discussion in Section 2 of this post operates on top of this scheduling paradigm, i.e. the batch_size x prompt_tokens memory constraint applies regardless of whether batching is static or continuous, but Orca’s approach mitigates the head-of-line blocking that static batching causes when prompt sizes vary.
SARATHI [#14] formalises chunked prefill to avoid decode stalls in mixed batches. When prefill and decode requests share a GPU, long prefill operations block decode tokens from being generated, causing latency spikes. SARATHI splits prefill into uniform chunks and interleaves them with decode steps, bounding the prefill-induced latency to chunk size. This directly relates to the “Chunked or Streaming Prefill” technique discussed in the Advanced section, i.e. SARATHI provides the formal analysis of the problem that section references informally.
ROI Calculation
Daily savings = V requests/day à (X-Y) ms improvement à $C per ms
Break-even days = Engineering cost / Daily savings
Where:
- V = Daily request volume
- X = Current P95 TTFT (ms)
- Y = Target P95 TTFT (ms)
- $C = Business cost per millisecond of latency
- Engineering cost = Z days à engineer daily rate
An Example,
Given:
- Current P95 TTFT: 2,400ms
- Target P95 TTFT: 1,200ms (improvement: 1,200ms)
- Daily request volume: 200 requests/day
- Business cost: $6,000/hour = $1.67/second = $0.00167/ms
- Engineering effort: 5 days @ $1,000/day = $5,000
Calculate:
- Daily savings = 200 à 1,200ms à $0.00167/ms = $400.80/day
- Break-even = $5,000 / $400.80 = 12.5 days
Conclusion:
- The optimisation pays for itself in 12.5 working days.
- After that, it delivers $400.80/day in ongoing savings (~$100K annually).
đĄ Key insight: At 200 requests/day with 1,200ms improvement, even a 5-day engineering effort breaks even in under 3 weeks.
References
- How input token count impacts the latency of AI chat tools (Jan, 2026)
- Prefix caching
- Compute or Load KV Cache? Why not both? (arXiv, Oct, 2024)
- Prompt caching with Claude
- Prompt caching announcement
- OpenAI Prompt Caching Documentation
- OpenViking Tiered Context Loading
- MCP Context Loader
- DeepWiki Context Engineer Architecture
- Efficient Memory Management for Large Language Model Serving with PagedAttention (Kwon et al., SOSP 2023)
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (Dao et al., NeurIPS 2022); FlashAttention-2 (Dao, 2023)
- SGLang: Efficient Execution of Structured Language Model Programs (Zheng et al., 2024)
- Orca: A Distributed Serving System for Transformer-Based Generative Models (Yu et al., OSDI 2022)
- SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills (Agrawal et al., 2023)
âšī¸ Version History
- 07 Feb 2026 : Post published
- 04 Mar 2026 : Added Related Work section (5 papers)
- 05 Mar 2026 : Added prefix caching benchmark (5 sizes, rate limiting analysis)