Understanding the mechanics of Time to First Token (TTFT) and why those extra tokens may lead to poor user experience (UX).
TL;DR: Prompt size has a direct, linear relationship with TTFT on dedicated infrastructure, with per-token latency ranging from 0.04-0.24ms depending on the setup. However, on shared infrastructure, network and queuing variance (±500ms) often dominates the prompt processing signal (40-120ms). Understanding both the underlying mechanics and your deployment model is critical for optimising production systems.
The Problem: That Uncomfortable Pause
If you have built LLM-powered applications, you have probably noticed the uncomfortable pause between when a user finishes their input and when the AI begins responding. In real-time chat applications, this pause can make or break the perceived responsiveness of the system.
This is not about how fast the model generates tokens once it starts. This is about how long it takes to begin generating in the first place.
The likely culprit? The size of the prompt.
The Linear Relationship
Prompt size has a direct, linear relationship with the initial delay of an LLM response, specifically the TTFT.
As the number of input tokens increases, the time required for the model to process the prompt and begin generating a response also grows.
Quantifying the impact:
Multiple sources have documented this relationship, though with significant variance:
- Per-token latency: Every additional input token increases the P95 TTFT by approximately 0.24ms [#2]
- Bulk increments: Every additional 500 tokens in a prompt adds roughly 20-30 milliseconds of latency [#1]
- Cumulative effect: Reducing a prompt by 1,000 tokens results in a 100ms to 240ms reduction in TTFT [#5]
⚠️ Note: These estimates come from two sources with different infrastructure (shared API vs dedicated), explaining the variance in per-token rates.
Available Data On Per-Token TTFT
Granular per-token latency data are rarely published. Based on data from different sources, the per-token TTFT ranges from 0.04ms[#1] to 0.24ms[#2], across different models and infrastructure types.
[Reference #1, Talkative] mentions:
“On average, every additional 500 tokens in a prompt increased response time by 20–30 milliseconds”
TTFT calculation:
Lower bound: 20ms ÷ 500 tokens = 0.04ms per token
Upper bound: 30ms ÷ 500 tokens = 0.06ms per token
Model: GPT-4o
Infrastructure: OpenAI standard API
However, they also note:
“Even with small prompts, a request to OpenAI will add ~800ms of dead air time”
This 800ms baseline suggests significant network and queuing overhead beyond prompt processing.
[Reference #2, Glean] mentions:
“For every additional input token, the P95 TTFT increases by ~0.24ms and the average TTFT increases by ~0.20ms”
TTFT:
0.24ms per token (P95)
0.20ms per token (average)
Model: GPT-4 Turbo
Infrastructure: Azure PTU (Provisioned Throughput Units - dedicated capacity)
Test range: 50 to 100,000 tokens
Practical Reasons For Variance
This variance stems from factors such as:
- Infrastructure: Dedicated capacity (Azure PTU) vs. shared APIs
- Model architecture: GPT-4o optimised for speed (~2x faster than GPT-4 Turbo)
- Model size: Number of parameters
- Batch size and current load: Server utilisation impacts queuing
- Network round-trip latency: Geographic distance to API endpoint
Network Round-Trip Latency Example
Geographic differences (37ms vs 224ms) can exceed the entire prompt processing overhead (40-120ms for 2,000 tokens), showing why infrastructure choices, depending on the setup, may matter more than prompt optimisation.
Below is an example (referring to the Azure network round-trip latency statistics) of additional latencies that would be experienced by an application hosted in Singapore accessing various Azure OpenAI endpoints:
| API Client Location | API Endpoint (Azure region) | Network Round-Trip Latency |
|---|---|---|
| Singapore | South India | 37ms |
| Singapore | Australia East | 94ms |
| Singapore | West US | 171ms |
| Singapore | East US | 224ms |
Empirical Testing Using Own Micro-Benchmark
To validate these published estimates, I ran my own micro-benchmark using 5 LLMs on Azure OpenAI endpoints in Australia and India, plus OpenRouter’s shared infrastructure.
Key findings:
- Flat or inverted relationship: 4 of 5 LLMs showed larger prompts completing at similar or faster speeds than smaller ones → infrastructure variance overwhelmed the prompt-processing signal.
- Only one model (gpt-4.1-mini) showed a measurable positive rate: +0.017 ms/token, within the expected theoretical range.
- Network routing added 600–1,400ms before any model processing began → dwarfing the ~200ms a 5,000-token prompt theoretically adds.
The uncomfortable conclusion: on shared API infrastructure, optimising prompt size for TTFT yields no reliable improvement. The per-token latency (0.10–0.25ms) that matters on dedicated capacity disappears into noise on standard endpoints.
For full methodology, raw results, and the benchmark script, see my ttft-benchmark Git repository.
Why the Discrepancy with Published Benchmarks?
Published benchmarks used dedicated infrastructure (Azure PTU) specifically to eliminate confounding variables. My results on shared infrastructure reveal what happens in several production scenarios:
- Network routing variance: OpenRouter dynamically routes requests to different backend servers
- Queuing delays: Server-side queues introduce variable wait times (likely ~2.5-3s baseline)
- Cold starts: Some requests may hit servers requiring initialization
- Load balancing: Server load varies across requests
Per-Token TTFT
For this post, I will be using 0.20-0.24ms as a conservative estimate based on Glean’s empirical testing [#2]. Always benchmark your specific setup for production planning.
Per-token rates vary dramatically. For production planning, benchmark your specific provider, model, and infrastructure configuration rather than relying on published estimates.
These numbers matter. In real-time chat applications, 200ms of latency is noticeable and can lead to the perception that the system is “slow”. When using small language models (SLM) locally, long prompts can add tens of seconds to response times.
The Mechanics of the Prefill Stage
The delay caused by prompt size primarily occurs during the prefill stage - the phase where the model processes the entire input sequence before generating its first output token.
Here is what happens during prefill:
KV Cache Generation
The model generates a Key-Value (KV) cache that acts as its “short-term memory” for the session [#3]. This cache:
- Stores the computed key and value vectors for each token in the input
- Grows linearly with the number of tokens [#4]
- Must be generated before the first completion token can be produced
- Can quickly overwhelm GPU memory under long context or large batch size settings [#4]
The KV cache is why the LLM can maintain coherent conversations without reprocessing the entire history on every turn - but building it is expensive.
Computational Intensity
Processing these tokens is compute-intensive because the model must run the input through its entire Transformer network before producing the first completion token [#3]. This includes:
- Multi-head attention computations across all layers
- Feed-forward network passes
- Layer normalizations
- Position encodings
All of this work happens before the end user sees the first word of the response.
Quadratic vs Linear Costs
The theoretical cost of the attention mechanism is quadratic relative to the context window size. However, recent empirical testing (conducted by Glean) shows that TTFT typically scales linearly with the actual number of input tokens [#2].
TTFT typically scales linearly with the actual number of input tokens
Why the discrepancy?
GPUs can skip unpopulated values in the context window [#2]. Modern implementations use sparse attention mechanisms and efficient memory layouts that avoid computing attention over empty slots.
Side-by-side comparison:
| Cost Model | Theoretical Complexity | Empirical Observation | Reason |
|---|---|---|---|
| Attention mechanism | O(n²) where n = context window | O(n) where n = actual input tokens | GPUs skip unpopulated context window slots |
| TTFT scaling | Quadratic with window size | Linear with input tokens | Sparse attention and efficient memory layouts |
💡 Takeaway: While attention is theoretically quadratic, production systems exhibit linear TTFT growth because modern implementations optimise for actual token count, not theoretical window size.
Conclusion
Prompt size directly impacts Time to First Token (TTFT) through the mechanics of the prefill stage. Every token you add to your prompt costs you milliseconds of user-perceived latency.
Key Takeaways
- TTFT scales linearly with input tokens (the exact number may vary based on provider, infrastructure, and other factors, such as network round-trip latency)
- Prefill generates the KV cache before the first token appears
- The attention mechanism is theoretically O(n²) but empirically O(n) due to sparse implementations
- Real-time chat applications and edge deployments are most sensitive to TTFT
For production systems, especially those targeting interactive experiences, understanding this relationship is foundational.
When To Prioritise What
When to prioritise prompt optimisation:
- Dedicated infrastructure: Azure PTU, AWS Provisioned Throughput, self-hosted models
- High-volume applications: Small per-request savings compound at scale
- After controlling variance: Once network and infrastructure latency is consistent
When to prioritise other factors:
- Shared API infrastructure: Network routing and queuing variance typically dominates
- Geographic optimisation: Choosing the right region can save 100-200ms (more than prompt optimisation)
- First response experiences: Infrastructure baseline matters more than marginal token costs
The next time you’re debugging why your LLM feels sluggish, check your infrastructure setup and network latency first. Prompt size optimisation delivers the most value when these foundational factors are already optimized.
What’s Next
Understanding the mechanics of TTFT is the first step. In Part 2 (will be published in a few days), I will cover practical optimisation strategies: prefix caching, parallel processing, prompt engineering patterns, and edge case handling for production systems.
References
- How Prompt Size Affects LLM Response Time in Voice AI (May, 2025)
- How input token count impacts the latency of AI chat tools (Jan, 2026)
- Understand LLM latency and throughput metrics - Anyscale Docs
- ".. since KV cache scales linearly with the number of tokens and batch size, it can quickly overwhelm the memory capacity of existing GPUs under long context or large batch size settings .." (arXiv, May, 2024)
- How KV caches impact time to first token for LLMs (Jan, 2026)
ℹ️ Version History
- 02 Feb 2026 : Post published
- 03 Feb 2026 : Added own micro-benchmark using Claude Sonnet 4.5 via OpenRouter’s API
- 04 Feb 2026 : Extended own micro-benchmark to cover 5 LLMs