How Prompt Size Directly Impacts LLM Response Latency

Understanding the mechanics of Time to First Token (TTFT) and why those extra tokens may lead to poor user experience (UX).

TL;DR: Prompt size has a direct, linear relationship with TTFT on dedicated infrastructure, with per-token latency ranging from 0.04-0.24ms depending on the setup. However, on shared infrastructure, network and queuing variance (±500ms) often dominates the prompt processing signal (40-120ms). Understanding both the underlying mechanics and your deployment model is critical for optimising production systems.

The Problem: That Uncomfortable Pause

If you have built LLM-powered applications, you have probably noticed the uncomfortable pause between when a user finishes their input and when the AI begins responding. In real-time chat applications, this pause can make or break the perceived responsiveness of the system.

This is not about how fast the model generates tokens once it starts. This is about how long it takes to begin generating in the first place.

The likely culprit? The size of the prompt.

The Linear Relationship

Prompt size has a direct, linear relationship with the initial delay of an LLM response, specifically the TTFT.

As the number of input tokens increases, the time required for the model to process the prompt and begin generating a response also grows.

Quantifying the impact:

Multiple sources have documented this relationship, though with significant variance:

Per-token latency: Every additional input token increases the P95 TTFT by approximately 0.24ms [#2]
Bulk increments: Every additional 500 tokens in a prompt adds roughly 20-30 milliseconds of latency [#1]
Cumulative effect: Reducing a prompt by 1,000 tokens results in a 100ms to 240ms reduction in TTFT [#5]

⚠️ Note: These estimates come from two sources with different infrastructure (shared API vs dedicated), explaining the variance in per-token rates.

Available Data On Per-Token TTFT

Granular per-token latency data are rarely published. Based on data from different sources, the per-token TTFT ranges from 0.04ms^[#1] to 0.24ms^[#2], across different models and infrastructure types.

[Reference #1, Talkative] mentions:

“On average, every additional 500 tokens in a prompt increased response time by 20–30 milliseconds”

TTFT calculation:
  Lower bound: 20ms ÷ 500 tokens = 0.04ms per token
  Upper bound: 30ms ÷ 500 tokens = 0.06ms per token

Model: GPT-4o
Infrastructure: OpenAI standard API

However, they also note:

“Even with small prompts, a request to OpenAI will add ~800ms of dead air time”

This 800ms baseline suggests significant network and queuing overhead beyond prompt processing.

[Reference #2, Glean] mentions:

“For every additional input token, the P95 TTFT increases by ~0.24ms and the average TTFT increases by ~0.20ms”

TTFT:
  0.24ms per token (P95)
  0.20ms per token (average)

Model: GPT-4 Turbo
Infrastructure: Azure PTU (Provisioned Throughput Units - dedicated capacity)
Test range: 50 to 100,000 tokens

Practical Reasons For Variance

This variance stems from factors such as:

Infrastructure: Dedicated capacity (Azure PTU) vs. shared APIs
Model architecture: GPT-4o optimised for speed (~2x faster than GPT-4 Turbo)
Model size: Number of parameters
Batch size and current load: Server utilisation impacts queuing
Network round-trip latency: Geographic distance to API endpoint

Network Round-Trip Latency Example

Geographic differences (37ms vs 224ms) can exceed the entire prompt processing overhead (40-120ms for 2,000 tokens), showing why infrastructure choices, depending on the setup, may matter more than prompt optimisation.

Below is an example (referring to the Azure network round-trip latency statistics) of additional latencies that would be experienced by an application hosted in Singapore accessing various Azure OpenAI endpoints:

API Client Location	API Endpoint (Azure region)	Network Round-Trip Latency
Singapore	South India	37ms
Singapore	Australia East	94ms
Singapore	West US	171ms
Singapore	East US	224ms

Empirical Testing Using Own Micro-Benchmark

To validate these published estimates, I ran my own micro-benchmark using 5 LLMs on Azure OpenAI endpoints in Australia and India, plus OpenRouter’s shared infrastructure.

Key findings:

Flat or inverted relationship: 4 of 5 LLMs showed larger prompts completing at similar or faster speeds than smaller ones → infrastructure variance overwhelmed the prompt-processing signal.
Only one model (gpt-4.1-mini) showed a measurable positive rate: +0.017 ms/token, within the expected theoretical range.
Network routing added 600–1,400ms before any model processing began → dwarfing the ~200ms a 5,000-token prompt theoretically adds.

The uncomfortable conclusion: on shared API infrastructure, optimising prompt size for TTFT yields no reliable improvement. The per-token latency (0.10–0.25ms) that matters on dedicated capacity disappears into noise on standard endpoints.

For full methodology, raw results, and the benchmark script, see my ttft-benchmark Git repository.

Why the Discrepancy with Published Benchmarks?

Published benchmarks used dedicated infrastructure (Azure PTU) specifically to eliminate confounding variables. My results on shared infrastructure reveal what happens in several production scenarios:

Network routing variance: OpenRouter dynamically routes requests to different backend servers
Queuing delays: Server-side queues introduce variable wait times (likely ~2.5-3s baseline)
Cold starts: Some requests may hit servers requiring initialization
Load balancing: Server load varies across requests

Per-Token TTFT

For this post, I will be using 0.20-0.24ms as a conservative estimate based on Glean’s empirical testing [#2]. Always benchmark your specific setup for production planning.

Per-token rates vary dramatically. For production planning, benchmark your specific provider, model, and infrastructure configuration rather than relying on published estimates.

These numbers matter. In real-time chat applications, 200ms of latency is noticeable and can lead to the perception that the system is “slow”. When using small language models (SLM) locally, long prompts can add tens of seconds to response times.

The Mechanics of the Prefill Stage

The delay caused by prompt size primarily occurs during the prefill stage - the phase where the model processes the entire input sequence before generating its first output token.

Here is what happens during prefill:

KV Cache Generation

The model generates a Key-Value (KV) cache that acts as its “short-term memory” for the session [#3]. This cache:

Stores the computed key and value vectors for each token in the input
Grows linearly with the number of tokens [#4]
Must be generated before the first completion token can be produced
Can quickly overwhelm GPU memory under long context or large batch size settings [#4]

The KV cache is why the LLM can maintain coherent conversations without reprocessing the entire history on every turn - but building it is expensive.

Computational Intensity

Processing these tokens is compute-intensive because the model must run the input through its entire Transformer network before producing the first completion token [#3]. This includes:

Multi-head attention computations across all layers
Feed-forward network passes
Layer normalizations
Position encodings

All of this work happens before the end user sees the first word of the response.

Quadratic vs Linear Costs

The theoretical cost of the attention mechanism is quadratic relative to the context window size. However, recent empirical testing (conducted by Glean) shows that TTFT typically scales linearly with the actual number of input tokens [#2].

TTFT typically scales linearly with the actual number of input tokens

Why the discrepancy?

GPUs can skip unpopulated values in the context window [#2]. Modern implementations use sparse attention mechanisms and efficient memory layouts that avoid computing attention over empty slots.

Side-by-side comparison:

Cost Model	Theoretical Complexity	Empirical Observation	Reason
Attention mechanism	O(n²) where n = context window	O(n) where n = actual input tokens	GPUs skip unpopulated context window slots
TTFT scaling	Quadratic with window size	Linear with input tokens	Sparse attention and efficient memory layouts

💡 Takeaway: While attention is theoretically quadratic, production systems exhibit linear TTFT growth because modern implementations optimise for actual token count, not theoretical window size.

Conclusion

Prompt size directly impacts Time to First Token (TTFT) through the mechanics of the prefill stage. Every token you add to your prompt costs you milliseconds of user-perceived latency.

Key Takeaways

TTFT scales linearly with input tokens (the exact number may vary based on provider, infrastructure, and other factors, such as network round-trip latency)
Prefill generates the KV cache before the first token appears
The attention mechanism is theoretically O(n²) but empirically O(n) due to sparse implementations
Real-time chat applications and edge deployments are most sensitive to TTFT

For production systems, especially those targeting interactive experiences, understanding this relationship is foundational.

When To Prioritise What

When to prioritise prompt optimisation:

Dedicated infrastructure: Azure PTU, AWS Provisioned Throughput, self-hosted models
High-volume applications: Small per-request savings compound at scale
After controlling variance: Once network and infrastructure latency is consistent

When to prioritise other factors:

Shared API infrastructure: Network routing and queuing variance typically dominates
Geographic optimisation: Choosing the right region can save 100-200ms (more than prompt optimisation)
First response experiences: Infrastructure baseline matters more than marginal token costs

The next time you’re debugging why your LLM feels sluggish, check your infrastructure setup and network latency first. Prompt size optimisation delivers the most value when these foundational factors are already optimized.

What’s Next

Understanding the mechanics of TTFT is the first step. In Part 2 (will be published in a few days), I will cover practical optimisation strategies: prefix caching, parallel processing, prompt engineering patterns, and edge case handling for production systems.

References

ℹ️ Version History

02 Feb 2026 : Post published
03 Feb 2026 : Added own micro-benchmark using Claude Sonnet 4.5 via OpenRouter’s API
04 Feb 2026 : Extended own micro-benchmark to cover 5 LLMs