Back
Prompt Caching: The Cost Optimization Most Teams Haven't Touched Yet

Prompt Caching: The Cost Optimization Most Teams Haven't Touched Yet

Prompt caching can cut repeated context costs by up to 90%. Here is how it works across major providers and why most teams are not using it yet

Andrej Gamser
Andrej Gamser
14 Min Read|Latest -

You are stuffing the same 8,000-token system prompt into every request. The same document. The same few-shot examples. Hundreds of times an hour. And you're paying full input token price on every single one.

We were doing this for months on a document analysis pipeline before someone actually looked at the per-request token breakdown. Input costs were 3x what we'd budgeted — not because the model was expensive, but because we were re-processing 12 pages of identical document context on every turn of a multi-turn conversation. The same bytes, tokenized and charged fresh, over and over.

Prompt caching fixes this. And for most teams the implementation cost is embarrassingly low — in many cases, a single line of config or literally nothing at all.

But there's more to it than cost. To understand why prompt caching matters this much, you have to understand what's actually happening on the provider's GPU cluster when you send a request.


TL;DR

  • If you're sending the same system prompt, document, or RAG chunks on every request, you're paying full input price every time — prompt caching drops that to 10–50% of the standard rate
  • Some providers (gpt-5.4, deepseek-v4, grok-4.2, gemini-3.1-pro-preview) cache automatically with zero code changes — you might already be getting partial hits and not know it
  • claude-sonnet-4-6 requires explicit cache_control markers, and cache writes cost more than standard input (1.25x) — if your cache thrashes, you're paying a premium for nothing
  • The most common reason caching fails silently: a dynamic value (timestamp, session ID, user name) injected early in the prompt that busts the cache on every call
  • If you're not checking usage.prompt_tokens_details.cached_tokens in your API responses right now, you're flying blind to both cost and latency regressions


Why Caching Is a Performance Multiplier, Not Just a Cost Play

Most articles frame prompt caching as a billing optimization. It is. But the deeper win is latency and throughput.

LLM inference happens in two distinct phases: pre-fill and decode. During pre-fill, the model processes your entire input prompt to compute the Key-Value (KV) states for every token. Because attention mechanisms scale quadratically with sequence length, computing a 50,000-token prompt requires massive computation. This is why your Time to First Token (TTFT) is slow — it's entirely compute-bound.

During decode (generation), the model produces tokens one at a time. This phase is memory-bandwidth bound and comparatively fast per token.

When prompt caching is active, the provider stores the computed KV states of your prefix in GPU VRAM. When your next request hits that same node with the same prefix, the model skips the pre-fill phase entirely for those cached tokens. You go straight to generation.

What this means in practice:

  • Latency: TTFT drops significantly on large payloads. On heavy context windows (50k+ tokens), the difference between a cache hit and a miss can be the difference between sub-second and multi-second TTFT.
  • Throughput: Because you're not forcing the provider to burn compute on pre-fill, you're less likely to trigger aggressive server-side rate limits. Your effective Tokens Per Minute (TPM) capacity goes up.

So when you see a 90% discount on cached input tokens, the provider isn't being generous. They're charging you less because you're consuming dramatically less compute.


How It Works Across Providers

Zero-Configuration Providers

Several providers cache automatically. You don't change your request payload. You don't set any flags. It just happens.

The list: OpenAI, DeepSeek, Google AI Studio, Google Vertex AI, Grok, Moonshot AI, and Baseten.

The specifics that matter:

  • Google AI Studio / Vertex AI: Implicit caching on gemini-3.1-pro-preview and newer models. Cache reads cost 0.10x the standard input price — a 90% discount. TTL is typically 3–5 minutes. No storage cost. No configuration.
  • OpenAI: Automatic caching on gpt-5.4 and others. Minimum threshold of 1,024 tokens before caching kicks in. Cache reads priced at 0.25x to 0.50x depending on the model. If your system prompt is under 1,024 tokens, caching won't activate — either accept it or add more detailed instructions (which you probably should have anyway).
  • DeepSeek: deepseek-v4 caches automatically. Competitive pricing on cache reads. Works well for long-context scenarios.
  • Grok: grok-4.2 caches automatically with zero configuration.

When you're routing through FastRouter, sticky routing keeps requests in the same conversation pinned to the same provider endpoint. This matters because the KV cache lives on a specific physical GPU node. If consecutive requests in a conversation bounce between different backend instances via load balancing, the cache stays cold on all of them. Without sticky routing, expect highly volatile cache hit rates — anywhere from 20% to 60% depending entirely on the provider's internal traffic shaping.

Anthropic Claude: Explicit Opt-In Required

claude-sonnet-4-6 is the notable exception to the "it just works" pattern. You must explicitly flag where cache breakpoints should occur using cache_control markers.

FastRouter supports two approaches, documented at docs.fastrouter.ai/prompt-caching:

Top-level caching — the simpler option. You add a single cache_control field at the request root. FastRouter places the cache breakpoint automatically and advances it as the conversation grows. This is what you want for most chat and conversational use cases.

Per-block caching — for precise control. When you have a large stable payload (a reference document, a codebase, a character definition) that you want cached exactly, you place cache_control on individual content blocks. Up to four breakpoints per request.

Here's where the economics get interesting and where teams mess up:

  • Cache reads: 0.10x standard input price
  • Cache writes: 1.25x standard input price (5-minute TTL) or 2x (1-hour TTL)

Read that again. Cache writes cost more than a standard request. The first time you populate the cache, you're paying a premium. You break even after one or two cache hits. If your cache gets busted on every request because of a prompt structure issue, you're actually paying more than if you had no caching at all.

We learned this the hard way. We enabled caching on a pipeline where a timestamp was embedded in the system prompt. Every request was a cache write (1.25x cost), zero cache reads (0x savings). We were paying 25% more than baseline until someone noticed the pattern in the billing dashboard.


Making Cache Hits Actually Happen

This is the section that matters most. Getting caching enabled is easy. Getting a high cache hit rate requires understanding one rule:

Static content first. Dynamic content last.

Your prompt sequence typically looks like this:

[system prompt] → [documents/context] → [conversation history] → [user message]

Everything from the start of the sequence up to the first byte that differs from the previous request is cacheable. Everything after that first difference is processed at full price.

Good caching candidates:

  • System prompts that don't change between requests
  • Documents being analyzed across multiple conversation turns
  • Codebase or file tree context in coding assistants
  • Shared few-shot examples
  • RAG chunks that are stable within a session

Cache busters — things that destroy your hit rate:

  • Timestamps anywhere in the cached prefix
  • Session IDs or request IDs injected early in the prompt
  • Randomized few-shot example ordering
  • User names or account details injected before your stable context
  • Any per-request dynamic value that appears before your static content

The most common failure mode I've seen: a team injects a current_time or user_context block between the system prompt and the document context. The system prompt might get cached (if it crosses the minimum token threshold), but the much larger document context — often thousands of tokens — never hits cache because it comes after the dynamic block.

The fix is straightforward. Move the dynamic content to the end:

[system prompt] → [static documents] → [stable RAG chunks] → [dynamic user context] → [user message]

This maximizes the cacheable prefix length.

For RAG applications specifically: if your retrieved chunks vary per query, they're hard to cache at the chunk level. But if you have a stable preamble of instructions before the chunks, cache that. You won't cache the chunks themselves, but you'll cache everything before them. For document analysis with multiple turns on the same document, the document itself is an excellent caching target — it's large, stable within a session, and appears in every request.


Implementing Caching via FastRouter

For zero-config providers, nothing changes in your request. Caching happens server-side. For Anthropic, you need to add cache markers. Here's what per-block caching looks like against claude-sonnet-4-6 through FastRouter.

cURL

1bash
2curl https://api.fastrouter.ai/api/v1/chat/completions \
3  -H "Content-Type: application/json" \
4  -H "Authorization: Bearer $FASTROUTER_API_KEY" \
5  -d '{
6    "model": "anthropic/claude-sonnet-4-6",
7    "messages": [
8      {
9        "role": "system",
10        "content": [
11          {
12            "type": "text",
13            "text": "You are a senior codebase assistant. Here is the entire project context: [INSERT LARGE STABLE PAYLOAD]..."
14          },
15          {
16            "type": "text",
17            "text": "End of project context. Follow these strict formatting rules...",
18            "cache_control": { "type": "ephemeral" }
19          }
20        ]
21      },
22      {
23        "role": "user",
24        "content": "Why is the redis connection failing in worker.ts?"
25      }
26    ]
27  }'

Python (OpenAI SDK pointed at FastRouter)

1python
2import os
3from openai import OpenAI
4
5client = OpenAI(
6    base_url="https://api.fastrouter.ai/api/v1",
7    api_key=os.environ.get("FASTROUTER_API_KEY")
8)
9
10response = client.chat.completions.create(
11    model="anthropic/claude-sonnet-4-6",
12    messages=[
13        {
14            "role": "system",
15            "content": [
16                {
17                    "type": "text",
18                    "text": "You are a senior codebase assistant. [INSERT LARGE STABLE PAYLOAD]..."
19                },
20                {
21                    "type": "text",
22                    "text": "End of project context. Follow these strict rules...",
23                    "cache_control": {"type": "ephemeral"}
24                }
25            ]
26        },
27        {
28            "role": "user",
29            "content": "Why is the redis connection failing in worker.ts?"
30        }
31    ]
32)
33
34usage = response.usage
35cached = getattr(usage.prompt_tokens_details, 'cached_tokens', 0)
36print(f"Cached tokens: {cached}")

TypeScript

1typescript
2import OpenAI from 'openai';
3
4const client = new OpenAI({
5  baseURL: 'https://api.fastrouter.ai/api/v1',
6  apiKey: process.env.FASTROUTER_API_KEY,
7});
8
9async function runCachedQuery() {
10  const response = await client.chat.completions.create({
11    model: 'anthropic/claude-sonnet-4-6',
12    messages: [
13      {
14        role: 'system',
15        content: [
16          {
17            type: 'text',
18            text: 'You are a senior codebase assistant. [INSERT LARGE STABLE PAYLOAD]...',
19          },
20          {
21            type: 'text',
22            text: 'End of project context. Follow these strict rules...',
23            // @ts-ignore - FastRouter extension mapped to Anthropic cache_control
24            cache_control: { type: 'ephemeral' },
25          },
26        ],
27      },
28      {
29        role: 'user',
30        content: 'Why is the redis connection failing in worker.ts?',
31      },
32    ],
33  });
34
35  const cachedTokens =
36    (response.usage as any)?.prompt_tokens_details?.cached_tokens || 0;
37  console.log(`Cached Tokens: ${cachedTokens}`);
38}
39
40runCachedQuery();

Note: The cache_control field on content blocks is a FastRouter extension that gets translated to Anthropic's native caching format. The @ts-ignore in TypeScript is honest — the OpenAI SDK's type system doesn't know about this field. If your setup differs from what's shown here, describe the behavior you need and check the FastRouter docs rather than guessing at parameters.


Checking Whether Caching Is Working

Don't guess. Measure.

API responses include a prompt_tokens_details object. Here's what to look for:

1json
2{
3  "usage": {
4    "prompt_tokens": 8500,
5    "completion_tokens": 350,
6    "total_tokens": 8850,
7    "prompt_tokens_details": {
8      "cached_tokens": 7200,
9      "cache_write_tokens": 0
10    }
11  }
12}

  • cached_tokens > 0: Cache is being hit. You're saving money and compute on those tokens.
  • cache_write_tokens > 0 and cached_tokens == 0: You're populating the cache but not reading from it. Either this is the first request in a session (expected) or your cache is getting busted on every request (problem).
  • Both are zero: Caching isn't active, or your prompt is below the minimum token threshold.

Per-request cache usage also shows up in the Activity Logs on the FastRouter dashboard. This is easier than parsing response bodies at scale when you're trying to spot patterns.

If you haven't looked at these numbers yet, do it now. Teams on zero-config providers often find caching is already partially working — but the hit rate is 20% when it should be 80% because of a dynamic value injected in the wrong place.


Real Failure Modes

Beyond the timestamp-in-prefix problem, here are failure modes we've encountered or seen other teams hit:

Cache thrashing from conversation history growth. In a multi-turn conversation, the conversation history grows on every turn. If the history is inserted before your document context, the cache gets busted every turn. Put stable context (system prompt, documents) before growing context (conversation history).

Minimum token thresholds not met. OpenAI requires 1,024 tokens minimum. Anthropic's thresholds vary by model. If your system prompt is 500 tokens, it won't be cached on gpt-5.4. Either pad the stable prefix with more detailed instructions or accept that caching won't help for that particular prompt.

Load balancer bouncing killing cache hit rate. If you're calling providers directly and load balancing across multiple API keys or instances, consecutive requests might hit different backend caches. The KV cache lives on specific GPU nodes — if your request lands on a different node, the cache isn't there. Sticky routing solves this. Without it, every request is potentially a cache miss.

TTL expiration on idle users. Default TTLs are short — typically 3–5 minutes. If your users go idle for 10 minutes between messages, the cache is cold when they come back. Don't build cost projections assuming 100% cache hit rates. In practice, expect 60–80% hit rates on active conversations and much lower on bursty, irregular traffic.

High-concurrency LRU eviction. Providers have finite GPU VRAM. The stated TTL is a ceiling, not a guarantee. If you send a burst of traffic across many unique sessions, the provider's VRAM fills up and older caches get evicted before the TTL expires via Least Recently Used (LRU) policies. When your user replies at minute 4, they may suffer a cold start. This introduces bimodality in your latency distribution — p50 TTFT looks great, but p95 spikes hard.

Over-caching with Anthropic's write premium. With claude-sonnet-4-6, every cache write costs 1.25x. If you're caching a prefix that only gets reused once before the TTL expires, you're paying 1.25x + 0.10x = 1.35x across two requests, versus 2.0x without caching (1.0x + 1.0x). Still a win. But if the cache write happens and the user never sends a follow-up? You paid 1.25x instead of 1.0x. For single-turn workloads on Anthropic, caching can cost you more.

Assuming cached = free. Cached tokens are cheaper, not free. Factor the read cost into your unit economics. And on Anthropic, the write cost is above baseline — your actual savings depend entirely on how many subsequent requests hit the cache before TTL expiration.


When Prompt Caching Delivers the Most Value

The workloads with the highest return share a pattern: high request volume + large repeated context + stable prompt prefixes.

  • Document analysis — same document, multiple user turns. The document tokens get cached after the first turn. Every subsequent turn reads from cache.
  • Coding assistants — codebase or file tree injected as context on every request. Often 5,000–20,000 tokens of stable context. Cache it.
  • Evaluation pipelines — running many prompt variations against the same base prompt and test cases. The base prompt caches; the variations are the dynamic suffix.
  • Batch classification — processing thousands of items with a shared system prompt and few-shot examples. High request volume keeps the cache warm.
  • RAG pipelines with session-stable chunks — if your retrieved chunks don't change within a conversation session, they're cacheable.

Where caching doesn't move the needle: short interactive prompts with minimal repeated context, single-turn workloads with no follow-up, highly variable inputs where almost nothing is stable across requests. Don't bother optimizing what doesn't matter.


Monitoring Cache Performance

You cannot optimize what you do not measure. A cache miss still returns a valid 200 response — it just takes longer and costs more. This makes cache degradation a silent failure.

Track three metrics at the application level:

  1. Cache Hit Ratio: cached_tokens / prompt_tokens per request. Aggregate this per endpoint. If it drops below your expected baseline, investigate prompt structure changes.
  2. TTFT correlated with cache hits: Measure the time from request dispatch to the first streamed chunk. Plot this against your cache hit ratio. The correlation should be obvious — when cache hits drop, TTFT spikes.
  3. Cache Write Frequency (Anthropic only): If cache_write_tokens is consistently high and cached_tokens is consistently zero, your prompt structure is volatile and you're burning money on write penalties.

Add alerting on cache metrics the same way you'd alert on error rates or latency spikes. A sudden drop in cache hit rate could indicate a deployed prompt template change, a bug that's injecting dynamic content in the wrong position, or even prompt injection attacks modifying your prefix.


What to Do This Week

  1. Audit your current cache hit rates. Pull the last 48 hours of API responses. Check prompt_tokens_details for your requests in the Activity Log. If you're on a zero-config provider and sending repeated system prompts over 1,024 tokens, you should be seeing cached_tokens > 0. If you're not, you have a prompt structure problem.
  2. Restructure one prompt template. Pick your highest-volume endpoint. Move all static content (system prompt, documents, few-shot examples) to the front. Move all dynamic content (user context, timestamps, session data) to the end. Deploy it and compare cache hit rates before and after.
  3. Add cache markers for Anthropic. If you're using claude-sonnet-4-6 and haven't added cache_control markers, you're getting zero caching. Add per-block caching to your primary endpoint. The code examples above show exactly how. Check cached_tokens in the response after your first few requests to confirm it's working.
  4. Set up monitoring. Log cached_tokens and cache_write_tokens per request. Create a dashboard view plotting cache hit ratio alongside TTFT. If cache hit rate drops, you'll know immediately — and you'll have the TTFT correlation to prove the impact.
  5. Run the cost math. Take your last month's input token costs on your highest-volume endpoint. Estimate the cacheable portion. Multiply by 0.10 (for providers with 90% discount) or 0.50 (for OpenAI's worst case). That's your rough savings floor. If the number is material, prioritize the prompt restructuring work this sprint.

This isn't a multi-sprint project. The prompt restructuring is typically a one-line change. The monitoring is a few log lines. The savings start on the next request after deployment.


FastRouter is an LLM gateway providing a single OpenAI-compatible endpoint to 150+ models. Prompt caching is supported across all major providers with automatic sticky routing to maximize cache hit rates. No markup on API calls. fastrouter.ai


Related Articles