.png&w=3840&q=100)
Stop Paying Full Price for Tokens You've Already Sent
Cut LLM costs on repeated context with Prompt Caching on FastRouter. Automatic for OpenAI, DeepSeek, and Gemini. One field for Anthropic Claude.

.png&w=3840&q=100)
If your prompts include long system instructions, RAG chunks, or document context, you're likely paying full input price every single time — even when that content hasn't changed. Prompt Caching on FastRouter fixes that automatically.
The Problem
Most LLM applications send the same large blocks of context with every request. A legal assistant that always loads a 10,000-token contract. A customer support bot that always includes a full product knowledge base. An eval pipeline that always sends the same scoring rubric. Every call, same tokens, full price.
At scale, that's a significant and entirely avoidable cost.
How Prompt Caching Works on FastRouter
Prompt Caching reduces the cost of repeated context by charging a fraction of the normal input price when a cache hit occurs. The exact discount depends on the provider — but it's substantial across the board.
For most providers, caching is completely automatic. For Anthropic Claude, you opt in with a single field.
Zero-Config Providers — It Just Works
For the following providers, FastRouter handles caching automatically. No changes to your requests needed:
Provider | Cache write | Cache read |
|---|---|---|
OpenAI | Free | 0.25x – 0.50x input price |
DeepSeek | Same as input | ~0.10x input price |
Google AI Studio | Free | 0.10x input price |
Google Vertex AI | Free | 0.10x input price |
Grok | Free | See provider pricing |
Moonshot AI | Free | See provider pricing |
Baseten | Free | See provider pricing |
OpenAI requires a minimum of 1,024 tokens before caching applies.
Google Gemini 2.5 and Gemini 3.x models on AI Studio and Vertex AI support implicit caching — FastRouter keeps your prompt prefixes stable to maximise cache hits automatically. The 0.10x cache-read rate (a 90% discount on input tokens) applies to all Gemini 2.5 and newer models, including the current Gemini 3.1 series. Cache TTL is typically 3–5 minutes.
To maximise cache hits across all providers: keep large static content (system instructions, RAG context, few-shot examples) at the beginning of your prompt, and push dynamic content to the end.
Minimum token thresholds for Gemini:
Model | Min tokens |
|---|---|
Gemini 3.1 Pro | 4,096 |
Gemini 3.1 Flash | 1,024 |
Gemini 3.1 Flash-Lite | 1,024 |
Gemini 2.5 Pro | 4,096 |
Gemini 2.5 Flash | 1,024 |
Gemini 2.5 Flash-Lite | 1,024 |
Anthropic Claude — Opt-In with cache_control
Anthropic requires you to explicitly mark what should be cached. FastRouter supports two approaches.
Option A — Top-level (recommended for chat)
Add cache_control once at the request root. FastRouter automatically places the cache breakpoint at the last cacheable block and advances it as the conversation grows.
json
1{2 "model": "anthropic/claude-sonnet-4.6",3 "cache_control": { "type": "ephemeral" },4 "messages": [...]5}
Only works when routed to Anthropic directly.
Option B — Per-block (for precise control)
Place cache_control on individual content blocks. Useful when you have a large stable payload — a document, RAG chunks, or a character card — and want to cache exactly that block. Maximum 4 breakpoints per request.
json
1{2 "messages": [3 {4 "role": "system",5 "content": [6 { "type": "text", "text": "You are a research assistant." },7 {8 "type": "text",9 "text": "<large document>",10 "cache_control": { "type": "ephemeral" }11 }12 ]13 },14 { "role": "user", "content": "Summarize the findings." }15 ]16}
Per-block caching works across both Anthropic and Vertex.
Anthropic cache TTL and model minimums:
TTL | Syntax | Write cost | Read cost |
|---|---|---|---|
5 min (default) |
| 1.25x input | 0.10x input |
Min tokens | Models |
|---|---|
4,096 | Opus 4.5, 4.6, 4.7 · Haiku 4.5 |
2,048 | Sonnet 4.6 · Haiku 3.5 |
1,024 | Sonnet 4, 4.5 · Opus 4, 4.1 · Sonnet 3.7 |
Checking Whether You're Hitting the Cache
Every API response includes a prompt_tokens_details object. If cached_tokens is greater than 0, you're hitting the cache:
json
1"prompt_tokens_details": {2 "cached_tokens": 10318,3 "cache_write_tokens": 04}
You can also check per-request cache usage directly in the Activity Logs page flyout on the FastRouter dashboard.
Stack It With Flex
Prompt Caching and Flex Processing work together. For background pipelines with repeated context, combining :flex with prompt caching can reduce input token costs to a small fraction of the standard rate — 0.10x cache read price on top of the ~50% Flex discount, depending on the provider.
Get Started
Prompt Caching is available now. For zero-config providers, there's nothing to do — it's already on. For Anthropic Claude, add cache_control to your next request and check the cached_tokens field in the response to confirm.
Documentation: docs.fastrouter.ai/prompt-caching
FastRouter — The LLM Gateway Built for Production | fastrouter.ai
Related Articles
.png&w=3840&q=100)
.png&w=3840&q=100)
Stop Overpaying for LLMs: Run a Free Audit on Your Real Traffic
Run a free LLM audit on real traffic. Find cheaper models, reduce costs, and optimize performance without sacrificing quality.

.png&w=3840&q=100)
.png&w=3840&q=100)
Slash Your AI Costs in Half with FastRouter Flex Processing: The Zero-Code Way to Save 50%
Cut batch processing costs ~50% by appending :flex to your model ID. No code refactors, no migration — just cheaper inference.



FastRouter vs. OpenRouter
Both put a single API in front of every major LLM provider. Past that, the products diverge — on cost, routing depth, evaluations, and the governance tooling that decides whether you can still use either one at $100K/month in spend.
