If your prompts include long system instructions, RAG chunks, or document context, you're likely paying full input price every single time — even when that content hasn't changed. Prompt Caching on FastRouter fixes that automatically.

The Problem

Most LLM applications send the same large blocks of context with every request. A legal assistant that always loads a 10,000-token contract. A customer support bot that always includes a full product knowledge base. An eval pipeline that always sends the same scoring rubric. Every call, same tokens, full price.

At scale, that's a significant and entirely avoidable cost.

How Prompt Caching Works on FastRouter

Prompt Caching reduces the cost of repeated context by charging a fraction of the normal input price when a cache hit occurs. The exact discount depends on the provider — but it's substantial across the board.

For most providers, caching is completely automatic. For Anthropic Claude, you opt in with a single field.

Zero-Config Providers — It Just Works

For the following providers, FastRouter handles caching automatically. No changes to your requests needed:

Provider	Cache write	Cache read
OpenAI	Free	0.25x – 0.50x input price
DeepSeek	Same as input	~0.10x input price
Google AI Studio	Free	0.10x input price
Google Vertex AI	Free	0.10x input price
Grok	Free	See provider pricing
Moonshot AI	Free	See provider pricing
Baseten	Free	See provider pricing

OpenAI requires a minimum of 1,024 tokens before caching applies.

Google Gemini 2.5 and Gemini 3.x models on AI Studio and Vertex AI support implicit caching — FastRouter keeps your prompt prefixes stable to maximise cache hits automatically. The 0.10x cache-read rate (a 90% discount on input tokens) applies to all Gemini 2.5 and newer models, including the current Gemini 3.1 series. Cache TTL is typically 3–5 minutes.

To maximise cache hits across all providers: keep large static content (system instructions, RAG context, few-shot examples) at the beginning of your prompt, and push dynamic content to the end.

Minimum token thresholds for Gemini:

Model	Min tokens
Gemini 3.1 Pro	4,096
Gemini 3.1 Flash	1,024
Gemini 3.1 Flash-Lite	1,024
Gemini 2.5 Pro	4,096
Gemini 2.5 Flash	1,024
Gemini 2.5 Flash-Lite	1,024

Anthropic Claude — Opt-In with cache_control

Anthropic requires you to explicitly mark what should be cached. FastRouter supports two approaches.

Option A — Top-level (recommended for chat)

Add cache_control once at the request root. FastRouter automatically places the cache breakpoint at the last cacheable block and advances it as the conversation grows.

json

1{
2  "model": "anthropic/claude-sonnet-4.6",
3  "cache_control": { "type": "ephemeral" },
4  "messages": [...]
5}

Only works when routed to Anthropic directly.

Option B — Per-block (for precise control)

Place cache_control on individual content blocks. Useful when you have a large stable payload — a document, RAG chunks, or a character card — and want to cache exactly that block. Maximum 4 breakpoints per request.

json

1{
2  "messages": [
3    {
4      "role": "system",
5      "content": [
6        { "type": "text", "text": "You are a research assistant." },
7        {
8          "type": "text",
9          "text": "<large document>",
10          "cache_control": { "type": "ephemeral" }
11        }
12      ]
13    },
14    { "role": "user", "content": "Summarize the findings." }
15  ]
16}

Per-block caching works across both Anthropic and Vertex.

Anthropic cache TTL and model minimums:

TTL	Syntax	Write cost	Read cost
5 min (default)	`{ "type": "ephemeral" }`	1.25x input	0.10x input

Min tokens	Models
4,096	Opus 4.5, 4.6, 4.7 · Haiku 4.5
2,048	Sonnet 4.6 · Haiku 3.5
1,024	Sonnet 4, 4.5 · Opus 4, 4.1 · Sonnet 3.7

Checking Whether You're Hitting the Cache

Every API response includes a prompt_tokens_details object. If cached_tokens is greater than 0, you're hitting the cache:

json

1"prompt_tokens_details": {
2  "cached_tokens": 10318,
3  "cache_write_tokens": 0
4}

You can also check per-request cache usage directly in the Activity Logs page flyout on the FastRouter dashboard.

Stack It With Flex

Prompt Caching and Flex Processing work together. For background pipelines with repeated context, combining :flex with prompt caching can reduce input token costs to a small fraction of the standard rate — 0.10x cache read price on top of the ~50% Flex discount, depending on the provider.

Get Started

Prompt Caching is available now. For zero-config providers, there's nothing to do — it's already on. For Anthropic Claude, add cache_control to your next request and check the cached_tokens field in the response to confirm.

Documentation: docs.fastrouter.ai/prompt-caching

FastRouter — The LLM Gateway Built for Production | fastrouter.ai

Stop Paying Full Price for Tokens You've Already Sent

Related Articles

Stop Overpaying for LLMs: Run a Free Audit on Your Real Traffic

Slash Your AI Costs in Half with FastRouter Flex Processing: The Zero-Code Way to Save 50%

FastRouter vs. OpenRouter