Back

Stop Paying Full Price for Tokens You've Already Sent

Cut LLM costs on repeated context with Prompt Caching on FastRouter. Automatic for OpenAI, DeepSeek, and Gemini. One field for Anthropic Claude.

Andrej Gamser
Andrej Gamser
4 Min Read|Latest -

If your prompts include long system instructions, RAG chunks, or document context, you're likely paying full input price every single time — even when that content hasn't changed. Prompt Caching on FastRouter fixes that automatically.


The Problem

Most LLM applications send the same large blocks of context with every request. A legal assistant that always loads a 10,000-token contract. A customer support bot that always includes a full product knowledge base. An eval pipeline that always sends the same scoring rubric. Every call, same tokens, full price.

At scale, that's a significant and entirely avoidable cost.


How Prompt Caching Works on FastRouter

Prompt Caching reduces the cost of repeated context by charging a fraction of the normal input price when a cache hit occurs. The exact discount depends on the provider — but it's substantial across the board.

For most providers, caching is completely automatic. For Anthropic Claude, you opt in with a single field.


Zero-Config Providers — It Just Works

For the following providers, FastRouter handles caching automatically. No changes to your requests needed:

Provider

Cache write

Cache read

OpenAI

Free

0.25x – 0.50x input price

DeepSeek

Same as input

~0.10x input price

Google AI Studio

Free

0.10x input price

Google Vertex AI

Free

0.10x input price

Grok

Free

See provider pricing

Moonshot AI

Free

See provider pricing

Baseten

Free

See provider pricing

OpenAI requires a minimum of 1,024 tokens before caching applies.

Google Gemini 2.5 and Gemini 3.x models on AI Studio and Vertex AI support implicit caching — FastRouter keeps your prompt prefixes stable to maximise cache hits automatically. The 0.10x cache-read rate (a 90% discount on input tokens) applies to all Gemini 2.5 and newer models, including the current Gemini 3.1 series. Cache TTL is typically 3–5 minutes.

To maximise cache hits across all providers: keep large static content (system instructions, RAG context, few-shot examples) at the beginning of your prompt, and push dynamic content to the end.

Minimum token thresholds for Gemini:

Model

Min tokens

Gemini 3.1 Pro

4,096

Gemini 3.1 Flash

1,024

Gemini 3.1 Flash-Lite

1,024

Gemini 2.5 Pro

4,096

Gemini 2.5 Flash

1,024

Gemini 2.5 Flash-Lite

1,024


Anthropic Claude — Opt-In with cache_control

Anthropic requires you to explicitly mark what should be cached. FastRouter supports two approaches.

Option A — Top-level (recommended for chat)

Add cache_control once at the request root. FastRouter automatically places the cache breakpoint at the last cacheable block and advances it as the conversation grows.

json

1{
2 "model": "anthropic/claude-sonnet-4.6",
3 "cache_control": { "type": "ephemeral" },
4 "messages": [...]
5}

Only works when routed to Anthropic directly.

Option B — Per-block (for precise control)

Place cache_control on individual content blocks. Useful when you have a large stable payload — a document, RAG chunks, or a character card — and want to cache exactly that block. Maximum 4 breakpoints per request.

json

1{
2 "messages": [
3 {
4 "role": "system",
5 "content": [
6 { "type": "text", "text": "You are a research assistant." },
7 {
8 "type": "text",
9 "text": "<large document>",
10 "cache_control": { "type": "ephemeral" }
11 }
12 ]
13 },
14 { "role": "user", "content": "Summarize the findings." }
15 ]
16}

Per-block caching works across both Anthropic and Vertex.

Anthropic cache TTL and model minimums:

TTL

Syntax

Write cost

Read cost

5 min (default)

{ "type": "ephemeral" }

1.25x input

0.10x input

Min tokens

Models

4,096

Opus 4.5, 4.6, 4.7 · Haiku 4.5

2,048

Sonnet 4.6 · Haiku 3.5

1,024

Sonnet 4, 4.5 · Opus 4, 4.1 · Sonnet 3.7


Checking Whether You're Hitting the Cache

Every API response includes a prompt_tokens_details object. If cached_tokens is greater than 0, you're hitting the cache:

json

1"prompt_tokens_details": {
2 "cached_tokens": 10318,
3 "cache_write_tokens": 0
4}

You can also check per-request cache usage directly in the Activity Logs page flyout on the FastRouter dashboard.


Stack It With Flex

Prompt Caching and Flex Processing work together. For background pipelines with repeated context, combining :flex with prompt caching can reduce input token costs to a small fraction of the standard rate — 0.10x cache read price on top of the ~50% Flex discount, depending on the provider.


Get Started

Prompt Caching is available now. For zero-config providers, there's nothing to do — it's already on. For Anthropic Claude, add cache_control to your next request and check the cached_tokens field in the response to confirm.

Documentation: docs.fastrouter.ai/prompt-caching

FastRouter — The LLM Gateway Built for Production | fastrouter.ai

Related Articles

Fastrouter vs OpenRouter
Fastrouter vs OpenRouter
AI Gateway Comparisons

FastRouter vs. OpenRouter

Both put a single API in front of every major LLM provider. Past that, the products diverge — on cost, routing depth, evaluations, and the governance tooling that decides whether you can still use either one at $100K/month in spend.

Ritesh Prasad
Ritesh Prasad
11 Min ReadMay, 9 2026