Prompt Caching

Stop paying full price for repeated context

Long system prompts, RAG chunks, and documents are cached across requests-so repeated input tokens are billed at a fraction of the normal price, with no extra infrastructure to run.

No credit card required · Free to start

Prompt cache
Prompt · 12,480 tokens94% reused

System prompt + RAG context + document

cached

Latest user question

new
First requestCache write

Cached input billed

Standard

Request cost

$0.01248

The static prefix is processed once, then cached for reuse.

Repeat requestCache read

Cached input billed

0.10× input

Request cost

$0.00184

11,776 tokens served from cache · cache_write_tokens: 0.

On a repeatCached input ↓ 90%
Why prompt caching

Cheaper repeated context, automatically

Turn the static parts of your prompts-instructions, context, and documents-into low-cost cached tokens across the providers you already use.

Up to 90% cheaper input

Cached context is billed as low as 0.10× the normal input price on supported providers-so repeated system prompts, RAG chunks, and documents cost a fraction of the first call.

Zero-config on major providers

OpenAI, DeepSeek, Google AI Studio and Vertex AI, Grok, Moonshot AI, and Baseten cache automatically-no changes to your requests needed.

Sticky routing keeps caches warm

FastRouter pins each conversation to the provider whose cache is already primed, then falls back automatically if that provider goes down.

How it works

From repeated prefix to cache hit

Put your stable content first, and FastRouter's sticky routing keeps each conversation on a provider whose cache is already warm-so repeat calls read the prefix instead of reprocessing it.

Request

Stable prefix first

Static contextDynamic tail
  • Keep system prompts, RAG chunks, and documents at the start.
  • Push the changing user input to the end of the prompt.

FastRouter

Sticky routing

Conversation-awareCache-aware
  • Pins each conversation to a provider whose cache stays warm.
  • Falls back automatically if that provider goes down.

First call

Writes the prefix

The provider processes and stores the static prefix in its cache.

Repeat call

Reads from cache

Cached tokens are billed at a fraction of the normal input price.

200 OK · usage
Cache hit
"usage": {
"prompt_tokens": 12480,
"prompt_tokens_details": {
"cached_tokens": 11776,
"cache_write_tokens": 0
}
}
cached_tokens > 0You're hitting the cache

See your cache savings on every response

Prompt caching is transparent. Every response reports exactly how many tokens were served from cache, so you can verify savings without guesswork.

  • prompt_tokens_details reports cached_tokens and cache_write_tokens-a cached_tokens value above 0 means you are hitting the cache.
  • See per-request cache usage in the Activity Logs page flyout on the FastRouter dashboard.
Sticky routing

Keep every conversation on a warm cache

Cache hits only happen when repeat requests reach the same provider. FastRouter pins each conversation to one endpoint so its cache stays warm-while different conversations still spread across providers.

Conversation-aware pinning

A conversation is identified by hashing the first system and first user message, so each one consistently routes to the same provider.

Only when it pays off

Sticky routing kicks in only when a provider's cache-read price is lower than its regular input price-otherwise requests route normally.

Automatic fallback

If the pinned provider goes down, FastRouter falls back automatically. A manual provider.order always takes precedence and skips sticky routing.

Sticky routing

conv a1f3c9

Conversation identified by hashing the first system + first user message.

Req 1Req 2Req 3
Provider AWarm

cache read 0.10×

88% cache hit rate

If Provider A goes down, FastRouter falls back automatically. A manual provider.order always takes precedence.

Zero-config providers

Automatic caching across the providers you already use

Most providers cache repeated prompt prefixes with no request changes at all. FastRouter keeps your prefixes stable to maximize hits and bills cache reads at the provider's discounted rate.

No request changes

OpenAI, DeepSeek, Google AI Studio and Vertex AI, Grok, Moonshot AI, and Baseten all cache automatically.

Implicit Gemini caching

Gemini 2.5 and newer cache implicitly at 0.10× read-a 90% discount-and FastRouter keeps prompt prefixes stable to maximize cache hits.

Minimum token thresholds

Caching applies past a per-model minimum-for example 1,024 tokens on OpenAI and Gemini Flash, and 4,096 on Gemini 2.5 Pro.

Zero-config providers

Automatic
ProviderCache read
  • OpenAI0.25–0.50×
  • DeepSeek~0.10×
  • Google AI Studio0.10×
  • Google Vertex AI0.10×
  • GrokProvider rate
  • Moonshot AIProvider rate
  • BasetenProvider rate

Cache write is free on most providers. Gemini 2.5+ caches implicitly, read at 0.10× input.

Anthropic Claude

Precise cache control when you need it

Anthropic caches what you explicitly mark with cache_control. Add it once for chat, or place it on individual blocks to cache exactly a document or RAG payload.

Top-level for chat

Add cache_control once at the request root and FastRouter places the breakpoint at the last cacheable block, advancing it as the conversation grows.

Per-block for precision

Mark up to 4 individual content blocks to cache exactly a document, RAG chunks, or a character card-this works across Anthropic and Vertex.

Choose your TTL

Pick 5-minute (default) or 1-hour ephemeral caching; both read at 0.10× input, with the 1-hour option ideal for long sessions.

cache_control

Anthropic
{
"model": "anthropic/claude-sonnet-4.6",
"cache_control": { "type": "ephemeral" },
"messages": [ … ]
}
Top-level for chatPer-block · max 4
5 mindefault

read 0.10×

1 hourttl: 1h

read 0.10×

Provider support

Zero-config caching vs explicit cache_control

Most providers cache automatically, while Anthropic gives you explicit, per-block control. Either way, sticky routing keeps the cache warm across requests.

Comparison of zero-config provider caching and Anthropic cache_control
How it worksZero-config providersOpenAI, Gemini, DeepSeek…Anthropic ClaudeExplicit cache_control
Setup
Caches automaticallyIncludedNot included
Needs cache_control markersNot includedIncluded
Per-block control (max 4 breakpoints)Not includedIncluded
Pricing
Cache read price0.10×–0.50× input0.10× input
Cache write costFree on most1.25× (5m) / 2× (1h)
Control
Configurable TTLNot includedProvider-managedIncluded5m or 1h
Sticky routing keeps cache warmIncludedIncluded

Cache read and write prices are multiples of each provider's standard input price. Anthropic caching requires explicit cache_control.

Use cases

Built for prompts that repeat their context

Anywhere the same instructions, context, or documents show up again, prompt caching turns that repeated input into low-cost cached tokens.

Long system prompts

Reuse detailed system instructions and tool definitions across every request instead of paying to reprocess them on each call.

RAG and document Q&A

Cache large retrieved context and documents so repeated questions over the same material are billed at a fraction of the input price.

Multi-turn chat and agents

Keep growing conversation history on a warm cache with sticky routing, so each new turn only pays full price for the latest message.

High-volume pipelines

Front-load few-shot examples once and reuse them across high-volume classification, extraction, and generation jobs.

FAQ

Prompt caching questions, answered

Prompt caching reduces the cost of repeated context-long system prompts, RAG chunks, and documents-by charging a fraction of the normal input price when that content is served from a provider's cache. It caches repeated prompt content across requests, so you don't pay full price to reprocess the same prefix every time.

Response caching returns a full stored response for an identical or similar request through FastRouter's own gateway cache. Prompt caching instead reuses the repeated input tokens-the prompt prefix-at a provider's discounted cache-read rate, while the model still generates a fresh completion. They're complementary: response caching saves the whole call, prompt caching saves the repeated context within a call.

OpenAI, DeepSeek, Google AI Studio, Google Vertex AI, Grok, Moonshot AI, and Baseten cache automatically with no changes to your requests. Google's Gemini 2.5 and newer models also cache implicitly. Note that OpenAI requires a minimum of 1,024 tokens before caching applies.

Anthropic requires you to explicitly mark what to cache with cache_control. You can add it once at the request root (recommended for chat), where FastRouter places the breakpoint at the last cacheable block and advances it as the conversation grows, or place it on up to 4 individual content blocks for precise control. You can choose a 5-minute (default) or 1-hour TTL, and both read at 0.10× input.

When a request benefits from caching, FastRouter pins subsequent requests for that model and conversation to the same provider endpoint so the cache stays warm. A conversation is identified by hashing the first system and first user message. Sticky routing only kicks in when a provider's cache-read price is lower than its regular input price, and FastRouter falls back automatically if that provider goes down. A manual provider.order takes precedence and skips sticky routing.

Keep large static content-system instructions, RAG context, and few-shot examples-at the beginning of your prompt, and push dynamic content to the end. FastRouter keeps your prompt prefixes stable to help. For Gemini implicit caching, the cache TTL is typically 3-5 minutes, so frequent requests with a stable prefix get the most reuse.

Every API response includes a prompt_tokens_details object with cached_tokens and cache_write_tokens. A cached_tokens value greater than 0 means you're hitting the cache. You can also see per-request cache usage in the Activity Logs page flyout on the FastRouter dashboard.

Start caching repeated context today

Keep your static prefixes warm with sticky routing and let supported providers serve repeated context at a fraction of the price-across every model you route to.