Up to 90% cheaper input
Cached context is billed as low as 0.10× the normal input price on supported providers-so repeated system prompts, RAG chunks, and documents cost a fraction of the first call.
Long system prompts, RAG chunks, and documents are cached across requests-so repeated input tokens are billed at a fraction of the normal price, with no extra infrastructure to run.
No credit card required · Free to start
System prompt + RAG context + document
cachedLatest user question
newCached input billed
Standard
Request cost
$0.01248
The static prefix is processed once, then cached for reuse.
Cached input billed
0.10× input
Request cost
$0.00184
11,776 tokens served from cache · cache_write_tokens: 0.
Turn the static parts of your prompts-instructions, context, and documents-into low-cost cached tokens across the providers you already use.
Cached context is billed as low as 0.10× the normal input price on supported providers-so repeated system prompts, RAG chunks, and documents cost a fraction of the first call.
OpenAI, DeepSeek, Google AI Studio and Vertex AI, Grok, Moonshot AI, and Baseten cache automatically-no changes to your requests needed.
FastRouter pins each conversation to the provider whose cache is already primed, then falls back automatically if that provider goes down.
Put your stable content first, and FastRouter's sticky routing keeps each conversation on a provider whose cache is already warm-so repeat calls read the prefix instead of reprocessing it.
Request
FastRouter
First call
The provider processes and stores the static prefix in its cache.
Repeat call
Cached tokens are billed at a fraction of the normal input price.
Prompt caching is transparent. Every response reports exactly how many tokens were served from cache, so you can verify savings without guesswork.
Cache hits only happen when repeat requests reach the same provider. FastRouter pins each conversation to one endpoint so its cache stays warm-while different conversations still spread across providers.
A conversation is identified by hashing the first system and first user message, so each one consistently routes to the same provider.
Sticky routing kicks in only when a provider's cache-read price is lower than its regular input price-otherwise requests route normally.
If the pinned provider goes down, FastRouter falls back automatically. A manual provider.order always takes precedence and skips sticky routing.
Sticky routing
Conversation identified by hashing the first system + first user message.
cache read 0.10×
88% cache hit rate
If Provider A goes down, FastRouter falls back automatically. A manual provider.order always takes precedence.
Most providers cache repeated prompt prefixes with no request changes at all. FastRouter keeps your prefixes stable to maximize hits and bills cache reads at the provider's discounted rate.
OpenAI, DeepSeek, Google AI Studio and Vertex AI, Grok, Moonshot AI, and Baseten all cache automatically.
Gemini 2.5 and newer cache implicitly at 0.10× read-a 90% discount-and FastRouter keeps prompt prefixes stable to maximize cache hits.
Caching applies past a per-model minimum-for example 1,024 tokens on OpenAI and Gemini Flash, and 4,096 on Gemini 2.5 Pro.
Zero-config providers
Cache write is free on most providers. Gemini 2.5+ caches implicitly, read at 0.10× input.
Anthropic caches what you explicitly mark with cache_control. Add it once for chat, or place it on individual blocks to cache exactly a document or RAG payload.
Add cache_control once at the request root and FastRouter places the breakpoint at the last cacheable block, advancing it as the conversation grows.
Mark up to 4 individual content blocks to cache exactly a document, RAG chunks, or a character card-this works across Anthropic and Vertex.
Pick 5-minute (default) or 1-hour ephemeral caching; both read at 0.10× input, with the 1-hour option ideal for long sessions.
cache_control
read 0.10×
read 0.10×
Most providers cache automatically, while Anthropic gives you explicit, per-block control. Either way, sticky routing keeps the cache warm across requests.
| How it works | Zero-config providersOpenAI, Gemini, DeepSeek… | Anthropic ClaudeExplicit cache_control |
|---|---|---|
| Setup | ||
| Caches automatically | Included | Not included |
| Needs cache_control markers | Not included | Included |
| Per-block control (max 4 breakpoints) | Not included | Included |
| Pricing | ||
| Cache read price | 0.10×–0.50× input | 0.10× input |
| Cache write cost | Free on most | 1.25× (5m) / 2× (1h) |
| Control | ||
| Configurable TTL | Not includedProvider-managed | Included5m or 1h |
| Sticky routing keeps cache warm | Included | Included |
Cache read and write prices are multiples of each provider's standard input price. Anthropic caching requires explicit cache_control.
Anywhere the same instructions, context, or documents show up again, prompt caching turns that repeated input into low-cost cached tokens.
Reuse detailed system instructions and tool definitions across every request instead of paying to reprocess them on each call.
Cache large retrieved context and documents so repeated questions over the same material are billed at a fraction of the input price.
Keep growing conversation history on a warm cache with sticky routing, so each new turn only pays full price for the latest message.
Front-load few-shot examples once and reuse them across high-volume classification, extraction, and generation jobs.
Prompt caching reduces the cost of repeated context-long system prompts, RAG chunks, and documents-by charging a fraction of the normal input price when that content is served from a provider's cache. It caches repeated prompt content across requests, so you don't pay full price to reprocess the same prefix every time.
Response caching returns a full stored response for an identical or similar request through FastRouter's own gateway cache. Prompt caching instead reuses the repeated input tokens-the prompt prefix-at a provider's discounted cache-read rate, while the model still generates a fresh completion. They're complementary: response caching saves the whole call, prompt caching saves the repeated context within a call.
OpenAI, DeepSeek, Google AI Studio, Google Vertex AI, Grok, Moonshot AI, and Baseten cache automatically with no changes to your requests. Google's Gemini 2.5 and newer models also cache implicitly. Note that OpenAI requires a minimum of 1,024 tokens before caching applies.
Anthropic requires you to explicitly mark what to cache with cache_control. You can add it once at the request root (recommended for chat), where FastRouter places the breakpoint at the last cacheable block and advances it as the conversation grows, or place it on up to 4 individual content blocks for precise control. You can choose a 5-minute (default) or 1-hour TTL, and both read at 0.10× input.
When a request benefits from caching, FastRouter pins subsequent requests for that model and conversation to the same provider endpoint so the cache stays warm. A conversation is identified by hashing the first system and first user message. Sticky routing only kicks in when a provider's cache-read price is lower than its regular input price, and FastRouter falls back automatically if that provider goes down. A manual provider.order takes precedence and skips sticky routing.
Keep large static content-system instructions, RAG context, and few-shot examples-at the beginning of your prompt, and push dynamic content to the end. FastRouter keeps your prompt prefixes stable to help. For Gemini implicit caching, the cache TTL is typically 3-5 minutes, so frequent requests with a stable prefix get the most reuse.
Every API response includes a prompt_tokens_details object with cached_tokens and cache_write_tokens. A cached_tokens value greater than 0 means you're hitting the cache. You can also see per-request cache usage in the Activity Logs page flyout on the FastRouter dashboard.
Keep your static prefixes warm with sticky routing and let supported providers serve repeated context at a fraction of the price-across every model you route to.