Response Caching

Serve repeated requests in milliseconds

Add a cache_key and FastRouter caches LLM responses across providers-identical and similar requests return instantly at a fraction of the cost, while everything else routes as normal.

Get started for free Book a demo

No credit card required · Free to start

~90% lower cost
on cache hits

Response cache

cache_key: myapp-faq

"Tell me about physics"

First requestCache MISS

Latency

1,240 ms

Cost

$0.00030

Routed to the provider, then stored in the cache.

Repeat requestCache HIT

Latency

8 ms

Cost

$0.00003

Served from cache · similarity 0.92 · billed at 0.1×.

On a hitLatency ↓ 99% · Cost ↓ 90%

Why response caching

Caching that's automatic, accurate, and in your control

Turn repeated and similar prompts into instant, near-free responses-without changing how you call models.

Instant cache hits

Cache hits return in under 10ms, so repeated and similar prompts skip the model round-trip entirely and feel instant to your users.

Up to 90% cheaper on hits

Cached responses are billed at 0.1× standard token pricing, and cache storage itself is free-so repeat traffic costs a fraction of the original call.

Exact and semantic matching

Reuse responses for identical requests or paraphrases with a tunable similarity threshold-plus per-request control over TTL, model, and provider matching.

How it works

From request to cache hit in one lookup

Add a cache_key and FastRouter checks the cache before calling a model. A hit returns instantly; a miss calls the model, then stores the response for next time.

Request

Chat completion

cache_keycache { }

Send a cache_key header to switch caching on.
An optional cache object tunes TTL and matching.

FastRouter

Cache lookup

Exact matchSemantic match

Hashes prompt, sampling params, and optional model.
Reuses near-matches above your similarity threshold.

On hit

Returns instantly

The stored response is returned in under 10ms at 0.1× cost.

On miss

Calls the model

FastRouter routes to the provider, then stores the result for reuse.

Cache configuration

Enabled

cache_key

myapp-faq

expiration_time

3600sTTL 1h

filter_on_model

filter_on_provider

conversation_mode

full_conversation

similarity_threshold0.75

AggressiveDefaultExact 1.0

Gateway caching, configured per request

Response Caching is FastRouter's own gateway cache-distinct from provider-native prompt caching-so it works the same way across every provider you route to. Tune it per request with a small cache object, or keep the defaults and just send a cache_key.

expiration_time sets the TTL in seconds-3600 by default, adjustable from 60 to 86,400.
Custom cache_key namespaces like myapp-faq or user_123_session keep cache scopes clean and predictable.

Matching & lookup

Match the requests that should share an answer

FastRouter combines your cache_key with hashed request attributes to build the lookup key, then applies a similarity score to decide whether a near-match is close enough to reuse.

Exact and semantic

A similarity_threshold of 1.0 matches only identical requests; the 0.75 default reuses minor rewordings and paraphrases.

Filter on model or provider

filter_on_model (on by default) and filter_on_provider (off by default) decide whether a match must share the same model or provider.

Sampling-aware keys

temperature, top_p, and max_tokens are always part of the key, so different sampling never returns the wrong cached response.

Cache lookup key

Hashed

Included in the key

prompt_messagesAlways
temperatureAlways
top_pAlways
max_tokensAlways
modelfilter_on_model

Ignored for matching

providerstreamusernstop

Conversation modes

Cache the right context for every chat

Multi-turn chats need different matching than one-off questions. Choose how much conversation context counts toward a cache match so answers stay relevant.

Full conversation

Match on the entire message history-ideal for stateful, multi-turn conversations that build on context.

Last message only

Match on just the latest user message, perfect for FAQs and stateless bots where each question stands alone.

Last N turns

Match on the last N user-assistant pairs plus the system message for context-aware assistants. One turn is one user message and one response.

Conversation matching

SUA

full_conversation
Entire message history
SUAU
last_message_only
Only the last user message
SUAU
last_n_turns
Last N turns + system
SUUA

Hits, misses & billing

Every response shows how it was served

Cache behavior is transparent on every call. Responses carry cache metadata, hits are billed at a fraction of the cost, and streaming works either way.

Cache metadata on every call

Responses include cached: true and a similarity score on hits, so you always know what came from cache.

Hits billed at 0.1×

Cache hits are billed at one-tenth of standard token pricing; misses bill normally and are stored for next time.

Instant or streamed

Hits return instantly, or stream chunk-by-chunk when stream is true-so caching never breaks your streaming UX.

200 OK · response

Cache HIT

{

"cached": true,

"similarity": 0.92,

"usage": {

"total_tokens": 193,

"cost": 0.00002956

}

Billed at 0.1×~90% saved vs miss

Hit vs miss

What changes between a hit and a miss

A cache hit and a cache miss return the same shape of response-but the speed, cost, and provider load are worlds apart.

Comparison of cache hit and cache miss behavior
Behavior	Cache HITServed from cache	Cache MISSRouted to provider
Performance
Typical response time	Under 10ms	Provider latency
Skips the model round-trip	Included	Not included
Cost
Token billing	0.1× standard	Standard
~90% cost savings	Included	Not included
Behavior
Calls the upstream provider	Not included	Included
Stored for future reuse	IncludedAlready stored	Included
Works with streaming	Included	Included

Cache hits are billed at 0.1× standard token pricing; cache storage is free.

Use cases

Built for traffic that repeats itself

Wherever the same or similar prompts come up again, caching turns them into instant, low-cost responses-without extra infrastructure.

Chatbots and agents

Reuse answers across repeated agent and chatbot exchanges so common turns return instantly instead of hitting the model every time.

FAQs and support flows

Serve frequently asked questions straight from cache with last_message_only matching-fast, consistent answers for stateless bots.

Dashboards and summaries

Back data summaries and reporting views with cached responses so repeated loads stay fast and cheap under heavy refresh traffic.

Predictable, repetitive APIs

Cache product descriptions, classifications, and other repeatable queries under a custom cache_key namespace for precise control.

FAQ

Response caching questions, answered

Send a cache_key header with your request. The cache_key is a namespace that groups related requests under a shared cache scope-if it's omitted, caching is disabled. You can fine-tune behavior with an optional cache object in the request body, but the header alone is enough to start caching.

FastRouter combines your cache_key with hashed request attributes-the prompt messages, temperature, top_p, max_tokens, and optionally the model (filter_on_model) or provider (filter_on_provider)-to form the lookup key. The similarity_threshold is then applied to decide whether a near-match is close enough to reuse. Parameters like stream, user, n, and stop are ignored when matching.

Cache lifetime is set per request with expiration_time, the TTL in seconds. It defaults to 3600 seconds (one hour) and can be set anywhere from 60 seconds up to 86,400 seconds (24 hours).

The similarity_threshold controls how closely a request must match to reuse a cached response. A value of 1.0 returns only exact matches, the 0.75 default reuses minor rewordings and paraphrases, and lower values reuse more aggressively. If no cached entry meets the threshold, the request is treated as a cache miss.

Cache hits are billed at 0.1× standard token pricing-roughly a 90% saving-while cache misses bill at the standard rate and are then stored for reuse. Cache storage itself is free.

Yes. On a cache hit with stream set to true, the cached response is chunked and streamed back; with streaming off it returns instantly. A cache miss streams from the provider as usual and stores the response so the next identical or similar request can be served from cache.

Start caching repeated responses today

Add a cache_key, tune your matching, and turn repeated and similar prompts into instant, near-free responses across every provider.

Get started for free Talk to us