Back
How FastRouter Keeps You "Stuck" to the Right Provider — and Why That Saves You Money

How FastRouter Keeps You "Stuck" to the Right Provider — and Why That Saves You Money

Sticky routing pins each conversation to one provider endpoint so your prompt cache stays warm. Here is how FastRouter handles it automatically.

Vamsi Krishna
Vamsi Krishna
8 Min Read|Latest -

A beginner-friendly tour of sticky routing: what it is, where it kicks in, and what you get out of it as a user.


If you've ever used FastRouter, you know the pitch: one API, dozens of LLM providers, automatic failover, smart pricing. What you may not know is that under the hood, FastRouter goes out of its way to break its own load-balancing rules in a few specific situations. It deliberately sends your follow-up requests back to the same provider that handled your earlier ones.

This behaviour is called sticky routing, and it's one of those features you don't notice when it's working — which is exactly the point. Let's unpack what it does, why it matters, and where it kicks in.

What is "sticky routing"?

A normal LLM-router decision goes something like this: "for this model, sort the providers by price/latency/health, send the request to the top one." That works great for one-shot prompts.

It works terribly for multi-turn conversations. Why? Because a lot of the things that make LLM calls fast and cheap — prompt caches, tool-call IDs, conversation threads — are per-provider state. The moment you bounce a follow-up request to a different provider, you throw all of that state away.

Sticky routing fixes that. It says: "For requests that look like a continuation of an earlier request, prefer the provider that served the earlier one." Same model, same conversation, same provider — at least when it makes sense to.

FastRouter implements this in three different places, each tuned for a different "continuation" signal.


1. Sticky routing for tool calls — making function-calling actually work

What it does

Modern LLMs support tool calls (a.k.a. function calls). The flow looks like this:

  1. You send: "What's the weather in Paris?" with a get_weather tool defined.
  2. The model replies: "I'd like to call get_weather with {city: 'Paris'}. Here's a tool_call_id of call_abc123 so you know which call this answer is for."
  3. Your code runs the tool and sends a follow-up: "The tool with id call_abc123 returned 18°C."
  4. The model finishes: "It's 18°C in Paris."

Steps 1–2 and 3–4 are two separate API calls. From FastRouter's perspective, that's two routing decisions.

Here's the catch: call_abc123 is an identifier minted by whichever provider answered step 2. If OpenAI generated it, only OpenAI knows what it means. Send step 3 to Anthropic and you'll get a confused "no such tool call" error back, breaking your agent loop.

FastRouter handles this with a small fast cache:

  • When a response comes back containing tool calls, FastRouter quietly remembers which provider produced them — tagged against your account so it never gets mixed up with anyone else's.
  • When a new request arrives that references one of those tool calls, FastRouter recognises it and routes the request straight back to the same provider that started the conversation.

What you get out of it

  • Your tool-calling agents just work. No mysterious "tool call not found" errors mid-conversation when failover or load-balancing happens to land you on a different provider.
  • You don't have to pin a provider yourself. You can keep using FastRouter's automatic routing for the first request, and the system figures out the rest.
  • Tenant isolation is handled for you. Even if two of your customers happen to generate the same tool-call ID, their requests can never cross-contaminate.

2. Sticky routing for warm prompt caches — saving you money turn after turn

What it does

Almost every major provider — OpenAI, Anthropic, Google, Grok — now offers prompt caching: send the same prefix twice in a short window and the second call is dramatically cheaper and faster, because the provider has internally cached the encoded version of that prefix.

The exact pricing varies, but as a rough rule, cache-read tokens cost 10–50% of regular prompt tokens. For an agent that re-sends a 5,000-token system prompt on every turn, the savings stack up quickly.

There's a catch, of course. Prompt caches are provider-private. OpenAI's cache means nothing to Anthropic. So if your conversation bounces between providers, you're paying full price every single turn.

FastRouter solves this with a second sticky-routing layer:

  1. Fingerprint the conversation. When a request comes in, FastRouter creates a small fingerprint of it — just enough to recognise the same conversation if it shows up again.
  2. Remember which provider warmed the cache. If the response shows the provider actually used its prompt cache, FastRouter quietly notes that this provider is the "warm" one for this conversation.
  3. Prefer, don't force. On the next request that matches the same fingerprint, FastRouter moves that provider to the front of the line — but the usual fallbacks stay in place. If the preferred provider is down or rate-limited, you still failover gracefully.

What you get out of it

  • Lower bills, automatically. As long as you keep using the same conversation, you ride the same provider's prompt cache. No cache thrashing, no paying full price for the same 5K-token system prompt on every turn.
  • Lower latency. Cached prompt tokens are not just cheaper — they skip the encoding step on the provider side, which typically shaves hundreds of milliseconds off the time-to-first-token.
  • Failover is preserved. Sticky routing only prefers the warm provider — it doesn't lock you in. If that provider is misbehaving, FastRouter still moves on to the next one. You never trade reliability for cost optimisation.
  • Works across API styles. The same behaviour applies whether your client speaks the OpenAI-style chat API, the Responses API, or Anthropic's messages API.

3. Sticky routing for stateful conversations — keeping threads on the right provider

What it does

Some providers support server-side conversation state. Instead of sending the full message history on every turn, you send a single previous_response_id and the provider stitches the new turn onto the stored conversation thread.

This is the strongest possible "I am continuing an earlier request" signal — and unlike a tool-call ID, this thread genuinely only exists on one provider's servers. There's no way to fall back to a different provider mid-conversation, period.

So FastRouter handles previous_response_id more strictly than the other two:

  • When the original request goes out, FastRouter quietly remembers which provider served it, alongside any details needed to continue that exact conversation later.
  • When a follow-up request arrives that references it, FastRouter locks the request to that one provider — no fallbacks, no re-shuffling — because nothing else could continue the thread anyway.
  • Behind the scenes, FastRouter also translates between its own response IDs and the provider's, so your code only ever sees a stable, FastRouter-shaped ID.

What you get out of it

  • Stateful conversations actually stay stateful. You can use previous_response_id against FastRouter exactly like you'd use it against the underlying provider. The router handles all the bookkeeping.
  • Provider transparency is preserved. Your client only ever sees FastRouter response IDs; it never has to know which underlying provider served the conversation.

A quick mental model: three "continuation" signals, three levels of stickiness

Signal in the request

What FastRouter does

Why

tool_call_id in messages

Routes back to the provider that started the tool call.

Tool-call IDs are provider-specific; the wrong provider has no idea what the ID refers to.

Same conversation seen earlier with a warm cache

Prefers the previously warm provider, with normal failover still in place.

Provider-side prompt caches are private; staying put keeps cache hits — and the savings — coming.

previous_response_id

Locks to exactly one provider — the one holding the conversation thread.

Server-side conversation threads live on a single provider; nothing else can continue them.

Notice the escalation: a hint for cache warmth, a redirect for tool-call correctness, a hard lock for stateful threads. Each level matches how unforgiving the underlying mechanism is.


Why this matters for users

If you're building anything more sophisticated than a single-shot text generator — agents, RAG pipelines, multi-turn assistants — sticky routing is doing serious work for you behind the scenes:

  1. You write simpler code. No need to manually pin providers or remember which provider answered a previous turn. Send the request to FastRouter; the right thing happens.
  2. You spend less. Prompt-cache stickiness can cut input-token costs by half or more on conversation-heavy workloads. The savings are entirely automatic — no flag to flip, no parameter to tune.
  3. Your agents don't break. Tool-call stickiness eliminates the most common, most confusing failure mode of multi-provider routing: the "what tool call?" error mid-loop.
  4. Failover still works. The cache-warmth and tool-call layers only prefer a provider — they don't lock you in. You get cheaper, faster requests without giving up the multi-provider resilience that's the whole reason you're using a router in the first place.
  5. Your account stays isolated. Every sticky-routing decision is scoped to your account, so there's no chance of crossover with anyone else's traffic.
The most elegant thing about FastRouter's sticky routing is that, as a user, you never really see it. You just notice that your bills are smaller than the raw token math suggests, your tool-calling agents don't randomly explode, and your stateful conversations stay coherent — and you go on with your day.

That's the whole point.

Related Articles

Your Fine-Tuned Models Now Work Inside FastRouter
Your Fine-Tuned Models Now Work Inside FastRouter
Integration & Architecture

Your Fine-Tuned Models Now Work Inside FastRouter

Add fine-tuned and custom model endpoints to FastRouter. Route them like any standard model — with full observability, cost tracking, and governance.

Andrej Gamser
Andrej Gamser
3 Min ReadMay, 18 2026