Activate with one suffix
Append :flex to any supported model ID and FastRouter routes the request to the provider's Flex tier. Your API key, endpoint, and request body stay exactly the same.
Append :flex to any supported model ID and FastRouter routes the request to the provider's discounted Flex tier-significantly lower token costs in exchange for variable throughput. Your key, endpoint, and payload stay exactly the same.
No credit card required · Free to start
Flex trades a little latency for a lot of savings on the requests that don't need to be instant-without changing how you call the API.
Append :flex to any supported model ID and FastRouter routes the request to the provider's Flex tier. Your API key, endpoint, and request body stay exactly the same.
Flex sends eligible requests to the provider's discounted tier, cutting token costs substantially-around 50% on some models, with savings that vary per model and provider.
Flex relaxes throughput guarantees in exchange for lower cost, so async and background jobs run cheaper while latency-sensitive traffic stays on the standard tier.
Add the suffix, FastRouter detects it and selects the provider's Flex tier, and you get back the same response shape-just at a lower token cost.
Request
FastRouter
Provider
Response
The only change
openai/gpt-5.4-nano
openai/gpt-5.4-nano:flex
Flex lives entirely in the model field. There is no separate endpoint, SDK, or account to set up-flip a request to Flex by editing one string, and flip it back just as easily.
Flex is a routing suffix, not a new API. Keep your OpenAI-compatible client, the same endpoint, and the same payload-just add :flex to the model ID.
openai/gpt-5.4-nano becomes openai/gpt-5.4-nano:flex. That single edit is the only change required.
Your base URL, API key, headers, and message format stay identical across every SDK and language.
Add provider.only to keep a request on a specific provider's Flex tier so it is never rerouted away.
When you route through the Flex tier, you pay the provider's discounted token rate. The savings can be substantial-with the exact reduction depending on the model and provider.
Both prompt and completion tokens are billed at the lower Flex rate-no separate plan, no commitment.
For GPT-5.4 Nano on OpenAI, input drops from $0.20 to $0.10 and output from $1.25 to $0.63 per 1M tokens.
Check the model catalog for per-model Flex pricing across every supported provider before you switch.
Cost per 1M tokens
Input tokens
Output tokens
~50% off per token
Example: GPT-5.4 Nano on OpenAI. Savings vary by model.
Flex shines on throughput-tolerant workloads. Keep interactive, streaming, and SLA-bound traffic on the standard tier, and move everything async to Flex.
Extraction, classification, batch summarisation, evals, and scheduled jobs all fit the Flex tier well.
Flex requests can slow down during peak provider demand, so avoid them for real-time, user-facing responses.
Decide tier on a per-call basis-route the same model through Flex or standard depending on the workload.
Route by workload
Both tiers share the same API, key, and payload. The difference is the trade-off you make-Flex swaps guaranteed throughput for a lower token cost.
| Characteristic | StandardDefault tier | FlexAppend :flex |
|---|---|---|
| Cost & savings | ||
| Discounted token pricing | Not included | Included |
| Up to ~50% lower cost on eligible models | Not included | Included |
| Latency & throughput | ||
| Guaranteed low latency | Included | Not included |
| Consistent throughput at peak load | Included | Not includedVariable |
| Recommended for streaming to users | Included | Not included |
| Setup | ||
| Activation | Default | :flex suffix |
| Same key, endpoint & payload | Included | Included |
| Provider pinning with provider.only | Included | Included |
| Best fit | ||
| Batch, eval & background jobs | Not included | Included |
| Real-time, user-facing traffic | Included | Not included |
A check marks the tier that best fits each characteristic. Flex is available on select OpenAI and Google models-look for the Flex tab in the model catalog.
Anywhere throughput matters more than instant latency, Flex turns the same request into a cheaper one.
Run extraction and classification pipelines on Flex, where steady throughput matters more than instant responses.
Summarise large document sets in the background and pay significantly less per token across the whole job.
Generate evaluation runs and fine-tuning datasets at scale without paying the standard-tier rate on every call.
Move cron-driven and async preprocessing to Flex to cut cost on non-urgent, latency-tolerant workloads.
Flex is a tiered pricing mode offered by OpenAI and Google on select models. When you append :flex to a model ID, FastRouter routes the request to the provider's Flex tier, which significantly reduces token costs in exchange for variable throughput and potentially higher latency under peak load.
Append :flex to the model ID-for example, openai/gpt-5.4-nano:flex or google/gemini-3.1-pro-preview:flex. Your API key, endpoint, headers, and request body stay exactly the same, so the suffix is the only change required.
Flex is currently available on select OpenAI and Google models (Gemini API on Vertex AI and AI Studio). Check the model catalog for a Flex tab under Provider Details-if it is present, Flex pricing is available for that model.
Savings vary by model. As a documented example, GPT-5.4 Nano on OpenAI drops from $0.20 to $0.10 per 1M input tokens and from $1.25 to $0.63 per 1M output tokens-roughly 50% off tokens. Check the model catalog for per-model Flex pricing across providers.
Avoid Flex for real-time, user-facing, or streaming responses. Flex requests can experience higher tail latencies during peak provider load, so interactive, latency-sensitive, and SLA-bound workloads should stay on the standard tier.
Yes. Add a provider pin such as provider.only set to openai (or googleaistudio / googlevertexai) so the request goes to the intended provider's Flex tier and isn't rerouted elsewhere.
Append :flex to a supported model and route batch, eval, and background workloads to the provider's discounted tier-no new SDK, no code rewrite.