Flex Pricing

Lower-cost inference for workloads that can wait

Append :flex to any supported model ID and FastRouter routes the request to the provider's discounted Flex tier-significantly lower token costs in exchange for variable throughput. Your key, endpoint, and payload stay exactly the same.

Get started for free Book a demo

No credit card required · Free to start

~50% lower
token cost

Inference pricing

openai/gpt-5.4-nanoper 1M tokens

StandardDefault

Low latency

Input$0.20

Output$1.25

Flex:flex

Variable

Input$0.10

Output$0.63

Why Flex Pricing

Meaningful savings, one suffix away

Flex trades a little latency for a lot of savings on the requests that don't need to be instant-without changing how you call the API.

Activate with one suffix

Append :flex to any supported model ID and FastRouter routes the request to the provider's Flex tier. Your API key, endpoint, and request body stay exactly the same.

Pay less per token

Flex sends eligible requests to the provider's discounted tier, cutting token costs substantially-around 50% on some models, with savings that vary per model and provider.

Trade latency you don't need

Flex relaxes throughput guarantees in exchange for lower cost, so async and background jobs run cheaper while latency-sensitive traffic stays on the standard tier.

How it works

From :flex suffix to a cheaper response

Add the suffix, FastRouter detects it and selects the provider's Flex tier, and you get back the same response shape-just at a lower token cost.

Request

model …:flex

OpenAI-compatible

Append :flex to a supported model ID.
Key, endpoint, and payload stay unchanged.

FastRouter

Detects :flex

Tier routingprovider.only

Selects the provider's Flex tier for the call.
Honors a provider pin so it isn't rerouted.

Provider

Flex tier

OpenAIGoogle

Runs on the provider's discounted capacity.
Throughput varies with provider load.

Response

Same shape, less cost

Lower token cost

Identical response format to standard calls.
Billed at the Flex token rate.

The only change

BeforeStandard

openai/gpt-5.4-nano

+ :flex suffix

AfterFlex

openai/gpt-5.4-nano:flex

A routing flag, not a migration

Flex lives entirely in the model field. There is no separate endpoint, SDK, or account to set up-flip a request to Flex by editing one string, and flip it back just as easily.

Add :flex to a supported model ID-the API key, endpoint, and payload are unchanged.
Pin the provider with provider.only so the call lands on the intended Flex tier.

Drop-in activation

Turn on savings without touching your stack

Flex is a routing suffix, not a new API. Keep your OpenAI-compatible client, the same endpoint, and the same payload-just add :flex to the model ID.

One-line model change

openai/gpt-5.4-nano becomes openai/gpt-5.4-nano:flex. That single edit is the only change required.

Same OpenAI-compatible API

Your base URL, API key, headers, and message format stay identical across every SDK and language.

Optional provider pin

Add provider.only to keep a request on a specific provider's Flex tier so it is never rerouted away.

POST /api/v1/chat/completions

JSON

{

"model": "openai/gpt-5.4-nano:flex",

"provider": { "only": ["openai"] },

"messages": [{ "role": "user", "content": "Summarise…" }]

}

Routed to OpenAI Flex tier~50% lower cost

Lower token cost

Meaningful savings on eligible requests

When you route through the Flex tier, you pay the provider's discounted token rate. The savings can be substantial-with the exact reduction depending on the model and provider.

Discounted input and output

Both prompt and completion tokens are billed at the lower Flex rate-no separate plan, no commitment.

Up to ~50% off on some models

For GPT-5.4 Nano on OpenAI, input drops from $0.20 to $0.10 and output from $1.25 to $0.63 per 1M tokens.

Usage-based, per model

Check the model catalog for per-model Flex pricing across every supported provider before you switch.

Cost per 1M tokens

gpt-5.4-nano

Input tokens

Standard

$0.20

Flex

$0.10

Output tokens

Standard

$1.25

Flex

$0.63

~50% off per token

Example: GPT-5.4 Nano on OpenAI. Savings vary by model.

Right tier, right request

Send the work that can wait to Flex

Flex shines on throughput-tolerant workloads. Keep interactive, streaming, and SLA-bound traffic on the standard tier, and move everything async to Flex.

Best for background work

Extraction, classification, batch summarisation, evals, and scheduled jobs all fit the Flex tier well.

Higher tail latency under load

Flex requests can slow down during peak provider demand, so avoid them for real-time, user-facing responses.

Mix tiers per request

Decide tier on a per-call basis-route the same model through Flex or standard depending on the workload.

Route by workload

Send to Flex

Data extraction & classification
Batch document summarisation
Eval & dataset generation
Scheduled background jobs

Keep on Standard

Real-time chat & interactive UIs
Streaming responses to users
Latency-sensitive agent loops
Voice & realtime applications

Standard vs Flex

Pick the tier that fits each request

Both tiers share the same API, key, and payload. The difference is the trade-off you make-Flex swaps guaranteed throughput for a lower token cost.

Comparison of FastRouter Standard and Flex inference tiers
Characteristic	StandardDefault tier	FlexAppend :flex
Cost & savings
Discounted token pricing	Not included	Included
Up to ~50% lower cost on eligible models	Not included	Included
Latency & throughput
Guaranteed low latency	Included	Not included
Consistent throughput at peak load	Included	Not includedVariable
Recommended for streaming to users	Included	Not included
Setup
Activation	Default	:flex suffix
Same key, endpoint & payload	Included	Included
Provider pinning with provider.only	Included	Included
Best fit
Batch, eval & background jobs	Not included	Included
Real-time, user-facing traffic	Included	Not included

A check marks the tier that best fits each characteristic. Flex is available on select OpenAI and Google models-look for the Flex tab in the model catalog.

Use cases

Built for async and at-scale workloads

Anywhere throughput matters more than instant latency, Flex turns the same request into a cheaper one.

Data extraction & classification

Run extraction and classification pipelines on Flex, where steady throughput matters more than instant responses.

Batch document summarisation

Summarise large document sets in the background and pay significantly less per token across the whole job.

Eval & dataset generation

Generate evaluation runs and fine-tuning datasets at scale without paying the standard-tier rate on every call.

Scheduled background jobs

Move cron-driven and async preprocessing to Flex to cut cost on non-urgent, latency-tolerant workloads.

FAQ

Flex Pricing questions, answered

Flex is a tiered pricing mode offered by OpenAI and Google on select models. When you append :flex to a model ID, FastRouter routes the request to the provider's Flex tier, which significantly reduces token costs in exchange for variable throughput and potentially higher latency under peak load.

Append :flex to the model ID-for example, openai/gpt-5.4-nano:flex or google/gemini-3.1-pro-preview:flex. Your API key, endpoint, headers, and request body stay exactly the same, so the suffix is the only change required.

Flex is currently available on select OpenAI and Google models (Gemini API on Vertex AI and AI Studio). Check the model catalog for a Flex tab under Provider Details-if it is present, Flex pricing is available for that model.

Savings vary by model. As a documented example, GPT-5.4 Nano on OpenAI drops from $0.20 to $0.10 per 1M input tokens and from $1.25 to $0.63 per 1M output tokens-roughly 50% off tokens. Check the model catalog for per-model Flex pricing across providers.

Avoid Flex for real-time, user-facing, or streaming responses. Flex requests can experience higher tail latencies during peak provider load, so interactive, latency-sensitive, and SLA-bound workloads should stay on the standard tier.

Yes. Add a provider pin such as provider.only set to openai (or googleaistudio / googlevertexai) so the request goes to the intended provider's Flex tier and isn't rerouted elsewhere.

Cut the cost of every request that can wait

Append :flex to a supported model and route batch, eval, and background workloads to the provider's discounted tier-no new SDK, no code rewrite.

Get started for free Talk to us