If you're running AI workloads at any scale, you know the pain: your LLM bill keeps climbing. Data extraction pipelines, eval runs, batch summarization, background classification—these jobs churn through billions of tokens every month, and most teams pay full "real-time" prices for tasks that don't need sub-second responses.

FastRouter's Flex Pricing changes that equation. By appending a single suffix to your model ID—literally :flex—you can cut token costs by roughly 50% across OpenAI and Google models, with zero changes to your API key, endpoint, or code structure.

This isn't a "maybe someday" optimization. It's live, it's simple, and it works today. Let's walk through exactly what Flex Processing is, how much you'll actually save, and how to turn it on in your production stack.

What Is Flex Processing?

Flex processing is a tiered inference mode that trades guaranteed low latency for dramatically lower token costs. When you route a request through FastRouter with the :flex suffix, your call goes to the provider's "economy tier"—think of it as the batch processing lane, priced at platform.openai.com. Batch API rates with additional prompt caching discounts.

Here's the core trade-off, stated plainly by docs.fastrouter.ai:

"Flex requests may experience higher tail latencies during peak provider load. Use the standard tier for interactive or streaming use-cases."

But for non-interactive workloads—the kind that run in queues, cron jobs, or offline pipelines—that trade-off is a no-brainer. You get:

Same models, same context windows, same quality
~50% lower token costs on both input and output
Zero infrastructure changes—just a suffix

OpenAI describes Flex as "ideal for non-production or lower priority tasks, such as model evaluations, data enrichment, and asynchronous workloads" platform.openai.com. FastRouter exposes this across both OpenAI (e.g., gpt-5.4-nano, gpt-5.5, o3, o4-mini) and Google Gemini (via Vertex AI and AI Studio, e.g., gemini-3.1-pro-preview) through a unified API docs.fastrouter.ai.

The Numbers: Real Savings on Real Models

Let's anchor this in concrete pricing. Using GPT-5.4 Nano as an example (from docs.fastrouter.ai):

Tier	Input (per 1M tokens)	Output (per 1M tokens)	Context Window
Standard	$0.20	$1.25	400,000
✦ Flex	$0.10	$0.63	400,000

Result: ~50% savings across the board.

OpenAI's own pricing tables confirm this pattern across the lineup. For example, gpt-5.5 on Flex/Batch pricing drops from $5.00 → $2.50 per million input tokens, and o3 drops from $2.00 → $1.00 developers.openai.com. The same logic applies to Google's Gemini models when routed through FastRouter's Flex tier.

Back-of-the-Envelope Example

Imagine a nightly batch job processing:

50 million input tokens
10 million output tokens
Using openai/gpt-5.4-nano

Standard tier cost:

Input: 50M × $0.20 / 1M = $10.00
Output: 10M × $1.25 / 1M = $12.50
Total per run: $22.50

Flex tier cost:

Input: 50M × $0.10 / 1M = $5.00
Output: 10M × $0.63 / 1M = $6.30
Total per run: $11.30

You just saved ~$11 per run, or about $4,000 per year if that job runs nightly. Scale that across multiple pipelines, eval suites, and enrichment jobs, and it's easy to reclaim tens of thousands of dollars annually medium.com.

Industry analyses show that intelligent routing—combining model selection with tier selection—can cut AI infrastructure bills by 50–80% without measurable quality loss aicostcheck.com langrouter.ai. Flex Processing is the tier-selection piece of that puzzle.

When to Use Flex (and When to Avoid It)

FastRouter's docs spell out the decision framework clearly docs.fastrouter.ai:

✅ Use Flex For:

Data extraction & classification pipelines (invoices, contracts, logs, tickets)
Batch document summarization (knowledge bases, call transcripts, meeting notes)
Eval and fine-tuning dataset generation (synthetic examples, benchmarks)
Scheduled background jobs (nightly ETL, enrichment cron jobs, periodic reporting)
Cost-optimized preprocessing at scale (tagging content libraries, CRM enrichment)

❌ Use Standard For:

Real-time chat & interactive UIs
Streaming responses to end users
Latency-sensitive agent loops
Voice or real-time applications
SLA-bound enterprise workflows

Rule of thumb: If a human is waiting on the response, use standard. If the job runs in a queue or cron, use Flex.

How to Enable Flex in FastRouter (It's Just a Suffix)

The implementation is intentionally trivial. You're already calling FastRouter; now you're just telling it which processing tier to use.

Step 1: Check Model Support

Open the FastRouter model catalog and look for the "Flex" tab in Provider Details. If it's there, Flex pricing is available docs.fastrouter.ai.

Examples:

1openai/gpt-5.4-nano
2openai/gpt-5.4-mini
3openai/gpt-5.5
4google/gemini-3.1-pro-preview

Step 2: Append :flex to the Model ID

Change:

1"model": "openai/gpt-5.4-nano"

To:

1"model": "openai/gpt-5.4-nano:flex"

Or for Google:

1"model": "google/gemini-3.1-pro-preview:flex"

Your API key, endpoint, and payload stay exactly the same docs.fastrouter.ai.

Step 3 (Optional): Pin the Provider

To ensure your request hits the exact Flex tier you expect, pin the provider:

1"provider": {"only": ["openai"]}

# or

1"provider": {"only": ["googlevertexai"]}

# or

1"provider": {"only": ["googleaistudio"]}

This prevents FastRouter from rerouting to an alternative provider that might not have Flex for that model.

Code Examples: Before vs. After

cURL

1# ✦ With Flex — ~50% cheaper
2curl 'https://api.fastrouter.ai/api/v1/chat/completions' \
3  --header 'Authorization: Bearer YOUR_API_KEY' \
4  --header 'Content-Type: application/json' \
5  --data '{
6    "model": "openai/gpt-5.4-nano:flex",
7    "provider": { "only": ["openai"] },
8    "messages": [
9      { "role": "user", "content": "Summarise this document..." }
10    ]
11  }'

Python

1from openai import OpenAI
2
3client = OpenAI(
4    base_url="https://api.fastrouter.ai/api/v1",
5    api_key="YOUR_API_KEY",
6)
7
8response = client.chat.completions.create(
9    model="openai/gpt-5.4-nano:flex",
10    extra_body={"provider": {"only": ["openai"]}},
11    messages=[
12        {"role": "user", "content": "Summarise this document..."}
13    ],
14)
15
16print(response.choices[0].message.content)

TypeScript

1import OpenAI from "openai";
2
3const client = new OpenAI({
4  baseURL: "https://api.fastrouter.ai/api/v1",
5  apiKey: process.env.FASTROUTER_API_KEY,
6});
7
8const response = await client.chat.completions.create({
9  model: "openai/gpt-5.4-nano:flex",
10  // @ts-expect-error - FastRouter routing extension
11  provider: { only: ["openai"] },
12  messages: [
13    { role: "user", content: "Summarise this document..." },
14  ],
15});
16
17console.log(response.choices[0].message.content);
18

You're not rewriting business logic. You're not migrating providers. You're telling FastRouter: "Use the cheaper Flex lane for this workload."

Handling Flex Trade-Offs: Timeouts & Capacity

Because Flex runs on lower-priority capacity at the provider level, you should design for:

1. Higher Latency

Don't put Flex behind synchronous user flows. Use queues, cron jobs, or async workers. OpenAI recommends increasing client timeouts to up to 15 minutes for some Flex workloads platform.openai.com.

2. Occasional Resource Errors

OpenAI notes that Flex may return 429 Resource Unavailable when capacity is tight. You are not charged when this happens platform.openai.com.

Recommended mitigation:

Implement retries with exponential backoff
For critical jobs, add a fallback to standard tier on repeated failures (by dropping :flex or using service_tier: "auto")

FastRouter doesn't change those semantics; it just makes routing into the Flex tier across providers trivial.

Flex + Model Routing: Stacking Savings

Flex Processing is even more powerful when combined with model routing—sending each request to the cheapest model that can do the job well.

Industry data shows:

Most workloads split into simple, moderate, and complex tasks
Routing simple tasks to nano/mini models, moderate ones to mid-tier, and complex ones to flagships cuts costs by 50–80% aicostcheck.com
Flagship models like GPT-5 and Claude Opus are 10–60× more expensive per token than nano/mini options langrouter.ai

A cost-efficient architecture with FastRouter might look like:

Tier 1 (cheapest): Nano/mini + Flex for background/simple work
Example:

1openai/gpt-5.4-nano:flex for extraction pipelines

Tier 2: Mid-range on standard for interactive but not extreme tasks
Example:

1openai/gpt-5.4-mini

Tier 3: Flagship on standard/priority for complex reasoning and SLAs
Example:

1openai/gpt-5.5 or o3

With this setup, you're not just picking the right model—you're also picking the right processing tier. FastRouter makes that a one-line change.

Getting Started: A Simple Rollout Plan

You can pilot Flex in a single afternoon:

Identify candidate workloads – Any job that runs on a schedule or via a queue is a good candidate.
Flip those jobs to :flex models – Start with openai/gpt-5.4-nano:flex or a Gemini :flex variant.
Monitor:
- Per-job cost
- P95 / P99 latency
- Error and retry rates
Roll out gradually – If the latency profile fits your batch window, migrate more workloads.

Most teams discover that 70–90% of their background traffic can move to Flex safely. Suddenly, the jobs everyone forgot about become the biggest source of savings.

Summary

FastRouter's Flex Processing is one of the highest-leverage cost optimizations available in 2026:

~50% lower token costs on supported models, with no code refactors
Multi-provider support across OpenAI and Google (Vertex AI / AI Studio)
Perfect fit for batch, eval, and background jobs
Stacks with model routing and prompt caching for even deeper savings

If you're still running all your workloads on standard tiers, you're almost certainly overpaying. Start by adding :flex to one high-volume background pipeline, measure the savings, and expand from there.

For full details and supported models, see the FastRouter Flex Pricing docs and OpenAI's Flex processing guide.

Slash Your AI Costs in Half with FastRouter Flex Processing: The Zero-Code Way to Save 50%