You may not need a bigger, more expensive LLM. You may just need an architecture that stops treating every tiny task like a luxury.

If you're building autonomous AI agents at any scale — coding assistants, CI/CD pipelines, developer copilots — you've probably felt the cost creep. Tasks that individually cost $0.10 to $0.30 quietly compound into hundreds or thousands of dollars a month when run at volume.

The instinct is to optimize prompts, set token limits, or switch to a cheaper model wholesale. But there's a more nuanced approach worth considering: treating AI models the way you'd treat any other computing resource in a distributed system — and routing work based on what each step actually needs.

This post walks through one such architecture, the "Architect-Editor" pipeline, and explores the tradeoffs involved in adopting it.

Note: The numbers and benchmarks in this article are based on a specific experimental setup described below. Your results will vary depending on task complexity, model choices, and infrastructure. Think of this as a starting point for your own experiments, not a guaranteed outcome.

The Problem: Paying Frontier Prices for Commodity Work

The core issue with most agent deployments isn't the AI itself — it's using the wrong AI for the wrong job.

Think of building a software feature like constructing a building. You need a skilled Senior Architect to design the structure, ensure integrity, and make the hard decisions. But you don't need that same architect to lay every brick. That work can be handled by someone with narrower, well-defined instructions.

In most AI agent setups, every step of every task — whether it's designing a system or writing a boilerplate helper function — hits the same frontier model. When you audit where tokens are actually being spent, you'll often find that a large proportion of generation work is routine: filling in predictable structure, writing standard patterns, or executing well-scoped sub-tasks.

The question worth asking is: "Does this specific step actually require a frontier model, or could a smaller, cheaper model handle it given clear enough instructions?"

For many workflows, the answer may surprise you.

A Common Pitfall: Turn-by-Turn Routing

The obvious first attempt is to build a router that inspects each request as it comes in and decides which model to send it to. Call it the "traffic cop" approach.

Using FastRouter's BYOK (Bring Your Own Key) Custom Hosts feature, you can register a locally-running open-source model (via Ollama and ngrok) as a provider alongside Anthropic and OpenAI — all under one unified API. This makes it technically straightforward to route some requests to a free local model and others to a frontier model.

The problem is that agent workflows are stateful. Coding agents, in particular, use tools (file reading, bash execution, etc.) on nearly every turn. A naive router that sees tool calls in the payload and escalates to the expensive model will end up routing 100% of traffic to the frontier model — completely defeating the purpose.

Key insight: You cannot effectively route AI tasks turn-by-turn when the workflow requires continuous, shared context across turns. The routing granularity needs to match the task granularity.

The Architect-Editor Pipeline: An Alternative Approach

Instead of routing at the turn level, consider routing at the task level. The idea is to decompose the work upfront, then assign each piece to the most appropriate model.

Here's how this looks in practice for an autonomous coding agent:

Role 1: The Architect (Frontier Model)

At the start of a task, you call the frontier model exactly once. You give it the full context — the user's goal, any relevant constraints — and ask it to produce a structured plan: which files need to be created, what each one should do, and any cross-file dependencies.

This is where frontier reasoning genuinely earns its cost. Planning, understanding intent, anticipating edge cases, structuring a coherent multi-file system — these are hard problems that benefit from a capable model.

Estimated cost: ~$0.01 per task (one well-scoped planning call).

Role 2: The Editor (Smaller Local Model)

Once you have the plan, you slice it into individual, well-scoped sub-tasks — typically one file at a time — and hand each to a smaller, locally-running model. The instructions are tight: "Here is the spec. Write exactly this file."

Smaller models (7B–14B parameter range) can perform well here because the task is bounded. They're not reasoning about the whole system — they're executing a narrow, well-defined piece of it.

Estimated cost: ~$0.00 per file (local compute).

The flow:

User submits a task (e.g., "Build a Python REST API with these endpoints")
Frontier model (Architect) produces a structured JSON blueprint — files to create, specs for each
For each file in the blueprint, the local model (Editor) writes the implementation
A local sandbox runs syntax checks and linters
On failure, the model retries up to twice — only then does it escalate that specific file to the frontier model

The Engineering You Can't Skip: Defensive Layers

One thing that becomes clear quickly when working with smaller open-source models: they're less consistent about output formatting. A model that's supposed to return clean JSON might wrap it in markdown fences, drop a closing bracket, or add explanatory text before the JSON block.

If your pipeline isn't built to handle this gracefully, a single formatting quirk crashes the whole run.

The solution is a coercion layer — sometimes called a _coerce_tool_calls adapter — that sits between the model's raw output and your execution logic. It cleans up malformed JSON, strips unexpected wrappers, and standardizes the output before anything downstream tries to act on it.

This layer is not optional. If you're routing any traffic to smaller models, plan for output inconsistency and build accordingly. The goal isn't to get the small model to be perfect — it's to build a system that handles imperfection gracefully.

What the Numbers Might Look Like

To give you a rough sense of the potential savings, here's an illustrative benchmark based on a multi-file Python package generation task (7 modules, a database layer, and mocked tests). These numbers come from a specific experimental run and should be treated as directional, not definitive:

Metric	Frontier Model Only	Hybrid (Architect-Editor)
Cost Per Task	$0.280	~$0.126 (est. 55% lower)
Files on Local Compute	0 / 12	12 / 12
Quality Score	97.5 / 100	95.0 / 100

Important caveat: These figures are from a controlled benchmark on a specific task type. Real-world savings will depend heavily on your task complexity, the quality of your planning prompts, your local hardware, and how well your coercion layer handles edge cases. We'd encourage you to run your own benchmarks on representative tasks before drawing conclusions.

That said, the directional logic holds: if most of your token spend is on execution (writing code, filling templates, generating structured output from a clear spec), there's likely room to shift some of that to cheaper compute without meaningfully degrading quality.

At 1,000 tasks per month, even a 40% cost reduction is material. At enterprise scale across CI/CD pipelines and developer tooling, the math becomes compelling enough to warrant serious exploration.

Design Principles Worth Taking Seriously

Whether or not this specific architecture fits your use case, the underlying principles are broadly applicable:

Route by task, not by turn: Agent workflows need coherent context. Instead of interrupting mid-flow, decompose upfront and assign entire, bounded sub-tasks to the right model.
Build defensively: Assume smaller models will produce inconsistent output. Build translation and coercion layers that absorb those inconsistencies before they propagate.
Instrument everything: You can't optimize what you can't measure. Unified observability across cloud and local providers — which FastRouter's BYOK feature enables — is what makes cost attribution and quality tracking possible.
Match model capability to task complexity: Not every step needs frontier reasoning. Identify where the reasoning actually matters and concentrate your spend there.

Is This Worth Trying?

The honest answer: it depends on your workload.

This approach works best when your agent tasks are decomposable — when there's a clear separation between the planning work (which benefits from frontier reasoning) and the execution work (which can be bounded and spec'd). Coding agents, document generation pipelines, and structured data extraction are reasonable candidates.

It works less well for tasks that require deep reasoning throughout, where context needs to flow continuously across many steps, or where the quality gap between frontier and local models is unacceptable for your use case.

The tools to experiment with this are already available. FastRouter's BYOK feature makes it straightforward to register a local Ollama instance as a provider alongside cloud models, giving you a single API, unified logging, and the ability to precisely measure cost and quality tradeoffs as you tune your approach.

If you're spending meaningfully on AI agent infrastructure, it's an experiment worth running.