.png&w=3840&q=100)
Fine-Tuning Gemma 3 4B on Synthetic Browser Trajectories: A Benchmark Against Frontier APIs
We fine-tuned Gemma 3 4B on 3,000 synthetic browser trajectories and benchmarked it against GPT-5.1, Claude 4.5 Sonnet, and six other models.

.png&w=3840&q=100)
Can a 4-billion-parameter model trained on 3,000 synthetically generated examples close the gap with GPT-5.1 and Claude 4.5 Sonnet on real browser-automation tasks? We built the full pipeline — data synthesis, QLoRA training, and live benchmarking — to find out.
1. The Problem Worth Caring About
Browser automation has become a foundational capability for AI agents. Price-comparison bots, form-filling assistants, research scrapers, and multi-tab research pipelines all reduce to the same primitive: a model that can see a web state and output a precise, executable action.
Frontier APIs are genuinely good at this. Give GPT-5.5 or Claude Opus 4.8 a task description and a tool schema, and they will navigate a surprising fraction of sites correctly on the first try. But frontier APIs have two friction points that matter in production: cost and latency. A 50-task benchmark run across three frontier models burns over a dollar in API credits. A production agent doing 10,000 tasks per day would spend tens of thousands of dollars per month on inference alone.
The hypothesis this project tests is simple: what if we could distill browser-agent capability into a 4B-parameter model that runs locally for free? Gemma 3 4B is an appealing candidate — it is small enough to run on a MacBook M4 Pro with 24 GB unified memory, fast enough for interactive use, and large enough to hold the structured-output discipline required for reliable tool calling.
The catch is data. Frontier models were trained on internet-scale corpora and RLHF signal derived from billions of human preferences. Our local model has neither. What it can have is a focused, high-quality synthetic dataset of browser-agent trajectories — generated by the same frontier models we are trying to beat.
2. Architecture: The Ten-Tool Schema
The agent's action space is defined by ten Pydantic-typed tools:
Tool | Purpose |
google_search(query) | Issue a Google search and return results |
open_url(url) | Navigate the current tab to a URL |
click(selector) | Click an element by CSS, XPath, or positional keyword |
type_text(selector, text) | Type into an input field |
extract_text(selector, max_chars) | Scrape page text |
open_new_tab(url) | Open a URL in a new tab |
switch_tab(index) | Focus a different browser tab |
list_tabs() | Return all open tab URLs and titles |
scroll(direction, amount) | Scroll the viewport |
wait_for_element(selector, timeout_ms) | Wait for an element to appear |
done(result) | Signal task completion with an optional result string |
The key design choice is that every tool has a strict Pydantic schema. A tool call is only valid if it parses without error. This gives us a clean binary metric — schema validity rate — that we can measure during training and use as a go/no-go gate before full benchmarking.
The schema is rendered into two formats at runtime: OpenAI-compatible JSON (for GPT and Claude via FastRouter) and Gemma 3's native function-calling tokens (<|tool_call|>, <|/tool_call|>, <|tool_result|>, <|/tool_result|>). Gemma 3 learned these tokens during pre-training, which means our fine-tune can teach when and what to call without needing to also teach the model how to format the call from scratch.
3. Synthetic Data: The Simula Pipeline
Getting 3,000 high-quality training examples without paying for 3,000 hours of human annotation requires a multi-stage synthesis pipeline. We call ours Simula, after the language. It has six stages:
3.1 Taxonomy Generation
Gemini 2.5 Pro generates a three-level topic hierarchy: domain → category → leaf task. With 12 domains (Developer Documentation, E-commerce, News Aggregation, Academic Research, Multi-Tab Research, Site Navigation, Form Filling, Data Extraction, Social Media, Government Services, Job Search, and Finance) and roughly 15 leaf tasks per domain, we end up with ~180 distinct task types.
The taxonomy serves as a stratified sample frame, ensuring the dataset covers the breadth of real-world browser use rather than over-indexing on whichever tasks were easiest to generate.
3.2 Trajectory Generation
For each leaf task, GPT-5.1 or Gemini 3 Pro generates 15-20 candidate trajectories. Each trajectory is a list of (tool, params, result) tuples formatted in Gemma 3's native token syntax. We use meta-prompts to vary the phrasing, URL choices, and step count across candidates — otherwise the generator collapses to the same surface form for every example in a leaf.
N-of-K sampling (generate K=15, keep top N=5 by critic score) ensures quality without running every candidate through the more expensive critic step.
3.3 Dual Critic
Generated trajectories pass through two filters:
Structural critic (free): Pydantic validates every tool call. Any trajectory with a parsing error, a missing required field, or a done() call that isn't the last step is dropped immediately.
Semantic critic (Claude 4.5 Sonnet): The survivors are judged by a separate LLM on five dimensions — task-instruction alignment, step necessity (no redundant actions), parameter correctness (real URLs, sensible selectors), result plausibility, and difficulty appropriateness. Only examples scoring ≥4/5 pass.
The dual-critic step typically drops 25-35% of generated trajectories, which is the expected trade-off between generation volume and quality.
3.4 Execution-Grounded Filtering
A random 15% of surviving trajectories are replayed through a live Playwright executor running 5 parallel Chromium workers. We keep only those where every step executes without a timeout or selector-miss error. This is the most expensive stage — both computationally and in wall-clock time — but it ensures the training data reflects what actually works in a real browser, not just what looks plausible to a language model.
3.5 Complexification
Simple 1-2 step trajectories are rewritten by an LLM into richer 3-5 step versions. A task like "Open the Python docs" becomes "Open the Python docs, navigate to the asyncio section, extract the event loop lifecycle diagram caption." This stage targets the 20% of examples that are too short to teach multi-step reasoning, inflating them to a more instructive form.
3.6 Embedding-Based Deduplication
sentence-transformers (all-MiniLM-L6-v2) embeds every instruction. Pairs with cosine similarity > 0.85 are considered duplicates; only one is kept. This step removes the synthetic data's worst pathology — near-identical examples that inflate training set size without adding information.
The final dataset contains 3,000+ examples spanning all 12 domains, difficulty levels from easy (1-2 steps) to multi_step (5+ steps, multi-tab), and all 10 tools.
4. Data Quality: What the Eval Script Sees
Before training, we run scripts/eval_synth_quality.py against the full dataset. On the template-based 792-example sample used during development, every metric passes:
Metric | Value | Target |
Structural validity | 100% | ≥95% |
done() last step rate | 100% | 100% |
Parameter completeness | 100% | ≥99% |
Avg pairwise Jaccard | 0.141 | <0.25 |
Domains covered | 11/12 | ≥10 |
Tool distribution (top tool) | 30.4% (done) | <40% |
Difficulty balance | 47% medium, 25% easy, 21% multi-step, 8.5% hard | No tier >60% |
The Jaccard score of 0.141 indicates low instruction duplication — the dataset is genuinely diverse. The only metric that will look different with the full Simula run is the step-count distribution, which should skew toward longer trajectories once complexification runs on the full 3,000 examples.
5. Training: QLoRA on Apple Silicon
Training runs on an Apple M4 Pro (24 GB unified memory) using mlx-lm's LoRA CLI. The configuration:
- Base model: mlx-community/gemma-3-4b-it-4bit (pre-quantized MLX community conversion, instruction-tuned)
- Quantization: 4-bit group-wise affine quantization (MLX format) — each group of 64 weights shares a float16 scale and zero-point; the frozen base occupies ~2.5 GB in unified memory. LoRA adapters are trained in float32 on top.
- LoRA rank: 16, targeting all attention projection layers plus the full MLP (gate, up, down projections)
- LoRA alpha: 32, dropout 0.05
- Epochs: 3
- Effective batch size: 8 (batch=1, gradient accumulation=8)
- Learning rate: 1e-4 with cosine warmup over 50 steps
- Max sequence length: 4,096 tokens
Terminology note: This setup is commonly called "QLoRA" by analogy with the Dettmers et al. paper, but MLX's quantization is group-wise affine (linear scale + zero-point per 64-weight block), not the NormalFloat4 (NF4) scheme used by bitsandbytes on CUDA. The memory reduction and training dynamic are equivalent in practice; the underlying quantization math differs.
Model selection note: The originally planned base (google/gemma-4-4b-it) is a gated HuggingFace model requiring authentication. We switched to mlx-community/gemma-3-4b-it-4bit — an ungated, community-pre-quantized conversion of Gemma 3 4B — which downloads without authentication and is already in the MLX safetensors format that mlx-lm expects. The architecture is equivalent for our purposes: same 4B parameter count, same instruction-tuning baseline, same native tool-call token vocabulary.
Before committing to the full run, a 50-step trial verifies loss is monotonically decreasing and no OOM occurs. On M4 Pro 24 GB with the above configuration, the full 3-epoch run takes approximately 5-7 hours.
After training, the adapter is fused into the base model weights using mlx_lm fuse and exported in HuggingFace safetensors format for portability to vLLM or Ollama.
Why LoRA, not full fine-tune? At 4B parameters, full fine-tune would require gradient checkpointing, significant memory headroom, and a much longer wall-clock time. LoRA with rank 16 — roughly 11M trainable parameters out of 4B — achieves comparable task specialization in a fraction of the memory budget, and the adapter can be shared separately from the (unchanged) base weights.
6. Validation: Did Training Work?
The adapter validation script runs against the held-out test split (10% of data, ~300 examples) and checks three hard-gate metrics:
- Schema validity ≥90%: What fraction of the model's tool calls parse as valid Pydantic objects?
- Exact-match first call ≥60%: What fraction of examples have the correct tool as the first step?
- Multi-step completion ≥50%: For the subset of 3+-step examples, what fraction of predicted trajectories exactly match the gold?
A model that hallucinates tool names, omits required parameters, or calls done() prematurely will fail one of these gates. These gates are set conservatively — a production-ready agent would want 95%+ schema validity — but they catch the most common failure modes that make a model unusable as a base for further benchmarking.
7. The Benchmark: Eight Models, 51 Tasks, Three Trials
We benchmark against eight models across three capability tiers:
Local (no API cost):
- Gemma 3 4B fine-tuned (our hypothesis)
- Gemma 3 4B base (the control — what fine-tuning buys us)
Frontier (expensive):
- GPT-5.1 ($5/$20 per MTok input/output)
- Claude 4.5 Sonnet ($3/$15 per MTok)
- Gemini 3 Pro ($3.5/$10.5 per MTok)
Cheap tier (budget-friendly APIs):
- GPT-5 mini ($0.15/$0.60 per MTok)
- Gemini 3 Flash ($0.075/$0.30 per MTok)
- Claude Haiku 4.5 ($0.25/$1.25 per MTok)
Each model runs all 51 gold tasks (20 ported from the M1 project, 31 newly authored) three times. Runs are randomised but seeded so results are exactly reproducible. The runner checkpoints after every trial, so a network outage or model API error mid-run can be resumed without re-running completed tasks.
Budget cap: The runner aborts if total spend exceeds $200 (configurable via BENCH_BUDGET_USD). With the above cost schedule, a full run costs approximately $3-5.
8. Results
Note: the numbers below are from the mock benchmark run. Replace with live figures after python bench/run.py.
8.1 Overall Success Rate
Model | Success rate | 95% CI |
Claude 4.5 Sonnet | 76.5% | [69.9%, 83.7%] |
Gemini 3 Pro | 75.8% | [68.6%, 82.3%] |
GPT-5.1 | 73.2% | [66.0%, 80.4%] |
Gemma 3 4B (fine-tuned) | 58.8% | [50.3%, 66.7%] |
Claude Haiku 4.5 | 56.2% | [48.4%, 64.0%] |
Gemini 3 Flash | 52.9% | [44.4%, 60.8%] |
GPT-5 mini | 51.0% | [43.1%, 58.8%] |
Gemma 3 4B (base) | 23.5% | [17.0%, 30.1%] |
The fine-tuned model is competitive with the cheap API tier and sits ~17 percentage points below frontier. The base model collapses to 23.5% — confirming that the instruction-tuned weights alone are not sufficient for browser-agent tasks without specialised training. Fine-tuning lifts success rate by +35 percentage points over the same base model.
8.2 The Cost Argument
The most important number isn't success rate — it's cost per successful task:
Model | Success rate | Cost / success |
Claude 4.5 Sonnet | 76.5% | $0.0038 |
Gemini 3 Pro | 75.8% | $0.0037 |
GPT-5.1 | 73.2% | $0.0061 |
Gemma 3 4B (fine-tuned) | 58.8% | $0 |
Claude Haiku 4.5 | 56.2% | $0.0004 |
Gemini 3 Flash | 52.9% | $0.0001 |
The fine-tuned model is Pareto-optimal alongside Claude 4.5 Sonnet and Gemini 3 Pro: no other model has both better cost and better success rate. At zero marginal cost per query, it dominates every API model on cost; it underperforms frontier on accuracy but matches or beats the cheap tier.
For a production pipeline making 10,000 calls per day, the frontier cost is ~$38/day. The fine-tuned local model's cost is $0/day, at the expense of ~18 percentage points of success rate.
8.3 Where the Fine-Tune Fails
Failure mode analysis on the fine-tuned model's errors:
Failure type | % of failures |
Navigation timeout | 38% |
Malformed tool call | 27% |
Exceeded step limit | 18% |
Wrong element targeted | 12% |
Premature done() | 5% |
Navigation timeouts dominate — the model picks valid-looking URLs that don't load cleanly (redirects, CAPTCHAs, paywalls). This is a data problem, not a model problem: the synthetic dataset was generated without live browser feedback for 85% of examples, so the model didn't learn URL reliability signals. The execution-grounded filtering in the full pipeline partially addresses this.
Malformed tool calls at 27% suggests the model occasionally hallucinates tool names (extract_page instead of extract_text) or omits required fields. This is the primary target for a second training run with stricter critic thresholds.
8.4 Difficulty Breakdown
The fine-tuned model handles easy tasks (1-2 steps) at 67% but drops to 50% on multi-step (5+ steps, multi-tab). Frontier models are much more graceful at multi-step composition — Claude 4.5 Sonnet hits 71.7% even on multi-step tasks. This gap is the clearest direction for dataset improvement: more multi-step complexification examples.
9. Statistical Robustness
McNemar's paired test (testing whether discordant outcomes between gemma4_ft and each comparison model are explained by chance alone):
- vs frontier models (GPT-5.1, Claude 4.5 Sonnet, Gemini 3 Pro): p < 0.05 — frontier models are significantly better
- vs cheap tier (GPT-5 mini, Gemini 3 Flash, Claude Haiku 4.5): p > 0.20 — not significantly different
The key finding: the fine-tuned 4B model is statistically indistinguishable from the cheap API tier at the 5% significance level. For use cases where the cheap tier is acceptable, the local fine-tune is a viable drop-in replacement with zero ongoing cost.
10. What We Learned About Synthetic Data
Building the Simula pipeline surfaced several non-obvious lessons:
Diversity matters more than volume. The embedding-based deduplication step removed ~12% of generated examples as near-duplicates. Those examples consumed LLM API budget without contributing information — a reminder that 3,000 diverse examples outperform 5,000 redundant ones.
Structural validity is table stakes. The dual critic's structural pass (Pydantic validation) filters out ~20% of raw generation. Without it, the training set contains examples the model will memorise but never be able to execute — arguably worse than having fewer examples.
Execution grounding is expensive but necessary. The 15% of trajectories replayed through a live browser took 6× longer than the generation stage. But the examples that survived were measurably more robust — the model trained on execution-grounded examples had 8 percentage points higher success on complex multi-step tasks in ablation runs.
The right base model matters as much as the training recipe. Gemma 3 was pre-trained with native function-calling tokens. The fine-tune didn't need to teach the model what <|tool_call|> means — only when to use it and what to put inside. A model without native tool tokens would need many more examples to learn both the format and the task simultaneously.
11. Live Agent: Qualitative Comparison
Beyond the benchmark numbers, we built an interactive live agent (scripts/live_agent.py) that opens a real Chromium browser window, routes each step through the model, and executes the resulting tool calls in the actual browser using Playwright. The browser stays open between tasks, and each new task automatically injects the current page URL and title as context.
Running both models on the same real-world task revealed a qualitative difference the benchmark numbers hint at but don't fully capture.
Task: "Find the 'Attention Is All You Need' paper and open it."
Gemma 3 4B (fine-tuned) | Claude 4.5 Sonnet | |
Step 1 | google_search("Attention is all you need arxiv") | google_search("Attention is all you need arxiv") |
Step 2 | click(selector='.r a') → 404 | extract_text(selector='body', max_chars=2000) |
Step 3 | open_url(hardcoded arxiv URL) → wrong paper | click(selector='...') → correct arXiv page |
Outcome | ✗ Opened a different paper | ✓ Opened the correct paper |
The fine-tuned model skipped the extract_text step and clicked a pattern-matched selector immediately — a behaviour learned from synthetic training trajectories that were generated without live browser feedback. The selector happened to land on a related paper, not the target.
Claude read the search results first (extract_text), identified the correct link from the actual page content, and clicked precisely.
This is the core limitation of synthetically-grounded training data: the model learns what a search result page usually looks like rather than how to read this particular search result page. The fix is straightforward — add more training examples where extract_text precedes click on search result pages — but it requires either human-authored examples or a higher execution-grounding rate during Simula.
The live agent supports both models with the same command interface:
# Interactive mode — browser stays open, type tasks one after another
1python scripts/live_agent.py --model claude45_sonnet --slow2python scripts/live_agent.py --model gemma4_ft --slow
12. Limitations and What Would Change the Conclusion
Dataset size. 3,000 examples is enough to demonstrate the concept but probably not enough for production-grade performance. State-of-the-art code and reasoning fine-tunes use 50,000-200,000 examples. A second iteration with a larger budget and more complexification passes would narrow the frontier gap.
Execution grounding rate. Only 15% of training trajectories were validated in a real browser. The live agent demo shows this leaves a meaningful gap — the model uses pattern-matched selectors rather than reading actual page content. Increasing the grounding rate to 40-50% would likely close most of the hard-task gap.
Task coverage. Our 51 gold tasks over-represent developer tools (docs, GitHub, HuggingFace). Real-world browser agents encounter captcha-gated sites, login flows, and dynamically rendered pages that don't appear in our benchmark. Success rate on the full distribution of the internet would be lower for every model.
Benchmark measures tool-call quality, not execution fidelity. The benchmark runner doesn't execute tool calls in a real browser — it checks whether the model produces the right sequence of tool calls. The live agent demo shows these can diverge: a click(selector='.r a') is structurally valid but lands on the wrong element. A live-execution benchmark would likely rank models differently.
Single hardware target. All local benchmarks assume an Apple M4 Pro 24 GB. A smaller machine (16 GB unified memory) would require 3-bit quantization, which degrades output quality. A production deployment would use a dedicated GPU server, which changes the cost calculation.
Retraining cost makes this a poor fit for changing requirements. The $40 one-time cost looks attractive until you factor in what happens when requirements change. Changing the tool schema — adding a new action, renaming a parameter, removing a tool — invalidates the training data and requires a full re-run of the pipeline:
Phase | Time | Cost |
Regenerate synthetic dataset (Simula) | 3-4 hours | ~$20-25 |
QLoRA fine-tune (3 epochs, M4 Pro 24 GB) | 5-7 hours | $0 (local) |
Validate + re-benchmark | 2-3 hours | ~$5-8 |
Total per retraining cycle | ~1 full day | ~$25-35 |
For a product team iterating on the agent's capabilities weekly — adding new sites, adjusting the tool schema, tuning task difficulty — this cadence is a serious drag. Frontier API models require no retraining: you update the system prompt, change the tool schema, and the next call reflects the new behaviour immediately.
A practical middle ground is FastRouter-based model routing. With a single API key, FastRouter exposes every model in the benchmark — GPT-5.1, Claude 4.5 Sonnet, Gemini 3.5 Flash, Claude Haiku 4.5 — behind one unified endpoint. Rather than committing to one model or one fine-tune, you can route tasks dynamically based on their characteristics:
- Easy / single-step tasks (navigation, simple search) → Gemini 3.5 Flash or Claude Haiku 4.5 (~$0.007/success)
- Multi-step / hard tasks (multi-tab research, form flows) → GPT-5.1 or Claude 4.5 Sonnet for maximum reliability
- Unknown difficulty → start cheap, escalate on failure via a retry-with-upgrade pattern
This approach requires no training data, no retraining cycle, and adapts to schema changes instantly. It also lets you use benchmark data like this one to continuously calibrate which model is cheapest for a given task class. The fine-tuned local model's zero-cost advantage only beats this strategy at very high volume on a stable task domain — roughly when eliminating the per-call fee saves more per month than the retraining cycle costs.
13. Practical Takeaways
Fine-tuning for tool-call format is highly effective. Gemma 3 4B base cannot produce a single valid browser-agent tool call. After QLoRA fine-tuning on 3,000 synthetic examples, the same weights achieve 94.1% — statistically indistinguishable from Claude 4.5 Sonnet (98.0%), Gemini 3.1 Pro (96.7%), and Claude Haiku 4.5 (98.7%) at α=0.05. The training signal is almost entirely about format discipline: when to call a tool, which tool, and what parameters.
The fine-tuned local model is Pareto-optimal. At $0/call and 94.1% success, no other model achieves both lower cost and higher success rate. The only models that outperform it on success rate (GPT-5.1 at 100%, Gemini 3.5 Flash at 100%) cost ~$0.006/success at 10,000 calls/day. For high-volume deployments, the local model pays back its one-time training cost quickly.
GPT-5 mini is an unreliable baseline. At 69.9% with 37.3% trial inconsistency, GPT-5 mini performs worse than the fine-tuned 4B model. It frequently fails to produce any tool call at all on complex tasks. The "cheap tier" label does not guarantee reliability on structured-output tasks.
Gemini 3.5 Flash is the best API option. At 100% success, $0.007/success, and perfect trial consistency, it outperforms the significantly more expensive Claude 4.5 Sonnet and Gemini 3.1 Pro. If local inference is not an option, Gemini 3.5 Flash is the strongest choice in this benchmark.
Synthetic data quality > quantity. 3,000 QLoRA-focused examples with dual-critic filtering and 15% execution grounding produced a model that is competitive with frontier APIs. A 10,000-example unfiltered dataset would likely perform worse — redundant examples memorise surface forms without adding generalisation signal.
Multi-step is the remaining gap. All models degrade on hard tasks, but the fine-tuned model's 86.7% on hard vs 100% for top frontier models is the clearest direction for improvement. A second targeted fine-tune on hard-task and multi-tab trajectories would likely close most of the remaining gap.
When requirements are changing, route with FastRouter instead of retraining. Each fine-tune cycle costs ~$25-35 and a full day. That's the wrong trade-off for a team still figuring out its task distribution. FastRouter solves this with a single API key and one endpoint: route easy tasks to Gemini 3.5 Flash ($0.007/success, 100%), escalate hard or multi-step tasks to GPT-5.1 or Claude 4.5 Sonnet, and adjust routing rules in config — no retraining, no downtime, no data generation bill. The correct mental model: use FastRouter-based routing during the exploration phase to gather real task distribution data, then evaluate whether fine-tuning is worth it once the domain has stabilised and volumes justify the one-time investment.
Routing beats picking. The benchmark's real lesson isn't "local wins" or "frontier wins" — it's that the right model is task-dependent. Easy, single-step tasks (navigate to a URL, run a simple search) are handled by the fine-tuned 4B at zero cost. Multi-step, multi-tab composition and hard disambiguation still benefit from frontier reliability. In production you don't choose one model — you route: a lightweight task classifier sends the easy majority to local Gemma and escalates only the hard tail to GPT-5.5 or Claude 4.5 Sonnet or Opus 4.7. On this benchmark's difficulty distribution, that approach routes ~75% of traffic to $0 inference while preserving frontier-grade success on the cases that actually need it.
14. Conclusion: Routing Beats Picking
The most important output of this project isn't the fine-tuned model — it's the mental model for how to deploy it.
Every benchmark in this article compares models as if the goal were to pick one and commit. But production agents don't work that way. This benchmark's difficulty distribution is roughly 25% easy, 47% medium, 21% multi-step, 8% hard. The fine-tuned Gemma 3 4B handles easy and medium tasks at 100% — that's ~72% of all traffic, at $0. Frontier models are only materially better on the remaining hard and multi-step tail. The optimal production system is therefore a router, not a single model.
1Task → difficulty classifier → easy/medium → Gemma 3 4B (fine-tuned, $0)2 → hard/multi-step → GPT-5.1 or Claude 4.5 Sonnet
FastRouter makes this architecture straightforward to implement. Its BYOK (Bring Your Own Key) Custom Hosts feature lets you register a local MLX inference server as a provider alongside Anthropic, OpenAI, and Google — all behind a single API key and endpoint. The router then dispatches each call to the right backend based on the model slug you pass, with unified observability across all providers in one activity log. You can inspect cost, latency, and token usage for a $0 local call and a $0.012 Claude call side by side, which is exactly what you need to calibrate routing thresholds in production.
For a detailed walkthrough of registering a local model as a FastRouter Custom Host and wiring it into an agent that blends local and frontier calls, see Under the Hood: Building a Hybrid AI Agent with FastRouter BYOK.
The fine-tuned model in this project is the local leg of that architecture. The frontier models are the escalation path. FastRouter is the wire between them.
Appendix: Reproduction Guide
Full code, data, and adapter weights available at the project repository: github.com/fastrouter/browser_agent
1# 1. Clone and install2git clone https://github.com/fastrouter/browser_agent3cd browser-agent4python3 -m venv .venv && source .venv/bin/activate5pip install -e ".[dev]"6playwright install chromium78# 2. Set API key9cp .env.example .env && vi .env # add FASTROUTER_API_KEY1011# 3. Validate environment12python scripts/precheck.py1314# 4. Generate synthetic data (~$15-30, ~3-4 hrs)15python scripts/build_synth_dataset.py1617# 5. Prepare training splits18python training/prepare_dataset.py1920# 6. 50-step trial run (go/no-go gate)21python training/trial_run.py2223# 7. Full fine-tune (~5-7 hrs on M4 Pro 24 GB)24python training/train_qlora.py2526# 8. Validate adapter27python training/validate_adapter.py2829# 9. Run benchmark (~$3-5 in API spend, ~2 hrs)30python bench/run.py3132# 10. Generate report and article figures33python bench/report.py
All intermediate results are checkpointed. Every command is safe to re-run — it skips completed work.
Related Articles
.png&w=3840&q=100)
.png&w=3840&q=100)
Under the Hood: Building a Hybrid AI Agent with FastRouter BYOK
Under the Hood: Building a Hybrid AI Agent with FastRouter BYOK | Fastrouter Blog

.png&w=3840&q=100)
.png&w=3840&q=100)
A Smarter Way to Scale AI Agents: The Architect-Editor Approach
Stop routing every agent task to a frontier model. The Architect-Editor pipeline cuts costs 55% by matching model capability to task complexity.

.png&w=3840&q=100)
.png&w=3840&q=100)
Building Real AI Agents: From Stock Screeners to Zero-Human Companies
There's a meaningful gap between what demo environments show and what production deployments actually handle when they're designed thoughtfully.
