Reflection-driven optimization
A reflection model reads your prompt's failures and proposes targeted edits, so every mutation is grounded in what actually went wrong-not guesswork.
GEPA-Genetic-Pareto optimization-evolves your LLM prompts over iterations using reflection and mutation, then keeps only the Pareto-best variants measured against LLM-judge metrics on your own data.
No credit card required · Free to start
Status
Optimization complete
Final score
1.000
9.7
Accuracy
9.5
Helpfulness
10
Safety
You are a senior support specialist. Read the customer's message, identify the underlying issue, then reply with a concise, empathetic, step-by-step resolution. If key details are missing, ask exactly one clarifying question before…
Stop hand-tuning prompts and guessing. GEPA searches for better prompts the way a strong engineer would-by studying real failures, proposing fixes, and keeping only what measurably wins.
A reflection model reads your prompt's failures and proposes targeted edits, so every mutation is grounded in what actually went wrong-not guesswork.
Each candidate is scored on every metric and only Pareto-optimal prompts survive, so improvements compound instead of trading one metric away for another.
Light, Medium, and Heavy tiers cap how far the search runs, and a Credit Utilization Estimate appears before each run so spend is predictable.
Each iteration runs your prompt on real data, reflects on the failures, mutates the prompt, and scores every candidate so only Pareto-optimal variants survive into the next round.
GEPA engine
Three mechanisms, every iteration
Reflection
Diagnoses why an output failed.
Mutation
Proposes targeted prompt edits.
Pareto selection
Keeps only variants that improve.
01 · Base prompt
Start from your current production system prompt.
02 · Run on data
Execute across imported logs and uploaded datasets.
03 · Reflect on failures
The reflection model diagnoses what went wrong.
04 · Mutate prompt
Targeted edits are proposed from that feedback.
05 · Evaluate (Pareto)
LLM judges score every metric from 0 to 1.
06 · Keep the best
Pareto selection retains the top-performing variants.
Import the examples that matter-chat-completion logs straight from your Activity Log, or your own CSV, JSON, or JSONL. Good and Bad annotations become the signal GEPA prioritizes.
Pull real chat-completion logs you've already served in production into an optimization run.
Upload labeled examples as CSV, JSON, or JSONL to target a specific task or edge case.
Thumbs annotations become a textual gradient that steers which failures get fixed first.
Imported input data
Rotate an API key without downtime?
What is the refund window for annual plans?
Why did my request fail with a 429?
Steps to invite a project member
Choose the model whose prompt ships to production, a cheaper model to reflect on failures, a budget tier, and the LLM-as-a-judge that scores every candidate.
The model whose prompt is optimized and then used in production with the refined instructions.
Reviews failures and proposes edits-it can be a cheaper model to keep optimization affordable.
Choose Light, Medium, or Heavy. Each mode sets the iteration count and a max batch size-FastRouter sizes the actual batch per step for you.
Optimizer configuration
Optimization model
Refined & shipped to production
Reflection model
Reviews failures, proposes edits
Evaluator model
Shared LLM-as-a-judge
25
Iterations
5
Max batch size
The details view shows the optimized prompt, its composite score versus your baseline, the improvement percentage, and a full breakdown of every accepted and rejected iteration.
A single 0-1 composite across your chosen metrics tracks the strength of every candidate.
See how each iteration scored on Accuracy, Helpfulness, Safety, and any other metric you added.
The final score is reported as a clear percentage lift over your starting prompt.
All iterations
Default
BaselineIteration 1
AcceptedIteration 2
RejectedIteration 3
AcceptedIteration 4
Best keptHigher tiers run more iterations for more quality at a higher cost. Start light to find direction, then go heavier when you're ready to maximize a high-traffic prompt.
| Detail | Light | MediumRecommended | Heavy |
|---|---|---|---|
| Optimization budget | |||
| Iterations | 10 | 25 | 50 |
| Max batch size | 3 | 5 | 10 |
| Best for | Quick direction | Balanced quality | Maximum quality |
| Relative cost | Low | Medium | High |
| Turnaround | Fastest | Balanced | Most thorough |
Max batch size is the upper bound per step-FastRouter may evaluate fewer samples depending on your dataset. Each iteration calls the optimization, reflection, and evaluator models, so higher budgets cost more, and a Credit Utilization Estimate is shown before every run.
GEPA shines whenever prompt quality directly affects outcomes-and when the failures you care about are already sitting in your logs.
Raise answer quality without endless manual prompt tinkering-GEPA runs the search and surfaces what works.
Feed in real logged failures so the optimizer fixes the exact cases your users actually hit.
Lock a consistent voice and safety posture across an entire product surface, not just one prompt.
Squeeze maximum quality out of a prompt before it serves high-volume production traffic.
GEPA (Genetic-Pareto) is a state-of-the-art evolutionary algorithm for prompts. Instead of hand-editing, it runs your prompt on real data, uses a reflection model to analyze failures, mutates the prompt based on that feedback, scores each candidate with LLM-as-a-judge metrics, and keeps only the Pareto-optimal variants. Repeating this loop over many iterations steadily evolves a stronger prompt.
Any supported chat model can be the Optimization Model-the one whose prompt is refined and shipped to production. The Reflection Model that reviews failures and the shared Evaluator Model that scores candidates can be different, often cheaper, models. For example, you might run GPT-5.5 in production while Claude Opus 4.8 reflects on failures and Gemini 3 Pro acts as the judge.
Each evaluation metric is an LLM-as-a-judge criterion that returns a score between 0 and 1 plus written feedback. GEPA combines your metrics into a single composite score, and the details view reports the optimized prompt's final composite against your baseline as a clear improvement percentage.
Yes-optimization runs on the input data you provide: chat-completion logs imported from your Activity Log, or datasets you upload as CSV, JSON, or JSONL. Good and Bad annotations on your completions act as a textual gradient that tells the optimizer which failures to prioritize.
Cost scales with iterations because each one calls the optimization, reflection, and evaluator models. You cap it by choosing a budget tier-Light (10 iterations), Medium (25), or Heavy (50)-and FastRouter shows a Credit Utilization Estimate before every run so there are no surprises.
No-batch size isn't a manual setting. Each budget mode defines a maximum batch size (Light up to 3, Medium up to 5, Heavy up to 10), and FastRouter automatically chooses how many samples to evaluate per step based on your dataset, so it can be lower than the max. Larger batches give more stable signal per mutation; smaller ones iterate faster.
Import your data, set a budget, and let GEPA evolve a measurably better prompt-then push it straight to production.