Prompt Optimization

Optimize prompts automatically with GEPA

GEPA-Genetic-Pareto optimization-evolves your LLM prompts over iterations using reflection and mutation, then keeps only the Pareto-best variants measured against LLM-judge metrics on your own data.

No credit card required · Free to start

Optimization run

Status

Optimization complete

Done

Final score

1.000

+14% improvement

9.7

Accuracy

9.5

Helpfulness

10

Safety

You are a senior support specialist. Read the customer's message, identify the underlying issue, then reply with a concise, empathetic, step-by-step resolution. If key details are missing, ask exactly one clarifying question before…

Why GEPA

Prompt engineering that improves itself

Stop hand-tuning prompts and guessing. GEPA searches for better prompts the way a strong engineer would-by studying real failures, proposing fixes, and keeping only what measurably wins.

Reflection-driven optimization

A reflection model reads your prompt's failures and proposes targeted edits, so every mutation is grounded in what actually went wrong-not guesswork.

Pareto selection keeps winners

Each candidate is scored on every metric and only Pareto-optimal prompts survive, so improvements compound instead of trading one metric away for another.

Budget-aware by design

Light, Medium, and Heavy tiers cap how far the search runs, and a Credit Utilization Estimate appears before each run so spend is predictable.

How GEPA works

Reflect, mutate, evaluate-then keep the best

Each iteration runs your prompt on real data, reflects on the failures, mutates the prompt, and scores every candidate so only Pareto-optimal variants survive into the next round.

GEPA engine

Three mechanisms, every iteration

Reflection

Diagnoses why an output failed.

Mutation

Proposes targeted prompt edits.

Pareto selection

Keeps only variants that improve.

Composite score0.59 → 1.000
  1. 01 · Base prompt

    Start from your current production system prompt.

  2. 02 · Run on data

    Execute across imported logs and uploaded datasets.

  3. 03 · Reflect on failures

    The reflection model diagnoses what went wrong.

  4. 04 · Mutate prompt

    Targeted edits are proposed from that feedback.

  5. 05 · Evaluate (Pareto)

    LLM judges score every metric from 0 to 1.

  6. 06 · Keep the best

    Pareto selection retains the top-performing variants.

  7. Repeat each iteration until the budget tier is reached, keeping only Pareto-optimal prompts.
Input data & feedback

Optimize against your real failures

Import the examples that matter-chat-completion logs straight from your Activity Log, or your own CSV, JSON, or JSONL. Good and Bad annotations become the signal GEPA prioritizes.

Activity Log import

Pull real chat-completion logs you've already served in production into an optimization run.

Bring your own dataset

Upload labeled examples as CSV, JSON, or JSONL to target a specific task or edge case.

Good / Bad as a gradient

Thumbs annotations become a textual gradient that steers which failures get fixed first.

Imported input data

42 examples
Activity LogCSVJSONJSONL
  • Rotate an API key without downtime?

    Good
  • What is the refund window for annual plans?

    Bad
  • Why did my request fail with a 429?

    Bad
  • Steps to invite a project member

    Good
Good / Bad votes become the textual gradient GEPA optimizes against.
Optimizer

Configure exactly how it tunes

Choose the model whose prompt ships to production, a cheaper model to reflect on failures, a budget tier, and the LLM-as-a-judge that scores every candidate.

Optimization model

The model whose prompt is optimized and then used in production with the refined instructions.

Reflection model

Reviews failures and proposes edits-it can be a cheaper model to keep optimization affordable.

Pick a budget mode

Choose Light, Medium, or Heavy. Each mode sets the iteration count and a max batch size-FastRouter sizes the actual batch per step for you.

Optimizer configuration

Optimization model

Refined & shipped to production

GPT-5.5

Reflection model

Reviews failures, proposes edits

Claude Opus 4.8

Evaluator model

Shared LLM-as-a-judge

Gemini 3 Pro
Budget modeHigher = more quality, more cost
LightMediumHeavy

25

Iterations

5

Max batch size

Evaluation metrics
AccuracyHelpfulnessTone & StyleSafety
Results

Watch quality climb every iteration

The details view shows the optimized prompt, its composite score versus your baseline, the improvement percentage, and a full breakdown of every accepted and rejected iteration.

Composite scoring

A single 0-1 composite across your chosen metrics tracks the strength of every candidate.

Per-metric breakdown

See how each iteration scored on Accuracy, Helpfulness, Safety, and any other metric you added.

Improvement vs baseline

The final score is reported as a clear percentage lift over your starting prompt.

All iterations

Composite climbing
  • Default

    Baseline
    0.59
  • Iteration 1

    Accepted
    0.74
  • Iteration 2

    Rejected
    0.68
  • Iteration 3

    Accepted
    0.86
  • Iteration 4

    Best kept
    1.00
Optimization budget

Pick the budget that fits the job

Higher tiers run more iterations for more quality at a higher cost. Start light to find direction, then go heavier when you're ready to maximize a high-traffic prompt.

Comparison of GEPA optimization budget tiers
DetailLightMediumRecommendedHeavy
Optimization budget
Iterations102550
Max batch size3510
Best forQuick directionBalanced qualityMaximum quality
Relative costLowMediumHigh
TurnaroundFastestBalancedMost thorough

Max batch size is the upper bound per step-FastRouter may evaluate fewer samples depending on your dataset. Each iteration calls the optimization, reflection, and evaluator models, so higher budgets cost more, and a Credit Utilization Estimate is shown before every run.

Built for production prompts

Where automated optimization pays off

GEPA shines whenever prompt quality directly affects outcomes-and when the failures you care about are already sitting in your logs.

Lift accuracy, hands-free

Raise answer quality without endless manual prompt tinkering-GEPA runs the search and surfaces what works.

Cut hallucinations & failures

Feed in real logged failures so the optimizer fixes the exact cases your users actually hit.

Standardize tone & safety

Lock a consistent voice and safety posture across an entire product surface, not just one prompt.

Harden before you scale

Squeeze maximum quality out of a prompt before it serves high-volume production traffic.

FAQ

Answers for AI & product teams

GEPA (Genetic-Pareto) is a state-of-the-art evolutionary algorithm for prompts. Instead of hand-editing, it runs your prompt on real data, uses a reflection model to analyze failures, mutates the prompt based on that feedback, scores each candidate with LLM-as-a-judge metrics, and keeps only the Pareto-optimal variants. Repeating this loop over many iterations steadily evolves a stronger prompt.

Any supported chat model can be the Optimization Model-the one whose prompt is refined and shipped to production. The Reflection Model that reviews failures and the shared Evaluator Model that scores candidates can be different, often cheaper, models. For example, you might run GPT-5.5 in production while Claude Opus 4.8 reflects on failures and Gemini 3 Pro acts as the judge.

Each evaluation metric is an LLM-as-a-judge criterion that returns a score between 0 and 1 plus written feedback. GEPA combines your metrics into a single composite score, and the details view reports the optimized prompt's final composite against your baseline as a clear improvement percentage.

Yes-optimization runs on the input data you provide: chat-completion logs imported from your Activity Log, or datasets you upload as CSV, JSON, or JSONL. Good and Bad annotations on your completions act as a textual gradient that tells the optimizer which failures to prioritize.

Cost scales with iterations because each one calls the optimization, reflection, and evaluator models. You cap it by choosing a budget tier-Light (10 iterations), Medium (25), or Heavy (50)-and FastRouter shows a Credit Utilization Estimate before every run so there are no surprises.

No-batch size isn't a manual setting. Each budget mode defines a maximum batch size (Light up to 3, Medium up to 5, Heavy up to 10), and FastRouter automatically chooses how many samples to evaluate per step based on your dataset, so it can be lower than the max. Larger batches give more stable signal per mutation; smaller ones iterate faster.

Ship your best prompt, not your first draft

Import your data, set a budget, and let GEPA evolve a measurably better prompt-then push it straight to production.