Evaluations

Define scoring criteria and automatically judge your agent traces using an LLM.

Creating an Evaluation

curl -X POST https://api.retrace.yashbogam.me/api/v1/evaluations \
  -H "x-retrace-key: rt_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Agent Quality",
    "criteria": [
      {"name": "accuracy", "description": "Factually correct?", "weight": 1.0},
      {"name": "helpfulness", "description": "Addresses the question?", "weight": 0.8}
    ],
    "judge_model": "gemini-3.1-pro-preview"
  }'

Running Evaluations

curl -X POST https://api.retrace.yashbogam.me/api/v1/evaluations/{id}/run \
  -H "x-retrace-key: rt_live_..." \
  -H "Content-Type: application/json" \
  -d '{"trace_ids": ["trace-1", "trace-2"]}'

How It Works

Retrace summarizes your trace (spans, inputs, outputs, errors)
The judge LLM scores each criterion from 0.0 to 1.0
A weighted average produces the overall score
The judge provides textual feedback

[!TIP] Use gemini-2.5-flash for fast, cheap evaluations. Use gemini-3.1-pro-preview for high-stakes production evals.

Evaluations

Define scoring criteria and automatically judge your agent traces using an LLM.

Creating an Evaluation

curl -X POST https://api.retrace.yashbogam.me/api/v1/evaluations \
  -H "x-retrace-key: rt_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Agent Quality",
    "criteria": [
      {"name": "accuracy", "description": "Factually correct?", "weight": 1.0},
      {"name": "helpfulness", "description": "Addresses the question?", "weight": 0.8}
    ],
    "judge_model": "gemini-3.1-pro-preview"
  }'

Running Evaluations

curl -X POST https://api.retrace.yashbogam.me/api/v1/evaluations/{id}/run \
  -H "x-retrace-key: rt_live_..." \
  -H "Content-Type: application/json" \
  -d '{"trace_ids": ["trace-1", "trace-2"]}'

How It Works

Retrace summarizes your trace (spans, inputs, outputs, errors)
The judge LLM scores each criterion from 0.0 to 1.0
A weighted average produces the overall score
The judge provides textual feedback

[!TIP] Use gemini-2.5-flash for fast, cheap evaluations. Use gemini-3.1-pro-preview for high-stakes production evals.

Evaluations

Evaluations

Creating an Evaluation

Running Evaluations

How It Works

On this page

Evaluations

Evaluations

Creating an Evaluation

Running Evaluations

How It Works

On this page