Evaluations
Run LLM-as-judge evaluations on your agent traces.
Evaluations
Define scoring criteria and automatically judge your agent traces using an LLM.
Creating an Evaluation
curl -X POST https://api.retrace.yashbogam.me/api/v1/evaluations \
-H "x-retrace-key: rt_live_..." \
-H "Content-Type: application/json" \
-d '{
"name": "Agent Quality",
"criteria": [
{"name": "accuracy", "description": "Factually correct?", "weight": 1.0},
{"name": "helpfulness", "description": "Addresses the question?", "weight": 0.8}
],
"judge_model": "gemini-3.1-pro-preview"
}'Running Evaluations
curl -X POST https://api.retrace.yashbogam.me/api/v1/evaluations/{id}/run \
-H "x-retrace-key: rt_live_..." \
-H "Content-Type: application/json" \
-d '{"trace_ids": ["trace-1", "trace-2"]}'How It Works
- Retrace summarizes your trace (spans, inputs, outputs, errors)
- The judge LLM scores each criterion from 0.0 to 1.0
- A weighted average produces the overall score
- The judge provides textual feedback
[!TIP] Use gemini-2.5-flash for fast, cheap evaluations. Use gemini-3.1-pro-preview for high-stakes production evals.