From 5be38c3a271cb4402bb83a6c758a56cc0b0aafef Mon Sep 17 00:00:00 2001 From: Christian Gunderman Date: Mon, 26 Jan 2026 15:25:37 -0800 Subject: [PATCH] Add slash command for diagnosing test failures. --- evals/README.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/evals/README.md b/evals/README.md index eb3cf2be70..cdcbcff124 100644 --- a/evals/README.md +++ b/evals/README.md @@ -160,8 +160,7 @@ failing evaluations. It will: `evals/logs`. 2. **Fix**: Suggest and apply targeted fixes to the prompt or tool definitions. It prioritizes minimal changes to `prompt.ts`, tool instructions, and - modules that contribute to the prompt. It generally tries to avoid changing - the test itself. + modules that contribute to the prompt. 3. **Verify**: Re-run the test 3 times across multiple models (e.g., Gemini 3.0, Gemini 3 Flash, Gemini 2.5 Pro) to ensure stability and calculate a success rate.