Fixing Behavioral Evals

Use this guide when asked to debug, troubleshoot, or fix a failing behavioral evaluation.

1. 🔍 Investigate

Fetch Nightly Results: Use the gh CLI to inspect the latest run from evals-nightly.yml if applicable.
- Example view URL: https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml
Isolate: DO NOT push changes or start remote runs. Confine investigation to the local workspace.
Read Logs:
- Eval logs live in evals/logs/<test_name>.log.
- Enable verbose debugging via export GEMINI_DEBUG_LOG_FILE="debug.log".
Diagnose: Audit tool logs and telemetry. Note if due to setup/assert.
- Tip: Proactively add custom logging/diagnostics to check hypotheses.

Targeted Location: Locate the test case and the corresponding prompt/code.
Iterative Scope: Make extreme change first to verify scope, then refine to a minimal, targeted change.
Assertion Fidelity:
- Changing the test prompt is a last resort (prompts are often vague by design).
- Warning: Do not lose test fidelity by making prompts too direct/easy.
- Primary Fix Trigger: Adjust tool descriptions, system prompts (snippets.ts), or modules that contribute to the prompt template.
- Warning: Prompts have multiple configurations; ensure your fix targets the correct config for the model in question.
Architecture Options: If prompt or instruction tuning triggers no improvement, analyze loop composition.
- AgentLoop: Defined by context + toolset + prompt.
- Enhancements: Loops perform best with direct prompts, fewer irrelevant tools, low goal density, and minimal low-value/irrelevant context.
- Modifications: Compose subagents or isolate tools. Ground in observed traces.
- Warning: Think deeply before offering recommendations; avoid parroting abstract design guidelines.

Run Local: Run Vitest in non-interactive mode on just the file.
Log Audit: Prioritize diagnosing failures via log comparison before triggering heavy test runs.
Stability Limit: Run the test 3 times locally on key models (can use scripts to run in parallel for speed):
- Gemini 3.0
- Gemini 3 Flash
- Gemini 2.5 Pro
Flakiness Rule: If it passes 2/3 times, it may be inherent noise difficult to improve without a structural split.

Provide a summary of: