Fixing Behavioral Evals

Use this guide when asked to debug, troubleshoot, or fix a failing behavioral evaluation.

1. 🔍 Investigate

Fetch Nightly Results: Use the gh CLI to inspect the latest run from evals-nightly.yml if applicable.
- Example view URL: https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml
Isolate: DO NOT push changes or start remote runs. Confine investigation to the local workspace.
Read Logs:
- Eval logs live in evals/logs/<test_name>.log.
- Enable verbose debugging via export GEMINI_DEBUG_LOG_FILE="debug.log".
Diagnose: Audit tool logs and telemetry. Note if due to setup/assert.
- Tip: Proactively add custom logging/diagnostics to check hypotheses.

2. 🛠️ Fix Strategy

Targeted Location: Locate the test case and the corresponding prompt/code.
Iterative Scope: Make extreme change first to verify scope, then refine to a minimal, targeted change.
Assertion Fidelity:
- Changing the test prompt is a last resort (prompts are often vague by design).
- Warning: Do not lose test fidelity by making prompts too direct/easy.
- Primary Fix Trigger: Adjust tool descriptions, system prompts (snippets.ts), or modules that contribute to the prompt template.
- Fixes should generally try to improve the prompt @packages/core/src/prompts/snippets.ts first.
- Instructional Generality: Changes to the system prompt should aim to be as general as possible while still accomplishing the goal. Specificity should be added only as needed.
  - Principle: Instead of creating "forbidden lists" for specific syntax (e.g., "Don't use Object.create()"), formulate a broader engineering principle that covers the underlying issue (e.g., "Prioritize explicit composition over hidden prototype manipulation"). This improves steerability across a wider range of similar scenarios.
  - Low Specificity: "Follow ecosystem best practices"
  - Medium Specificity: "Utilize OOP and functional best practices, as applicable"
  - High Specificity: Provide ecosystem-specific hints as examples of a broader principle rather than direct instructions. e.g., "NEVER use hacks like bypassing the type system or employing 'hidden' logic (e.g.: reflection, prototype manipulation). Instead, use explicit and idiomatic language features (e.g.: type guards, explicit class instantiation, or object spread) that maintain structural integrity."
- Prompt Simplification: Once the test is passing, use ask_user to determine if prompt simplification is desired.
  - Criteria: Simplification should be attempted only if there are related clauses that can be de-duplicated or reparented under a single heading.
  - Verification: As part of simplification, you MUST identify and run any behavioral eval tests that might be affected by the changes to ensure no regressions are introduced.
- Test fixes should not "cheat" by changing a test's GEMINI.md file or by updating the test's prompt to instruct it to not repro the bug.
- Warning: Prompts have multiple configurations; ensure your fix targets the correct config for the model in question.
Architecture Options: If prompt or instruction tuning triggers no improvement, analyze loop composition.
- AgentLoop: Defined by context + toolset + prompt.
- Enhancements: Loops perform best with direct prompts, fewer irrelevant tools, low goal density, and minimal low-value/irrelevant context.
- Modifications: Compose subagents or isolate tools. Ground in observed traces.
- Warning: Think deeply before offering recommendations; avoid parroting abstract design guidelines.

3. ✅ Verify

Run Local: Run Vitest in non-interactive mode on just the file.
Log Audit: Prioritize diagnosing failures via log comparison before triggering heavy test runs.
Stability Limit: Run the test 3 times locally on key models (can use scripts to run in parallel for speed):
- Gemini 3.0
- Gemini 3 Flash
- Gemini 2.5 Pro
Flakiness Rule: If it passes 2/3 times, it may be inherent noise difficult to improve without a structural split.

4. 📊 Report

Provide a summary of:

Test success rate for each tested model (e.g., 3/3 = 100%).
Root cause identification and fix explanation.
If unfixed, provide high-confidence architecture recommendations.

4.6 KiB Raw Blame History

Fixing Behavioral Evals

1. 🔍 Investigate

2. 🛠️ Fix Strategy

3. ✅ Verify

4. 📊 Report

4.6 KiB

Raw Blame History