mirror of
https://github.com/google-gemini/gemini-cli.git
synced 2026-04-29 06:25:16 -07:00
4.6 KiB
4.6 KiB
Fixing Behavioral Evals
Use this guide when asked to debug, troubleshoot, or fix a failing behavioral evaluation.
1. 🔍 Investigate
- Fetch Nightly Results: Use the
ghCLI to inspect the latest run fromevals-nightly.ymlif applicable.- Example view URL:
https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml
- Example view URL:
- Isolate: DO NOT push changes or start remote runs. Confine investigation to the local workspace.
- Read Logs:
- Eval logs live in
evals/logs/<test_name>.log. - Enable verbose debugging via
export GEMINI_DEBUG_LOG_FILE="debug.log".
- Eval logs live in
- Diagnose: Audit tool logs and telemetry. Note if due to setup/assert.
- Tip: Proactively add custom logging/diagnostics to check hypotheses.
2. 🛠️ Fix Strategy
- Targeted Location: Locate the test case and the corresponding prompt/code.
- Iterative Scope: Make extreme change first to verify scope, then refine to a minimal, targeted change.
- Assertion Fidelity:
- Changing the test prompt is a last resort (prompts are often vague by design).
- Warning: Do not lose test fidelity by making prompts too direct/easy.
- Primary Fix Trigger: Adjust tool descriptions, system prompts
(
snippets.ts), or modules that contribute to the prompt template. - Fixes should generally try to improve the prompt
@packages/core/src/prompts/snippets.tsfirst. - Instructional Generality: Changes to the system prompt should aim to
be as general as possible while still accomplishing the goal. Specificity
should be added only as needed.
- Principle: Instead of creating "forbidden lists" for specific syntax
(e.g., "Don't use
Object.create()"), formulate a broader engineering principle that covers the underlying issue (e.g., "Prioritize explicit composition over hidden prototype manipulation"). This improves steerability across a wider range of similar scenarios. - Low Specificity: "Follow ecosystem best practices"
- Medium Specificity: "Utilize OOP and functional best practices, as applicable"
- High Specificity: Provide ecosystem-specific hints as examples of a broader principle rather than direct instructions. e.g., "NEVER use hacks like bypassing the type system or employing 'hidden' logic (e.g.: reflection, prototype manipulation). Instead, use explicit and idiomatic language features (e.g.: type guards, explicit class instantiation, or object spread) that maintain structural integrity."
- Principle: Instead of creating "forbidden lists" for specific syntax
(e.g., "Don't use
- Prompt Simplification: Once the test is passing, use
ask_userto determine if prompt simplification is desired.- Criteria: Simplification should be attempted only if there are related clauses that can be de-duplicated or reparented under a single heading.
- Verification: As part of simplification, you MUST identify and run any behavioral eval tests that might be affected by the changes to ensure no regressions are introduced.
- Test fixes should not "cheat" by changing a test's
GEMINI.mdfile or by updating the test's prompt to instruct it to not repro the bug. - Warning: Prompts have multiple configurations; ensure your fix targets the correct config for the model in question.
- Architecture Options: If prompt or instruction tuning triggers no
improvement, analyze loop composition.
- AgentLoop: Defined by
context + toolset + prompt. - Enhancements: Loops perform best with direct prompts, fewer irrelevant tools, low goal density, and minimal low-value/irrelevant context.
- Modifications: Compose subagents or isolate tools. Ground in observed traces.
- Warning: Think deeply before offering recommendations; avoid parroting abstract design guidelines.
- AgentLoop: Defined by
3. ✅ Verify
- Run Local: Run Vitest in non-interactive mode on just the file.
- Log Audit: Prioritize diagnosing failures via log comparison before triggering heavy test runs.
- Stability Limit: Run the test 3 times locally on key models (can use
scripts to run in parallel for speed):
- Gemini 3.0
- Gemini 3 Flash
- Gemini 2.5 Pro
- Flakiness Rule: If it passes 2/3 times, it may be inherent noise difficult to improve without a structural split.
4. 📊 Report
Provide a summary of:
- Test success rate for each tested model (e.g., 3/3 = 100%).
- Root cause identification and fix explanation.
- If unfixed, provide high-confidence architecture recommendations.