mirror of
https://github.com/google-gemini/gemini-cli.git
synced 2026-04-20 18:14:29 -07:00
2.7 KiB
2.7 KiB
Fixing Behavioral Evals
Use this guide when asked to debug, troubleshoot, or fix a failing behavioral evaluation.
1. 🔍 Investigate
- Fetch Nightly Results: Use the
ghCLI to inspect the latest run fromevals-nightly.ymlif applicable.- Example view URL:
https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml
- Example view URL:
- Isolate: DO NOT push changes or start remote runs. Confine investigation to the local workspace.
- Read Logs:
- Eval logs live in
evals/logs/<test_name>.log. - Enable verbose debugging via
export GEMINI_DEBUG_LOG_FILE="debug.log".
- Eval logs live in
- Diagnose: Audit tool logs and telemetry. Note if due to setup/assert.
- Tip: Proactively add custom logging/diagnostics to check hypotheses.
2. 🛠️ Fix Strategy
- Targeted Location: Locate the test case and the corresponding prompt/code.
- Iterative Scope: Make extreme change first to verify scope, then refine to a minimal, targeted change.
- Assertion Fidelity:
- Changing the test prompt is a last resort (prompts are often vague by design).
- Warning: Do not lose test fidelity by making prompts too direct/easy.
- Primary Fix Trigger: Adjust tool descriptions, system prompts
(
snippets.ts), or modules that contribute to the prompt template. - Warning: Prompts have multiple configurations; ensure your fix targets the correct config for the model in question.
- Architecture Options: If prompt or instruction tuning triggers no
improvement, analyze loop composition.
- AgentLoop: Defined by
context + toolset + prompt. - Enhancements: Loops perform best with direct prompts, fewer irrelevant tools, low goal density, and minimal low-value/irrelevant context.
- Modifications: Compose subagents or isolate tools. Ground in observed traces.
- Warning: Think deeply before offering recommendations; avoid parroting abstract design guidelines.
- AgentLoop: Defined by
3. ✅ Verify
- Run Local: Run Vitest in non-interactive mode on just the file.
- Log Audit: Prioritize diagnosing failures via log comparison before triggering heavy test runs.
- Stability Limit: Run the test 3 times locally on key models (can use
scripts to run in parallel for speed):
- Gemini 3.0
- Gemini 3 Flash
- Gemini 2.5 Pro
- Flakiness Rule: If it passes 2/3 times, it may be inherent noise difficult to improve without a structural split.
4. 📊 Report
Provide a summary of:
- Test success rate for each tested model (e.g., 3/3 = 100%).
- Root cause identification and fix explanation.
- If unfixed, provide high-confidence architecture recommendations.