--- name: behavioral-evals description: Guidance for creating, running, fixing, and promoting behavioral evaluations. Use when verifying agent decision logic, debugging failures, debugging prompt steering, or adding workspace regression tests. --- # Behavioral Evals ## Overview Behavioral evaluations (evals) are tests that validate the **agent's decision-making** (e.g., tool choice) rather than pure functionality. They are critical for verifying prompt changes, debugging steerability, and preventing regressions. > [!NOTE] > **Single Source of Truth**: For core concepts, policies, running tests, and general best practices, always refer to **[evals/README.md](file:///Users/abhipatel/code/gemini-cli/docs/evals/README.md)**. --- ## 🔄 Workflow Decision Tree 1. **Does a prompt/tool change need validation?** * *No* -> Normal integration tests. * *Yes* -> Continue below. 2. **Is it UI/Interaction heavy?** * *Yes* -> Use `appEvalTest` (`AppRig`). See **[creating.md](references/creating.md)**. * *No* -> Use `evalTest` (`TestRig`). See **[creating.md](references/creating.md)**. 3. **Is it a new test?** * *Yes* -> Set policy to `USUALLY_PASSES`. * *No* -> `ALWAYS_PASSES` (locks in regression). 4. **Are you fixing a failure or promoting a test?** * *Fixing* -> See **[fixing.md](references/fixing.md)**. * *Promoting* -> See **[promoting.md](references/promoting.md)**. --- ## 📋 Quick Checklist ### 1. Setup Workspace Seed the workspace with necessary files using the `files` object to simulate a realistic scenario (e.g., NodeJS project with `package.json`). * *Details in **[creating.md](references/creating.md)*** ### 2. Write Assertions Audit agent decisions using `rig.setBreakpoint()` (AppRig only) or index verification on `rig.readToolLogs()`. * *Details in **[creating.md](references/creating.md)*** ### 3. Verify Run single tests locally with Vitest. Confirm stability locally before relying on CI workflows. * *See **[evals/README.md](file:///Users/abhipatel/code/gemini-cli/docs/evals/README.md)** for running commands.* --- ## 📦 Bundled Resources Detailed procedural guides: * **[creating.md](references/creating.md)**: Assertion strategies, Rig selection, Mock MCPs. * **[fixing.md](references/fixing.md)**: Step-by-step automated investigation, architecture diagnosis guidelines. * **[promoting.md](references/promoting.md)**: Candidate identification criteria and threshold guidelines.