feat(skills): add behavioral-evals skill with fixing and promoting guides (#23349)

2026-05-12 12:54:07 -07:00 · 2026-03-23 17:06:43 -04:00
parent fbf38361ad
commit db14cdf92b
10 changed files with 509 additions and 143 deletions
@@ -0,0 +1,151 @@
+# Creating Behavioral Evals
+
+## 🔬 Rig Selection
+
+| Rig Type          | Import From            | Architecture                                                         | Use When                                                                                              |
+| :---------------- | :--------------------- | :------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------------- |
+| **`evalTest`**    | `./test-helper.js`     | **Subprocess**. Runs the CLI in a separate process + waits for exit. | Standard workspace tests. **Do not use `setBreakpoint`**; auditing history (`readToolLogs`) is safer. |
+| **`appEvalTest`** | `./app-test-helper.js` | **In-Process**. Runs directly inside the runner loop.                | UI/Ink rendering. Safe for `setBreakpoint` triggers.                                                  |
+
+---
+
+## 🏗️ Scenario Design
+
+Evals must simulate realistic agent environments to effectively test
+decision-making.
+
+- **Workspace State**: Seed with standard project anchors if testing general
+  capabilities:
+  - `package.json` for NodeJS environments.
+  - Minimal configuration files (`tsconfig.json`, `GEMINI.md`).
+- **Structural Complexity**: Provide enough files to force the agent to _search_
+  or _navigate_, rather than giving the answer directly. Avoid trivial one-file
+  tests unless testing exact prompt steering.
+
+---
+
+## ❌ Fail First Principle
+
+Before asserting a new capability or locking in a fix, **verify that the test
+fails first**.
+
+- It is easy to accidentally write an eval that asserts behaviors that are
+  already met or pass by default.
+- **Process**: reproduce failure with test -> apply fix (prompt/tool) -> verify
+  test passes.
+
+---
+
+## ✋ Testing Patterns
+
+### 1. Breakpoints
+
+Verifies the agent _intends_ to use a tool BEFORE executing it. Useful for
+interactive prompts or safety checks.
+
+```typescript
+// ⚠️ Only works with appEvalTest (AppRig)
+setup: async (rig) => {
+  rig.setBreakpoint(['ask_user']);
+},
+assert: async (rig) => {
+  const confirmation = await rig.waitForPendingConfirmation('ask_user');
+  expect(confirmation).toBeDefined();
+}
+```
+
+### 2. Tool Confirmation Race
+
+When asserting multiple triggers (e.g., "enters plan mode then asks question"):
+
+```typescript
+assert: async (rig) => {
+  let confirmation = await rig.waitForPendingConfirmation([
+    'enter_plan_mode',
+    'ask_user',
+  ]);
+
+  if (confirmation?.name === 'enter_plan_mode') {
+    rig.acceptConfirmation('enter_plan_mode');
+    confirmation = await rig.waitForPendingConfirmation('ask_user');
+  }
+  expect(confirmation?.toolName).toBe('ask_user');
+};
+```
+
+### 3. Audit Tool Logs
+
+Audit exact operations to ensure efficiency (e.g., no redundant reads).
+
+```typescript
+assert: async (rig, result) => {
+  await rig.waitForTelemetryReady();
+  const toolLogs = rig.readToolLogs();
+
+  const writeCall = toolLogs.find(
+    (log) => log.toolRequest.name === 'write_file',
+  );
+  expect(writeCall).toBeDefined();
+};
+```
+
+### 4. Mock MCP Facades
+
+To evaluate tools connected via MCP without hitting live endpoints, load a mock
+server configuration in the `setup` hook.
+
+```typescript
+setup: async (rig) => {
+  rig.addMockMcpServer('workspace-server', 'google-workspace');
+},
+assert: async (rig) => {
+  await rig.waitForTelemetryReady();
+  const toolLogs = rig.readToolLogs();
+  const workspaceCall = toolLogs.find(
+    (log) => log.toolRequest.name === 'mcp_workspace-server_docs.getText'
+  );
+  expect(workspaceCall).toBeDefined();
+};
+```
+
+---
+
+## ⚠️ Safety & Efficiency Guardrails
+
+### 1. Breakpoint Deadlocks
+
+Breakpoints (`setBreakpoint`) pause execution. In standard `evalTest`,
+`rig.run()` waits for the process to exit _before_ assertions run. **This will
+hang indefinitely.**
+
+- **Use Breakpoints** for `appEvalTest` or interactive simulations.
+- **Use Audit Tool Logs** (above) for standard trajectory tests.
+
+### 2. Runaway Timeout
+
+Always set a budget boundary in the `EvalCase` to prevent runaway loops on
+quota:
+
+```typescript
+evalTest('USUALLY_PASSES', {
+  name: '...',
+  timeout: 60000, // 1 minute safety limit
+  // ...
+});
+```
+
+### 3. Efficiency Assertion (Turn limits)
+
+Check if a tool is called _early_ using index checks:
+
+```typescript
+assert: async (rig) => {
+  const toolLogs = rig.readToolLogs();
+  const toolCallIndex = toolLogs.findIndex(
+    (log) => log.toolRequest.name === 'cli_help',
+  );
+
+  expect(toolCallIndex).toBeGreaterThan(-1);
+  expect(toolCallIndex).toBeLessThan(5); // Called within first 5 turns
+};
+```
@@ -0,0 +1,71 @@
+# Fixing Behavioral Evals
+
+Use this guide when asked to debug, troubleshoot, or fix a failing behavioral
+evaluation.
+
+---
+
+## 1. 🔍 Investigate
+
+1.  **Fetch Nightly Results**: Use the `gh` CLI to inspect the latest run from
+    `evals-nightly.yml` if applicable.
+    - _Example view URL_:
+      `https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml`
+2.  **Isolate**: DO NOT push changes or start remote runs. Confine investigation
+    to the local workspace.
+3.  **Read Logs**:
+    - Eval logs live in `evals/logs/<test_name>.log`.
+    - Enable verbose debugging via `export GEMINI_DEBUG_LOG_FILE="debug.log"`.
+4.  **Diagnose**: Audit tool logs and telemetry. Note if due to setup/assert.
+    - **Tip**: Proactively add custom logging/diagnostics to check hypotheses.
+
+---
+
+## 2. 🛠️ Fix Strategy
+
+1.  **Targeted Location**: Locate the test case and the corresponding
+    prompt/code.
+2.  **Iterative Scope**: Make extreme change first to verify scope, then refine
+    to a minimal, targeted change.
+3.  **Assertion Fidelity**:
+    - Changing the test prompt is a **last resort** (prompts are often vague by
+      design).
+    - **Warning**: Do not lose test fidelity by making prompts too direct/easy.
+    - **Primary Fix Trigger**: Adjust tool descriptions, system prompts
+      (`snippets.ts`), or **modules that contribute to the prompt template**.
+    - **Warning**: Prompts have multiple configurations; ensure your fix targets
+      the correct config for the model in question.
+4.  **Architecture Options**: If prompt or instruction tuning triggers no
+    improvement, analyze loop composition.
+    - **AgentLoop**: Defined by `context + toolset + prompt`.
+    - **Enhancements**: Loops perform best with direct prompts, fewer irrelevant
+      tools, low goal density, and minimal low-value/irrelevant context.
+    - **Modifications**: Compose subagents or isolate tools. Ground in observed
+      traces.
+    - **Warning**: Think deeply before offering recommendations; avoid parroting
+      abstract design guidelines.
+
+---
+
+## 3. ✅ Verify
+
+1.  **Run Local**: Run Vitest in non-interactive mode on just the file.
+2.  **Log Audit**: Prioritize diagnosing failures via log comparison before
+    triggering heavy test runs.
+3.  **Stability Limit**: Run the test **3 times** locally on key models (can use
+    scripts to run in parallel for speed):
+    - **Gemini 3.0**
+    - **Gemini 3 Flash**
+    - **Gemini 2.5 Pro**
+4.  **Flakiness Rule**: If it passes 2/3 times, it may be inherent noise
+    difficult to improve without a structural split.
+
+---
+
+## 4. 📊 Report
+
+Provide a summary of:
+
+- Test success rate for each tested model (e.g., 3/3 = 100%).
+- Root cause identification and fix explanation.
+- If unfixed, provide high-confidence architecture recommendations.
@@ -0,0 +1,55 @@
+# Promoting Behavioral Evals
+
+Use this guide when asked to analyze nightly results and promote incubated tests
+to stable suites.
+
+---
+
+## 1. 🔍 Investigate candidates
+
+1.  **Audit Nightly Logs**: Use the `gh` CLI to fetch results from
+    `evals-nightly.yml` (Direct URL:
+    `https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml`).
+    - **Tip**: The aggregate summary from the most recent run integrates the
+      last 7 runs of history automatically.
+    - **Safety**: DO NOT push changes or start remote runs. All verification is
+      local.
+2.  **Assess Stability**: Identify tests that pass **100% of the time** across
+    ALL enabled models over the **last 7 nightly runs** in a row.
+    - _100% means the test passed 3/3 times for every model and run._
+3.  **Promotion Targets**: Tests meeting this criteria are candidates for
+    promotion from `USUALLY_PASSES` to `ALWAYS_PASSES`.
+
+---
+
+## 2. 🚥 Promotion Steps
+
+1.  **Locate File**: Locate the eval file in the `evals/` directory.
+2.  **Update Policy**: Modify the policy argument to `ALWAYS_PASSES`.
+    ```typescript
+    evalTest('ALWAYS_PASSES', { ... })
+    ```
+3.  **Targeting**: Follow guidelines in `evals/README.md` regarding stable suite
+    organization.
+4.  **Constraint**: Your final change must be **minimal and targeted** strictly
+    to promoting the test status. Do not refactor the test or setup fixtures.
+
+---
+
+## 3. ✅ Verify
+
+1.  **Run Prompted Tests**: Run the promoted test locally using non-interactive
+    Vitest to confirm structure validity.
+2.  **Verify Suite Inclusion**: Check that the test is successfully picked up by
+    standard runnable ranges.
+
+---
+
+## 4. 📊 Report
+
+Provide a summary of:
+
+- Which tests were promoted.
+- Provide the success rate evidence (e.g., 7/7 runs passed for all models).
+- If no candidates qualified, list the next closest candidates and their current
+  pass rate.
@@ -0,0 +1,95 @@
+# Running & Promoting Evals
+
+## 🛠️ Prerequisites
+
+Behavioral evals run against the compiled binary. You **must** build and bundle
+the project first after making changes:
+
+```bash
+npm run build && npm run bundle
+```
+
+---
+
+## 🏃‍♂️ Running Tests
+
+### 1. Configure Environment Variables
+
+Evals require a standard API key. If your `.env` file has multiple keys or
+comments, use this precise extraction setup:
+
+```bash
+export GEMINI_API_KEY=$(grep '^GEMINI_API_KEY=' .env | cut -d '=' -f2) && RUN_EVALS=1 npx vitest run --config evals/vitest.config.ts <file_name>
+```
+
+### 2. Commands
+
+| Command                             | Scope           | Description                                        |
+| :---------------------------------- | :-------------- | :------------------------------------------------- |
+| `npm run test:always_passing_evals` | `ALWAYS_PASSES` | Fast feedback, runs in CI.                         |
+| `npm run test:all_evals`            | All             | Runs nightly incubation tests. Sets `RUN_EVALS=1`. |
+
+### Target Specific File
+
+_Note: `RUN_EVALS=1` is required for incubated (`USUALLY_PASSES`) tests._
+
+```bash
+RUN_EVALS=1 npx vitest run --config evals/vitest.config.ts my_feature.eval.ts
+```
+
+---
+
+## 🐞 Debugging and Logs
+
+If a test fails, verify:
+
+- **Tool Trajectory Logs**:序列 of calls in `evals/logs/<test_name>.log`.
+- **Verbose Reasoning**: Capture raw buffer traces by setting
+  `GEMINI_DEBUG_LOG_FILE`:
+  ```bash
+  export GEMINI_DEBUG_LOG_FILE="debug.log"
+  ```
+
+---
+
+### 🎯 Verify Model Targeting
+
+- **Tip:** Standard evals benchmark against model variations. If a test passes
+  on Flash but fails on Pro (or vice versa), the issue is usually in the **tool
+  description**, not the prompt definition. Flash is sensitive to "instruction
+  bloat," while Pro is sensitive to "ambiguous intent."
+
+---
+
+## 🚥 deflaking & Promotion
+
+To maintain CI stability, all new evals follow a strict incubation period.
+
+### 1. Incubation (`USUALLY_PASSES`)
+
+New tests must be created with the `USUALLY_PASSES` policy.
+
+```typescript
+evalTest('USUALLY_PASSES', { ... })
+```
+
+They run in **Evals: Nightly** workflows and do not block PR merges.
+
+### 2. Investigate Failures
+
+If a nightly eval regresses, investigate via agent:
+
+```bash
+gemini /fix-behavioral-eval [optional-run-uri]
+```
+
+### 3. Promotion (`ALWAYS_PASSES`)
+
+Once a test scores 100% consistency over multiple nightly cycles:
+
+```bash
+gemini /promote-behavioral-eval
+```
+
+_Do not promote manually._ The command verifies trajectory logs before updating
+the file policy.