feat(skills): add behavioral-evals skill with fixing and promoting guides (#23349)

This commit is contained in:
Abhi
2026-03-23 17:06:43 -04:00
committed by GitHub
parent fbf38361ad
commit db14cdf92b
10 changed files with 509 additions and 143 deletions
@@ -0,0 +1,151 @@
# Creating Behavioral Evals
## 🔬 Rig Selection
| Rig Type | Import From | Architecture | Use When |
| :---------------- | :--------------------- | :------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------------- |
| **`evalTest`** | `./test-helper.js` | **Subprocess**. Runs the CLI in a separate process + waits for exit. | Standard workspace tests. **Do not use `setBreakpoint`**; auditing history (`readToolLogs`) is safer. |
| **`appEvalTest`** | `./app-test-helper.js` | **In-Process**. Runs directly inside the runner loop. | UI/Ink rendering. Safe for `setBreakpoint` triggers. |
---
## 🏗️ Scenario Design
Evals must simulate realistic agent environments to effectively test
decision-making.
- **Workspace State**: Seed with standard project anchors if testing general
capabilities:
- `package.json` for NodeJS environments.
- Minimal configuration files (`tsconfig.json`, `GEMINI.md`).
- **Structural Complexity**: Provide enough files to force the agent to _search_
or _navigate_, rather than giving the answer directly. Avoid trivial one-file
tests unless testing exact prompt steering.
---
## ❌ Fail First Principle
Before asserting a new capability or locking in a fix, **verify that the test
fails first**.
- It is easy to accidentally write an eval that asserts behaviors that are
already met or pass by default.
- **Process**: reproduce failure with test -> apply fix (prompt/tool) -> verify
test passes.
---
## ✋ Testing Patterns
### 1. Breakpoints
Verifies the agent _intends_ to use a tool BEFORE executing it. Useful for
interactive prompts or safety checks.
```typescript
// ⚠️ Only works with appEvalTest (AppRig)
setup: async (rig) => {
rig.setBreakpoint(['ask_user']);
},
assert: async (rig) => {
const confirmation = await rig.waitForPendingConfirmation('ask_user');
expect(confirmation).toBeDefined();
}
```
### 2. Tool Confirmation Race
When asserting multiple triggers (e.g., "enters plan mode then asks question"):
```typescript
assert: async (rig) => {
let confirmation = await rig.waitForPendingConfirmation([
'enter_plan_mode',
'ask_user',
]);
if (confirmation?.name === 'enter_plan_mode') {
rig.acceptConfirmation('enter_plan_mode');
confirmation = await rig.waitForPendingConfirmation('ask_user');
}
expect(confirmation?.toolName).toBe('ask_user');
};
```
### 3. Audit Tool Logs
Audit exact operations to ensure efficiency (e.g., no redundant reads).
```typescript
assert: async (rig, result) => {
await rig.waitForTelemetryReady();
const toolLogs = rig.readToolLogs();
const writeCall = toolLogs.find(
(log) => log.toolRequest.name === 'write_file',
);
expect(writeCall).toBeDefined();
};
```
### 4. Mock MCP Facades
To evaluate tools connected via MCP without hitting live endpoints, load a mock
server configuration in the `setup` hook.
```typescript
setup: async (rig) => {
rig.addMockMcpServer('workspace-server', 'google-workspace');
},
assert: async (rig) => {
await rig.waitForTelemetryReady();
const toolLogs = rig.readToolLogs();
const workspaceCall = toolLogs.find(
(log) => log.toolRequest.name === 'mcp_workspace-server_docs.getText'
);
expect(workspaceCall).toBeDefined();
};
```
---
## ⚠️ Safety & Efficiency Guardrails
### 1. Breakpoint Deadlocks
Breakpoints (`setBreakpoint`) pause execution. In standard `evalTest`,
`rig.run()` waits for the process to exit _before_ assertions run. **This will
hang indefinitely.**
- **Use Breakpoints** for `appEvalTest` or interactive simulations.
- **Use Audit Tool Logs** (above) for standard trajectory tests.
### 2. Runaway Timeout
Always set a budget boundary in the `EvalCase` to prevent runaway loops on
quota:
```typescript
evalTest('USUALLY_PASSES', {
name: '...',
timeout: 60000, // 1 minute safety limit
// ...
});
```
### 3. Efficiency Assertion (Turn limits)
Check if a tool is called _early_ using index checks:
```typescript
assert: async (rig) => {
const toolLogs = rig.readToolLogs();
const toolCallIndex = toolLogs.findIndex(
(log) => log.toolRequest.name === 'cli_help',
);
expect(toolCallIndex).toBeGreaterThan(-1);
expect(toolCallIndex).toBeLessThan(5); // Called within first 5 turns
};
```
@@ -0,0 +1,71 @@
# Fixing Behavioral Evals
Use this guide when asked to debug, troubleshoot, or fix a failing behavioral
evaluation.
---
## 1. 🔍 Investigate
1. **Fetch Nightly Results**: Use the `gh` CLI to inspect the latest run from
`evals-nightly.yml` if applicable.
- _Example view URL_:
`https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml`
2. **Isolate**: DO NOT push changes or start remote runs. Confine investigation
to the local workspace.
3. **Read Logs**:
- Eval logs live in `evals/logs/<test_name>.log`.
- Enable verbose debugging via `export GEMINI_DEBUG_LOG_FILE="debug.log"`.
4. **Diagnose**: Audit tool logs and telemetry. Note if due to setup/assert.
- **Tip**: Proactively add custom logging/diagnostics to check hypotheses.
---
## 2. 🛠️ Fix Strategy
1. **Targeted Location**: Locate the test case and the corresponding
prompt/code.
2. **Iterative Scope**: Make extreme change first to verify scope, then refine
to a minimal, targeted change.
3. **Assertion Fidelity**:
- Changing the test prompt is a **last resort** (prompts are often vague by
design).
- **Warning**: Do not lose test fidelity by making prompts too direct/easy.
- **Primary Fix Trigger**: Adjust tool descriptions, system prompts
(`snippets.ts`), or **modules that contribute to the prompt template**.
- **Warning**: Prompts have multiple configurations; ensure your fix targets
the correct config for the model in question.
4. **Architecture Options**: If prompt or instruction tuning triggers no
improvement, analyze loop composition.
- **AgentLoop**: Defined by `context + toolset + prompt`.
- **Enhancements**: Loops perform best with direct prompts, fewer irrelevant
tools, low goal density, and minimal low-value/irrelevant context.
- **Modifications**: Compose subagents or isolate tools. Ground in observed
traces.
- **Warning**: Think deeply before offering recommendations; avoid parroting
abstract design guidelines.
---
## 3. ✅ Verify
1. **Run Local**: Run Vitest in non-interactive mode on just the file.
2. **Log Audit**: Prioritize diagnosing failures via log comparison before
triggering heavy test runs.
3. **Stability Limit**: Run the test **3 times** locally on key models (can use
scripts to run in parallel for speed):
- **Gemini 3.0**
- **Gemini 3 Flash**
- **Gemini 2.5 Pro**
4. **Flakiness Rule**: If it passes 2/3 times, it may be inherent noise
difficult to improve without a structural split.
---
## 4. 📊 Report
Provide a summary of:
- Test success rate for each tested model (e.g., 3/3 = 100%).
- Root cause identification and fix explanation.
- If unfixed, provide high-confidence architecture recommendations.
@@ -0,0 +1,55 @@
# Promoting Behavioral Evals
Use this guide when asked to analyze nightly results and promote incubated tests
to stable suites.
---
## 1. 🔍 Investigate candidates
1. **Audit Nightly Logs**: Use the `gh` CLI to fetch results from
`evals-nightly.yml` (Direct URL:
`https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml`).
- **Tip**: The aggregate summary from the most recent run integrates the
last 7 runs of history automatically.
- **Safety**: DO NOT push changes or start remote runs. All verification is
local.
2. **Assess Stability**: Identify tests that pass **100% of the time** across
ALL enabled models over the **last 7 nightly runs** in a row.
- _100% means the test passed 3/3 times for every model and run._
3. **Promotion Targets**: Tests meeting this criteria are candidates for
promotion from `USUALLY_PASSES` to `ALWAYS_PASSES`.
---
## 2. 🚥 Promotion Steps
1. **Locate File**: Locate the eval file in the `evals/` directory.
2. **Update Policy**: Modify the policy argument to `ALWAYS_PASSES`.
```typescript
evalTest('ALWAYS_PASSES', { ... })
```
3. **Targeting**: Follow guidelines in `evals/README.md` regarding stable suite
organization.
4. **Constraint**: Your final change must be **minimal and targeted** strictly
to promoting the test status. Do not refactor the test or setup fixtures.
---
## 3. ✅ Verify
1. **Run Prompted Tests**: Run the promoted test locally using non-interactive
Vitest to confirm structure validity.
2. **Verify Suite Inclusion**: Check that the test is successfully picked up by
standard runnable ranges.
---
## 4. 📊 Report
Provide a summary of:
- Which tests were promoted.
- Provide the success rate evidence (e.g., 7/7 runs passed for all models).
- If no candidates qualified, list the next closest candidates and their current
pass rate.
@@ -0,0 +1,95 @@
# Running & Promoting Evals
## 🛠️ Prerequisites
Behavioral evals run against the compiled binary. You **must** build and bundle
the project first after making changes:
```bash
npm run build && npm run bundle
```
---
## 🏃‍♂️ Running Tests
### 1. Configure Environment Variables
Evals require a standard API key. If your `.env` file has multiple keys or
comments, use this precise extraction setup:
```bash
export GEMINI_API_KEY=$(grep '^GEMINI_API_KEY=' .env | cut -d '=' -f2) && RUN_EVALS=1 npx vitest run --config evals/vitest.config.ts <file_name>
```
### 2. Commands
| Command | Scope | Description |
| :---------------------------------- | :-------------- | :------------------------------------------------- |
| `npm run test:always_passing_evals` | `ALWAYS_PASSES` | Fast feedback, runs in CI. |
| `npm run test:all_evals` | All | Runs nightly incubation tests. Sets `RUN_EVALS=1`. |
### Target Specific File
_Note: `RUN_EVALS=1` is required for incubated (`USUALLY_PASSES`) tests._
```bash
RUN_EVALS=1 npx vitest run --config evals/vitest.config.ts my_feature.eval.ts
```
---
## 🐞 Debugging and Logs
If a test fails, verify:
- **Tool Trajectory Logs**:序列 of calls in `evals/logs/<test_name>.log`.
- **Verbose Reasoning**: Capture raw buffer traces by setting
`GEMINI_DEBUG_LOG_FILE`:
```bash
export GEMINI_DEBUG_LOG_FILE="debug.log"
```
---
### 🎯 Verify Model Targeting
- **Tip:** Standard evals benchmark against model variations. If a test passes
on Flash but fails on Pro (or vice versa), the issue is usually in the **tool
description**, not the prompt definition. Flash is sensitive to "instruction
bloat," while Pro is sensitive to "ambiguous intent."
---
## 🚥 deflaking & Promotion
To maintain CI stability, all new evals follow a strict incubation period.
### 1. Incubation (`USUALLY_PASSES`)
New tests must be created with the `USUALLY_PASSES` policy.
```typescript
evalTest('USUALLY_PASSES', { ... })
```
They run in **Evals: Nightly** workflows and do not block PR merges.
### 2. Investigate Failures
If a nightly eval regresses, investigate via agent:
```bash
gemini /fix-behavioral-eval [optional-run-uri]
```
### 3. Promotion (`ALWAYS_PASSES`)
Once a test scores 100% consistency over multiple nightly cycles:
```bash
gemini /promote-behavioral-eval
```
_Do not promote manually._ The command verifies trajectory logs before updating
the file policy.