mirror of
https://github.com/google-gemini/gemini-cli.git
synced 2026-05-12 12:54:07 -07:00
feat(skills): add behavioral-evals skill with fixing and promoting guides (#23349)
This commit is contained in:
@@ -0,0 +1,151 @@
|
||||
# Creating Behavioral Evals
|
||||
|
||||
## 🔬 Rig Selection
|
||||
|
||||
| Rig Type | Import From | Architecture | Use When |
|
||||
| :---------------- | :--------------------- | :------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------------- |
|
||||
| **`evalTest`** | `./test-helper.js` | **Subprocess**. Runs the CLI in a separate process + waits for exit. | Standard workspace tests. **Do not use `setBreakpoint`**; auditing history (`readToolLogs`) is safer. |
|
||||
| **`appEvalTest`** | `./app-test-helper.js` | **In-Process**. Runs directly inside the runner loop. | UI/Ink rendering. Safe for `setBreakpoint` triggers. |
|
||||
|
||||
---
|
||||
|
||||
## 🏗️ Scenario Design
|
||||
|
||||
Evals must simulate realistic agent environments to effectively test
|
||||
decision-making.
|
||||
|
||||
- **Workspace State**: Seed with standard project anchors if testing general
|
||||
capabilities:
|
||||
- `package.json` for NodeJS environments.
|
||||
- Minimal configuration files (`tsconfig.json`, `GEMINI.md`).
|
||||
- **Structural Complexity**: Provide enough files to force the agent to _search_
|
||||
or _navigate_, rather than giving the answer directly. Avoid trivial one-file
|
||||
tests unless testing exact prompt steering.
|
||||
|
||||
---
|
||||
|
||||
## ❌ Fail First Principle
|
||||
|
||||
Before asserting a new capability or locking in a fix, **verify that the test
|
||||
fails first**.
|
||||
|
||||
- It is easy to accidentally write an eval that asserts behaviors that are
|
||||
already met or pass by default.
|
||||
- **Process**: reproduce failure with test -> apply fix (prompt/tool) -> verify
|
||||
test passes.
|
||||
|
||||
---
|
||||
|
||||
## ✋ Testing Patterns
|
||||
|
||||
### 1. Breakpoints
|
||||
|
||||
Verifies the agent _intends_ to use a tool BEFORE executing it. Useful for
|
||||
interactive prompts or safety checks.
|
||||
|
||||
```typescript
|
||||
// ⚠️ Only works with appEvalTest (AppRig)
|
||||
setup: async (rig) => {
|
||||
rig.setBreakpoint(['ask_user']);
|
||||
},
|
||||
assert: async (rig) => {
|
||||
const confirmation = await rig.waitForPendingConfirmation('ask_user');
|
||||
expect(confirmation).toBeDefined();
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Tool Confirmation Race
|
||||
|
||||
When asserting multiple triggers (e.g., "enters plan mode then asks question"):
|
||||
|
||||
```typescript
|
||||
assert: async (rig) => {
|
||||
let confirmation = await rig.waitForPendingConfirmation([
|
||||
'enter_plan_mode',
|
||||
'ask_user',
|
||||
]);
|
||||
|
||||
if (confirmation?.name === 'enter_plan_mode') {
|
||||
rig.acceptConfirmation('enter_plan_mode');
|
||||
confirmation = await rig.waitForPendingConfirmation('ask_user');
|
||||
}
|
||||
expect(confirmation?.toolName).toBe('ask_user');
|
||||
};
|
||||
```
|
||||
|
||||
### 3. Audit Tool Logs
|
||||
|
||||
Audit exact operations to ensure efficiency (e.g., no redundant reads).
|
||||
|
||||
```typescript
|
||||
assert: async (rig, result) => {
|
||||
await rig.waitForTelemetryReady();
|
||||
const toolLogs = rig.readToolLogs();
|
||||
|
||||
const writeCall = toolLogs.find(
|
||||
(log) => log.toolRequest.name === 'write_file',
|
||||
);
|
||||
expect(writeCall).toBeDefined();
|
||||
};
|
||||
```
|
||||
|
||||
### 4. Mock MCP Facades
|
||||
|
||||
To evaluate tools connected via MCP without hitting live endpoints, load a mock
|
||||
server configuration in the `setup` hook.
|
||||
|
||||
```typescript
|
||||
setup: async (rig) => {
|
||||
rig.addMockMcpServer('workspace-server', 'google-workspace');
|
||||
},
|
||||
assert: async (rig) => {
|
||||
await rig.waitForTelemetryReady();
|
||||
const toolLogs = rig.readToolLogs();
|
||||
const workspaceCall = toolLogs.find(
|
||||
(log) => log.toolRequest.name === 'mcp_workspace-server_docs.getText'
|
||||
);
|
||||
expect(workspaceCall).toBeDefined();
|
||||
};
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ⚠️ Safety & Efficiency Guardrails
|
||||
|
||||
### 1. Breakpoint Deadlocks
|
||||
|
||||
Breakpoints (`setBreakpoint`) pause execution. In standard `evalTest`,
|
||||
`rig.run()` waits for the process to exit _before_ assertions run. **This will
|
||||
hang indefinitely.**
|
||||
|
||||
- **Use Breakpoints** for `appEvalTest` or interactive simulations.
|
||||
- **Use Audit Tool Logs** (above) for standard trajectory tests.
|
||||
|
||||
### 2. Runaway Timeout
|
||||
|
||||
Always set a budget boundary in the `EvalCase` to prevent runaway loops on
|
||||
quota:
|
||||
|
||||
```typescript
|
||||
evalTest('USUALLY_PASSES', {
|
||||
name: '...',
|
||||
timeout: 60000, // 1 minute safety limit
|
||||
// ...
|
||||
});
|
||||
```
|
||||
|
||||
### 3. Efficiency Assertion (Turn limits)
|
||||
|
||||
Check if a tool is called _early_ using index checks:
|
||||
|
||||
```typescript
|
||||
assert: async (rig) => {
|
||||
const toolLogs = rig.readToolLogs();
|
||||
const toolCallIndex = toolLogs.findIndex(
|
||||
(log) => log.toolRequest.name === 'cli_help',
|
||||
);
|
||||
|
||||
expect(toolCallIndex).toBeGreaterThan(-1);
|
||||
expect(toolCallIndex).toBeLessThan(5); // Called within first 5 turns
|
||||
};
|
||||
```
|
||||
@@ -0,0 +1,71 @@
|
||||
# Fixing Behavioral Evals
|
||||
|
||||
Use this guide when asked to debug, troubleshoot, or fix a failing behavioral
|
||||
evaluation.
|
||||
|
||||
---
|
||||
|
||||
## 1. 🔍 Investigate
|
||||
|
||||
1. **Fetch Nightly Results**: Use the `gh` CLI to inspect the latest run from
|
||||
`evals-nightly.yml` if applicable.
|
||||
- _Example view URL_:
|
||||
`https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml`
|
||||
2. **Isolate**: DO NOT push changes or start remote runs. Confine investigation
|
||||
to the local workspace.
|
||||
3. **Read Logs**:
|
||||
- Eval logs live in `evals/logs/<test_name>.log`.
|
||||
- Enable verbose debugging via `export GEMINI_DEBUG_LOG_FILE="debug.log"`.
|
||||
4. **Diagnose**: Audit tool logs and telemetry. Note if due to setup/assert.
|
||||
- **Tip**: Proactively add custom logging/diagnostics to check hypotheses.
|
||||
|
||||
---
|
||||
|
||||
## 2. 🛠️ Fix Strategy
|
||||
|
||||
1. **Targeted Location**: Locate the test case and the corresponding
|
||||
prompt/code.
|
||||
2. **Iterative Scope**: Make extreme change first to verify scope, then refine
|
||||
to a minimal, targeted change.
|
||||
3. **Assertion Fidelity**:
|
||||
- Changing the test prompt is a **last resort** (prompts are often vague by
|
||||
design).
|
||||
- **Warning**: Do not lose test fidelity by making prompts too direct/easy.
|
||||
- **Primary Fix Trigger**: Adjust tool descriptions, system prompts
|
||||
(`snippets.ts`), or **modules that contribute to the prompt template**.
|
||||
- **Warning**: Prompts have multiple configurations; ensure your fix targets
|
||||
the correct config for the model in question.
|
||||
4. **Architecture Options**: If prompt or instruction tuning triggers no
|
||||
improvement, analyze loop composition.
|
||||
- **AgentLoop**: Defined by `context + toolset + prompt`.
|
||||
- **Enhancements**: Loops perform best with direct prompts, fewer irrelevant
|
||||
tools, low goal density, and minimal low-value/irrelevant context.
|
||||
- **Modifications**: Compose subagents or isolate tools. Ground in observed
|
||||
traces.
|
||||
- **Warning**: Think deeply before offering recommendations; avoid parroting
|
||||
abstract design guidelines.
|
||||
|
||||
---
|
||||
|
||||
## 3. ✅ Verify
|
||||
|
||||
1. **Run Local**: Run Vitest in non-interactive mode on just the file.
|
||||
2. **Log Audit**: Prioritize diagnosing failures via log comparison before
|
||||
triggering heavy test runs.
|
||||
3. **Stability Limit**: Run the test **3 times** locally on key models (can use
|
||||
scripts to run in parallel for speed):
|
||||
- **Gemini 3.0**
|
||||
- **Gemini 3 Flash**
|
||||
- **Gemini 2.5 Pro**
|
||||
4. **Flakiness Rule**: If it passes 2/3 times, it may be inherent noise
|
||||
difficult to improve without a structural split.
|
||||
|
||||
---
|
||||
|
||||
## 4. 📊 Report
|
||||
|
||||
Provide a summary of:
|
||||
|
||||
- Test success rate for each tested model (e.g., 3/3 = 100%).
|
||||
- Root cause identification and fix explanation.
|
||||
- If unfixed, provide high-confidence architecture recommendations.
|
||||
@@ -0,0 +1,55 @@
|
||||
# Promoting Behavioral Evals
|
||||
|
||||
Use this guide when asked to analyze nightly results and promote incubated tests
|
||||
to stable suites.
|
||||
|
||||
---
|
||||
|
||||
## 1. 🔍 Investigate candidates
|
||||
|
||||
1. **Audit Nightly Logs**: Use the `gh` CLI to fetch results from
|
||||
`evals-nightly.yml` (Direct URL:
|
||||
`https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml`).
|
||||
- **Tip**: The aggregate summary from the most recent run integrates the
|
||||
last 7 runs of history automatically.
|
||||
- **Safety**: DO NOT push changes or start remote runs. All verification is
|
||||
local.
|
||||
2. **Assess Stability**: Identify tests that pass **100% of the time** across
|
||||
ALL enabled models over the **last 7 nightly runs** in a row.
|
||||
- _100% means the test passed 3/3 times for every model and run._
|
||||
3. **Promotion Targets**: Tests meeting this criteria are candidates for
|
||||
promotion from `USUALLY_PASSES` to `ALWAYS_PASSES`.
|
||||
|
||||
---
|
||||
|
||||
## 2. 🚥 Promotion Steps
|
||||
|
||||
1. **Locate File**: Locate the eval file in the `evals/` directory.
|
||||
2. **Update Policy**: Modify the policy argument to `ALWAYS_PASSES`.
|
||||
```typescript
|
||||
evalTest('ALWAYS_PASSES', { ... })
|
||||
```
|
||||
3. **Targeting**: Follow guidelines in `evals/README.md` regarding stable suite
|
||||
organization.
|
||||
4. **Constraint**: Your final change must be **minimal and targeted** strictly
|
||||
to promoting the test status. Do not refactor the test or setup fixtures.
|
||||
|
||||
---
|
||||
|
||||
## 3. ✅ Verify
|
||||
|
||||
1. **Run Prompted Tests**: Run the promoted test locally using non-interactive
|
||||
Vitest to confirm structure validity.
|
||||
2. **Verify Suite Inclusion**: Check that the test is successfully picked up by
|
||||
standard runnable ranges.
|
||||
|
||||
---
|
||||
|
||||
## 4. 📊 Report
|
||||
|
||||
Provide a summary of:
|
||||
|
||||
- Which tests were promoted.
|
||||
- Provide the success rate evidence (e.g., 7/7 runs passed for all models).
|
||||
- If no candidates qualified, list the next closest candidates and their current
|
||||
pass rate.
|
||||
@@ -0,0 +1,95 @@
|
||||
# Running & Promoting Evals
|
||||
|
||||
## 🛠️ Prerequisites
|
||||
|
||||
Behavioral evals run against the compiled binary. You **must** build and bundle
|
||||
the project first after making changes:
|
||||
|
||||
```bash
|
||||
npm run build && npm run bundle
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🏃♂️ Running Tests
|
||||
|
||||
### 1. Configure Environment Variables
|
||||
|
||||
Evals require a standard API key. If your `.env` file has multiple keys or
|
||||
comments, use this precise extraction setup:
|
||||
|
||||
```bash
|
||||
export GEMINI_API_KEY=$(grep '^GEMINI_API_KEY=' .env | cut -d '=' -f2) && RUN_EVALS=1 npx vitest run --config evals/vitest.config.ts <file_name>
|
||||
```
|
||||
|
||||
### 2. Commands
|
||||
|
||||
| Command | Scope | Description |
|
||||
| :---------------------------------- | :-------------- | :------------------------------------------------- |
|
||||
| `npm run test:always_passing_evals` | `ALWAYS_PASSES` | Fast feedback, runs in CI. |
|
||||
| `npm run test:all_evals` | All | Runs nightly incubation tests. Sets `RUN_EVALS=1`. |
|
||||
|
||||
### Target Specific File
|
||||
|
||||
_Note: `RUN_EVALS=1` is required for incubated (`USUALLY_PASSES`) tests._
|
||||
|
||||
```bash
|
||||
RUN_EVALS=1 npx vitest run --config evals/vitest.config.ts my_feature.eval.ts
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🐞 Debugging and Logs
|
||||
|
||||
If a test fails, verify:
|
||||
|
||||
- **Tool Trajectory Logs**:序列 of calls in `evals/logs/<test_name>.log`.
|
||||
- **Verbose Reasoning**: Capture raw buffer traces by setting
|
||||
`GEMINI_DEBUG_LOG_FILE`:
|
||||
```bash
|
||||
export GEMINI_DEBUG_LOG_FILE="debug.log"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 🎯 Verify Model Targeting
|
||||
|
||||
- **Tip:** Standard evals benchmark against model variations. If a test passes
|
||||
on Flash but fails on Pro (or vice versa), the issue is usually in the **tool
|
||||
description**, not the prompt definition. Flash is sensitive to "instruction
|
||||
bloat," while Pro is sensitive to "ambiguous intent."
|
||||
|
||||
---
|
||||
|
||||
## 🚥 deflaking & Promotion
|
||||
|
||||
To maintain CI stability, all new evals follow a strict incubation period.
|
||||
|
||||
### 1. Incubation (`USUALLY_PASSES`)
|
||||
|
||||
New tests must be created with the `USUALLY_PASSES` policy.
|
||||
|
||||
```typescript
|
||||
evalTest('USUALLY_PASSES', { ... })
|
||||
```
|
||||
|
||||
They run in **Evals: Nightly** workflows and do not block PR merges.
|
||||
|
||||
### 2. Investigate Failures
|
||||
|
||||
If a nightly eval regresses, investigate via agent:
|
||||
|
||||
```bash
|
||||
gemini /fix-behavioral-eval [optional-run-uri]
|
||||
```
|
||||
|
||||
### 3. Promotion (`ALWAYS_PASSES`)
|
||||
|
||||
Once a test scores 100% consistency over multiple nightly cycles:
|
||||
|
||||
```bash
|
||||
gemini /promote-behavioral-eval
|
||||
```
|
||||
|
||||
_Do not promote manually._ The command verifies trajectory logs before updating
|
||||
the file policy.
|
||||
Reference in New Issue
Block a user