mirror of
https://github.com/google-gemini/gemini-cli.git
synced 2026-05-12 12:54:07 -07:00
feat(skills): add behavioral-evals skill with fixing and promoting guides (#23349)
This commit is contained in:
@@ -0,0 +1,151 @@
|
||||
# Creating Behavioral Evals
|
||||
|
||||
## 🔬 Rig Selection
|
||||
|
||||
| Rig Type | Import From | Architecture | Use When |
|
||||
| :---------------- | :--------------------- | :------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------------- |
|
||||
| **`evalTest`** | `./test-helper.js` | **Subprocess**. Runs the CLI in a separate process + waits for exit. | Standard workspace tests. **Do not use `setBreakpoint`**; auditing history (`readToolLogs`) is safer. |
|
||||
| **`appEvalTest`** | `./app-test-helper.js` | **In-Process**. Runs directly inside the runner loop. | UI/Ink rendering. Safe for `setBreakpoint` triggers. |
|
||||
|
||||
---
|
||||
|
||||
## 🏗️ Scenario Design
|
||||
|
||||
Evals must simulate realistic agent environments to effectively test
|
||||
decision-making.
|
||||
|
||||
- **Workspace State**: Seed with standard project anchors if testing general
|
||||
capabilities:
|
||||
- `package.json` for NodeJS environments.
|
||||
- Minimal configuration files (`tsconfig.json`, `GEMINI.md`).
|
||||
- **Structural Complexity**: Provide enough files to force the agent to _search_
|
||||
or _navigate_, rather than giving the answer directly. Avoid trivial one-file
|
||||
tests unless testing exact prompt steering.
|
||||
|
||||
---
|
||||
|
||||
## ❌ Fail First Principle
|
||||
|
||||
Before asserting a new capability or locking in a fix, **verify that the test
|
||||
fails first**.
|
||||
|
||||
- It is easy to accidentally write an eval that asserts behaviors that are
|
||||
already met or pass by default.
|
||||
- **Process**: reproduce failure with test -> apply fix (prompt/tool) -> verify
|
||||
test passes.
|
||||
|
||||
---
|
||||
|
||||
## ✋ Testing Patterns
|
||||
|
||||
### 1. Breakpoints
|
||||
|
||||
Verifies the agent _intends_ to use a tool BEFORE executing it. Useful for
|
||||
interactive prompts or safety checks.
|
||||
|
||||
```typescript
|
||||
// ⚠️ Only works with appEvalTest (AppRig)
|
||||
setup: async (rig) => {
|
||||
rig.setBreakpoint(['ask_user']);
|
||||
},
|
||||
assert: async (rig) => {
|
||||
const confirmation = await rig.waitForPendingConfirmation('ask_user');
|
||||
expect(confirmation).toBeDefined();
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Tool Confirmation Race
|
||||
|
||||
When asserting multiple triggers (e.g., "enters plan mode then asks question"):
|
||||
|
||||
```typescript
|
||||
assert: async (rig) => {
|
||||
let confirmation = await rig.waitForPendingConfirmation([
|
||||
'enter_plan_mode',
|
||||
'ask_user',
|
||||
]);
|
||||
|
||||
if (confirmation?.name === 'enter_plan_mode') {
|
||||
rig.acceptConfirmation('enter_plan_mode');
|
||||
confirmation = await rig.waitForPendingConfirmation('ask_user');
|
||||
}
|
||||
expect(confirmation?.toolName).toBe('ask_user');
|
||||
};
|
||||
```
|
||||
|
||||
### 3. Audit Tool Logs
|
||||
|
||||
Audit exact operations to ensure efficiency (e.g., no redundant reads).
|
||||
|
||||
```typescript
|
||||
assert: async (rig, result) => {
|
||||
await rig.waitForTelemetryReady();
|
||||
const toolLogs = rig.readToolLogs();
|
||||
|
||||
const writeCall = toolLogs.find(
|
||||
(log) => log.toolRequest.name === 'write_file',
|
||||
);
|
||||
expect(writeCall).toBeDefined();
|
||||
};
|
||||
```
|
||||
|
||||
### 4. Mock MCP Facades
|
||||
|
||||
To evaluate tools connected via MCP without hitting live endpoints, load a mock
|
||||
server configuration in the `setup` hook.
|
||||
|
||||
```typescript
|
||||
setup: async (rig) => {
|
||||
rig.addMockMcpServer('workspace-server', 'google-workspace');
|
||||
},
|
||||
assert: async (rig) => {
|
||||
await rig.waitForTelemetryReady();
|
||||
const toolLogs = rig.readToolLogs();
|
||||
const workspaceCall = toolLogs.find(
|
||||
(log) => log.toolRequest.name === 'mcp_workspace-server_docs.getText'
|
||||
);
|
||||
expect(workspaceCall).toBeDefined();
|
||||
};
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ⚠️ Safety & Efficiency Guardrails
|
||||
|
||||
### 1. Breakpoint Deadlocks
|
||||
|
||||
Breakpoints (`setBreakpoint`) pause execution. In standard `evalTest`,
|
||||
`rig.run()` waits for the process to exit _before_ assertions run. **This will
|
||||
hang indefinitely.**
|
||||
|
||||
- **Use Breakpoints** for `appEvalTest` or interactive simulations.
|
||||
- **Use Audit Tool Logs** (above) for standard trajectory tests.
|
||||
|
||||
### 2. Runaway Timeout
|
||||
|
||||
Always set a budget boundary in the `EvalCase` to prevent runaway loops on
|
||||
quota:
|
||||
|
||||
```typescript
|
||||
evalTest('USUALLY_PASSES', {
|
||||
name: '...',
|
||||
timeout: 60000, // 1 minute safety limit
|
||||
// ...
|
||||
});
|
||||
```
|
||||
|
||||
### 3. Efficiency Assertion (Turn limits)
|
||||
|
||||
Check if a tool is called _early_ using index checks:
|
||||
|
||||
```typescript
|
||||
assert: async (rig) => {
|
||||
const toolLogs = rig.readToolLogs();
|
||||
const toolCallIndex = toolLogs.findIndex(
|
||||
(log) => log.toolRequest.name === 'cli_help',
|
||||
);
|
||||
|
||||
expect(toolCallIndex).toBeGreaterThan(-1);
|
||||
expect(toolCallIndex).toBeLessThan(5); // Called within first 5 turns
|
||||
};
|
||||
```
|
||||
Reference in New Issue
Block a user