mirror of
https://github.com/google-gemini/gemini-cli.git
synced 2026-04-23 19:44:30 -07:00
152 lines
4.5 KiB
Markdown
152 lines
4.5 KiB
Markdown
# Creating Behavioral Evals
|
|
|
|
## 🔬 Rig Selection
|
|
|
|
| Rig Type | Import From | Architecture | Use When |
|
|
| :---------------- | :--------------------- | :------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------------- |
|
|
| **`evalTest`** | `./test-helper.js` | **Subprocess**. Runs the CLI in a separate process + waits for exit. | Standard workspace tests. **Do not use `setBreakpoint`**; auditing history (`readToolLogs`) is safer. |
|
|
| **`appEvalTest`** | `./app-test-helper.js` | **In-Process**. Runs directly inside the runner loop. | UI/Ink rendering. Safe for `setBreakpoint` triggers. |
|
|
|
|
---
|
|
|
|
## 🏗️ Scenario Design
|
|
|
|
Evals must simulate realistic agent environments to effectively test
|
|
decision-making.
|
|
|
|
- **Workspace State**: Seed with standard project anchors if testing general
|
|
capabilities:
|
|
- `package.json` for NodeJS environments.
|
|
- Minimal configuration files (`tsconfig.json`, `GEMINI.md`).
|
|
- **Structural Complexity**: Provide enough files to force the agent to _search_
|
|
or _navigate_, rather than giving the answer directly. Avoid trivial one-file
|
|
tests unless testing exact prompt steering.
|
|
|
|
---
|
|
|
|
## ❌ Fail First Principle
|
|
|
|
Before asserting a new capability or locking in a fix, **verify that the test
|
|
fails first**.
|
|
|
|
- It is easy to accidentally write an eval that asserts behaviors that are
|
|
already met or pass by default.
|
|
- **Process**: reproduce failure with test -> apply fix (prompt/tool) -> verify
|
|
test passes.
|
|
|
|
---
|
|
|
|
## ✋ Testing Patterns
|
|
|
|
### 1. Breakpoints
|
|
|
|
Verifies the agent _intends_ to use a tool BEFORE executing it. Useful for
|
|
interactive prompts or safety checks.
|
|
|
|
```typescript
|
|
// ⚠️ Only works with appEvalTest (AppRig)
|
|
setup: async (rig) => {
|
|
rig.setBreakpoint(['ask_user']);
|
|
},
|
|
assert: async (rig) => {
|
|
const confirmation = await rig.waitForPendingConfirmation('ask_user');
|
|
expect(confirmation).toBeDefined();
|
|
}
|
|
```
|
|
|
|
### 2. Tool Confirmation Race
|
|
|
|
When asserting multiple triggers (e.g., "enters plan mode then asks question"):
|
|
|
|
```typescript
|
|
assert: async (rig) => {
|
|
let confirmation = await rig.waitForPendingConfirmation([
|
|
'enter_plan_mode',
|
|
'ask_user',
|
|
]);
|
|
|
|
if (confirmation?.name === 'enter_plan_mode') {
|
|
rig.acceptConfirmation('enter_plan_mode');
|
|
confirmation = await rig.waitForPendingConfirmation('ask_user');
|
|
}
|
|
expect(confirmation?.toolName).toBe('ask_user');
|
|
};
|
|
```
|
|
|
|
### 3. Audit Tool Logs
|
|
|
|
Audit exact operations to ensure efficiency (e.g., no redundant reads).
|
|
|
|
```typescript
|
|
assert: async (rig, result) => {
|
|
await rig.waitForTelemetryReady();
|
|
const toolLogs = rig.readToolLogs();
|
|
|
|
const writeCall = toolLogs.find(
|
|
(log) => log.toolRequest.name === 'write_file',
|
|
);
|
|
expect(writeCall).toBeDefined();
|
|
};
|
|
```
|
|
|
|
### 4. Mock MCP Facades
|
|
|
|
To evaluate tools connected via MCP without hitting live endpoints, load a mock
|
|
server configuration in the `setup` hook.
|
|
|
|
```typescript
|
|
setup: async (rig) => {
|
|
rig.addMockMcpServer('workspace-server', 'google-workspace');
|
|
},
|
|
assert: async (rig) => {
|
|
await rig.waitForTelemetryReady();
|
|
const toolLogs = rig.readToolLogs();
|
|
const workspaceCall = toolLogs.find(
|
|
(log) => log.toolRequest.name === 'mcp_workspace-server_docs.getText'
|
|
);
|
|
expect(workspaceCall).toBeDefined();
|
|
};
|
|
```
|
|
|
|
---
|
|
|
|
## ⚠️ Safety & Efficiency Guardrails
|
|
|
|
### 1. Breakpoint Deadlocks
|
|
|
|
Breakpoints (`setBreakpoint`) pause execution. In standard `evalTest`,
|
|
`rig.run()` waits for the process to exit _before_ assertions run. **This will
|
|
hang indefinitely.**
|
|
|
|
- **Use Breakpoints** for `appEvalTest` or interactive simulations.
|
|
- **Use Audit Tool Logs** (above) for standard trajectory tests.
|
|
|
|
### 2. Runaway Timeout
|
|
|
|
Always set a budget boundary in the `EvalCase` to prevent runaway loops on
|
|
quota:
|
|
|
|
```typescript
|
|
evalTest('USUALLY_PASSES', {
|
|
name: '...',
|
|
timeout: 60000, // 1 minute safety limit
|
|
// ...
|
|
});
|
|
```
|
|
|
|
### 3. Efficiency Assertion (Turn limits)
|
|
|
|
Check if a tool is called _early_ using index checks:
|
|
|
|
```typescript
|
|
assert: async (rig) => {
|
|
const toolLogs = rig.readToolLogs();
|
|
const toolCallIndex = toolLogs.findIndex(
|
|
(log) => log.toolRequest.name === 'cli_help',
|
|
);
|
|
|
|
expect(toolCallIndex).toBeGreaterThan(-1);
|
|
expect(toolCallIndex).toBeLessThan(5); // Called within first 5 turns
|
|
};
|
|
```
|