4.5 KiB
Creating Behavioral Evals
🔬 Rig Selection
| Rig Type | Import From | Architecture | Use When |
|---|---|---|---|
evalTest |
./test-helper.js |
Subprocess. Runs the CLI in a separate process + waits for exit. | Standard workspace tests. Do not use setBreakpoint; auditing history (readToolLogs) is safer. |
appEvalTest |
./app-test-helper.js |
In-Process. Runs directly inside the runner loop. | UI/Ink rendering. Safe for setBreakpoint triggers. |
🏗️ Scenario Design
Evals must simulate realistic agent environments to effectively test decision-making.
- Workspace State: Seed with standard project anchors if testing general
capabilities:
package.jsonfor NodeJS environments.- Minimal configuration files (
tsconfig.json,GEMINI.md).
- Structural Complexity: Provide enough files to force the agent to search or navigate, rather than giving the answer directly. Avoid trivial one-file tests unless testing exact prompt steering.
❌ Fail First Principle
Before asserting a new capability or locking in a fix, verify that the test fails first.
- It is easy to accidentally write an eval that asserts behaviors that are already met or pass by default.
- Process: reproduce failure with test -> apply fix (prompt/tool) -> verify test passes.
✋ Testing Patterns
1. Breakpoints
Verifies the agent intends to use a tool BEFORE executing it. Useful for interactive prompts or safety checks.
// ⚠️ Only works with appEvalTest (AppRig)
setup: async (rig) => {
rig.setBreakpoint(['ask_user']);
},
assert: async (rig) => {
const confirmation = await rig.waitForPendingConfirmation('ask_user');
expect(confirmation).toBeDefined();
}
2. Tool Confirmation Race
When asserting multiple triggers (e.g., "enters plan mode then asks question"):
assert: async (rig) => {
let confirmation = await rig.waitForPendingConfirmation([
'enter_plan_mode',
'ask_user',
]);
if (confirmation?.name === 'enter_plan_mode') {
rig.acceptConfirmation('enter_plan_mode');
confirmation = await rig.waitForPendingConfirmation('ask_user');
}
expect(confirmation?.toolName).toBe('ask_user');
};
3. Audit Tool Logs
Audit exact operations to ensure efficiency (e.g., no redundant reads).
assert: async (rig, result) => {
await rig.waitForTelemetryReady();
const toolLogs = rig.readToolLogs();
const writeCall = toolLogs.find(
(log) => log.toolRequest.name === 'write_file',
);
expect(writeCall).toBeDefined();
};
4. Mock MCP Facades
To evaluate tools connected via MCP without hitting live endpoints, load a mock
server configuration in the setup hook.
setup: async (rig) => {
rig.addMockMcpServer('workspace-server', 'google-workspace');
},
assert: async (rig) => {
await rig.waitForTelemetryReady();
const toolLogs = rig.readToolLogs();
const workspaceCall = toolLogs.find(
(log) => log.toolRequest.name === 'mcp_workspace-server_docs.getText'
);
expect(workspaceCall).toBeDefined();
};
⚠️ Safety & Efficiency Guardrails
1. Breakpoint Deadlocks
Breakpoints (setBreakpoint) pause execution. In standard evalTest,
rig.run() waits for the process to exit before assertions run. This will
hang indefinitely.
- Use Breakpoints for
appEvalTestor interactive simulations. - Use Audit Tool Logs (above) for standard trajectory tests.
2. Runaway Timeout
Always set a budget boundary in the EvalCase to prevent runaway loops on
quota:
evalTest('USUALLY_PASSES', {
name: '...',
timeout: 60000, // 1 minute safety limit
// ...
});
3. Efficiency Assertion (Turn limits)
Check if a tool is called early using index checks:
assert: async (rig) => {
const toolLogs = rig.readToolLogs();
const toolCallIndex = toolLogs.findIndex(
(log) => log.toolRequest.name === 'cli_help',
);
expect(toolCallIndex).toBeGreaterThan(-1);
expect(toolCallIndex).toBeLessThan(5); // Called within first 5 turns
};