Creating Behavioral Evals

🔬 Rig Selection

Rig Type	Import From	Architecture	Use When
`evalTest`	`./test-helper.js`	Subprocess. Runs the CLI in a separate process + waits for exit.	Standard workspace tests. Do not use `setBreakpoint`; auditing history (`readToolLogs`) is safer.
`appEvalTest`	`./app-test-helper.js`	In-Process. Runs directly inside the runner loop.	UI/Ink rendering. Safe for `setBreakpoint` triggers.

🏗️ Scenario Design

Evals must simulate realistic agent environments to effectively test decision-making.

Workspace State: Seed with standard project anchors if testing general capabilities:
- package.json for NodeJS environments.
- Minimal configuration files (tsconfig.json, GEMINI.md).
Structural Complexity: Provide enough files to force the agent to search or navigate, rather than giving the answer directly. Avoid trivial one-file tests unless testing exact prompt steering.

❌ Fail First Principle

Before asserting a new capability or locking in a fix, verify that the test fails first.

It is easy to accidentally write an eval that asserts behaviors that are already met or pass by default.
Process: reproduce failure with test -> apply fix (prompt/tool) -> verify test passes.

✋ Testing Patterns

1. Breakpoints

Verifies the agent intends to use a tool BEFORE executing it. Useful for interactive prompts or safety checks.

// ⚠️ Only works with appEvalTest (AppRig)
setup: async (rig) => {
  rig.setBreakpoint(['ask_user']);
},
assert: async (rig) => {
  const confirmation = await rig.waitForPendingConfirmation('ask_user');
  expect(confirmation).toBeDefined();
}

2. Tool Confirmation Race

When asserting multiple triggers (e.g., "enters plan mode then asks question"):

assert: async (rig) => {
  let confirmation = await rig.waitForPendingConfirmation([
    'enter_plan_mode',
    'ask_user',
  ]);

  if (confirmation?.name === 'enter_plan_mode') {
    rig.acceptConfirmation('enter_plan_mode');
    confirmation = await rig.waitForPendingConfirmation('ask_user');
  }
  expect(confirmation?.toolName).toBe('ask_user');
};

3. Audit Tool Logs

Audit exact operations to ensure efficiency (e.g., no redundant reads).

assert: async (rig, result) => {
  await rig.waitForTelemetryReady();
  const toolLogs = rig.readToolLogs();

  const writeCall = toolLogs.find(
    (log) => log.toolRequest.name === 'write_file',
  );
  expect(writeCall).toBeDefined();
};

4. Mock MCP Facades

To evaluate tools connected via MCP without hitting live endpoints, load a mock server configuration in the setup hook.

setup: async (rig) => {
  rig.addMockMcpServer('workspace-server', 'google-workspace');
},
assert: async (rig) => {
  await rig.waitForTelemetryReady();
  const toolLogs = rig.readToolLogs();
  const workspaceCall = toolLogs.find(
    (log) => log.toolRequest.name === 'mcp_workspace-server_docs.getText'
  );
  expect(workspaceCall).toBeDefined();
};

⚠️ Safety & Efficiency Guardrails

1. Breakpoint Deadlocks

Breakpoints (setBreakpoint) pause execution. In standard evalTest, rig.run() waits for the process to exit before assertions run. This will hang indefinitely.

Use Breakpoints for appEvalTest or interactive simulations.
Use Audit Tool Logs (above) for standard trajectory tests.

2. Runaway Timeout

Always set a budget boundary in the EvalCase to prevent runaway loops on quota:

evalTest('USUALLY_PASSES', {
  name: '...',
  timeout: 60000, // 1 minute safety limit
  // ...
});

3. Efficiency Assertion (Turn limits)

Check if a tool is called early using index checks:

assert: async (rig) => {
  const toolLogs = rig.readToolLogs();
  const toolCallIndex = toolLogs.findIndex(
    (log) => log.toolRequest.name === 'cli_help',
  );

  expect(toolCallIndex).toBeGreaterThan(-1);
  expect(toolCallIndex).toBeLessThan(5); // Called within first 5 turns
};

4.5 KiB Raw Blame History