Files
gemini-cli/.gemini/skills/behavioral-evals/references/creating.md

4.5 KiB

Creating Behavioral Evals

🔬 Rig Selection

Rig Type Import From Architecture Use When
evalTest ./test-helper.js Subprocess. Runs the CLI in a separate process + waits for exit. Standard workspace tests. Do not use setBreakpoint; auditing history (readToolLogs) is safer.
appEvalTest ./app-test-helper.js In-Process. Runs directly inside the runner loop. UI/Ink rendering. Safe for setBreakpoint triggers.

🏗️ Scenario Design

Evals must simulate realistic agent environments to effectively test decision-making.

  • Workspace State: Seed with standard project anchors if testing general capabilities:
    • package.json for NodeJS environments.
    • Minimal configuration files (tsconfig.json, GEMINI.md).
  • Structural Complexity: Provide enough files to force the agent to search or navigate, rather than giving the answer directly. Avoid trivial one-file tests unless testing exact prompt steering.

Fail First Principle

Before asserting a new capability or locking in a fix, verify that the test fails first.

  • It is easy to accidentally write an eval that asserts behaviors that are already met or pass by default.
  • Process: reproduce failure with test -> apply fix (prompt/tool) -> verify test passes.

Testing Patterns

1. Breakpoints

Verifies the agent intends to use a tool BEFORE executing it. Useful for interactive prompts or safety checks.

// ⚠️ Only works with appEvalTest (AppRig)
setup: async (rig) => {
  rig.setBreakpoint(['ask_user']);
},
assert: async (rig) => {
  const confirmation = await rig.waitForPendingConfirmation('ask_user');
  expect(confirmation).toBeDefined();
}

2. Tool Confirmation Race

When asserting multiple triggers (e.g., "enters plan mode then asks question"):

assert: async (rig) => {
  let confirmation = await rig.waitForPendingConfirmation([
    'enter_plan_mode',
    'ask_user',
  ]);

  if (confirmation?.name === 'enter_plan_mode') {
    rig.acceptConfirmation('enter_plan_mode');
    confirmation = await rig.waitForPendingConfirmation('ask_user');
  }
  expect(confirmation?.toolName).toBe('ask_user');
};

3. Audit Tool Logs

Audit exact operations to ensure efficiency (e.g., no redundant reads).

assert: async (rig, result) => {
  await rig.waitForTelemetryReady();
  const toolLogs = rig.readToolLogs();

  const writeCall = toolLogs.find(
    (log) => log.toolRequest.name === 'write_file',
  );
  expect(writeCall).toBeDefined();
};

4. Mock MCP Facades

To evaluate tools connected via MCP without hitting live endpoints, load a mock server configuration in the setup hook.

setup: async (rig) => {
  rig.addMockMcpServer('workspace-server', 'google-workspace');
},
assert: async (rig) => {
  await rig.waitForTelemetryReady();
  const toolLogs = rig.readToolLogs();
  const workspaceCall = toolLogs.find(
    (log) => log.toolRequest.name === 'mcp_workspace-server_docs.getText'
  );
  expect(workspaceCall).toBeDefined();
};

⚠️ Safety & Efficiency Guardrails

1. Breakpoint Deadlocks

Breakpoints (setBreakpoint) pause execution. In standard evalTest, rig.run() waits for the process to exit before assertions run. This will hang indefinitely.

  • Use Breakpoints for appEvalTest or interactive simulations.
  • Use Audit Tool Logs (above) for standard trajectory tests.

2. Runaway Timeout

Always set a budget boundary in the EvalCase to prevent runaway loops on quota:

evalTest('USUALLY_PASSES', {
  name: '...',
  timeout: 60000, // 1 minute safety limit
  // ...
});

3. Efficiency Assertion (Turn limits)

Check if a tool is called early using index checks:

assert: async (rig) => {
  const toolLogs = rig.readToolLogs();
  const toolCallIndex = toolLogs.findIndex(
    (log) => log.toolRequest.name === 'cli_help',
  );

  expect(toolCallIndex).toBeGreaterThan(-1);
  expect(toolCallIndex).toBeLessThan(5); // Called within first 5 turns
};