gemini-cli/.gemini/skills/behavioral-evals/references/creating.md

# Creating Behavioral Evals

## 🔬 Rig Selection

| Rig Type          | Import From            | Architecture                                                         | Use When                                                                                              |
| :---------------- | :--------------------- | :------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------------- |
| **`evalTest`**    | `./test-helper.js`     | **Subprocess**. Runs the CLI in a separate process + waits for exit. | Standard workspace tests. **Do not use `setBreakpoint`**; auditing history (`readToolLogs`) is safer. |
| **`appEvalTest`** | `./app-test-helper.js` | **In-Process**. Runs directly inside the runner loop.                | UI/Ink rendering. Safe for `setBreakpoint` triggers.                                                  |

---

## 🏗️ Scenario Design

Evals must simulate realistic agent environments to effectively test
decision-making.

- **Workspace State**: Seed with standard project anchors if testing general
  capabilities:
  - `package.json` for NodeJS environments.
  - Minimal configuration files (`tsconfig.json`, `GEMINI.md`).
- **Structural Complexity**: Provide enough files to force the agent to _search_
  or _navigate_, rather than giving the answer directly. Avoid trivial one-file
  tests unless testing exact prompt steering.

---

## ❌ Fail First Principle

Before asserting a new capability or locking in a fix, **verify that the test
fails first**.

- It is easy to accidentally write an eval that asserts behaviors that are
  already met or pass by default.
- **Process**: reproduce failure with test -> apply fix (prompt/tool) -> verify
  test passes.

---

## ✋ Testing Patterns

### 1. Breakpoints

Verifies the agent _intends_ to use a tool BEFORE executing it. Useful for
interactive prompts or safety checks.

```typescript
// ⚠️ Only works with appEvalTest (AppRig)
setup: async (rig) => {
  rig.setBreakpoint(['ask_user']);
},
assert: async (rig) => {
  const confirmation = await rig.waitForPendingConfirmation('ask_user');
  expect(confirmation).toBeDefined();
}
```

### 2. Tool Confirmation Race

When asserting multiple triggers (e.g., "enters plan mode then asks question"):

```typescript
assert: async (rig) => {
  let confirmation = await rig.waitForPendingConfirmation([
    'enter_plan_mode',
    'ask_user',
  ]);

  if (confirmation?.name === 'enter_plan_mode') {
    rig.acceptConfirmation('enter_plan_mode');
    confirmation = await rig.waitForPendingConfirmation('ask_user');
  }
  expect(confirmation?.toolName).toBe('ask_user');
};
```

### 3. Audit Tool Logs

Audit exact operations to ensure efficiency (e.g., no redundant reads).

```typescript
assert: async (rig, result) => {
  await rig.waitForTelemetryReady();
  const toolLogs = rig.readToolLogs();

  const writeCall = toolLogs.find(
    (log) => log.toolRequest.name === 'write_file',
  );
  expect(writeCall).toBeDefined();
};
```

### 4. Mock MCP Facades

To evaluate tools connected via MCP without hitting live endpoints, load a mock
server configuration in the `setup` hook.

```typescript
setup: async (rig) => {
  rig.addMockMcpServer('workspace-server', 'google-workspace');
},
assert: async (rig) => {
  await rig.waitForTelemetryReady();
  const toolLogs = rig.readToolLogs();
  const workspaceCall = toolLogs.find(
    (log) => log.toolRequest.name === 'mcp_workspace-server_docs.getText'
  );
  expect(workspaceCall).toBeDefined();
};
```

---

## ⚠️ Safety & Efficiency Guardrails

### 1. Breakpoint Deadlocks

Breakpoints (`setBreakpoint`) pause execution. In standard `evalTest`,
`rig.run()` waits for the process to exit _before_ assertions run. **This will
hang indefinitely.**

- **Use Breakpoints** for `appEvalTest` or interactive simulations.
- **Use Audit Tool Logs** (above) for standard trajectory tests.

### 2. Runaway Timeout

Always set a budget boundary in the `EvalCase` to prevent runaway loops on
quota:

```typescript
evalTest('USUALLY_PASSES', {
  name: '...',
  timeout: 60000, // 1 minute safety limit
  // ...
});
```

### 3. Efficiency Assertion (Turn limits)

Check if a tool is called _early_ using index checks:

```typescript
assert: async (rig) => {
  const toolLogs = rig.readToolLogs();
  const toolCallIndex = toolLogs.findIndex(
    (log) => log.toolRequest.name === 'cli_help',
  );

  expect(toolCallIndex).toBeGreaterThan(-1);
  expect(toolCallIndex).toBeLessThan(5); // Called within first 5 turns
};
```