feat(skills): add behavioral-evals skill with fixing and promoting guides (#23349)

2026-05-12 12:54:07 -07:00 · 2026-03-23 17:06:43 -04:00
parent fbf38361ad
commit db14cdf92b
10 changed files with 509 additions and 143 deletions
@@ -0,0 +1,151 @@
+# Creating Behavioral Evals
+
+## 🔬 Rig Selection
+
+| Rig Type          | Import From            | Architecture                                                         | Use When                                                                                              |
+| :---------------- | :--------------------- | :------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------------- |
+| **`evalTest`**    | `./test-helper.js`     | **Subprocess**. Runs the CLI in a separate process + waits for exit. | Standard workspace tests. **Do not use `setBreakpoint`**; auditing history (`readToolLogs`) is safer. |
+| **`appEvalTest`** | `./app-test-helper.js` | **In-Process**. Runs directly inside the runner loop.                | UI/Ink rendering. Safe for `setBreakpoint` triggers.                                                  |
+
+---
+
+## 🏗️ Scenario Design
+
+Evals must simulate realistic agent environments to effectively test
+decision-making.
+
+- **Workspace State**: Seed with standard project anchors if testing general
+  capabilities:
+  - `package.json` for NodeJS environments.
+  - Minimal configuration files (`tsconfig.json`, `GEMINI.md`).
+- **Structural Complexity**: Provide enough files to force the agent to _search_
+  or _navigate_, rather than giving the answer directly. Avoid trivial one-file
+  tests unless testing exact prompt steering.
+
+---
+
+## ❌ Fail First Principle
+
+Before asserting a new capability or locking in a fix, **verify that the test
+fails first**.
+
+- It is easy to accidentally write an eval that asserts behaviors that are
+  already met or pass by default.
+- **Process**: reproduce failure with test -> apply fix (prompt/tool) -> verify
+  test passes.
+
+---
+
+## ✋ Testing Patterns
+
+### 1. Breakpoints
+
+Verifies the agent _intends_ to use a tool BEFORE executing it. Useful for
+interactive prompts or safety checks.
+
+```typescript
+// ⚠️ Only works with appEvalTest (AppRig)
+setup: async (rig) => {
+  rig.setBreakpoint(['ask_user']);
+},
+assert: async (rig) => {
+  const confirmation = await rig.waitForPendingConfirmation('ask_user');
+  expect(confirmation).toBeDefined();
+}
+```
+
+### 2. Tool Confirmation Race
+
+When asserting multiple triggers (e.g., "enters plan mode then asks question"):
+
+```typescript
+assert: async (rig) => {
+  let confirmation = await rig.waitForPendingConfirmation([
+    'enter_plan_mode',
+    'ask_user',
+  ]);
+
+  if (confirmation?.name === 'enter_plan_mode') {
+    rig.acceptConfirmation('enter_plan_mode');
+    confirmation = await rig.waitForPendingConfirmation('ask_user');
+  }
+  expect(confirmation?.toolName).toBe('ask_user');
+};
+```
+
+### 3. Audit Tool Logs
+
+Audit exact operations to ensure efficiency (e.g., no redundant reads).
+
+```typescript
+assert: async (rig, result) => {
+  await rig.waitForTelemetryReady();
+  const toolLogs = rig.readToolLogs();
+
+  const writeCall = toolLogs.find(
+    (log) => log.toolRequest.name === 'write_file',
+  );
+  expect(writeCall).toBeDefined();
+};
+```
+
+### 4. Mock MCP Facades
+
+To evaluate tools connected via MCP without hitting live endpoints, load a mock
+server configuration in the `setup` hook.
+
+```typescript
+setup: async (rig) => {
+  rig.addMockMcpServer('workspace-server', 'google-workspace');
+},
+assert: async (rig) => {
+  await rig.waitForTelemetryReady();
+  const toolLogs = rig.readToolLogs();
+  const workspaceCall = toolLogs.find(
+    (log) => log.toolRequest.name === 'mcp_workspace-server_docs.getText'
+  );
+  expect(workspaceCall).toBeDefined();
+};
+```
+
+---
+
+## ⚠️ Safety & Efficiency Guardrails
+
+### 1. Breakpoint Deadlocks
+
+Breakpoints (`setBreakpoint`) pause execution. In standard `evalTest`,
+`rig.run()` waits for the process to exit _before_ assertions run. **This will
+hang indefinitely.**
+
+- **Use Breakpoints** for `appEvalTest` or interactive simulations.
+- **Use Audit Tool Logs** (above) for standard trajectory tests.
+
+### 2. Runaway Timeout
+
+Always set a budget boundary in the `EvalCase` to prevent runaway loops on
+quota:
+
+```typescript
+evalTest('USUALLY_PASSES', {
+  name: '...',
+  timeout: 60000, // 1 minute safety limit
+  // ...
+});
+```
+
+### 3. Efficiency Assertion (Turn limits)
+
+Check if a tool is called _early_ using index checks:
+
+```typescript
+assert: async (rig) => {
+  const toolLogs = rig.readToolLogs();
+  const toolCallIndex = toolLogs.findIndex(
+    (log) => log.toolRequest.name === 'cli_help',
+  );
+
+  expect(toolCallIndex).toBeGreaterThan(-1);
+  expect(toolCallIndex).toBeLessThan(5); // Called within first 5 turns
+};
+```