Running & Promoting Evals

🛠️ Prerequisites

Behavioral evals run against the compiled binary. You must build and bundle the project first after making changes:

npm run build && npm run bundle

Evals require a standard API key. If your .env file has multiple keys or comments, use this precise extraction setup:

export GEMINI_API_KEY=$(grep '^GEMINI_API_KEY=' .env | cut -d '=' -f2) && RUN_EVALS=1 npx vitest run --config evals/vitest.config.ts <file_name>

Command	Scope	Description
`npm run test:always_passing_evals`	`ALWAYS_PASSES`	Fast feedback, runs in CI.
`npm run test:all_evals`	All	Runs nightly incubation tests. Sets `RUN_EVALS=1`.

Note: RUN_EVALS=1 is required for incubated (USUALLY_PASSES) tests.

RUN_EVALS=1 npx vitest run --config evals/vitest.config.ts my_feature.eval.ts

If a test fails, verify:

Tool Trajectory Logs:序列 of calls in evals/logs/<test_name>.log.
Verbose Reasoning: Capture raw buffer traces by setting GEMINI_DEBUG_LOG_FILE:
```
export GEMINI_DEBUG_LOG_FILE="debug.log"
```

Tip: Standard evals benchmark against model variations. If a test passes on Flash but fails on Pro (or vice versa), the issue is usually in the tool description, not the prompt definition. Flash is sensitive to "instruction bloat," while Pro is sensitive to "ambiguous intent."

To maintain CI stability, all new evals follow a strict incubation period.

New tests must be created with the USUALLY_PASSES policy.

evalTest('USUALLY_PASSES', { ... })

They run in Evals: Nightly workflows and do not block PR merges.

If a nightly eval regresses, investigate via agent:

gemini /fix-behavioral-eval [optional-run-uri]

Once a test scores 100% consistency over multiple nightly cycles:

gemini /promote-behavioral-eval

Do not promote manually. The command verifies trajectory logs before updating the file policy.