Promoting Behavioral Evals

Use this guide when asked to analyze nightly results and promote incubated tests to stable suites.

1. 🔍 Investigate candidates

Audit Nightly Logs: Use the gh CLI to fetch results from evals-nightly.yml (Direct URL: https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml).
- Tip: The aggregate summary from the most recent run integrates the last 7 runs of history automatically.
- Safety: DO NOT push changes or start remote runs. All verification is local.
Assess Stability: Identify tests that pass 100% of the time across ALL enabled models over the last 7 nightly runs in a row.
- 100% means the test passed 3/3 times for every model and run.
Promotion Targets: Tests meeting this criteria are candidates for promotion from USUALLY_PASSES to ALWAYS_PASSES.

Locate File: Locate the eval file in the evals/ directory.
Update Policy: Modify the policy argument to ALWAYS_PASSES.
```
evalTest('ALWAYS_PASSES', { ... })
```
Targeting: Follow guidelines in evals/README.md regarding stable suite organization.
Constraint: Your final change must be minimal and targeted strictly to promoting the test status. Do not refactor the test or setup fixtures.

Run Prompted Tests: Run the promoted test locally using non-interactive Vitest to confirm structure validity.
Verify Suite Inclusion: Check that the test is successfully picked up by standard runnable ranges.

Provide a summary of:

Which tests were promoted.
Provide the success rate evidence (e.g., 7/7 runs passed for all models).
If no candidates qualified, list the next closest candidates and their current pass rate.