mirror of
https://github.com/google-gemini/gemini-cli.git
synced 2026-04-24 03:54:43 -07:00
1.9 KiB
1.9 KiB
Promoting Behavioral Evals
Use this guide when asked to analyze nightly results and promote incubated tests to stable suites.
1. 🔍 Investigate candidates
- Audit Nightly Logs: Use the
ghCLI to fetch results fromevals-nightly.yml(Direct URL:https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml).- Tip: The aggregate summary from the most recent run integrates the last 7 runs of history automatically.
- Safety: DO NOT push changes or start remote runs. All verification is local.
- Assess Stability: Identify tests that pass 100% of the time across
ALL enabled models over the last 7 nightly runs in a row.
- 100% means the test passed 3/3 times for every model and run.
- Promotion Targets: Tests meeting this criteria are candidates for
promotion from
USUALLY_PASSEStoALWAYS_PASSES.
2. 🚥 Promotion Steps
- Locate File: Locate the eval file in the
evals/directory. - Update Policy: Modify the policy argument to
ALWAYS_PASSES.evalTest('ALWAYS_PASSES', { ... }) - Targeting: Follow guidelines in
evals/README.mdregarding stable suite organization. - Constraint: Your final change must be minimal and targeted strictly to promoting the test status. Do not refactor the test or setup fixtures.
3. ✅ Verify
- Run Prompted Tests: Run the promoted test locally using non-interactive Vitest to confirm structure validity.
- Verify Suite Inclusion: Check that the test is successfully picked up by standard runnable ranges.
4. 📊 Report
Provide a summary of:
- Which tests were promoted.
- Provide the success rate evidence (e.g., 7/7 runs passed for all models).
- If no candidates qualified, list the next closest candidates and their current pass rate.