Files
gemini-cli/.gemini/skills/behavioral-evals/references/promoting.md
T

1.9 KiB

Promoting Behavioral Evals

Use this guide when asked to analyze nightly results and promote incubated tests to stable suites.


1. 🔍 Investigate candidates

  1. Audit Nightly Logs: Use the gh CLI to fetch results from evals-nightly.yml (Direct URL: https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml).
    • Tip: The aggregate summary from the most recent run integrates the last 7 runs of history automatically.
    • Safety: DO NOT push changes or start remote runs. All verification is local.
  2. Assess Stability: Identify tests that pass 100% of the time across ALL enabled models over the last 7 nightly runs in a row.
    • 100% means the test passed 3/3 times for every model and run.
  3. Promotion Targets: Tests meeting this criteria are candidates for promotion from USUALLY_PASSES to ALWAYS_PASSES.

2. 🚥 Promotion Steps

  1. Locate File: Locate the eval file in the evals/ directory.
  2. Update Policy: Modify the policy argument to ALWAYS_PASSES.
    evalTest('ALWAYS_PASSES', { ... })
    
  3. Targeting: Follow guidelines in evals/README.md regarding stable suite organization.
  4. Constraint: Your final change must be minimal and targeted strictly to promoting the test status. Do not refactor the test or setup fixtures.

3. Verify

  1. Run Prompted Tests: Run the promoted test locally using non-interactive Vitest to confirm structure validity.
  2. Verify Suite Inclusion: Check that the test is successfully picked up by standard runnable ranges.

4. 📊 Report

Provide a summary of:

  • Which tests were promoted.
  • Provide the success rate evidence (e.g., 7/7 runs passed for all models).
  • If no candidates qualified, list the next closest candidates and their current pass rate.