Add slash command for promoting behavioral evals to CI blocking (#20575)

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-05-13 13:22:35 -07:00 · 2026-02-27 19:11:30 +00:00
parent e00e8f4728
commit b2b6092c24
2 changed files with 100 additions and 8 deletions
@@ -0,0 +1,29 @@
 description = "Promote behavioral evals that have a 100% success rate over the last 7 nightly runs."
 prompt = """
 You are an expert at analyzing and promoting behavioral evaluations.
 1. **Investigate**:
   - Use 'gh' cli to fetch the results from the most recent run from the main branch: https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml.
   - DO NOT push any changes or start any runs. The rest of your evaluation will be local.
   - Evals are in evals/ directory and are documented by evals/README.md.
   - Identify tests that have passed 100% of the time for ALL enabled models across the past 7 runs in a row.
   - NOTE: the results summary from the most recent run contains the last 7 runs test results. 100% means the test passed 3/3 times for that model and run.
   - If a test meets this criteria, it is a candidate for promotion.
 2. **Promote**:
   - For each candidate test, locate the test file in the evals/ directory.
   - Promote the test according to the project's standard promotion process (e.g., moving it to a stable suite, updating its tags, or removing skip/flaky annotations). 
   - Ensure you follow any guidelines in evals/README.md for stable tests.
   - Your **final** change should be **minimal and targeted** to just promoting the test status.
 3. **Verify**:
   - Run the promoted tests locally to validate that they still execute correctly. Be sure to run vitest in non-interactive mode.
   - Check that the test is now part of the expected standard or stable test suites.
 4. **Report**:
   - Provide a summary of the tests that were promoted.
   - Include the success rate evidence (7/7 runs passed for all models) for each promoted test.
   - If no tests met the criteria for promotion, clearly state that and summarize the closest candidates.
 {{args}}
 """
@@ -46,18 +46,20 @@ two arguments:
 #### Policies
-Policies control how strictly a test is validated. Tests should generally use
+Policies control how strictly a test is validated.
 the ALWAYS_PASSES policy to offer the strictest guarantees.
 USUALLY_PASSES exists to enable assertion of less consistent or aspirational
 behaviors.
 - `ALWAYS_PASSES`: Tests expected to pass 100% of the time. These are typically
-  trivial and test basic functionality. These run in every CI.
+  trivial and test basic functionality. These run in every CI and can block PRs
  on failure.
 - `USUALLY_PASSES`: Tests expected to pass most of the time but may have some
  flakiness due to non-deterministic behaviors. These are run nightly and used
  to track the health of the product from build to build.
 **All new behavioral evaluations must be created with the `USUALLY_PASSES`
 policy.** A subset that prove to be highly stable over time may be promoted to
 `ALWAYS_PASSES`. For more information, see
 [Test promotion process](#test-promotion-process).
 #### `EvalCase` Properties
 - `name`: The name of the evaluation case.
@@ -76,7 +78,8 @@ import { describe, expect } from 'vitest';
 import { evalTest } from './test-helper.js';
 describe('my_feature', () => {
-  evalTest('ALWAYS_PASSES', {
+  // New tests MUST start as USUALLY_PASSES and be promoted via /promote-behavioral-eval
  evalTest('USUALLY_PASSES', {
    name: 'should do something',
    prompt: 'do it',
    assert: async (rig, result) => {
@@ -114,6 +117,39 @@ npm run test:all_evals
 This command sets the `RUN_EVALS` environment variable to `1`, which enables the
 `USUALLY_PASSES` tests.
 ## Ensuring Eval is Stable Prior to Check-in
 The
 [Evals: Nightly](https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml)
 run is considered to be the source of truth for the quality of an eval test.
 Each run of it executes a test 3 times in a row, for each supported model. The
 result is then scored 0%, 33%, 66%, or 100% respectively, to indicate how many
 of the individual executions passed.
 Googlers can schedule a manual run against their branch by clicking the link
 above.
 Tests should score at least 66% with key models including Gemini 3.1 pro, Gemini
 3.0 pro, and Gemini 3 flash prior to check in and they must pass 100% of the
 time before they are promoted.
 ## Test promotion process
 To maintain a stable and reliable CI, all new behavioral evaluations follow a
 mandatory deflaking process.
 1. **Incubation**: You must create all new tests with the `USUALLY_PASSES`
   policy. This lets them be monitored in the nightly runs without blocking PRs.
 2. **Monitoring**: The test must complete at least 10 nightly runs across all
   supported models.
 3. **Promotion**: Promotion to `ALWAYS_PASSES` happens exclusively through the
   `/promote-behavioral-eval` slash command. This command verifies the 100%
   success rate requirement is met across many runs before updating the test
   policy.
 This promotion process is essential for preventing the introduction of flaky
 evaluations into the CI.
 ## Reporting
 Results for evaluations are available on GitHub Actions:
@@ -135,7 +171,7 @@ aggregated into a **Nightly Summary** attached to the workflow run.
 - **Pass Rate (%)**: Each cell represents the percentage of successful runs for
  a specific test in that workflow instance.
- **History**: The table shows the pass rates for the last 10 nightly runs,
+- **History**: The table shows the pass rates for the last 7 nightly runs,
  allowing you to identify if a model's behavior is trending towards
  instability.
 - **Total Pass Rate**: An aggregate metric of all evaluations run in that batch.
@@ -184,8 +220,35 @@ gemini /fix-behavioral-eval https://github.com/google-gemini/gemini-cli/actions/
 When investigating failures manually, you can also enable verbose agent logs by
 setting the `GEMINI_DEBUG_LOG_FILE` environment variable.
 ### Best practices
 It's highly recommended to manually review and/or ask the agent to iterate on
 any prompt changes, even if they pass all evals. The prompt should prefer
 positive traits ('do X') and resort to negative traits ('do not do X') only when
 unable to accomplish the goal with positive traits. Gemini is quite good at
 instrospecting on its prompt when asked the right questions.
 ## Promoting evaluations
 Evaluations must be promoted from `USUALLY_PASSES` to `ALWAYS_PASSES`
 exclusively using the `/promote-behavioral-eval` slash command. Manual promotion
 is not allowed to ensure that the 100% success rate requirement is empirically
 met.
 ### `/promote-behavioral-eval`
 This command automates the promotion of stable tests by:
 1.  **Investigating**: Analyzing the results of the last 7 nightly runs on the
    `main` branch using the `gh` CLI.
 2.  **Criteria Check**: Identifying tests that have passed 100% of the time for
    ALL enabled models across the entire 7-run history.
 3.  **Promotion**: Updating the test file's policy from `USUALLY_PASSES` to
    `ALWAYS_PASSES`.
 4.  **Verification**: Running the promoted test locally to ensure correctness.
 To run it:
 ```bash
 gemini /promote-behavioral-eval
 ```