From b2b6092c24ae71eec91b3917f3d4aa0a1dfc7ab3 Mon Sep 17 00:00:00 2001 From: Christian Gunderman Date: Fri, 27 Feb 2026 19:11:30 +0000 Subject: [PATCH] Add slash command for promoting behavioral evals to CI blocking (#20575) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --- .gemini/commands/promote-behavioral-eval.toml | 29 +++++++ evals/README.md | 79 +++++++++++++++++-- 2 files changed, 100 insertions(+), 8 deletions(-) create mode 100644 .gemini/commands/promote-behavioral-eval.toml diff --git a/.gemini/commands/promote-behavioral-eval.toml b/.gemini/commands/promote-behavioral-eval.toml new file mode 100644 index 0000000000..9893e9b02b --- /dev/null +++ b/.gemini/commands/promote-behavioral-eval.toml @@ -0,0 +1,29 @@ +description = "Promote behavioral evals that have a 100% success rate over the last 7 nightly runs." +prompt = """ +You are an expert at analyzing and promoting behavioral evaluations. + +1. **Investigate**: + - Use 'gh' cli to fetch the results from the most recent run from the main branch: https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml. + - DO NOT push any changes or start any runs. The rest of your evaluation will be local. + - Evals are in evals/ directory and are documented by evals/README.md. + - Identify tests that have passed 100% of the time for ALL enabled models across the past 7 runs in a row. + - NOTE: the results summary from the most recent run contains the last 7 runs test results. 100% means the test passed 3/3 times for that model and run. + - If a test meets this criteria, it is a candidate for promotion. + +2. **Promote**: + - For each candidate test, locate the test file in the evals/ directory. + - Promote the test according to the project's standard promotion process (e.g., moving it to a stable suite, updating its tags, or removing skip/flaky annotations). + - Ensure you follow any guidelines in evals/README.md for stable tests. + - Your **final** change should be **minimal and targeted** to just promoting the test status. + +3. **Verify**: + - Run the promoted tests locally to validate that they still execute correctly. Be sure to run vitest in non-interactive mode. + - Check that the test is now part of the expected standard or stable test suites. + +4. **Report**: + - Provide a summary of the tests that were promoted. + - Include the success rate evidence (7/7 runs passed for all models) for each promoted test. + - If no tests met the criteria for promotion, clearly state that and summarize the closest candidates. + +{{args}} +""" diff --git a/evals/README.md b/evals/README.md index eb3cf2be70..41ce3440b8 100644 --- a/evals/README.md +++ b/evals/README.md @@ -46,18 +46,20 @@ two arguments: #### Policies -Policies control how strictly a test is validated. Tests should generally use -the ALWAYS_PASSES policy to offer the strictest guarantees. - -USUALLY_PASSES exists to enable assertion of less consistent or aspirational -behaviors. +Policies control how strictly a test is validated. - `ALWAYS_PASSES`: Tests expected to pass 100% of the time. These are typically - trivial and test basic functionality. These run in every CI. + trivial and test basic functionality. These run in every CI and can block PRs + on failure. - `USUALLY_PASSES`: Tests expected to pass most of the time but may have some flakiness due to non-deterministic behaviors. These are run nightly and used to track the health of the product from build to build. +**All new behavioral evaluations must be created with the `USUALLY_PASSES` +policy.** A subset that prove to be highly stable over time may be promoted to +`ALWAYS_PASSES`. For more information, see +[Test promotion process](#test-promotion-process). + #### `EvalCase` Properties - `name`: The name of the evaluation case. @@ -76,7 +78,8 @@ import { describe, expect } from 'vitest'; import { evalTest } from './test-helper.js'; describe('my_feature', () => { - evalTest('ALWAYS_PASSES', { + // New tests MUST start as USUALLY_PASSES and be promoted via /promote-behavioral-eval + evalTest('USUALLY_PASSES', { name: 'should do something', prompt: 'do it', assert: async (rig, result) => { @@ -114,6 +117,39 @@ npm run test:all_evals This command sets the `RUN_EVALS` environment variable to `1`, which enables the `USUALLY_PASSES` tests. +## Ensuring Eval is Stable Prior to Check-in + +The +[Evals: Nightly](https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml) +run is considered to be the source of truth for the quality of an eval test. +Each run of it executes a test 3 times in a row, for each supported model. The +result is then scored 0%, 33%, 66%, or 100% respectively, to indicate how many +of the individual executions passed. + +Googlers can schedule a manual run against their branch by clicking the link +above. + +Tests should score at least 66% with key models including Gemini 3.1 pro, Gemini +3.0 pro, and Gemini 3 flash prior to check in and they must pass 100% of the +time before they are promoted. + +## Test promotion process + +To maintain a stable and reliable CI, all new behavioral evaluations follow a +mandatory deflaking process. + +1. **Incubation**: You must create all new tests with the `USUALLY_PASSES` + policy. This lets them be monitored in the nightly runs without blocking PRs. +2. **Monitoring**: The test must complete at least 10 nightly runs across all + supported models. +3. **Promotion**: Promotion to `ALWAYS_PASSES` happens exclusively through the + `/promote-behavioral-eval` slash command. This command verifies the 100% + success rate requirement is met across many runs before updating the test + policy. + +This promotion process is essential for preventing the introduction of flaky +evaluations into the CI. + ## Reporting Results for evaluations are available on GitHub Actions: @@ -135,7 +171,7 @@ aggregated into a **Nightly Summary** attached to the workflow run. - **Pass Rate (%)**: Each cell represents the percentage of successful runs for a specific test in that workflow instance. -- **History**: The table shows the pass rates for the last 10 nightly runs, +- **History**: The table shows the pass rates for the last 7 nightly runs, allowing you to identify if a model's behavior is trending towards instability. - **Total Pass Rate**: An aggregate metric of all evaluations run in that batch. @@ -184,8 +220,35 @@ gemini /fix-behavioral-eval https://github.com/google-gemini/gemini-cli/actions/ When investigating failures manually, you can also enable verbose agent logs by setting the `GEMINI_DEBUG_LOG_FILE` environment variable. +### Best practices + It's highly recommended to manually review and/or ask the agent to iterate on any prompt changes, even if they pass all evals. The prompt should prefer positive traits ('do X') and resort to negative traits ('do not do X') only when unable to accomplish the goal with positive traits. Gemini is quite good at instrospecting on its prompt when asked the right questions. + +## Promoting evaluations + +Evaluations must be promoted from `USUALLY_PASSES` to `ALWAYS_PASSES` +exclusively using the `/promote-behavioral-eval` slash command. Manual promotion +is not allowed to ensure that the 100% success rate requirement is empirically +met. + +### `/promote-behavioral-eval` + +This command automates the promotion of stable tests by: + +1. **Investigating**: Analyzing the results of the last 7 nightly runs on the + `main` branch using the `gh` CLI. +2. **Criteria Check**: Identifying tests that have passed 100% of the time for + ALL enabled models across the entire 7-run history. +3. **Promotion**: Updating the test file's policy from `USUALLY_PASSES` to + `ALWAYS_PASSES`. +4. **Verification**: Running the promoted test locally to ensure correctness. + +To run it: + +```bash +gemini /promote-behavioral-eval +```