mirror of
https://github.com/google-gemini/gemini-cli.git
synced 2026-04-25 04:24:51 -07:00
Add slash command for promoting behavioral evals to CI blocking (#20575)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
This commit is contained in:
committed by
GitHub
parent
e00e8f4728
commit
b2b6092c24
@@ -0,0 +1,29 @@
|
|||||||
|
description = "Promote behavioral evals that have a 100% success rate over the last 7 nightly runs."
|
||||||
|
prompt = """
|
||||||
|
You are an expert at analyzing and promoting behavioral evaluations.
|
||||||
|
|
||||||
|
1. **Investigate**:
|
||||||
|
- Use 'gh' cli to fetch the results from the most recent run from the main branch: https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml.
|
||||||
|
- DO NOT push any changes or start any runs. The rest of your evaluation will be local.
|
||||||
|
- Evals are in evals/ directory and are documented by evals/README.md.
|
||||||
|
- Identify tests that have passed 100% of the time for ALL enabled models across the past 7 runs in a row.
|
||||||
|
- NOTE: the results summary from the most recent run contains the last 7 runs test results. 100% means the test passed 3/3 times for that model and run.
|
||||||
|
- If a test meets this criteria, it is a candidate for promotion.
|
||||||
|
|
||||||
|
2. **Promote**:
|
||||||
|
- For each candidate test, locate the test file in the evals/ directory.
|
||||||
|
- Promote the test according to the project's standard promotion process (e.g., moving it to a stable suite, updating its tags, or removing skip/flaky annotations).
|
||||||
|
- Ensure you follow any guidelines in evals/README.md for stable tests.
|
||||||
|
- Your **final** change should be **minimal and targeted** to just promoting the test status.
|
||||||
|
|
||||||
|
3. **Verify**:
|
||||||
|
- Run the promoted tests locally to validate that they still execute correctly. Be sure to run vitest in non-interactive mode.
|
||||||
|
- Check that the test is now part of the expected standard or stable test suites.
|
||||||
|
|
||||||
|
4. **Report**:
|
||||||
|
- Provide a summary of the tests that were promoted.
|
||||||
|
- Include the success rate evidence (7/7 runs passed for all models) for each promoted test.
|
||||||
|
- If no tests met the criteria for promotion, clearly state that and summarize the closest candidates.
|
||||||
|
|
||||||
|
{{args}}
|
||||||
|
"""
|
||||||
+71
-8
@@ -46,18 +46,20 @@ two arguments:
|
|||||||
|
|
||||||
#### Policies
|
#### Policies
|
||||||
|
|
||||||
Policies control how strictly a test is validated. Tests should generally use
|
Policies control how strictly a test is validated.
|
||||||
the ALWAYS_PASSES policy to offer the strictest guarantees.
|
|
||||||
|
|
||||||
USUALLY_PASSES exists to enable assertion of less consistent or aspirational
|
|
||||||
behaviors.
|
|
||||||
|
|
||||||
- `ALWAYS_PASSES`: Tests expected to pass 100% of the time. These are typically
|
- `ALWAYS_PASSES`: Tests expected to pass 100% of the time. These are typically
|
||||||
trivial and test basic functionality. These run in every CI.
|
trivial and test basic functionality. These run in every CI and can block PRs
|
||||||
|
on failure.
|
||||||
- `USUALLY_PASSES`: Tests expected to pass most of the time but may have some
|
- `USUALLY_PASSES`: Tests expected to pass most of the time but may have some
|
||||||
flakiness due to non-deterministic behaviors. These are run nightly and used
|
flakiness due to non-deterministic behaviors. These are run nightly and used
|
||||||
to track the health of the product from build to build.
|
to track the health of the product from build to build.
|
||||||
|
|
||||||
|
**All new behavioral evaluations must be created with the `USUALLY_PASSES`
|
||||||
|
policy.** A subset that prove to be highly stable over time may be promoted to
|
||||||
|
`ALWAYS_PASSES`. For more information, see
|
||||||
|
[Test promotion process](#test-promotion-process).
|
||||||
|
|
||||||
#### `EvalCase` Properties
|
#### `EvalCase` Properties
|
||||||
|
|
||||||
- `name`: The name of the evaluation case.
|
- `name`: The name of the evaluation case.
|
||||||
@@ -76,7 +78,8 @@ import { describe, expect } from 'vitest';
|
|||||||
import { evalTest } from './test-helper.js';
|
import { evalTest } from './test-helper.js';
|
||||||
|
|
||||||
describe('my_feature', () => {
|
describe('my_feature', () => {
|
||||||
evalTest('ALWAYS_PASSES', {
|
// New tests MUST start as USUALLY_PASSES and be promoted via /promote-behavioral-eval
|
||||||
|
evalTest('USUALLY_PASSES', {
|
||||||
name: 'should do something',
|
name: 'should do something',
|
||||||
prompt: 'do it',
|
prompt: 'do it',
|
||||||
assert: async (rig, result) => {
|
assert: async (rig, result) => {
|
||||||
@@ -114,6 +117,39 @@ npm run test:all_evals
|
|||||||
This command sets the `RUN_EVALS` environment variable to `1`, which enables the
|
This command sets the `RUN_EVALS` environment variable to `1`, which enables the
|
||||||
`USUALLY_PASSES` tests.
|
`USUALLY_PASSES` tests.
|
||||||
|
|
||||||
|
## Ensuring Eval is Stable Prior to Check-in
|
||||||
|
|
||||||
|
The
|
||||||
|
[Evals: Nightly](https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml)
|
||||||
|
run is considered to be the source of truth for the quality of an eval test.
|
||||||
|
Each run of it executes a test 3 times in a row, for each supported model. The
|
||||||
|
result is then scored 0%, 33%, 66%, or 100% respectively, to indicate how many
|
||||||
|
of the individual executions passed.
|
||||||
|
|
||||||
|
Googlers can schedule a manual run against their branch by clicking the link
|
||||||
|
above.
|
||||||
|
|
||||||
|
Tests should score at least 66% with key models including Gemini 3.1 pro, Gemini
|
||||||
|
3.0 pro, and Gemini 3 flash prior to check in and they must pass 100% of the
|
||||||
|
time before they are promoted.
|
||||||
|
|
||||||
|
## Test promotion process
|
||||||
|
|
||||||
|
To maintain a stable and reliable CI, all new behavioral evaluations follow a
|
||||||
|
mandatory deflaking process.
|
||||||
|
|
||||||
|
1. **Incubation**: You must create all new tests with the `USUALLY_PASSES`
|
||||||
|
policy. This lets them be monitored in the nightly runs without blocking PRs.
|
||||||
|
2. **Monitoring**: The test must complete at least 10 nightly runs across all
|
||||||
|
supported models.
|
||||||
|
3. **Promotion**: Promotion to `ALWAYS_PASSES` happens exclusively through the
|
||||||
|
`/promote-behavioral-eval` slash command. This command verifies the 100%
|
||||||
|
success rate requirement is met across many runs before updating the test
|
||||||
|
policy.
|
||||||
|
|
||||||
|
This promotion process is essential for preventing the introduction of flaky
|
||||||
|
evaluations into the CI.
|
||||||
|
|
||||||
## Reporting
|
## Reporting
|
||||||
|
|
||||||
Results for evaluations are available on GitHub Actions:
|
Results for evaluations are available on GitHub Actions:
|
||||||
@@ -135,7 +171,7 @@ aggregated into a **Nightly Summary** attached to the workflow run.
|
|||||||
|
|
||||||
- **Pass Rate (%)**: Each cell represents the percentage of successful runs for
|
- **Pass Rate (%)**: Each cell represents the percentage of successful runs for
|
||||||
a specific test in that workflow instance.
|
a specific test in that workflow instance.
|
||||||
- **History**: The table shows the pass rates for the last 10 nightly runs,
|
- **History**: The table shows the pass rates for the last 7 nightly runs,
|
||||||
allowing you to identify if a model's behavior is trending towards
|
allowing you to identify if a model's behavior is trending towards
|
||||||
instability.
|
instability.
|
||||||
- **Total Pass Rate**: An aggregate metric of all evaluations run in that batch.
|
- **Total Pass Rate**: An aggregate metric of all evaluations run in that batch.
|
||||||
@@ -184,8 +220,35 @@ gemini /fix-behavioral-eval https://github.com/google-gemini/gemini-cli/actions/
|
|||||||
When investigating failures manually, you can also enable verbose agent logs by
|
When investigating failures manually, you can also enable verbose agent logs by
|
||||||
setting the `GEMINI_DEBUG_LOG_FILE` environment variable.
|
setting the `GEMINI_DEBUG_LOG_FILE` environment variable.
|
||||||
|
|
||||||
|
### Best practices
|
||||||
|
|
||||||
It's highly recommended to manually review and/or ask the agent to iterate on
|
It's highly recommended to manually review and/or ask the agent to iterate on
|
||||||
any prompt changes, even if they pass all evals. The prompt should prefer
|
any prompt changes, even if they pass all evals. The prompt should prefer
|
||||||
positive traits ('do X') and resort to negative traits ('do not do X') only when
|
positive traits ('do X') and resort to negative traits ('do not do X') only when
|
||||||
unable to accomplish the goal with positive traits. Gemini is quite good at
|
unable to accomplish the goal with positive traits. Gemini is quite good at
|
||||||
instrospecting on its prompt when asked the right questions.
|
instrospecting on its prompt when asked the right questions.
|
||||||
|
|
||||||
|
## Promoting evaluations
|
||||||
|
|
||||||
|
Evaluations must be promoted from `USUALLY_PASSES` to `ALWAYS_PASSES`
|
||||||
|
exclusively using the `/promote-behavioral-eval` slash command. Manual promotion
|
||||||
|
is not allowed to ensure that the 100% success rate requirement is empirically
|
||||||
|
met.
|
||||||
|
|
||||||
|
### `/promote-behavioral-eval`
|
||||||
|
|
||||||
|
This command automates the promotion of stable tests by:
|
||||||
|
|
||||||
|
1. **Investigating**: Analyzing the results of the last 7 nightly runs on the
|
||||||
|
`main` branch using the `gh` CLI.
|
||||||
|
2. **Criteria Check**: Identifying tests that have passed 100% of the time for
|
||||||
|
ALL enabled models across the entire 7-run history.
|
||||||
|
3. **Promotion**: Updating the test file's policy from `USUALLY_PASSES` to
|
||||||
|
`ALWAYS_PASSES`.
|
||||||
|
4. **Verification**: Running the promoted test locally to ensure correctness.
|
||||||
|
|
||||||
|
To run it:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
gemini /promote-behavioral-eval
|
||||||
|
```
|
||||||
|
|||||||
Reference in New Issue
Block a user