From f1868e599c838ead8b09996f5fd6e15cfd9af5a4 Mon Sep 17 00:00:00 2001 From: Christian Gunderman Date: Mon, 2 Mar 2026 15:43:48 -0800 Subject: [PATCH] feat(evals): add demotion capability to promote-behavioral-eval command --- .gemini/commands/promote-behavioral-eval.toml | 27 ++++++++++--------- 1 file changed, 14 insertions(+), 13 deletions(-) diff --git a/.gemini/commands/promote-behavioral-eval.toml b/.gemini/commands/promote-behavioral-eval.toml index 9893e9b02b..5c937cf62c 100644 --- a/.gemini/commands/promote-behavioral-eval.toml +++ b/.gemini/commands/promote-behavioral-eval.toml @@ -1,29 +1,30 @@ -description = "Promote behavioral evals that have a 100% success rate over the last 7 nightly runs." +description = "Promote behavioral evals that have a 100% success rate over the last 7 nightly runs, and demote evals that have failed." prompt = """ -You are an expert at analyzing and promoting behavioral evaluations. +You are an expert at analyzing, promoting, and demoting behavioral evaluations. 1. **Investigate**: - Use 'gh' cli to fetch the results from the most recent run from the main branch: https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml. - DO NOT push any changes or start any runs. The rest of your evaluation will be local. - Evals are in evals/ directory and are documented by evals/README.md. - - Identify tests that have passed 100% of the time for ALL enabled models across the past 7 runs in a row. + - Identify tests that have passed 100% of the time for ALL enabled models across the past 7 runs in a row. These are candidates for promotion. + - Identify tests currently marked as stable/promoted that have failed in recent runs for any model. These are candidates for demotion. - NOTE: the results summary from the most recent run contains the last 7 runs test results. 100% means the test passed 3/3 times for that model and run. - - If a test meets this criteria, it is a candidate for promotion. -2. **Promote**: +2. **Promote and Demote**: - For each candidate test, locate the test file in the evals/ directory. - - Promote the test according to the project's standard promotion process (e.g., moving it to a stable suite, updating its tags, or removing skip/flaky annotations). - - Ensure you follow any guidelines in evals/README.md for stable tests. - - Your **final** change should be **minimal and targeted** to just promoting the test status. + - Promote tests according to the project's standard promotion process (e.g., moving to a stable suite, updating tags, or removing skip/flaky annotations). + - Demote tests according to the project's standard process (e.g., adding skip/flaky annotations, moving out of stable suites). + - Ensure you follow any guidelines in evals/README.md for stable and flaky tests. + - Your **final** change should be **minimal and targeted** to just updating the test status. 3. **Verify**: - - Run the promoted tests locally to validate that they still execute correctly. Be sure to run vitest in non-interactive mode. - - Check that the test is now part of the expected standard or stable test suites. + - Run the updated tests locally to validate that they still execute correctly or are skipped as expected. Be sure to run vitest in non-interactive mode. + - Check that the tests are now part of the expected standard, stable, or flaky test suites. 4. **Report**: - - Provide a summary of the tests that were promoted. - - Include the success rate evidence (7/7 runs passed for all models) for each promoted test. - - If no tests met the criteria for promotion, clearly state that and summarize the closest candidates. + - Provide a summary of the tests that were promoted and demoted. + - Include the success or failure rate evidence for each updated test. + - If no tests met the criteria for promotion or demotion, clearly state that and summarize the closest candidates. {{args}} """