From f1868e599c838ead8b09996f5fd6e15cfd9af5a4 Mon Sep 17 00:00:00 2001
From: Christian Gunderman <gundermanc@google.com>
Date: Mon, 2 Mar 2026 15:43:48 -0800
Subject: [PATCH] feat(evals): add demotion capability to
 promote-behavioral-eval command

---
 .gemini/commands/promote-behavioral-eval.toml | 27 ++++++++++---------
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/.gemini/commands/promote-behavioral-eval.toml b/.gemini/commands/promote-behavioral-eval.toml
index 9893e9b02b..5c937cf62c 100644
--- a/.gemini/commands/promote-behavioral-eval.toml
+++ b/.gemini/commands/promote-behavioral-eval.toml
@@ -1,29 +1,30 @@
-description = "Promote behavioral evals that have a 100% success rate over the last 7 nightly runs."
+description = "Promote behavioral evals that have a 100% success rate over the last 7 nightly runs, and demote evals that have failed."
 prompt = """
-You are an expert at analyzing and promoting behavioral evaluations.
+You are an expert at analyzing, promoting, and demoting behavioral evaluations.
 
 1. **Investigate**:
    - Use 'gh' cli to fetch the results from the most recent run from the main branch: https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml.
    - DO NOT push any changes or start any runs. The rest of your evaluation will be local.
    - Evals are in evals/ directory and are documented by evals/README.md.
-   - Identify tests that have passed 100% of the time for ALL enabled models across the past 7 runs in a row.
+   - Identify tests that have passed 100% of the time for ALL enabled models across the past 7 runs in a row. These are candidates for promotion.
+   - Identify tests currently marked as stable/promoted that have failed in recent runs for any model. These are candidates for demotion.
    - NOTE: the results summary from the most recent run contains the last 7 runs test results. 100% means the test passed 3/3 times for that model and run.
-   - If a test meets this criteria, it is a candidate for promotion.
 
-2. **Promote**:
+2. **Promote and Demote**:
    - For each candidate test, locate the test file in the evals/ directory.
-   - Promote the test according to the project's standard promotion process (e.g., moving it to a stable suite, updating its tags, or removing skip/flaky annotations). 
-   - Ensure you follow any guidelines in evals/README.md for stable tests.
-   - Your **final** change should be **minimal and targeted** to just promoting the test status.
+   - Promote tests according to the project's standard promotion process (e.g., moving to a stable suite, updating tags, or removing skip/flaky annotations). 
+   - Demote tests according to the project's standard process (e.g., adding skip/flaky annotations, moving out of stable suites).
+   - Ensure you follow any guidelines in evals/README.md for stable and flaky tests.
+   - Your **final** change should be **minimal and targeted** to just updating the test status.
 
 3. **Verify**:
-   - Run the promoted tests locally to validate that they still execute correctly. Be sure to run vitest in non-interactive mode.
-   - Check that the test is now part of the expected standard or stable test suites.
+   - Run the updated tests locally to validate that they still execute correctly or are skipped as expected. Be sure to run vitest in non-interactive mode.
+   - Check that the tests are now part of the expected standard, stable, or flaky test suites.
 
 4. **Report**:
-   - Provide a summary of the tests that were promoted.
-   - Include the success rate evidence (7/7 runs passed for all models) for each promoted test.
-   - If no tests met the criteria for promotion, clearly state that and summarize the closest candidates.
+   - Provide a summary of the tests that were promoted and demoted.
+   - Include the success or failure rate evidence for each updated test.
+   - If no tests met the criteria for promotion or demotion, clearly state that and summarize the closest candidates.
 
 {{args}}
 """