feat(evals): add PR impact analysis workflow

2026-07-09 01:27:41 -07:00 · 2026-03-18 15:30:07 -07:00
parent a5a461c234
commit 6bef72cddd
3 changed files with 192 additions and 46 deletions
@@ -200,10 +200,30 @@ Results for evaluations are available on GitHub Actions:
 - **CI Evals**: Included in the
  [E2E (Chained)](https://github.com/google-gemini/gemini-cli/actions/workflows/chained_e2e.yml)
  workflow. These must pass 100% for every PR.
+- **PR Impact Analysis**: Run automatically on PRs that modify prompts, tools,
+  or agent logic via the
+  [Evals: PR Impact](https://github.com/google-gemini/gemini-cli/actions/workflows/eval-pr.yml)
+  workflow. This provides a "before and after" comparison of behavioral eval
+  stability and will fail the PR if regressions are detected.
 - **Nightly Evals**: Run daily via the
  [Evals: Nightly](https://github.com/google-gemini/gemini-cli/actions/workflows/evals-nightly.yml)
  workflow. These track the long-term health and stability of model steering.

+### PR Impact Report
+
+When a PR triggers the impact analysis, a bot will post a table to the PR
+commenting on the change in pass rates for critical models (e.g. Gemini 3.1 Pro
+and Gemini 3 Flash).
+
+- **Baseline**: The pass rate from the latest successful nightly run on `main`.
+- **Current**: The pass rate on the PR branch (based on a "lite" run of 1
+  attempt).
+- **Impact**: A delta showing if stability improved (🟢), regressed (🔴), or
+  remained stable (⚪).
+
+A PR that introduces a regression (🔴) in any behavioral evaluation will fail
+this check and require investigation before merging.
+
 ### Nightly Report Format

 The nightly workflow executes the full evaluation suite multiple times