mirror of https://github.com/google-gemini/gemini-cli.git synced 2026-06-19 15:56:48 -07:00

Files

T

Christian Gunderman b2b6092c24 Add slash command for promoting behavioral evals to CI blocking (#20575 )

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

2026-02-27 19:11:30 +00:00

answer-vs-act.eval.ts

Fix issue where Gemini CLI can make changes when simply asked a question (#17608 )

2026-01-27 19:47:33 +00:00

app-test-helper.ts

feat(core): experimental in-progress steering hints (1 of 3) (#19008 )

2026-02-17 22:59:33 +00:00

automated-tool-use.eval.ts

Encourage agent to utilize ecosystem tools to perform work (#17881 )

2026-02-04 02:02:25 +00:00

edit-locations-eval.eval.ts

Fix issue where Gemini CLI creates tests in a new file (#18409 )

2026-02-10 20:53:29 +00:00

frugalReads.eval.ts

Stabilize tests. (#20095 )

2026-02-24 00:01:39 +00:00

frugalSearch.eval.ts

Stabilize tests. (#20095 )

2026-02-24 00:01:39 +00:00

generalist_agent.eval.ts

Cleanup post delegate_to_agent removal (#17875 )

2026-01-29 18:24:35 +00:00

generalist_delegation.eval.ts

feat(core): Enable generalist agent (#19665 )

2026-02-26 16:38:49 +00:00

gitRepo.eval.ts

Demote git evals to nightly run. (#17030 )

2026-01-19 19:00:41 +00:00

grep_search_functionality.eval.ts

feat(core): rename grep_search include parameter to include_pattern (#20328 )

2026-02-26 04:16:21 +00:00

hierarchical_memory.eval.ts

Disable failing eval test (#19455 )

2026-02-18 19:27:21 +00:00

interactive-hang.eval.ts

Stabilize tests. (#20095 )

2026-02-24 00:01:39 +00:00

model_steering.eval.ts

feat(core): experimental in-progress steering hints (2 of 2) (#19307 )

2026-02-18 22:05:50 +00:00

plan_mode.eval.ts

fix(core): clarify plan mode constraints and exit mechanism (#19438 )

2026-02-18 20:09:59 +00:00

README.md

Add slash command for promoting behavioral evals to CI blocking (#20575 )

2026-02-27 19:11:30 +00:00

save_memory.eval.ts

test(evals): mark all save_memory evals as USUALLY_PASSES due to unreliability (#18786 )

2026-02-11 02:16:52 +00:00

shell-efficiency.eval.ts

test(core): remove hardcoded model from TestRig (#18710 )

2026-02-10 07:54:23 +00:00

subagents.eval.ts

Refactor subagent delegation to be one tool per agent (#17346 )

2026-01-23 02:18:31 +00:00

test-helper.ts

feat(core): experimental in-progress steering hints (1 of 3) (#19008 )

2026-02-17 22:59:33 +00:00

tool_output_masking.eval.ts

test(evals): add behavioral tests for tool output masking (#19172 )

2026-02-18 05:07:25 +00:00

validation_fidelity_pre_existing_errors.eval.ts

chore(evals): update validation_fidelity_pre_existing_errors to USUALLY_PASSES (#18617 )

2026-02-09 01:31:22 -08:00

validation_fidelity.eval.ts

Demote unreliable test. (#20571 )

2026-02-27 16:48:46 +00:00

vitest.config.ts

feat(core): experimental in-progress steering hints (1 of 3) (#19008 )

2026-02-17 22:59:33 +00:00

README.md

Behavioral Evals

Behavioral evaluations (evals) are tests designed to validate the agent's behavior in response to specific prompts. They serve as a critical feedback loop for changes to system prompts, tool definitions, and other model-steering mechanisms.

Why Behavioral Evals?

Unlike traditional integration tests which verify that the system functions correctly (e.g., "does the file writer actually write to disk?"), behavioral evals verify that the model chooses to take the correct action (e.g., "does the model decide to write to disk when asked to save code?").

They are also distinct from broad industry benchmarks (like SWE-bench). While benchmarks measure general capabilities across complex challenges, our behavioral evals focus on specific, granular behaviors relevant to the Gemini CLI's features.

Key Characteristics

Feedback Loop: They help us understand how changes to prompts or tools affect the model's decision-making.
- Did a change to the system prompt make the model less likely to use tool X?
- Did a new tool definition confuse the model?
Regression Testing: They prevent regressions in model steering.
Non-Determinism: Unlike unit tests, LLM behavior can be non-deterministic. We distinguish between behaviors that should be robust (ALWAYS_PASSES) and those that are generally reliable but might occasionally vary (USUALLY_PASSES).

Creating an Evaluation

Evaluations are located in the evals directory. Each evaluation is a Vitest test file that uses the evalTest function from evals/test-helper.ts.

`evalTest`

The evalTest function is a helper that runs a single evaluation case. It takes two arguments:

policy: The consistency expectation for this test ('ALWAYS_PASSES' or 'USUALLY_PASSES').
evalCase: An object defining the test case.

Policies

Policies control how strictly a test is validated.

ALWAYS_PASSES: Tests expected to pass 100% of the time. These are typically trivial and test basic functionality. These run in every CI and can block PRs on failure.
USUALLY_PASSES: Tests expected to pass most of the time but may have some flakiness due to non-deterministic behaviors. These are run nightly and used to track the health of the product from build to build.

All new behavioral evaluations must be created with the USUALLY_PASSES policy. A subset that prove to be highly stable over time may be promoted to ALWAYS_PASSES. For more information, see Test promotion process.

`EvalCase` Properties

name: The name of the evaluation case.
prompt: The prompt to send to the model.
params: An optional object with parameters to pass to the test rig (e.g., settings).
assert: An async function that takes the test rig and the result of the run and asserts that the result is correct.
log: An optional boolean that, if set to true, will log the tool calls to a file in the evals/logs directory.

Example

import { describe, expect } from 'vitest';
import { evalTest } from './test-helper.js';

describe('my_feature', () => {
  // New tests MUST start as USUALLY_PASSES and be promoted via /promote-behavioral-eval
  evalTest('USUALLY_PASSES', {
    name: 'should do something',
    prompt: 'do it',
    assert: async (rig, result) => {
      // assertions
    },
  });
});

Running Evaluations

First, build the bundled Gemini CLI. You must do this after every code change.

npm run build
npm run bundle

Always Passing Evals

To run the evaluations that are expected to always pass (CI safe):

npm run test:always_passing_evals

All Evals

To run all evaluations, including those that may be flaky ("usually passes"):

npm run test:all_evals

This command sets the RUN_EVALS environment variable to 1, which enables the USUALLY_PASSES tests.

Ensuring Eval is Stable Prior to Check-in

The Evals: Nightly run is considered to be the source of truth for the quality of an eval test. Each run of it executes a test 3 times in a row, for each supported model. The result is then scored 0%, 33%, 66%, or 100% respectively, to indicate how many of the individual executions passed.

Googlers can schedule a manual run against their branch by clicking the link above.

Tests should score at least 66% with key models including Gemini 3.1 pro, Gemini 3.0 pro, and Gemini 3 flash prior to check in and they must pass 100% of the time before they are promoted.

Test promotion process

To maintain a stable and reliable CI, all new behavioral evaluations follow a mandatory deflaking process.

Incubation: You must create all new tests with the USUALLY_PASSES policy. This lets them be monitored in the nightly runs without blocking PRs.
Monitoring: The test must complete at least 10 nightly runs across all supported models.
Promotion: Promotion to ALWAYS_PASSES happens exclusively through the /promote-behavioral-eval slash command. This command verifies the 100% success rate requirement is met across many runs before updating the test policy.

This promotion process is essential for preventing the introduction of flaky evaluations into the CI.

Reporting

Results for evaluations are available on GitHub Actions:

CI Evals: Included in the E2E (Chained) workflow. These must pass 100% for every PR.
Nightly Evals: Run daily via the Evals: Nightly workflow. These track the long-term health and stability of model steering.

Nightly Report Format

The nightly workflow executes the full evaluation suite multiple times (currently 3 attempts) to account for non-determinism. These results are aggregated into a Nightly Summary attached to the workflow run.

How to interpret the report:

Pass Rate (%): Each cell represents the percentage of successful runs for a specific test in that workflow instance.
History: The table shows the pass rates for the last 7 nightly runs, allowing you to identify if a model's behavior is trending towards instability.
Total Pass Rate: An aggregate metric of all evaluations run in that batch.

A significant drop in the pass rate for a USUALLY_PASSES test—even if it doesn't drop to 0%—often indicates that a recent change to a system prompt or tool definition has made the model's behavior less reliable.

Fixing Evaluations

If an evaluation is failing or has a regressed pass rate, you can use the /fix-behavioral-eval command within Gemini CLI to help investigate and fix the issue.

`/fix-behavioral-eval`

This command is designed to automate the investigation and fixing process for failing evaluations. It will:

Investigate: Fetch the latest results from the nightly workflow using the gh CLI, identify the failing test, and review test trajectory logs in evals/logs.
Fix: Suggest and apply targeted fixes to the prompt or tool definitions. It prioritizes minimal changes to prompt.ts, tool instructions, and modules that contribute to the prompt. It generally tries to avoid changing the test itself.
Verify: Re-run the test 3 times across multiple models (e.g., Gemini 3.0, Gemini 3 Flash, Gemini 2.5 Pro) to ensure stability and calculate a success rate.
Report: Provide a summary of the success rate for each model and details on the applied fixes.

To use it, run:

gemini /fix-behavioral-eval

You can also provide a link to a specific GitHub Action run or the name of a specific test to focus the investigation:

gemini /fix-behavioral-eval https://github.com/google-gemini/gemini-cli/actions/runs/123456789

When investigating failures manually, you can also enable verbose agent logs by setting the GEMINI_DEBUG_LOG_FILE environment variable.

Best practices

It's highly recommended to manually review and/or ask the agent to iterate on any prompt changes, even if they pass all evals. The prompt should prefer positive traits ('do X') and resort to negative traits ('do not do X') only when unable to accomplish the goal with positive traits. Gemini is quite good at instrospecting on its prompt when asked the right questions.

Promoting evaluations

Evaluations must be promoted from USUALLY_PASSES to ALWAYS_PASSES exclusively using the /promote-behavioral-eval slash command. Manual promotion is not allowed to ensure that the 100% success rate requirement is empirically met.

`/promote-behavioral-eval`

This command automates the promotion of stable tests by:

Investigating: Analyzing the results of the last 7 nightly runs on the main branch using the gh CLI.
Criteria Check: Identifying tests that have passed 100% of the time for ALL enabled models across the entire 7-run history.
Promotion: Updating the test file's policy from USUALLY_PASSES to ALWAYS_PASSES.
Verification: Running the promoted test locally to ensure correctness.

To run it:

gemini /promote-behavioral-eval

README.md

Behavioral Evals

Why Behavioral Evals?

Key Characteristics

Creating an Evaluation

evalTest

Policies

EvalCase Properties

Example

Running Evaluations

Always Passing Evals

All Evals

Ensuring Eval is Stable Prior to Check-in

Test promotion process

Reporting

Nightly Report Format

How to interpret the report:

Fixing Evaluations

/fix-behavioral-eval

Best practices

Promoting evaluations

/promote-behavioral-eval

`evalTest`

`EvalCase` Properties

`/fix-behavioral-eval`

`/promote-behavioral-eval`