mirror of
https://github.com/google-gemini/gemini-cli.git
synced 2026-03-10 14:10:37 -07:00
Add some dos and don'ts to behavioral evals README. (#20629)
This commit is contained in:
committed by
GitHub
parent
01927a36d1
commit
25f59a0099
@@ -3,7 +3,8 @@
|
||||
Behavioral evaluations (evals) are tests designed to validate the agent's
|
||||
behavior in response to specific prompts. They serve as a critical feedback loop
|
||||
for changes to system prompts, tool definitions, and other model-steering
|
||||
mechanisms.
|
||||
mechanisms, and as a tool for assessing feature reliability by model, and
|
||||
preventing regressions.
|
||||
|
||||
## Why Behavioral Evals?
|
||||
|
||||
@@ -30,6 +31,48 @@ CLI's features.
|
||||
those that are generally reliable but might occasionally vary
|
||||
(`USUALLY_PASSES`).
|
||||
|
||||
## Best Practices
|
||||
|
||||
When designing behavioral evals, aim for scenarios that accurately reflect
|
||||
real-world usage while remaining small and maintainable.
|
||||
|
||||
- **Realistic Complexity**: Evals should be complicated enough to be
|
||||
"realistic." They should operate on actual files and a source directory,
|
||||
mirroring how a real agent interacts with a workspace. Remember that the agent
|
||||
may behave differently in a larger codebase, so we want to avoid scenarios
|
||||
that are too simple to be realistic.
|
||||
- _Good_: An eval that provides a small, functional React component and asks
|
||||
the agent to add a specific feature, requiring it to read the file,
|
||||
understand the context, and write the correct changes.
|
||||
- _Bad_: An eval that simply asks the agent a trivia question or asks it to
|
||||
write a generic script without providing any local workspace context.
|
||||
- **Maintainable Size**: Evals should be small enough to reason about and
|
||||
maintain. We probably can't check in an entire repo as a test case, though
|
||||
over time we will want these evals to mature into more and more realistic
|
||||
scenarios.
|
||||
- _Good_: A test setup with 2-3 files (e.g., a source file, a config file, and
|
||||
a test file) that isolates the specific behavior being evaluated.
|
||||
- _Bad_: A test setup containing dozens of files from a complex framework
|
||||
where the setup logic itself is prone to breaking.
|
||||
- **Unambiguous and Reliable Assertions**: Assertions must be clear and specific
|
||||
to ensure the test passes for the right reason.
|
||||
- _Good_: Checking that a modified file contains a specific AST node or exact
|
||||
string, or verifying that a tool was called with with the right parameters.
|
||||
- _Bad_: Only checking for a tool call, which could happen for an unrelated
|
||||
reason. Expecting specific LLM output.
|
||||
- **Fail First**: Have tests that failed before your prompt or tool change. We
|
||||
want to be sure the test fails before your "fix". It's pretty easy to
|
||||
accidentally create a passing test that asserts behaviors we get for free. In
|
||||
general, every eval should be accompanied by prompt change, and most prompt
|
||||
changes should be accompanied by an eval.
|
||||
- _Good_: Observing a failure, writing an eval that reliably reproduces the
|
||||
failure, modifying the prompt/tool, and then verifying the eval passes.
|
||||
- _Bad_: Writing an eval that passes on the first run and assuming your new
|
||||
prompt change was responsible.
|
||||
- **Less is More**: Prefer fewer, more realistic tests that assert the major
|
||||
paths vs. more tests that are more unit-test like. These are evals, so the
|
||||
value is in testing how the agent works in a semi-realistic scenario.
|
||||
|
||||
## Creating an Evaluation
|
||||
|
||||
Evaluations are located in the `evals` directory. Each evaluation is a Vitest
|
||||
|
||||
Reference in New Issue
Block a user