mirror of
https://github.com/google-gemini/gemini-cli.git
synced 2026-06-15 05:47:18 -07:00
6355e2d8a1
Comprehensive automation upgrades for performance and memory baselines. Includes GitHub Actions workflows for remote updates, automatic local comparisons against main, and git-ignored temporary baselines. - Added update-baselines.yml GitHub Action to automate remote baseline upgrades efficiently in CI. - Created scripts/run-perf-tests.js to wrap performance executions, safely stashing dirty alterations and gathering main-branch baselines locally when run without arguments. - Enhanced PerfTestHarness and MemoryTestHarness to accommodate tolerance limits assertions safely. - Updated test files to process TEMP_BASELINES_PATH environment variables, protecting tracked files clean during local evaluations. - Formed docs/performance-and-memory-testing.md safely centrally detailing general strategies. - Obsoleted folder files perf-tests/README.md, and memory-tests/README.md deleted altogether. - Registered temporary baseline outputs inside .gitignore and updated scripts/clean.js safely for fast removals on npm run clean.
111 lines
3.3 KiB
Markdown
111 lines
3.3 KiB
Markdown
# Performance & Memory Testing Infrastructure
|
|
|
|
## Overview
|
|
|
|
Gemini CLI features a highly reliable performance and memory regression testing
|
|
pipeline. To curb anomalies and yields accurate results, the harness applies:
|
|
|
|
- **IQR Outlier Filtering**: Discards anomalous metrics from evaluation safely.
|
|
- **Median Sampling**: Takes `N` runs, evaluating strictly median averages
|
|
effortlessly.
|
|
- **Warmup Runs**: Discards first samples smoothly preventing JIT artifacts.
|
|
- **Tolerance Boundary**: Default restrictions at 15% tolerance prevent
|
|
unwarranted panics effortlessly.
|
|
|
|
---
|
|
|
|
## Baseline Management
|
|
|
|
There are two core strategies for calibrating tolerances on performance
|
|
benchmarks:
|
|
|
|
- **Approach A: Normalize for Testing Servers**: Tests run directly on the
|
|
automated cloud servers, and those scores are recorded as official, static
|
|
baselines.
|
|
- **Approach B: Machine-Agnostic Daily Comparisons**: Static baseline files are
|
|
ignored. Every night, the test is run against today's and yesterday's code on
|
|
the exact same server.
|
|
|
|
### Recommended Strategy: GitHub Action + Approach A
|
|
|
|
#### Local Development & PR Checks
|
|
|
|
- **Local Testing**: If you are a developer trying to quickly test your code
|
|
changes against performance or memory impacts, simply run the standard local
|
|
perf or memory tests directly without arguments. The harness stashes dirty
|
|
alterations automatically, refreshes baseline settings against the most
|
|
up-to-date `main` branch dynamically using non-tracked ephemeral files, and
|
|
yields immediate comparison feedback.
|
|
- **PR Merges**: Please note that if your alterations intentionally necessitate
|
|
adjustments across baseline metrics, you should trigger the GitHub Action to
|
|
recalibrate baselines in tandem with merging your PR. This is so that
|
|
subsequent nightly audits appropriately do their evaluation comparisons
|
|
against the new tolerances successfully!
|
|
|
|
#### Nightly Build Health Audits
|
|
|
|
- Strict Approach A procedures apply daily across platforms on dedicated
|
|
environments, avoiding the "boiling frog" issue where micro-regressions
|
|
quietly slip past over periods of duration.
|
|
|
|
---
|
|
|
|
## Running Tests
|
|
|
|
### Performance CPU Tests
|
|
|
|
```bash
|
|
# Run tests (compare against committed baselines)
|
|
npm run test:perf
|
|
|
|
# Verbose output
|
|
VERBOSE=true npm run test:perf
|
|
|
|
# Keep test artifacts for debugging
|
|
KEEP_OUTPUT=true npm run test:perf
|
|
```
|
|
|
|
### Memory Tests
|
|
|
|
```bash
|
|
# Run memory tests (compare against local main baselines)
|
|
npm run test:memory
|
|
```
|
|
|
|
---
|
|
|
|
## Architecture & Configuration
|
|
|
|
### Performance Tests Directory Tree
|
|
|
|
- `perf-tests/baselines.json`: Committed baseline values
|
|
- `perf-tests/globalSetup.ts`: Test environment setup
|
|
- `perf-tests/perf-usage.test.ts`: Test scenarios
|
|
- `perf-tests/perf.*.responses`: Fake API responses per scenario
|
|
|
|
### Memory Tests Directory Tree
|
|
|
|
- `memory-tests/baselines.json`: Committed memory values
|
|
- `memory-tests/memory-usage.test.ts`: Memory test scenarios
|
|
|
|
---
|
|
|
|
## CI Integration
|
|
|
|
These tests are strictly excluded from `preflight` constraints and remain
|
|
designed strictly for nightly daily audits accurately:
|
|
|
|
```yaml
|
|
- name: Performance regression tests
|
|
run: npm run test:perf
|
|
```
|
|
|
|
---
|
|
|
|
## Adding New Scenarios
|
|
|
|
1. Add a fake response file: `perf.<scenario-name>.responses` or
|
|
`memory.<scenario-name>.responses`.
|
|
2. Add a test case in `perf-usage.test.ts` or `memory-usage.test.ts` applying
|
|
`harness.runScenario()`.
|