mirror of
https://github.com/google-gemini/gemini-cli.git
synced 2026-05-31 06:02:47 -07:00
test(perf): overhaul performance and memory baseline management
Comprehensive automation upgrades for performance and memory baselines. Includes GitHub Actions workflows for remote updates, automatic local comparisons against main, and git-ignored temporary baselines. - Added update-baselines.yml GitHub Action to automate remote baseline upgrades efficiently in CI. - Created scripts/run-perf-tests.js to wrap performance executions, safely stashing dirty alterations and gathering main-branch baselines locally when run without arguments. - Enhanced PerfTestHarness and MemoryTestHarness to accommodate tolerance limits assertions safely. - Updated test files to process TEMP_BASELINES_PATH environment variables, protecting tracked files clean during local evaluations. - Formed docs/performance-and-memory-testing.md safely centrally detailing general strategies. - Obsoleted folder files perf-tests/README.md, and memory-tests/README.md deleted altogether. - Registered temporary baseline outputs inside .gitignore and updated scripts/clean.js safely for fast removals on npm run clean.
This commit is contained in:
@@ -0,0 +1,110 @@
|
||||
# Performance & Memory Testing Infrastructure
|
||||
|
||||
## Overview
|
||||
|
||||
Gemini CLI features a highly reliable performance and memory regression testing
|
||||
pipeline. To curb anomalies and yields accurate results, the harness applies:
|
||||
|
||||
- **IQR Outlier Filtering**: Discards anomalous metrics from evaluation safely.
|
||||
- **Median Sampling**: Takes `N` runs, evaluating strictly median averages
|
||||
effortlessly.
|
||||
- **Warmup Runs**: Discards first samples smoothly preventing JIT artifacts.
|
||||
- **Tolerance Boundary**: Default restrictions at 15% tolerance prevent
|
||||
unwarranted panics effortlessly.
|
||||
|
||||
---
|
||||
|
||||
## Baseline Management
|
||||
|
||||
There are two core strategies for calibrating tolerances on performance
|
||||
benchmarks:
|
||||
|
||||
- **Approach A: Normalize for Testing Servers**: Tests run directly on the
|
||||
automated cloud servers, and those scores are recorded as official, static
|
||||
baselines.
|
||||
- **Approach B: Machine-Agnostic Daily Comparisons**: Static baseline files are
|
||||
ignored. Every night, the test is run against today's and yesterday's code on
|
||||
the exact same server.
|
||||
|
||||
### Recommended Strategy: GitHub Action + Approach A
|
||||
|
||||
#### Local Development & PR Checks
|
||||
|
||||
- **Local Testing**: If you are a developer trying to quickly test your code
|
||||
changes against performance or memory impacts, simply run the standard local
|
||||
perf or memory tests directly without arguments. The harness stashes dirty
|
||||
alterations automatically, refreshes baseline settings against the most
|
||||
up-to-date `main` branch dynamically using non-tracked ephemeral files, and
|
||||
yields immediate comparison feedback.
|
||||
- **PR Merges**: Please note that if your alterations intentionally necessitate
|
||||
adjustments across baseline metrics, you should trigger the GitHub Action to
|
||||
recalibrate baselines in tandem with merging your PR. This is so that
|
||||
subsequent nightly audits appropriately do their evaluation comparisons
|
||||
against the new tolerances successfully!
|
||||
|
||||
#### Nightly Build Health Audits
|
||||
|
||||
- Strict Approach A procedures apply daily across platforms on dedicated
|
||||
environments, avoiding the "boiling frog" issue where micro-regressions
|
||||
quietly slip past over periods of duration.
|
||||
|
||||
---
|
||||
|
||||
## Running Tests
|
||||
|
||||
### Performance CPU Tests
|
||||
|
||||
```bash
|
||||
# Run tests (compare against committed baselines)
|
||||
npm run test:perf
|
||||
|
||||
# Verbose output
|
||||
VERBOSE=true npm run test:perf
|
||||
|
||||
# Keep test artifacts for debugging
|
||||
KEEP_OUTPUT=true npm run test:perf
|
||||
```
|
||||
|
||||
### Memory Tests
|
||||
|
||||
```bash
|
||||
# Run memory tests (compare against local main baselines)
|
||||
npm run test:memory
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Architecture & Configuration
|
||||
|
||||
### Performance Tests Directory Tree
|
||||
|
||||
- `perf-tests/baselines.json`: Committed baseline values
|
||||
- `perf-tests/globalSetup.ts`: Test environment setup
|
||||
- `perf-tests/perf-usage.test.ts`: Test scenarios
|
||||
- `perf-tests/perf.*.responses`: Fake API responses per scenario
|
||||
|
||||
### Memory Tests Directory Tree
|
||||
|
||||
- `memory-tests/baselines.json`: Committed memory values
|
||||
- `memory-tests/memory-usage.test.ts`: Memory test scenarios
|
||||
|
||||
---
|
||||
|
||||
## CI Integration
|
||||
|
||||
These tests are strictly excluded from `preflight` constraints and remain
|
||||
designed strictly for nightly daily audits accurately:
|
||||
|
||||
```yaml
|
||||
- name: Performance regression tests
|
||||
run: npm run test:perf
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Adding New Scenarios
|
||||
|
||||
1. Add a fake response file: `perf.<scenario-name>.responses` or
|
||||
`memory.<scenario-name>.responses`.
|
||||
2. Add a test case in `perf-usage.test.ts` or `memory-usage.test.ts` applying
|
||||
`harness.runScenario()`.
|
||||
Reference in New Issue
Block a user