test(perf): overhaul performance and memory baseline management

Comprehensive automation upgrades for performance and memory baselines. Includes GitHub Actions workflows for remote updates, automatic local comparisons against main, and git-ignored temporary baselines.

- Added update-baselines.yml GitHub Action to automate remote baseline upgrades efficiently in CI.
- Created scripts/run-perf-tests.js to wrap performance executions, safely stashing dirty alterations and gathering main-branch baselines locally when run without arguments.
- Enhanced PerfTestHarness and MemoryTestHarness to accommodate tolerance limits assertions safely.
- Updated test files to process TEMP_BASELINES_PATH environment variables, protecting tracked files clean during local evaluations.
- Formed docs/performance-and-memory-testing.md safely centrally detailing general strategies.
- Obsoleted folder files perf-tests/README.md, and memory-tests/README.md deleted altogether.
- Registered temporary baseline outputs inside .gitignore and updated scripts/clean.js safely for fast removals on npm run clean.
This commit is contained in:
Sri Pasumarthi
2026-04-16 16:10:31 -07:00
parent daf5006237
commit 6355e2d8a1
17 changed files with 650 additions and 136 deletions
+110
View File
@@ -0,0 +1,110 @@
# Performance & Memory Testing Infrastructure
## Overview
Gemini CLI features a highly reliable performance and memory regression testing
pipeline. To curb anomalies and yields accurate results, the harness applies:
- **IQR Outlier Filtering**: Discards anomalous metrics from evaluation safely.
- **Median Sampling**: Takes `N` runs, evaluating strictly median averages
effortlessly.
- **Warmup Runs**: Discards first samples smoothly preventing JIT artifacts.
- **Tolerance Boundary**: Default restrictions at 15% tolerance prevent
unwarranted panics effortlessly.
---
## Baseline Management
There are two core strategies for calibrating tolerances on performance
benchmarks:
- **Approach A: Normalize for Testing Servers**: Tests run directly on the
automated cloud servers, and those scores are recorded as official, static
baselines.
- **Approach B: Machine-Agnostic Daily Comparisons**: Static baseline files are
ignored. Every night, the test is run against today's and yesterday's code on
the exact same server.
### Recommended Strategy: GitHub Action + Approach A
#### Local Development & PR Checks
- **Local Testing**: If you are a developer trying to quickly test your code
changes against performance or memory impacts, simply run the standard local
perf or memory tests directly without arguments. The harness stashes dirty
alterations automatically, refreshes baseline settings against the most
up-to-date `main` branch dynamically using non-tracked ephemeral files, and
yields immediate comparison feedback.
- **PR Merges**: Please note that if your alterations intentionally necessitate
adjustments across baseline metrics, you should trigger the GitHub Action to
recalibrate baselines in tandem with merging your PR. This is so that
subsequent nightly audits appropriately do their evaluation comparisons
against the new tolerances successfully!
#### Nightly Build Health Audits
- Strict Approach A procedures apply daily across platforms on dedicated
environments, avoiding the "boiling frog" issue where micro-regressions
quietly slip past over periods of duration.
---
## Running Tests
### Performance CPU Tests
```bash
# Run tests (compare against committed baselines)
npm run test:perf
# Verbose output
VERBOSE=true npm run test:perf
# Keep test artifacts for debugging
KEEP_OUTPUT=true npm run test:perf
```
### Memory Tests
```bash
# Run memory tests (compare against local main baselines)
npm run test:memory
```
---
## Architecture & Configuration
### Performance Tests Directory Tree
- `perf-tests/baselines.json`: Committed baseline values
- `perf-tests/globalSetup.ts`: Test environment setup
- `perf-tests/perf-usage.test.ts`: Test scenarios
- `perf-tests/perf.*.responses`: Fake API responses per scenario
### Memory Tests Directory Tree
- `memory-tests/baselines.json`: Committed memory values
- `memory-tests/memory-usage.test.ts`: Memory test scenarios
---
## CI Integration
These tests are strictly excluded from `preflight` constraints and remain
designed strictly for nightly daily audits accurately:
```yaml
- name: Performance regression tests
run: npm run test:perf
```
---
## Adding New Scenarios
1. Add a fake response file: `perf.<scenario-name>.responses` or
`memory.<scenario-name>.responses`.
2. Add a test case in `perf-usage.test.ts` or `memory-usage.test.ts` applying
`harness.runScenario()`.