Files
gemini-cli/perf-tests/README.md

122 lines
3.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# CPU Performance Integration Test Harness
## Overview
This directory contains performance/CPU integration tests for the Gemini CLI.
These tests measure wall-clock time, CPU usage, and event loop responsiveness to
detect regressions across key scenarios.
CPU performance is inherently noisy, especially in CI. The harness addresses
this with:
- **IQR outlier filtering** — discards anomalous samples
- **Median sampling** — takes N runs, reports the median after filtering
- **Warmup runs** — discards the first run to mitigate JIT compilation noise
- **15% default tolerance** — won't panic at slight regressions
## Running
```bash
# Run tests (compare against committed baselines)
npm run test:perf
# Update baselines (after intentional changes)
npm run test:perf:update-baselines
# Verbose output
VERBOSE=true npm run test:perf
# Keep test artifacts for debugging
KEEP_OUTPUT=true npm run test:perf
```
## How It Works
### Measurement Primitives
The `PerfTestHarness` class (in `packages/test-utils`) provides:
- **`performance.now()`** — high-resolution wall-clock timing
- **`process.cpuUsage()`** — user + system CPU microseconds (delta between
start/stop)
- **`perf_hooks.monitorEventLoopDelay()`** — event loop delay histogram
(p50/p95/p99/max)
### Noise Reduction
1. **Warmup**: First run is discarded to mitigate JIT compilation artifacts
2. **Multiple samples**: Each scenario runs N times (default 5)
3. **IQR filtering**: Samples outside Q11.5×IQR and Q3+1.5×IQR are discarded
4. **Median**: The median of remaining samples is used for comparison
### Baseline Management
Baselines are stored in `baselines.json` in this directory. Each scenario has:
```json
{
"cold-startup-time": {
"wallClockMs": 1234.5,
"cpuTotalUs": 567890,
"eventLoopDelayP99Ms": 12.3,
"timestamp": "2026-04-08T..."
}
}
```
Tests fail if the measured value exceeds `baseline × 1.15` (15% tolerance).
To recalibrate after intentional changes:
```bash
npm run test:perf:update-baselines
# then commit baselines.json
```
### Report Output
After all tests, the harness prints an ASCII summary:
```
═══════════════════════════════════════════════════
PERFORMANCE TEST REPORT
═══════════════════════════════════════════════════
cold-startup-time: 1234.5 ms (Baseline: 1200.0 ms, Delta: +2.9%) ✅
idle-cpu-usage: 2.1 % (Baseline: 2.0 %, Delta: +5.0%) ✅
skill-loading-time: 1567.8 ms (Baseline: 1500.0 ms, Delta: +4.5%) ✅
```
## Architecture
```
perf-tests/
├── README.md ← you are here
├── baselines.json ← committed baseline values
├── globalSetup.ts ← test environment setup
├── perf-usage.test.ts ← test scenarios
├── perf.*.responses ← fake API responses per scenario
├── tsconfig.json ← TypeScript config
└── vitest.config.ts ← vitest config (serial, isolated)
packages/test-utils/src/
├── perf-test-harness.ts ← PerfTestHarness class
└── index.ts ← re-exports
```
## CI Integration
These tests are **excluded from `preflight`** and designed for nightly CI:
```yaml
- name: Performance regression tests
run: npm run test:perf
```
## Adding a New Scenario
1. Add a fake response file: `perf.<scenario-name>.responses`
2. Add a test case in `perf-usage.test.ts` using `harness.runScenario()`
3. Run `npm run test:perf:update-baselines` to establish initial baseline
4. Commit the updated `baselines.json`