test(perf): overhaul performance and memory baseline management

Comprehensive automation upgrades for performance and memory baselines. Includes GitHub Actions workflows for remote updates, automatic local comparisons against main, and git-ignored temporary baselines. - Added update-baselines.yml GitHub Action to automate remote baseline upgrades efficiently in CI. - Created scripts/run-perf-tests.js to wrap performance executions, safely stashing dirty alterations and gathering main-branch baselines locally when run without arguments. - Enhanced PerfTestHarness and MemoryTestHarness to accommodate tolerance limits assertions safely. - Updated test files to process TEMP_BASELINES_PATH environment variables, protecting tracked files clean during local evaluations. - Formed docs/performance-and-memory-testing.md safely centrally detailing general strategies. - Obsoleted folder files perf-tests/README.md, and memory-tests/README.md deleted altogether. - Registered temporary baseline outputs inside .gitignore and updated scripts/clean.js safely for fast removals on npm run clean.
2026-07-22 15:51:18 -07:00 · 2026-04-16 16:10:31 -07:00
parent daf5006237
commit 6355e2d8a1
17 changed files with 650 additions and 136 deletions
@@ -0,0 +1,110 @@
+# Performance & Memory Testing Infrastructure
+
+## Overview
+
+Gemini CLI features a highly reliable performance and memory regression testing
+pipeline. To curb anomalies and yields accurate results, the harness applies:
+
+- **IQR Outlier Filtering**: Discards anomalous metrics from evaluation safely.
+- **Median Sampling**: Takes `N` runs, evaluating strictly median averages
+  effortlessly.
+- **Warmup Runs**: Discards first samples smoothly preventing JIT artifacts.
+- **Tolerance Boundary**: Default restrictions at 15% tolerance prevent
+  unwarranted panics effortlessly.
+
+---
+
+## Baseline Management
+
+There are two core strategies for calibrating tolerances on performance
+benchmarks:
+
+- **Approach A: Normalize for Testing Servers**: Tests run directly on the
+  automated cloud servers, and those scores are recorded as official, static
+  baselines.
+- **Approach B: Machine-Agnostic Daily Comparisons**: Static baseline files are
+  ignored. Every night, the test is run against today's and yesterday's code on
+  the exact same server.
+
+### Recommended Strategy: GitHub Action + Approach A
+
+#### Local Development & PR Checks
+
+- **Local Testing**: If you are a developer trying to quickly test your code
+  changes against performance or memory impacts, simply run the standard local
+  perf or memory tests directly without arguments. The harness stashes dirty
+  alterations automatically, refreshes baseline settings against the most
+  up-to-date `main` branch dynamically using non-tracked ephemeral files, and
+  yields immediate comparison feedback.
+- **PR Merges**: Please note that if your alterations intentionally necessitate
+  adjustments across baseline metrics, you should trigger the GitHub Action to
+  recalibrate baselines in tandem with merging your PR. This is so that
+  subsequent nightly audits appropriately do their evaluation comparisons
+  against the new tolerances successfully!
+
+#### Nightly Build Health Audits
+
+- Strict Approach A procedures apply daily across platforms on dedicated
+  environments, avoiding the "boiling frog" issue where micro-regressions
+  quietly slip past over periods of duration.
+
+---
+
+## Running Tests
+
+### Performance CPU Tests
+
+```bash
+# Run tests (compare against committed baselines)
+npm run test:perf
+
+# Verbose output
+VERBOSE=true npm run test:perf
+
+# Keep test artifacts for debugging
+KEEP_OUTPUT=true npm run test:perf
+```
+
+### Memory Tests
+
+```bash
+# Run memory tests (compare against local main baselines)
+npm run test:memory
+```
+
+---
+
+## Architecture & Configuration
+
+### Performance Tests Directory Tree
+
+- `perf-tests/baselines.json`: Committed baseline values
+- `perf-tests/globalSetup.ts`: Test environment setup
+- `perf-tests/perf-usage.test.ts`: Test scenarios
+- `perf-tests/perf.*.responses`: Fake API responses per scenario
+
+### Memory Tests Directory Tree
+
+- `memory-tests/baselines.json`: Committed memory values
+- `memory-tests/memory-usage.test.ts`: Memory test scenarios
+
+---
+
+## CI Integration
+
+These tests are strictly excluded from `preflight` constraints and remain
+designed strictly for nightly daily audits accurately:
+
+```yaml
+- name: Performance regression tests
+  run: npm run test:perf
+```
+
+---
+
+## Adding New Scenarios
+
+1. Add a fake response file: `perf.<scenario-name>.responses` or
+   `memory.<scenario-name>.responses`.
+2. Add a test case in `perf-usage.test.ts` or `memory-usage.test.ts` applying
+   `harness.runScenario()`.