mirror of
https://github.com/google-gemini/gemini-cli.git
synced 2026-05-15 14:23:02 -07:00
6355e2d8a1
Comprehensive automation upgrades for performance and memory baselines. Includes GitHub Actions workflows for remote updates, automatic local comparisons against main, and git-ignored temporary baselines. - Added update-baselines.yml GitHub Action to automate remote baseline upgrades efficiently in CI. - Created scripts/run-perf-tests.js to wrap performance executions, safely stashing dirty alterations and gathering main-branch baselines locally when run without arguments. - Enhanced PerfTestHarness and MemoryTestHarness to accommodate tolerance limits assertions safely. - Updated test files to process TEMP_BASELINES_PATH environment variables, protecting tracked files clean during local evaluations. - Formed docs/performance-and-memory-testing.md safely centrally detailing general strategies. - Obsoleted folder files perf-tests/README.md, and memory-tests/README.md deleted altogether. - Registered temporary baseline outputs inside .gitignore and updated scripts/clean.js safely for fast removals on npm run clean.
3.3 KiB
3.3 KiB
Performance & Memory Testing Infrastructure
Overview
Gemini CLI features a highly reliable performance and memory regression testing pipeline. To curb anomalies and yields accurate results, the harness applies:
- IQR Outlier Filtering: Discards anomalous metrics from evaluation safely.
- Median Sampling: Takes
Nruns, evaluating strictly median averages effortlessly. - Warmup Runs: Discards first samples smoothly preventing JIT artifacts.
- Tolerance Boundary: Default restrictions at 15% tolerance prevent unwarranted panics effortlessly.
Baseline Management
There are two core strategies for calibrating tolerances on performance benchmarks:
- Approach A: Normalize for Testing Servers: Tests run directly on the automated cloud servers, and those scores are recorded as official, static baselines.
- Approach B: Machine-Agnostic Daily Comparisons: Static baseline files are ignored. Every night, the test is run against today's and yesterday's code on the exact same server.
Recommended Strategy: GitHub Action + Approach A
Local Development & PR Checks
- Local Testing: If you are a developer trying to quickly test your code
changes against performance or memory impacts, simply run the standard local
perf or memory tests directly without arguments. The harness stashes dirty
alterations automatically, refreshes baseline settings against the most
up-to-date
mainbranch dynamically using non-tracked ephemeral files, and yields immediate comparison feedback. - PR Merges: Please note that if your alterations intentionally necessitate adjustments across baseline metrics, you should trigger the GitHub Action to recalibrate baselines in tandem with merging your PR. This is so that subsequent nightly audits appropriately do their evaluation comparisons against the new tolerances successfully!
Nightly Build Health Audits
- Strict Approach A procedures apply daily across platforms on dedicated environments, avoiding the "boiling frog" issue where micro-regressions quietly slip past over periods of duration.
Running Tests
Performance CPU Tests
# Run tests (compare against committed baselines)
npm run test:perf
# Verbose output
VERBOSE=true npm run test:perf
# Keep test artifacts for debugging
KEEP_OUTPUT=true npm run test:perf
Memory Tests
# Run memory tests (compare against local main baselines)
npm run test:memory
Architecture & Configuration
Performance Tests Directory Tree
perf-tests/baselines.json: Committed baseline valuesperf-tests/globalSetup.ts: Test environment setupperf-tests/perf-usage.test.ts: Test scenariosperf-tests/perf.*.responses: Fake API responses per scenario
Memory Tests Directory Tree
memory-tests/baselines.json: Committed memory valuesmemory-tests/memory-usage.test.ts: Memory test scenarios
CI Integration
These tests are strictly excluded from preflight constraints and remain
designed strictly for nightly daily audits accurately:
- name: Performance regression tests
run: npm run test:perf
Adding New Scenarios
- Add a fake response file:
perf.<scenario-name>.responsesormemory.<scenario-name>.responses. - Add a test case in
perf-usage.test.tsormemory-usage.test.tsapplyingharness.runScenario().