mirror of https://github.com/google-gemini/gemini-cli.git synced 2026-05-15 14:23:02 -07:00

Files

T

Sri Pasumarthi 6355e2d8a1 test(perf): overhaul performance and memory baseline management

Comprehensive automation upgrades for performance and memory baselines. Includes GitHub Actions workflows for remote updates, automatic local comparisons against main, and git-ignored temporary baselines.

- Added update-baselines.yml GitHub Action to automate remote baseline upgrades efficiently in CI.
- Created scripts/run-perf-tests.js to wrap performance executions, safely stashing dirty alterations and gathering main-branch baselines locally when run without arguments.
- Enhanced PerfTestHarness and MemoryTestHarness to accommodate tolerance limits assertions safely.
- Updated test files to process TEMP_BASELINES_PATH environment variables, protecting tracked files clean during local evaluations.
- Formed docs/performance-and-memory-testing.md safely centrally detailing general strategies.
- Obsoleted folder files perf-tests/README.md, and memory-tests/README.md deleted altogether.
- Registered temporary baseline outputs inside .gitignore and updated scripts/clean.js safely for fast removals on npm run clean.

2026-04-17 13:46:58 -07:00

3.3 KiB

Raw Blame History

Performance & Memory Testing Infrastructure

Overview

Gemini CLI features a highly reliable performance and memory regression testing pipeline. To curb anomalies and yields accurate results, the harness applies:

IQR Outlier Filtering: Discards anomalous metrics from evaluation safely.
Median Sampling: Takes N runs, evaluating strictly median averages effortlessly.
Warmup Runs: Discards first samples smoothly preventing JIT artifacts.
Tolerance Boundary: Default restrictions at 15% tolerance prevent unwarranted panics effortlessly.

Baseline Management

There are two core strategies for calibrating tolerances on performance benchmarks:

Approach A: Normalize for Testing Servers: Tests run directly on the automated cloud servers, and those scores are recorded as official, static baselines.
Approach B: Machine-Agnostic Daily Comparisons: Static baseline files are ignored. Every night, the test is run against today's and yesterday's code on the exact same server.

Recommended Strategy: GitHub Action + Approach A

Local Development & PR Checks

Local Testing: If you are a developer trying to quickly test your code changes against performance or memory impacts, simply run the standard local perf or memory tests directly without arguments. The harness stashes dirty alterations automatically, refreshes baseline settings against the most up-to-date main branch dynamically using non-tracked ephemeral files, and yields immediate comparison feedback.
PR Merges: Please note that if your alterations intentionally necessitate adjustments across baseline metrics, you should trigger the GitHub Action to recalibrate baselines in tandem with merging your PR. This is so that subsequent nightly audits appropriately do their evaluation comparisons against the new tolerances successfully!

Nightly Build Health Audits

Strict Approach A procedures apply daily across platforms on dedicated environments, avoiding the "boiling frog" issue where micro-regressions quietly slip past over periods of duration.

Running Tests

Performance CPU Tests

# Run tests (compare against committed baselines)
npm run test:perf

# Verbose output
VERBOSE=true npm run test:perf

# Keep test artifacts for debugging
KEEP_OUTPUT=true npm run test:perf

Memory Tests

# Run memory tests (compare against local main baselines)
npm run test:memory

Architecture & Configuration

Performance Tests Directory Tree

perf-tests/baselines.json: Committed baseline values
perf-tests/globalSetup.ts: Test environment setup
perf-tests/perf-usage.test.ts: Test scenarios
perf-tests/perf.*.responses: Fake API responses per scenario

Memory Tests Directory Tree

memory-tests/baselines.json: Committed memory values
memory-tests/memory-usage.test.ts: Memory test scenarios

CI Integration

These tests are strictly excluded from preflight constraints and remain designed strictly for nightly daily audits accurately:

- name: Performance regression tests
  run: npm run test:perf

Adding New Scenarios

Add a fake response file: perf.<scenario-name>.responses or memory.<scenario-name>.responses.
Add a test case in perf-usage.test.ts or memory-usage.test.ts applying harness.runScenario().

3.3 KiB Raw Blame History