From 8cec56706476e73db5a12df6080d6e1b09fc3c56 Mon Sep 17 00:00:00 2001 From: Samee Zahid Date: Wed, 29 Apr 2026 16:08:54 -0700 Subject: [PATCH] docs(core): add automated gemma setup guide (#26233) Co-authored-by: Samee Zahid --- docs/cli/model-routing.md | 6 +-- docs/core/gemma-setup.md | 83 ++++++++++++++++++++++++++++++++ docs/core/index.md | 5 +- docs/core/local-model-routing.md | 21 +++++--- 4 files changed, 103 insertions(+), 12 deletions(-) create mode 100644 docs/core/gemma-setup.md diff --git a/docs/cli/model-routing.md b/docs/cli/model-routing.md index c9ec073a64..a29dd98a9b 100644 --- a/docs/cli/model-routing.md +++ b/docs/cli/model-routing.md @@ -34,11 +34,11 @@ Gemini CLI will use a locally-running **Gemma** model to make routing decisions reduce costs associated with hosted model usage while offering similar routing decision latency and quality. -In order to use this feature, the local Gemma model **must** be served behind a -Gemini API and accessible via HTTP at an endpoint configured in `settings.json`. +The easiest way to set this up is using the automated `gemini gemma setup` +command. For more details on how to configure local model routing, see -[Local Model Routing](../core/local-model-routing.md). +[`gemini gemma` — Local Model Routing Setup](../core/gemma-setup.md). ### Model selection precedence diff --git a/docs/core/gemma-setup.md b/docs/core/gemma-setup.md new file mode 100644 index 0000000000..166bdc885b --- /dev/null +++ b/docs/core/gemma-setup.md @@ -0,0 +1,83 @@ +# `gemini gemma` — Automated Local Model Routing Setup + +Local model routing uses a local Gemma 3 1B model running on your machine to +classify and route user requests. It routes simple requests (like file reads) to +Gemini Flash and complex requests (like architecture discussions) to Gemini Pro. + + +> [!NOTE] +> This is an experimental feature currently under active development. + +## What is this? + +This feature saves cloud API costs by using local inference for task +classification instead of a cloud-based classifier. It adds a few milliseconds +of local latency but can significantly reduce the overall token usage for hosted +models. + +## Quick start + +```bash +# One command does everything: downloads runtime, pulls model, configures settings, starts server +gemini gemma setup +``` + +You'll be prompted to accept the Gemma Terms of Use. The model is ~1 GB. + +After setup, **just use the CLI normally** — routing happens automatically on +every request. + +## Commands + +| Command | What it does | +| --------------------- | -------------------------------------------------------------- | +| `gemini gemma setup` | Full install (binary + model + settings + server start) | +| `gemini gemma status` | Health check — shows what's installed and running | +| `gemini gemma start` | Start the LiteRT server (auto-starts on CLI launch by default) | +| `gemini gemma stop` | Stop the LiteRT server | +| `gemini gemma logs` | Tail the server logs to see routing requests live | +| `/gemma` | In-session status check (type it inside the CLI) | + +## Verifying it works + +1. Run `gemini gemma status` — all checks should show green +2. Open two terminals: + - Terminal 1: `gemini gemma logs` (watch for incoming requests) + - Terminal 2: use the CLI normally +3. You should see classification requests appear in the logs as you interact + with the CLI +4. The `/gemma` slash command inside a session shows a quick status panel + +## Setup flags + +```bash +gemini gemma setup --port 8080 # custom port +gemini gemma setup --no-start # don't start server after install +gemini gemma setup --force # re-download everything +gemini gemma setup --skip-model # binary only, skip the 1GB model download +``` + +## How it works under the hood + +- Local Gemma classifies each request as "simple" or "complex" (~100ms) +- Simple → Flash, Complex → Pro +- If the local server is down, the CLI silently falls back to the cloud + classifier — no errors, no disruption + +## Disabling + +Set `enabled: false` in settings or just run `gemini gemma stop` to turn off the +server: + +```json +{ "experimental": { "gemmaModelRouter": { "enabled": false } } } +``` + +## Advanced setup + +If you are in an environment where the `gemini gemma setup` command cannot +automatically download binaries (for example, behind a strict corporate +firewall), you can perform the setup manually. + +For more information, see the +[Manual Local Model Routing Setup guide](./local-model-routing.md). diff --git a/docs/core/index.md b/docs/core/index.md index 2724e8e922..ca10cc6e48 100644 --- a/docs/core/index.md +++ b/docs/core/index.md @@ -15,8 +15,9 @@ requests sent from `packages/cli`. For a general overview of Gemini CLI, see the modular GEMINI.md import feature using @file.md syntax. - **[Policy Engine](../reference/policy-engine.md):** Use the Policy Engine for fine-grained control over tool execution. -- **[Local Model Routing (experimental)](./local-model-routing.md):** Learn how - to enable use of a local Gemma model for model routing decisions. +- **[Local Model Routing (experimental)](./gemma-setup.md):** Learn how to + enable use of a local Gemma model for model routing decisions using the + automated setup command. ## Role of the core diff --git a/docs/core/local-model-routing.md b/docs/core/local-model-routing.md index 220ee13c46..3ab3709ed1 100644 --- a/docs/core/local-model-routing.md +++ b/docs/core/local-model-routing.md @@ -1,22 +1,29 @@ -# Local Model Routing (experimental) +# Manual Local Model Routing Setup (experimental) Gemini CLI supports using a local model for [routing decisions](../cli/model-routing.md). When configured, Gemini CLI will use a locally-running **Gemma** model to make routing decisions (instead of sending routing decisions to a hosted model). + +> [!NOTE] +> This is an experimental feature currently under active development. + + +> [!IMPORTANT] +> **Recommended:** We now provide a fully automated setup command. We recommend +> using the [`gemini gemma` Setup Guide](./gemma-setup.md) instead of following +> these manual steps. + This feature can help reduce costs associated with hosted model usage while offering similar routing decision latency and quality. -> **Note: Local model routing is currently an experimental feature.** - -## Setup +## Manual Setup Using a Gemma model for routing decisions requires that an implementation of a Gemma model be running locally on your machine, served behind an HTTP endpoint -and accessed via the Gemini API. - -To serve the Gemma model, follow these steps: +and accessed via the Gemini API. If you cannot use the `gemini gemma setup` +command, follow these manual steps: ### Download the LiteRT-LM runtime