mirror of
https://github.com/google-gemini/gemini-cli.git
synced 2026-03-13 07:30:52 -07:00
docs(local model routing): add docs on how to use Gemma for local model routing (#21365)
Co-authored-by: Douglas Reid <21148125+douglas-reid@users.noreply.github.com> Co-authored-by: Allen Hutchison <adh@google.com> Co-authored-by: matt korwel <matt.korwel@gmail.com>
This commit is contained in:
@@ -26,6 +26,20 @@ policies.
|
||||
the CLI will use an available fallback model for the current turn or the
|
||||
remainder of the session.
|
||||
|
||||
### Local Model Routing (Experimental)
|
||||
|
||||
Gemini CLI supports using a local model for routing decisions. When configured,
|
||||
Gemini CLI will use a locally-running **Gemma** model to make routing decisions
|
||||
(instead of sending routing decisions to a hosted model). This feature can help
|
||||
reduce costs associated with hosted model usage while offering similar routing
|
||||
decision latency and quality.
|
||||
|
||||
In order to use this feature, the local Gemma model **must** be served behind a
|
||||
Gemini API and accessible via HTTP at an endpoint configured in `settings.json`.
|
||||
|
||||
For more details on how to configure local model routing, see
|
||||
[Local Model Routing](../core/local-model-routing.md).
|
||||
|
||||
### Model selection precedence
|
||||
|
||||
The model used by Gemini CLI is determined by the following order of precedence:
|
||||
@@ -38,5 +52,8 @@ The model used by Gemini CLI is determined by the following order of precedence:
|
||||
3. **`model.name` in `settings.json`:** If neither of the above are set, the
|
||||
model specified in the `model.name` property of your `settings.json` file
|
||||
will be used.
|
||||
4. **Default model:** If none of the above are set, the default model will be
|
||||
4. **Local model (experimental):** If the Gemma local model router is enabled
|
||||
in your `settings.json` file, the CLI will use the local Gemma model
|
||||
(instead of Gemini models) to route the request to an appropriate model.
|
||||
5. **Default model:** If none of the above are set, the default model will be
|
||||
used. The default model is `auto`
|
||||
|
||||
@@ -15,6 +15,8 @@ requests sent from `packages/cli`. For a general overview of Gemini CLI, see the
|
||||
modular GEMINI.md import feature using @file.md syntax.
|
||||
- **[Policy Engine](../reference/policy-engine.md):** Use the Policy Engine for
|
||||
fine-grained control over tool execution.
|
||||
- **[Local Model Routing (experimental)](./local-model-routing.md):** Learn how
|
||||
to enable use of a local Gemma model for model routing decisions.
|
||||
|
||||
## Role of the core
|
||||
|
||||
|
||||
193
docs/core/local-model-routing.md
Normal file
193
docs/core/local-model-routing.md
Normal file
@@ -0,0 +1,193 @@
|
||||
# Local Model Routing (experimental)
|
||||
|
||||
Gemini CLI supports using a local model for
|
||||
[routing decisions](../cli/model-routing.md). When configured, Gemini CLI will
|
||||
use a locally-running **Gemma** model to make routing decisions (instead of
|
||||
sending routing decisions to a hosted model).
|
||||
|
||||
This feature can help reduce costs associated with hosted model usage while
|
||||
offering similar routing decision latency and quality.
|
||||
|
||||
> **Note: Local model routing is currently an experimental feature.**
|
||||
|
||||
## Setup
|
||||
|
||||
Using a Gemma model for routing decisions requires that an implementation of a
|
||||
Gemma model be running locally on your machine, served behind an HTTP endpoint
|
||||
and accessed via the Gemini API.
|
||||
|
||||
To serve the Gemma model, follow these steps:
|
||||
|
||||
### Download the LiteRT-LM runtime
|
||||
|
||||
The [LiteRT-LM](https://github.com/google-ai-edge/LiteRT-LM) runtime offers
|
||||
pre-built binaries for locally-serving models. Download the binary appropriate
|
||||
for your system.
|
||||
|
||||
#### Windows
|
||||
|
||||
1. Download
|
||||
[lit.windows_x86_64.exe](https://github.com/google-ai-edge/LiteRT-LM/releases/download/v0.9.0-alpha03/lit.windows_x86_64.exe).
|
||||
2. Using GPU on Windows requires the DirectXShaderCompiler. Download the
|
||||
[dxc zip from the latest release](https://github.com/microsoft/DirectXShaderCompiler/releases/download/v1.8.2505.1/dxc_2025_07_14.zip).
|
||||
Unzip the archive and from the architecture-appropriate `bin\` directory, and
|
||||
copy the `dxil.dll` and `dxcompiler.dll` into the same location as you saved
|
||||
`lit.windows_x86_64.exe`.
|
||||
3. (Optional) Test starting the runtime:
|
||||
`.\lit.windows_x86_64.exe serve --verbose`
|
||||
|
||||
#### Linux
|
||||
|
||||
1. Download
|
||||
[lit.linux_x86_64](https://github.com/google-ai-edge/LiteRT-LM/releases/download/v0.9.0-alpha03/lit.linux_x86_64).
|
||||
2. Ensure the binary is executable: `chmod a+x lit.linux_x86_64`
|
||||
3. (Optional) Test starting the runtime: `./lit.linux_x86_64 serve --verbose`
|
||||
|
||||
#### MacOS
|
||||
|
||||
1. Download
|
||||
[lit-macos-arm64](https://github.com/google-ai-edge/LiteRT-LM/releases/download/v0.9.0-alpha03/lit.macos_arm64).
|
||||
2. Ensure the binary is executable: `chmod a+x lit.macos_arm64`
|
||||
3. (Optional) Test starting the runtime: `./lit.macos_arm64 serve --verbose`
|
||||
|
||||
> **Note**: MacOS can be configured to only allows binaries from "App Store &
|
||||
> Known Developers". If you encounter an error message when attempting to run
|
||||
> the binary, you will need to allow the application. One option is to visit
|
||||
> `System Settings -> Privacy & Security`, scroll to `Security`, and click
|
||||
> `"Allow Anyway"` for `"lit.macos_arm64"`. Another option is to run
|
||||
> `xattr -d com.apple.quarantine lit.macos_arm64` from the commandline.
|
||||
|
||||
### Download the Gemma Model
|
||||
|
||||
Before using Gemma, you will need to download the model (and agree to the Terms
|
||||
of Service).
|
||||
|
||||
This can be done via the LiteRT-LM runtime.
|
||||
|
||||
#### Windows
|
||||
|
||||
```bash
|
||||
$ .\lit.windows_x86_64.exe pull gemma3-1b-gpu-custom
|
||||
|
||||
[Legal] The model you are about to download is governed by
|
||||
the Gemma Terms of Use and Prohibited Use Policy. Please review these terms and ensure you agree before continuing.
|
||||
|
||||
Full Terms: https://ai.google.dev/gemma/terms
|
||||
Prohibited Use Policy: https://ai.google.dev/gemma/prohibited_use_policy
|
||||
|
||||
Do you accept these terms? (Y/N): Y
|
||||
|
||||
Terms accepted.
|
||||
Downloading model 'gemma3-1b-gpu-custom' ...
|
||||
Downloading... 968.6 MB
|
||||
Download complete.
|
||||
```
|
||||
|
||||
#### Linux
|
||||
|
||||
```bash
|
||||
$ ./lit.linux_x86_64 pull gemma3-1b-gpu-custom
|
||||
|
||||
[Legal] The model you are about to download is governed by
|
||||
the Gemma Terms of Use and Prohibited Use Policy. Please review these terms and ensure you agree before continuing.
|
||||
|
||||
Full Terms: https://ai.google.dev/gemma/terms
|
||||
Prohibited Use Policy: https://ai.google.dev/gemma/prohibited_use_policy
|
||||
|
||||
Do you accept these terms? (Y/N): Y
|
||||
|
||||
Terms accepted.
|
||||
Downloading model 'gemma3-1b-gpu-custom' ...
|
||||
Downloading... 968.6 MB
|
||||
Download complete.
|
||||
```
|
||||
|
||||
#### MacOS
|
||||
|
||||
```bash
|
||||
$ ./lit.lit.macos_arm64 pull gemma3-1b-gpu-custom
|
||||
|
||||
[Legal] The model you are about to download is governed by
|
||||
the Gemma Terms of Use and Prohibited Use Policy. Please review these terms and ensure you agree before continuing.
|
||||
|
||||
Full Terms: https://ai.google.dev/gemma/terms
|
||||
Prohibited Use Policy: https://ai.google.dev/gemma/prohibited_use_policy
|
||||
|
||||
Do you accept these terms? (Y/N): Y
|
||||
|
||||
Terms accepted.
|
||||
Downloading model 'gemma3-1b-gpu-custom' ...
|
||||
Downloading... 968.6 MB
|
||||
Download complete.
|
||||
```
|
||||
|
||||
### Start LiteRT-LM Runtime
|
||||
|
||||
Using the command appropriate to your system, start the LiteRT-LM runtime.
|
||||
Configure the port that you want to use for your Gemma model. For the purposes
|
||||
of this document, we will use port `9379`.
|
||||
|
||||
Example command for MacOS: `./lit.macos_arm64 serve --port=9379 --verbose`
|
||||
|
||||
### (Optional) Verify Model Serving
|
||||
|
||||
Send a quick prompt to the model via HTTP to validate successful model serving.
|
||||
This will cause the runtime to download the model and run it once.
|
||||
|
||||
You should see a short joke in the server output as an indicator of success.
|
||||
|
||||
#### Windows
|
||||
|
||||
```
|
||||
# Run this in PowerShell to send a request to the server
|
||||
|
||||
$uri = "http://localhost:9379/v1beta/models/gemma3-1b-gpu-custom:generateContent"
|
||||
$body = @{contents = @( @{
|
||||
role = "user"
|
||||
parts = @( @{ text = "Tell me a joke." } )
|
||||
})} | ConvertTo-Json -Depth 10
|
||||
|
||||
Invoke-RestMethod -Uri $uri -Method Post -Body $body -ContentType "application/json"
|
||||
```
|
||||
|
||||
#### Linux/MacOS
|
||||
|
||||
```bash
|
||||
$ curl "http://localhost:9379/v1beta/models/gemma3-1b-gpu-custom:generateContent" \
|
||||
-H 'Content-Type: application/json' \
|
||||
-X POST \
|
||||
-d '{"contents":[{"role":"user","parts":[{"text":"Tell me a joke."}]}]}'
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
To use a local Gemma model for routing, you must explicitly enable it in your
|
||||
`settings.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"experimental": {
|
||||
"gemmaModelRouter": {
|
||||
"enabled": true,
|
||||
"classifier": {
|
||||
"host": "http://localhost:9379",
|
||||
"model": "gemma3-1b-gpu-custom"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
> Use the port you started your LiteRT-LM runtime on in the setup steps.
|
||||
|
||||
### Configuration schema
|
||||
|
||||
| Field | Type | Required | Description |
|
||||
| :----------------- | :------ | :------- | :----------------------------------------------------------------------------------------- |
|
||||
| `enabled` | boolean | Yes | Must be `true` to enable the feature. |
|
||||
| `classifier` | object | Yes | The configuration for the local model endpoint. It includes the host and model specifiers. |
|
||||
| `classifier.host` | string | Yes | The URL to the local model server. Should be `http://localhost:<port>`. |
|
||||
| `classifier.model` | string | Yes | The model name to use for decisions. Must be `"gemma3-1b-gpu-custom"`. |
|
||||
|
||||
> **Note: You will need to restart after configuration changes for local model
|
||||
> routing to take effect.**
|
||||
Reference in New Issue
Block a user