diff --git a/docs/cli/model-routing.md b/docs/cli/model-routing.md index 1f7ba5da09..3c7bd65bc5 100644 --- a/docs/cli/model-routing.md +++ b/docs/cli/model-routing.md @@ -26,6 +26,20 @@ policies. the CLI will use an available fallback model for the current turn or the remainder of the session. +### Local Model Routing (Experimental) + +Gemini CLI supports using a local model for routing decisions. When configured, +Gemini CLI will use a locally-running **Gemma** model to make routing decisions +(instead of sending routing decisions to a hosted model). This feature can help +reduce costs associated with hosted model usage while offering similar routing +decision latency and quality. + +In order to use this feature, the local Gemma model **must** be served behind a +Gemini API and accessible via HTTP at an endpoint configured in `settings.json`. + +For more details on how to configure local model routing, see +[Local Model Routing](../core/local-model-routing.md). + ### Model selection precedence The model used by Gemini CLI is determined by the following order of precedence: @@ -38,5 +52,8 @@ The model used by Gemini CLI is determined by the following order of precedence: 3. **`model.name` in `settings.json`:** If neither of the above are set, the model specified in the `model.name` property of your `settings.json` file will be used. -4. **Default model:** If none of the above are set, the default model will be +4. **Local model (experimental):** If the Gemma local model router is enabled + in your `settings.json` file, the CLI will use the local Gemma model + (instead of Gemini models) to route the request to an appropriate model. +5. **Default model:** If none of the above are set, the default model will be used. The default model is `auto` diff --git a/docs/core/index.md b/docs/core/index.md index adf186116f..afa13787b8 100644 --- a/docs/core/index.md +++ b/docs/core/index.md @@ -15,6 +15,8 @@ requests sent from `packages/cli`. For a general overview of Gemini CLI, see the modular GEMINI.md import feature using @file.md syntax. - **[Policy Engine](../reference/policy-engine.md):** Use the Policy Engine for fine-grained control over tool execution. +- **[Local Model Routing (experimental)](./local-model-routing.md):** Learn how + to enable use of a local Gemma model for model routing decisions. ## Role of the core diff --git a/docs/core/local-model-routing.md b/docs/core/local-model-routing.md new file mode 100644 index 0000000000..99f52511b0 --- /dev/null +++ b/docs/core/local-model-routing.md @@ -0,0 +1,193 @@ +# Local Model Routing (experimental) + +Gemini CLI supports using a local model for +[routing decisions](../cli/model-routing.md). When configured, Gemini CLI will +use a locally-running **Gemma** model to make routing decisions (instead of +sending routing decisions to a hosted model). + +This feature can help reduce costs associated with hosted model usage while +offering similar routing decision latency and quality. + +> **Note: Local model routing is currently an experimental feature.** + +## Setup + +Using a Gemma model for routing decisions requires that an implementation of a +Gemma model be running locally on your machine, served behind an HTTP endpoint +and accessed via the Gemini API. + +To serve the Gemma model, follow these steps: + +### Download the LiteRT-LM runtime + +The [LiteRT-LM](https://github.com/google-ai-edge/LiteRT-LM) runtime offers +pre-built binaries for locally-serving models. Download the binary appropriate +for your system. + +#### Windows + +1. Download + [lit.windows_x86_64.exe](https://github.com/google-ai-edge/LiteRT-LM/releases/download/v0.9.0-alpha03/lit.windows_x86_64.exe). +2. Using GPU on Windows requires the DirectXShaderCompiler. Download the + [dxc zip from the latest release](https://github.com/microsoft/DirectXShaderCompiler/releases/download/v1.8.2505.1/dxc_2025_07_14.zip). + Unzip the archive and from the architecture-appropriate `bin\` directory, and + copy the `dxil.dll` and `dxcompiler.dll` into the same location as you saved + `lit.windows_x86_64.exe`. +3. (Optional) Test starting the runtime: + `.\lit.windows_x86_64.exe serve --verbose` + +#### Linux + +1. Download + [lit.linux_x86_64](https://github.com/google-ai-edge/LiteRT-LM/releases/download/v0.9.0-alpha03/lit.linux_x86_64). +2. Ensure the binary is executable: `chmod a+x lit.linux_x86_64` +3. (Optional) Test starting the runtime: `./lit.linux_x86_64 serve --verbose` + +#### MacOS + +1. Download + [lit-macos-arm64](https://github.com/google-ai-edge/LiteRT-LM/releases/download/v0.9.0-alpha03/lit.macos_arm64). +2. Ensure the binary is executable: `chmod a+x lit.macos_arm64` +3. (Optional) Test starting the runtime: `./lit.macos_arm64 serve --verbose` + +> **Note**: MacOS can be configured to only allows binaries from "App Store & +> Known Developers". If you encounter an error message when attempting to run +> the binary, you will need to allow the application. One option is to visit +> `System Settings -> Privacy & Security`, scroll to `Security`, and click +> `"Allow Anyway"` for `"lit.macos_arm64"`. Another option is to run +> `xattr -d com.apple.quarantine lit.macos_arm64` from the commandline. + +### Download the Gemma Model + +Before using Gemma, you will need to download the model (and agree to the Terms +of Service). + +This can be done via the LiteRT-LM runtime. + +#### Windows + +```bash +$ .\lit.windows_x86_64.exe pull gemma3-1b-gpu-custom + +[Legal] The model you are about to download is governed by +the Gemma Terms of Use and Prohibited Use Policy. Please review these terms and ensure you agree before continuing. + +Full Terms: https://ai.google.dev/gemma/terms +Prohibited Use Policy: https://ai.google.dev/gemma/prohibited_use_policy + +Do you accept these terms? (Y/N): Y + +Terms accepted. +Downloading model 'gemma3-1b-gpu-custom' ... +Downloading... 968.6 MB +Download complete. +``` + +#### Linux + +```bash +$ ./lit.linux_x86_64 pull gemma3-1b-gpu-custom + +[Legal] The model you are about to download is governed by +the Gemma Terms of Use and Prohibited Use Policy. Please review these terms and ensure you agree before continuing. + +Full Terms: https://ai.google.dev/gemma/terms +Prohibited Use Policy: https://ai.google.dev/gemma/prohibited_use_policy + +Do you accept these terms? (Y/N): Y + +Terms accepted. +Downloading model 'gemma3-1b-gpu-custom' ... +Downloading... 968.6 MB +Download complete. +``` + +#### MacOS + +```bash +$ ./lit.lit.macos_arm64 pull gemma3-1b-gpu-custom + +[Legal] The model you are about to download is governed by +the Gemma Terms of Use and Prohibited Use Policy. Please review these terms and ensure you agree before continuing. + +Full Terms: https://ai.google.dev/gemma/terms +Prohibited Use Policy: https://ai.google.dev/gemma/prohibited_use_policy + +Do you accept these terms? (Y/N): Y + +Terms accepted. +Downloading model 'gemma3-1b-gpu-custom' ... +Downloading... 968.6 MB +Download complete. +``` + +### Start LiteRT-LM Runtime + +Using the command appropriate to your system, start the LiteRT-LM runtime. +Configure the port that you want to use for your Gemma model. For the purposes +of this document, we will use port `9379`. + +Example command for MacOS: `./lit.macos_arm64 serve --port=9379 --verbose` + +### (Optional) Verify Model Serving + +Send a quick prompt to the model via HTTP to validate successful model serving. +This will cause the runtime to download the model and run it once. + +You should see a short joke in the server output as an indicator of success. + +#### Windows + +``` +# Run this in PowerShell to send a request to the server + +$uri = "http://localhost:9379/v1beta/models/gemma3-1b-gpu-custom:generateContent" +$body = @{contents = @( @{ + role = "user" + parts = @( @{ text = "Tell me a joke." } ) +})} | ConvertTo-Json -Depth 10 + +Invoke-RestMethod -Uri $uri -Method Post -Body $body -ContentType "application/json" +``` + +#### Linux/MacOS + +```bash +$ curl "http://localhost:9379/v1beta/models/gemma3-1b-gpu-custom:generateContent" \ + -H 'Content-Type: application/json' \ + -X POST \ + -d '{"contents":[{"role":"user","parts":[{"text":"Tell me a joke."}]}]}' +``` + +## Configuration + +To use a local Gemma model for routing, you must explicitly enable it in your +`settings.json`: + +```json +{ + "experimental": { + "gemmaModelRouter": { + "enabled": true, + "classifier": { + "host": "http://localhost:9379", + "model": "gemma3-1b-gpu-custom" + } + } + } +} +``` + +> Use the port you started your LiteRT-LM runtime on in the setup steps. + +### Configuration schema + +| Field | Type | Required | Description | +| :----------------- | :------ | :------- | :----------------------------------------------------------------------------------------- | +| `enabled` | boolean | Yes | Must be `true` to enable the feature. | +| `classifier` | object | Yes | The configuration for the local model endpoint. It includes the host and model specifiers. | +| `classifier.host` | string | Yes | The URL to the local model server. Should be `http://localhost:`. | +| `classifier.model` | string | Yes | The model name to use for decisions. Must be `"gemma3-1b-gpu-custom"`. | + +> **Note: You will need to restart after configuration changes for local model +> routing to take effect.**