feat: implement adaptive thinking budget

2026-06-29 04:37:12 -07:00 · 2026-01-06 15:54:07 -08:00
parent 6f4b2ad0b9
commit 2404e4fae8
10 changed files with 450 additions and 1 deletions
@@ -0,0 +1,3 @@
+# Tracks
+
+- [Dynamic Thinking Budget](tracks/dynamic-thinking-budget/plan.md)
@@ -0,0 +1,101 @@
+# Dynamic Thinking Budget Plan
+
+## Context
+
+The current Gemini CLI implementation uses static thinking configurations
+defined in `settings.json` (or defaults).
+
+- **Gemini 2.x**: Uses a static `thinkingBudget` (e.g., 8192 tokens).
+- **Gemini 3**: Uses a static `thinkingLevel` (e.g., "HIGH").
+
+This "one-size-fits-all" approach is inefficient. Simple queries waste compute,
+while complex queries might not get enough reasoning depth. The goal is to
+implement an "Adaptive Budget Manager" that dynamically adjusts the
+`thinkingBudget` (for v2) or `thinkingLevel` (for v3) based on the complexity of
+the user's request.
+
+## Goals
+
+- Implement a **Complexity Classifier** using a lightweight model (e.g., Gemini
+  Flash) to analyze the user's prompt and history.
+- **Map complexity levels** to:
+  - `thinkingBudget` token counts for Gemini 2.x models.
+  - `thinkingLevel` enums for Gemini 3 models.
+- **Dynamically update** the `GenerateContentConfig` in `GeminiClient` before
+  the main model call.
+- Ensure **fallback mechanisms** if the classification fails.
+- (Optional) **Visual feedback** to the user regarding the determined
+  complexity.
+
+## Strategy
+
+### 1. Adaptive Budget Manager Service
+
+Create a new service `AdaptiveBudgetService` in
+`packages/core/src/services/adaptiveBudgetService.ts`.
+
+- **Functionality**:
+  - Takes `userPrompt` and `recentHistory` as input.
+  - Calls Gemini Flash (using `config.getBaseLlmClient()`) with a specialized
+    system prompt.
+  - Returns a `ComplexityLevel` (1-4).
+
+### 2. Budget/Level Mapping
+
+| Complexity Level | Gemini 2.x (`thinkingBudget`) | Gemini 3 (`thinkingLevel`) | Description                    |
+| :--------------- | :---------------------------- | :------------------------- | :----------------------------- |
+| **1 (Simple)**   | 1,024 tokens                  | `LOW`                      | Quick fixes, syntax questions. |
+| **2 (Moderate)** | 4,096 tokens                  | `MEDIUM` (or `LOW`)        | Function-level logic.          |
+| **3 (High)**     | 16,384 tokens                 | `HIGH`                     | Module-level refactoring.      |
+| **4 (Extreme)**  | 32,768+ tokens                | `HIGH`                     | Architecture, deep debugging.  |
+
+### 3. Integration Point
+
+Modify `packages/core/src/core/client.ts` to invoke the `AdaptiveBudgetService`
+before `sendMessageStream`.
+
+- **Flow**:
+  1.  User sends message.
+  2.  `GeminiClient` identifies the target model family (v2 or v3).
+  3.  Call `AdaptiveBudgetService.determineComplexity()`.
+  4.  If **v2**: Calculate `thinkingBudget` based on complexity. Update config.
+  5.  If **v3**: Calculate `thinkingLevel` based on complexity. Update config.
+  6.  Proceed with `sendMessageStream`.
+
+### 4. Configuration
+
+Add settings to `packages/core/src/config/config.ts` and `settings.schema.json`:
+
+- `adaptiveThinking.enabled`: boolean (default true)
+- `adaptiveThinking.classifierModel`: string (default "gemini-2.0-flash")
+
+## Insights from "J1: Exploring Simple Test-Time Scaling (STTS)"
+
+The paper (arXiv:2505.xxxx / 2512.19585) highlights that models trained with
+Reinforcement Learning (like Gemini 3) exhibit strong scaling trends when
+allocated more inference-time compute.
+
+- **Budget Forcing**: The "Adaptive Budget Manager" implements this by forcing
+  higher `thinkingLevel` or `thinkingBudget` for harder tasks, maximizing the
+  "verifiable reward" (correct code) for complex problems while saving latency
+  on simple ones.
+- **Best-of-N**: The paper suggests that generating N solutions and selecting
+  the best one is a powerful STTS method. While out of scope for _this_ specific
+  track, the "Complexity Classifier" we build here is the _prerequisite_ for
+  that future feature. We should only trigger expensive "Best-of-N" flows when
+  the Complexity Level is 3 or 4.
+
+## Files to Modify
+
+- `packages/core/src/services/adaptiveBudgetService.ts` (New)
+- `packages/core/src/core/client.ts`
+- `packages/core/src/config/config.ts`
+
+## Verification Plan
+
+1.  **Unit Tests**: Verify `AdaptiveBudgetService` returns correct mappings for
+    both model families.
+2.  **Integration Tests**: Mock API calls to ensure `thinkingLevel` is sent for
+    v3 and `thinkingBudget` for v2.
+3.  **Manual Verification**: Use debug logs to verify the correct parameters are
+    being sent to the API.
@@ -716,6 +716,7 @@ export async function loadCliConfig(
      settings.experimental?.codebaseInvestigatorSettings,
    introspectionAgentSettings:
      settings.experimental?.introspectionAgentSettings,
+    adaptiveThinking: settings.experimental?.adaptiveThinking,
    fakeResponses: argv.fakeResponses,
    recordResponses: argv.recordResponses,
    retryFetchErrors: settings.general?.retryFetchErrors,
@@ -1473,6 +1473,37 @@ const SETTINGS_SCHEMA = {
          },
        },
      },
+      adaptiveThinking: {
+        type: 'object',
+        label: 'Adaptive Thinking Settings',
+        category: 'Experimental',
+        requiresRestart: false,
+        default: {},
+        description: 'Configuration for Adaptive Thinking Budget.',
+        showInDialog: false,
+        properties: {
+          enabled: {
+            type: 'boolean',
+            label: 'Enable Adaptive Thinking',
+            category: 'Experimental',
+            requiresRestart: false,
+            default: false,
+            description:
+              'Enable adaptive thinking budget based on task complexity.',
+            showInDialog: true,
+          },
+          classifierModel: {
+            type: 'string',
+            label: 'Classifier Model',
+            category: 'Experimental',
+            requiresRestart: false,
+            default: 'classifier',
+            description:
+              'The model (or alias) to use for complexity classification.',
+            showInDialog: false,
+          },
+        },
+      },
    },
  },

@@ -73,6 +73,7 @@ import type { ModelConfigServiceConfig } from '../services/modelConfigService.js
 import { ModelConfigService } from '../services/modelConfigService.js';
 import { DEFAULT_MODEL_CONFIGS } from './defaultModelConfigs.js';
 import { ContextManager } from '../services/contextManager.js';
+import { AdaptiveBudgetService } from '../services/adaptiveBudgetService.js';

 // Re-export OAuth config type
 export type { MCPOAuthConfig, AnyToolInvocation };
@@ -335,6 +336,10 @@ export interface ConfigParameters {
  disableModelRouterForAuth?: AuthType[];
  codebaseInvestigatorSettings?: CodebaseInvestigatorSettings;
  introspectionAgentSettings?: IntrospectionAgentSettings;
+  adaptiveThinking?: {
+    enabled?: boolean;
+    classifierModel?: string;
+  };
  continueOnFailedApiCall?: boolean;
  retryFetchErrors?: boolean;
  enableShellOutputEfficiency?: boolean;
@@ -460,6 +465,10 @@ export class Config {
  private readonly outputSettings: OutputSettings;
  private readonly codebaseInvestigatorSettings: CodebaseInvestigatorSettings;
  private readonly introspectionAgentSettings: IntrospectionAgentSettings;
+  private readonly adaptiveThinking: {
+    enabled: boolean;
+    classifierModel: string;
+  };
  private readonly continueOnFailedApiCall: boolean;
  private readonly retryFetchErrors: boolean;
  private readonly enableShellOutputEfficiency: boolean;
@@ -491,6 +500,7 @@ export class Config {
  private readonly experimentalJitContext: boolean;
  private contextManager?: ContextManager;
  private terminalBackground: string | undefined = undefined;
+  private adaptiveBudgetService!: AdaptiveBudgetService;

  constructor(params: ConfigParameters) {
    this.sessionId = params.sessionId;
@@ -618,6 +628,10 @@ export class Config {
    this.introspectionAgentSettings = {
      enabled: params.introspectionAgentSettings?.enabled ?? false,
    };
+    this.adaptiveThinking = {
+      enabled: params.adaptiveThinking?.enabled ?? false,
+      classifierModel: params.adaptiveThinking?.classifierModel ?? 'classifier',
+    };
    this.continueOnFailedApiCall = params.continueOnFailedApiCall ?? true;
    this.enableShellOutputEfficiency =
      params.enableShellOutputEfficiency ?? true;
@@ -763,6 +777,13 @@ export class Config {
      await this.contextManager.refresh();
    }

+    this.adaptiveBudgetService = new AdaptiveBudgetService(this);
+    if (this.adaptiveThinking.enabled) {
+      debugLogger.debug(
+        `Adaptive Thinking Budget enabled (classifier: ${this.adaptiveThinking.classifierModel})`,
+      );
+    }
+
    await this.geminiClient.initialize();
  }

@@ -770,6 +791,10 @@ export class Config {
    return this.contentGenerator;
  }

+  getAdaptiveBudgetService(): AdaptiveBudgetService {
+    return this.adaptiveBudgetService;
+  }
+
  async refreshAuth(authMethod: AuthType) {
    // Reset availability service when switching auth
    this.modelAvailabilityService.reset();
@@ -1664,6 +1689,10 @@ export class Config {
    return this.introspectionAgentSettings;
  }

+  getAdaptiveThinkingConfig(): { enabled: boolean; classifierModel: string } {
+    return this.adaptiveThinking;
+  }
+
  async createToolRegistry(): Promise<ToolRegistry> {
    const registry = new ToolRegistry(this, this.messageBus);

@@ -28,6 +28,7 @@ import { GeminiChat } from './geminiChat.js';
 import { retryWithBackoff } from '../utils/retry.js';
 import { getErrorMessage } from '../utils/errors.js';
 import { tokenLimit } from './tokenLimits.js';
+import { partListUnionToString } from './geminiRequest.js';
 import type {
  ChatRecordingService,
  ResumedSessionData,
@@ -620,6 +621,25 @@ export class GeminiClient {

    // availability logic
    const modelConfigKey: ModelConfigKey = { model: modelToUse };
+
+    // Adaptive Thinking Budget Integration
+    if (
+      !isInvalidStreamRetry &&
+      this.config.getAdaptiveThinkingConfig().enabled
+    ) {
+      const userMessage = partListUnionToString(request);
+      if (userMessage) {
+        const adaptiveConfig = await this.config
+          .getAdaptiveBudgetService()
+          .determineAdaptiveConfig(userMessage, modelToUse);
+
+        if (adaptiveConfig) {
+          modelConfigKey.thinkingBudget = adaptiveConfig.thinkingBudget;
+          modelConfigKey.thinkingLevel = adaptiveConfig.thinkingLevel;
+        }
+      }
+    }
+
    const { model: finalModel } = applyModelSelection(
      this.config,
      modelConfigKey,
@@ -0,0 +1,88 @@
+/**
+ * @license
+ * Copyright 2026 Google LLC
+ * SPDX-License-Identifier: Apache-2.0
+ */
+import { describe, it, expect, vi } from 'vitest';
+import {
+  AdaptiveBudgetService,
+  ComplexityLevel,
+} from './adaptiveBudgetService.js';
+import type { Config } from '../config/config.js';
+import { ThinkingLevel } from '@google/genai';
+
+describe('AdaptiveBudgetService', () => {
+  it('should map complexity levels to correct V2 budgets', () => {
+    const service = new AdaptiveBudgetService({} as Config);
+    expect(service.getThinkingBudgetV2(ComplexityLevel.SIMPLE)).toBe(1024);
+    expect(service.getThinkingBudgetV2(ComplexityLevel.MODERATE)).toBe(4096);
+    expect(service.getThinkingBudgetV2(ComplexityLevel.HIGH)).toBe(16384);
+    expect(service.getThinkingBudgetV2(ComplexityLevel.EXTREME)).toBe(32768);
+  });
+
+  it('should map complexity levels to correct V3 levels', () => {
+    const service = new AdaptiveBudgetService({} as Config);
+    expect(service.getThinkingLevelV3(ComplexityLevel.SIMPLE)).toBe(
+      ThinkingLevel.LOW,
+    );
+    expect(service.getThinkingLevelV3(ComplexityLevel.MODERATE)).toBe(
+      ThinkingLevel.LOW,
+    );
+    expect(service.getThinkingLevelV3(ComplexityLevel.HIGH)).toBe(
+      ThinkingLevel.HIGH,
+    );
+    expect(service.getThinkingLevelV3(ComplexityLevel.EXTREME)).toBe(
+      ThinkingLevel.HIGH,
+    );
+  });
+
+  it('should determine adaptive config based on LLM response', async () => {
+    const mockGenerateContent = vi.fn().mockResolvedValue({
+      candidates: [{ content: { parts: [{ text: '3' }] } }],
+    });
+
+    const mockConfig = {
+      getBaseLlmClient: () => ({
+        generateContent: mockGenerateContent,
+      }),
+      getAdaptiveThinkingConfig: () => ({
+        enabled: true,
+        classifierModel: 'gemini-2.0-flash',
+      }),
+    } as unknown as Config;
+
+    const service = new AdaptiveBudgetService(mockConfig);
+    const result = await service.determineAdaptiveConfig(
+      'Complex task',
+      'gemini-2.5-pro',
+    );
+
+    expect(result?.complexity).toBe(ComplexityLevel.HIGH);
+    expect(result?.thinkingBudget).toBe(16384);
+    expect(mockGenerateContent).toHaveBeenCalled();
+  });
+
+  it('should handle Gemini 3 models with thinkingLevel', async () => {
+    const mockConfig = {
+      getBaseLlmClient: () => ({
+        generateContent: vi.fn().mockResolvedValue({
+          candidates: [{ content: { parts: [{ text: '1' }] } }],
+        }),
+      }),
+      getAdaptiveThinkingConfig: () => ({
+        enabled: true,
+        classifierModel: 'gemini-2.0-flash',
+      }),
+    } as unknown as Config;
+
+    const service = new AdaptiveBudgetService(mockConfig);
+    const result = await service.determineAdaptiveConfig(
+      'Hi',
+      'gemini-3-pro-preview',
+    );
+
+    expect(result?.complexity).toBe(ComplexityLevel.SIMPLE);
+    expect(result?.thinkingLevel).toBe(ThinkingLevel.LOW);
+    expect(result?.thinkingBudget).toBeUndefined();
+  });
+});
@@ -0,0 +1,132 @@
+/**
+ * @license
+ * Copyright 2026 Google LLC
+ * SPDX-License-Identifier: Apache-2.0
+ */
+import type { Config } from '../config/config.js';
+import { debugLogger } from '../utils/debugLogger.js';
+import { isGemini2Model, isPreviewModel } from '../config/models.js';
+import { ThinkingLevel } from '@google/genai';
+
+export enum ComplexityLevel {
+  SIMPLE = 1,
+  MODERATE = 2,
+  HIGH = 3,
+  EXTREME = 4,
+}
+
+export const BUDGET_MAPPING_V2: Record<ComplexityLevel, number> = {
+  [ComplexityLevel.SIMPLE]: 1024,
+  [ComplexityLevel.MODERATE]: 4096,
+  [ComplexityLevel.HIGH]: 16384,
+  [ComplexityLevel.EXTREME]: 32768,
+};
+
+export const LEVEL_MAPPING_V3: Record<ComplexityLevel, ThinkingLevel> = {
+  [ComplexityLevel.SIMPLE]: ThinkingLevel.LOW,
+  [ComplexityLevel.MODERATE]: ThinkingLevel.LOW,
+  [ComplexityLevel.HIGH]: ThinkingLevel.HIGH,
+  [ComplexityLevel.EXTREME]: ThinkingLevel.HIGH,
+};
+
+export interface AdaptiveBudgetResult {
+  complexity: ComplexityLevel;
+  thinkingBudget?: number;
+  thinkingLevel?: ThinkingLevel;
+  strategyNote?: string;
+}
+
+export class AdaptiveBudgetService {
+  constructor(private config: Config) {}
+
+  /**
+   * Analyzes the user prompt and determines the optimal thinking configuration.
+   *
+   * Note on future scaling (per arXiv:2512.19585):
+   * At Complexity 4 (Extreme), we should consider:
+   * 1. Best-of-N: Generate multiple solutions.
+   * 2. LLM-as-a-Judge: Use a strong model to evaluate candidates.
+   * 3. Compiler Verification: Check code correctness via environment tools.
+   */
+  async determineAdaptiveConfig(
+    userPrompt: string,
+    model: string,
+  ): Promise<AdaptiveBudgetResult | undefined> {
+    const { classifierModel } = this.config.getAdaptiveThinkingConfig();
+
+    try {
+      const llm = this.config.getBaseLlmClient();
+      debugLogger.debug(
+        `AdaptiveBudgetService: Classifying prompt complexity using ${classifierModel}...`,
+      );
+      const systemPrompt = `You are a complexity classifier for a coding assistant. 
+Analyze the user's request and determine the complexity of the task.
+Output ONLY a single integer from 1 to 4 based on the following scale:
+
+1 (Simple): Quick fixes, syntax questions, simple explanations, greetings.
+2 (Moderate): Function-level logic, writing small scripts, standard debugging.
+3 (High): Module-level refactoring, complex feature implementation, multi-file changes.
+4 (Extreme): Architecture design, deep root-cause analysis of obscure bugs, large-scale migrations.
+
+Request: ${userPrompt}
+Complexity Level:`;
+
+      const response = await llm.generateContent({
+        modelConfigKey: { model: classifierModel },
+        contents: [{ role: 'user', parts: [{ text: systemPrompt }] }],
+        promptId: 'adaptive-budget-classifier',
+        abortSignal: new AbortController().signal,
+      });
+
+      const text = response.candidates?.[0]?.content?.parts?.[0]?.text?.trim();
+      if (!text) {
+        debugLogger.debug(
+          'AdaptiveBudgetService: No response from classifier.',
+        );
+        return undefined;
+      }
+
+      const level = parseInt(text, 10) as ComplexityLevel;
+      if (isNaN(level) || level < 1 || level > 4) {
+        debugLogger.debug(
+          `AdaptiveBudgetService: Invalid complexity level returned: ${text}`,
+        );
+        return undefined;
+      }
+
+      const result: AdaptiveBudgetResult = { complexity: level };
+
+      // Determine mapping based on model version
+      // Gemini 3 uses ThinkingLevel, Gemini 2.x uses thinkingBudget
+      if (isPreviewModel(model)) {
+        result.thinkingLevel = LEVEL_MAPPING_V3[level] ?? ThinkingLevel.HIGH;
+      } else if (isGemini2Model(model)) {
+        result.thinkingBudget = BUDGET_MAPPING_V2[level];
+      }
+
+      if (level === ComplexityLevel.EXTREME) {
+        result.strategyNote =
+          'EXTREME complexity detected. Future implementations should use Best-of-N + Verification.';
+      }
+
+      debugLogger.debug(
+        `AdaptiveBudgetService: Complexity ${level} -> Thinking Param: ${result.thinkingLevel || result.thinkingBudget}`,
+      );
+      return result;
+    } catch (error) {
+      debugLogger.error(
+        'AdaptiveBudgetService: Error classifying complexity',
+        error,
+      );
+      return undefined;
+    }
+  }
+
+  getThinkingBudgetV2(level: ComplexityLevel): number {
+    return BUDGET_MAPPING_V2[level];
+  }
+
+  getThinkingLevelV3(level: ComplexityLevel): ThinkingLevel {
+    return LEVEL_MAPPING_V3[level] ?? ThinkingLevel.HIGH;
+  }
+}
@@ -4,7 +4,7 @@
 * SPDX-License-Identifier: Apache-2.0
 */

-import type { GenerateContentConfig } from '@google/genai';
+import type { GenerateContentConfig, ThinkingLevel } from '@google/genai';

 // The primary key for the ModelConfig is the model string. However, we also
 // support a secondary key to limit the override scope, typically an agent name.
@@ -26,6 +26,10 @@ export interface ModelConfigKey {
  // This allows overrides to specify different settings (e.g., higher temperature)
  // specifically for retry scenarios.
  isRetry?: boolean;
+
+  // Dynamic thinking configuration determined at runtime (e.g. via complexity classification)
+  thinkingBudget?: number;
+  thinkingLevel?: ThinkingLevel;
 }

 export interface ModelConfig {
@@ -205,6 +209,22 @@ export class ModelConfigService {
      }
    }

+    // Apply dynamic thinking parameters from context if present
+    if (
+      context.thinkingBudget !== undefined ||
+      context.thinkingLevel !== undefined
+    ) {
+      resolvedConfig.thinkingConfig = {
+        ...(resolvedConfig.thinkingConfig as object),
+        ...(context.thinkingBudget !== undefined
+          ? { thinkingBudget: context.thinkingBudget }
+          : {}),
+        ...(context.thinkingLevel !== undefined
+          ? { thinkingLevel: context.thinkingLevel }
+          : {}),
+      };
+    }
+
    return {
      model: baseModel,
      generateContentConfig: resolvedConfig,
@@ -1441,6 +1441,30 @@
            }
          },
          "additionalProperties": false
+        },
+        "adaptiveThinking": {
+          "title": "Adaptive Thinking Settings",
+          "description": "Configuration for Adaptive Thinking Budget.",
+          "markdownDescription": "Configuration for Adaptive Thinking Budget.\n\n- Category: `Experimental`\n- Requires restart: `no`\n- Default: `{}`",
+          "default": {},
+          "type": "object",
+          "properties": {
+            "enabled": {
+              "title": "Enable Adaptive Thinking",
+              "description": "Enable adaptive thinking budget based on task complexity.",
+              "markdownDescription": "Enable adaptive thinking budget based on task complexity.\n\n- Category: `Experimental`\n- Requires restart: `no`\n- Default: `false`",
+              "default": false,
+              "type": "boolean"
+            },
+            "classifierModel": {
+              "title": "Classifier Model",
+              "description": "The model (or alias) to use for complexity classification.",
+              "markdownDescription": "The model (or alias) to use for complexity classification.\n\n- Category: `Experimental`\n- Requires restart: `no`\n- Default: `classifier`",
+              "default": "classifier",
+              "type": "string"
+            }
+          },
+          "additionalProperties": false
        }
      },
      "additionalProperties": false