feat(workspaces): transform workspaces feature into a distributable extension

2026-06-13 12:57:12 -07:00 · 2026-03-19 09:53:51 -07:00
parent 8cdd22d27d
commit df2ac184dd
37 changed files with 105 additions and 382 deletions
@@ -0,0 +1,35 @@
+# Future State: Gemini Workspaces Platform
+
+This document outlines the long-term architectural evolution of the Workspaces feature (formerly "Workspace").
+
+## 🎯 Vision
+Transform Workspaces into a first-class platform capability that allows developers to seamlessly move intensive workloads (AI reasoning, complex builds, parallel testing) to any compute environment (Cloud or Local).
+
+## 🗺️ Evolutionary Roadmap
+
+### Phase 1: Generalization & Renaming (Current)
+- **Goal**: Make the feature useful for any repository, not just Gemini CLI.
+- **Action**: Rename to "Workspaces."
+- **Action**: Implement dynamic repository detection via Git.
+- **Action**: Isolate all state into `.gemini/workspaces/`.
+
+### Phase 2: Pluggable Compute Extensions
+- **Goal**: Decouple the infrastructure logic from the core CLI.
+- **Action**: Move `WorkerProviders` into a dedicated **Workspaces Extension**.
+- **Action**: Support multiple providers (GCP, AWS, Local Docker).
+- **Action**: Define a standard API for Workspace Providers.
+
+### Phase 3: Core Integration
+- **Goal**: Standardize the user experience.
+- **Action**: Move the high-level `gemini workspace` command into the core `gemini` binary.
+- **Action**: Implement automated "Environment Hand-off" where the local agent can automatically spin up a remote workspace when it detects a heavy task.
+
+### Phase 4: Public Marketplace
+- **Goal**: Community adoption.
+- **Action**: Publish the official GCP Workspace Extension.
+- **Action**: Provide a "Zero-Config" public base image for standard Node/TS development.
+
+## 🏗️ Architectural Principles
+1. **BYOC (Bring Your Own Cloud)**: Users connect their own infrastructure.
+2. **Nested Persistence**: Keep the environment in the container, but manage the lifecycle with the host.
+3. **Repo-Agnostic**: One set of tools should work for any project.
@@ -0,0 +1,28 @@
+# Architectural Mandate: High-Performance Workspace System
+
+## Infrastructure Strategy
+- **Base OS**: Always use **Container-Optimized OS (COS)** (`cos-stable` family). It is security-hardened and has Docker pre-installed.
+- **Provisioning**: Use the **Cloud-Init (`user-data`)** pattern. 
+    - *Note*: Avoid `gcloud compute instances create-with-container` on standard Linux images as it uses a deprecated startup agent. On COS, use native `user-data` for cleanest execution.
+- **Performance**: Provision with a minimum of **200GB PD-Balanced** disk to ensure high I/O throughput for Node.js builds and to satisfy GCP disk performance requirements.
+
+## Container Isolation
+- **Image**: `us-docker.pkg.dev/gemini-code-dev/gemini-cli/maintainer:latest`.
+- **Identity**: The container must be named **`maintainer-worker`**.
+- **Mounts**: Standardize on these host-to-container mappings:
+    - `~/dev` -> `/home/node/dev` (Persistence for worktrees)
+    - `~/.gemini` -> `/home/node/.gemini` (Shared credentials)
+    - `~/.workspace` -> `/home/node/.workspace` (Shared scripts/logs)
+- **Runtime**: The container runs as a persistent service (`--restart always`) acting as a "Remote Workstation" rather than an ephemeral task.
+
+## Orchestration Logic
+- **Worker Provider Abstraction**: Infrastructure is managed via a `WorkerProvider` interface (e.g., `GceCosProvider`). This decouples the orchestration logic from the underlying platform.
+- **Robust Connectivity**: The system uses a dual-path connectivity strategy:
+    1. **Fast-Path SSH**: Primary connection via a standard SSH alias (`gcli-worker`) for high-performance synchronization and interaction.
+    2. **IAP Fallback**: Automatic fallback to `gcloud compute ssh --tunnel-through-iap` for users off-VPC or when direct DNS resolution fails.
+- **Context Execution**: Use `docker exec -it maintainer-worker ...` for interactive tasks and `tmux` sessions. This provides persistence against connection drops while keeping the host OS "invisible."
+- **Path Resolution**: Both Host and Container must share identical tilde (`~`) paths to avoid mapping confusion in automation scripts.
+
+## Maintenance
+- **Rebuilds**: If the environment drifts or the image updates, delete the VM and re-run the `provision` action.
+- **Status**: The Mission Control dashboard derives state by scanning host `tmux` sessions and container filesystem logs.
@@ -0,0 +1,49 @@
+# Network Architecture & Troubleshooting Research
+
+This document captures the empirical research and final configuration settled upon for the Gemini CLI Workspace system, specifically addressing the challenges of connecting from corporate environments to private GCP workers.
+
+## 🔍 The Challenge
+The goal was to achieve **Direct internal SSH** access to GCE workers that have **no public IP addresses**, allowing for high-performance file synchronization (`rsync`) and interactive sessions without the overhead of `gcloud` wrappers.
+
+## 🧪 What Was Tested
+
+### 1. Standard Internal DNS (`<instance>.<zone>.c.<project>.internal`)
+- **Result**: ❌ FAILED
+- **Observation**: Standard GCE internal DNS suffixes often fail to resolve or route correctly from local workstations in certain corporate environments, even when VPN/Peering is active.
+
+### 2. IAP Tunneling (`gcloud compute ssh --tunnel-through-iap`)
+- **Result**: ⚠️ INCONSISTENT
+- **Observation**: While IAP is the standard fallback for private VMs, it failed with "failed to connect to backend" (4003) when the underlying VPC network lacked proper configuration or when firewall rules were misaligned with the specific network interface.
+
+### 3. Custom "Auto" Networks
+- **Result**: ❌ FAILED
+- **Observation**: Creating a fresh VPC with default "auto" settings was insufficient. The "magic" corporate routing paths did not automatically extend to these new, isolated networks.
+
+## ✅ The Final State (The "Magic" Configuration)
+
+Through comparison with the working `gemini-cli-team-quota` project and empirical testing in a sandbox, we settled on the following requirements:
+
+### 1. Hostname Construction
+The system **MUST** use the following specific hostname pattern for direct internal reachability:
+`nic0.<instance>.<zone>.c.<project>.internal.gcpnode.com`
+
+### 2. VPC Configuration
+The VPC (e.g., `iap-vpc`) must be a **Custom Mode** network with the following properties:
+- **Private Google Access**: MUST be enabled on the subnetwork. This allows the private VM to communicate with Google services (like Artifact Registry) without a public IP.
+- **Firewall Rule**: An ingress rule allowing `tcp:22` from `0.0.0.0/0`.
+    - *Note*: While `0.0.0.0/0` seems broad, in this context it is typically restricted by the corporate-level gateway/peering that provides the `internal.gcpnode.com` route.
+
+### 3. Worker Provider Abstraction
+To manage this complexity, we implemented a `WorkerProvider` architecture:
+- **`BaseProvider`**: Defines a common interface for `exec`, `sync`, and `provision`.
+- **`GceCosProvider`**: Encapsulates the GCE-specific "magic" (hostname construction, IAP fallbacks, COS startup scripts).
+
+## 🛠️ Why This Works
+This configuration aligns with the **Google Corporate Direct-Access** pattern. By using the `nic0` prefix and the `.gcpnode.com` suffix, the connection is routed through internal corporate proxies that recognize the authenticated developer identity and permit the direct SSH handshake to the private IP.
+
+## 📜 Technical Metadata Summary
+- **Network**: `iap-vpc` (Custom)
+- **Subnet**: `iap-subnet` (Private Google Access: Enabled)
+- **Identity**: OS Login (`enable-oslogin=TRUE`)
+- **Image**: Container-Optimized OS (COS)
+- **Connectivity**: Direct SSH via `nic0` -> Automatic Fallback to IAP.
@@ -0,0 +1,60 @@
+# Mission: GCE Container-First Refactor 🚀
+
+## Current State
+- **Architecture**: Persistent GCE VM (`gcli-workspace-mattkorwel`) with Fast-Path SSH (`gcli-worker`).
+- **Logic**: Decoupled scripts in `~/.workspace/scripts`, using Git Worktrees for concurrency.
+- **Auth**: Scoped GitHub PATs mirrored via setup.
+
+## The Goal (Container-OS Transition)
+Shift from a "Manual VM" to an "Invisible VM" (Container-Optimized OS) that runs our Sandbox Docker image directly.
+
+## Planned Changes
+1. **Multi-Stage Dockerfile**: ✅ VERIFIED
+   - Optimize `.gcp/Dockerfile.maintainer` to include `tsx`, `vitest`, `gh`, and system dependencies (`libsecret`, `build-essential`).
+   - *Verified locally: Node v20, GH CLI, Git, TSX, and Vitest are functional with required headers.*
+2. **Dedicated Pipeline**:
+   - Use `.gcp/maintainer-worker.yml` for isolated builds.
+   - **Tagging Strategy**: 
+     - `latest`: Automatically updated on every merge to `main`.
+     - `branch-name`: Created on-demand for PRs via `/gcbrun` comment.
+3. **Setup Script (`setup.ts`)**:
+   - Refactor `provision` to use `gcloud compute instances create-with-container`.
+   - Point to the new `maintainer` image in Artifact Registry.
+4. **Orchestrator (`orchestrator.ts`)**:
+   - Update SSH logic to include the `--container` flag.
+
+## GCP Console Setup (Two Triggers)
+
+### Trigger 1: Production Maintainer Image (Automatic)
+1. **Event**: Push to branch.
+2. **Branch**: `^main$`.
+3. **Configuration**: Point to `.gcp/maintainer-worker.yml`.
+4. **Purpose**: Keeps the stable "Golden Image" up to date for daily use.
+
+### Trigger 2: On-Demand Testing (Comment-Gated)
+1. **Event**: Pull request.
+2. **Base Branch**: `^main$`.
+3. **Comment Control**: Set to **"Required"** (e.g. `/gcbrun`).
+4. **Configuration**: Point to `.gcp/maintainer-worker.yml`.
+5. **Purpose**: Allows developers to test infrastructure changes before merging.
+
+## Phase 2: Refactoring setup.ts for Container-OS
+This phase is currently **ARCHIVED** in favor of the Persistent Workstation model. 
+
+### Implementation Logic (Snapshot)
+The orchestrator should launch isolated containers using this pattern:
+```bash
+docker run --rm -it \
+  --name workspace-job-id \
+  -v ~/dev/worktrees/job-id:/home/node/dev/worktree:rw \
+  -v ~/dev/main:/home/node/dev/main:ro \
+  -v ~/.gemini:/home/node/.gemini:ro \
+  -w /home/node/dev/worktree \
+  maintainer-image:latest \
+  sh -c "tsx ~/.workspace/scripts/entrypoint.ts ..."
+```
+
+## How to Resume
+1. Review the archived container-launch logic above.
+2. Update `setup.ts` to use `gcloud compute instances create-with-container`.
+3. Update `orchestrator.ts` to use `docker run` instead of standard `ssh`.
@@ -0,0 +1,107 @@
+# Workspace maintainer skill
+
+The `workspace` skill provides a high-performance, parallelized workflow for
+workspaceing intensive developer tasks to a remote workstation. It leverages a 
+Node.js orchestrator to run complex validation playbooks concurrently in a 
+dedicated terminal window.
+
+## Why use workspace?
+
+As a maintainer, you eventually reach the limits of how much work you can manage
+at once on a single local machine. Heavy builds, concurrent test suites, and
+multiple PRs in flight can quickly overload local resources, leading to 
+performance degradation and developer friction.
+
+While manual remote management is a common workaround, it is often cumbersome
+and context-heavy. The `workspace` skill addresses these challenges by providing:
+
+-   **Elastic compute**: Workspace resource-intensive build and lint suites to a
+    beefy remote workstation, keeping your local machine responsive.
+-   **Context preservation**: The main Gemini session remains interactive and
+    focused on high-level reasoning while automated tasks provide real-time
+    feedback in a separate window.
+-   **Automated orchestration**: The skill handles worktree provisioning, 
+    script synchronization, and environment isolation automatically.
+-   **True parallelism**: Infrastructure validation, CI checks, and behavioral 
+    proofs run simultaneously, compressing a 15-minute process into 3 minutes.
+
+## Agentic skills: Sync or Workspace
+
+The `workspace` system is designed to work in synergy with specialized agentic 
+skills. These skills can be run **synchronously** in your current terminal for
+quick tasks, or **workspaceed** to a remote session for complex, iterative loops.
+
+-   **`review-pr`**: Conducts high-fidelity, behavioral code reviews. It assumes 
+    the infrastructure is already validated and focuses on physical proof of 
+    functionality.
+-   **`fix-pr`**: An autonomous "Fix-to-Green" loop. It iteratively addresses 
+    CI failures, merge conflicts, and review comments until the PR is mergeable.
+
+When you run `npm run workspace <PR> fix`, the orchestrator provisions the remote 
+environment and then launches a Gemini CLI session specifically powered by the
+`fix-pr` skill.
+
+## Architecture: The Hybrid Powerhouse
+
+The workspace system uses a **Hybrid VM + Docker** architecture designed for maximum performance and reliability:
+
+1.  **The GCE VM (Raw Power)**: By running on high-performance Google Compute Engine instances, we workspace heavy CPU and RAM tasks (like full project builds and massive test suites) from your local machine, keeping your primary workstation responsive.
+2.  **The Docker Container (Consistency & Resilience)**:
+    *   **Source of Truth**: The `.gcp/Dockerfile.maintainer` defines the exact environment. If a tool is added there, every maintainer gets it instantly.
+    *   **Zero Drift**: Containers are immutable. Every job starts in a fresh state, preventing the "OS rot" that typically affects persistent VMs.
+    *   **Local-to-Remote Parity**: The same image can be run locally on your Mac or remotely in GCP, ensuring that "it works on my machine" translates 100% to the remote worker.
+    *   **Safe Multi-tenancy**: Using Git Worktrees inside an isolated container environment allows multiple jobs to run in parallel without sharing state or polluting the host system.
+
+## Playbooks
+
+-   **`review`** (default): Build, CI check, static analysis, and behavioral proofs.
+-   **`fix`**: Iterative fixing of CI failures and review comments.
+-   **`ready`**: Final full validation (clean install + preflight) before merge.
+-   **`open`**: Provision a worktree and drop directly into a remote tmux session.
+
+## Scenario and workflows
+
+### Getting Started (Onboarding)
+For a complete guide on setting up your remote environment, see the [Maintainer Onboarding Guide](../../../MAINTAINER_ONBOARDING.md).
+
+### Persistence and Job Recovery
+
+The workspace system is designed for high reliability and persistence. Jobs use a nested execution model to ensure they continue running even if your local terminal is closed or the connection is lost.
+
+### How it Works
+1.  **Host-Level Persistence**: The orchestrator launches each job in a named **`tmux`** session on the remote VM.
+2.  **Container Isolation**: The actual work is performed inside the persistent `maintainer-worker` Docker container.
+
+### Re-attaching to a Job
+If you lose your connection, you can easily resume your session:
+
+-   **Automatic**: Simply run the exact same command you started with (e.g., `npm run workspace 123 review`). The system will automatically detect the existing session and re-attach you.
+-   **Manual**: Use `npm run workspace:status` to find the session name, then use `ssh gcli-worker` to jump into the VM and `tmux attach -t <session>` to resume.
+
+## Technical details
+
+This skill uses a **Worker Provider** abstraction (`GceCosProvider`) to manage the remote lifecycle. It uses an isolated Gemini profile on the remote host (`~/.workspace/gemini-cli-config`) to ensure that verification tasks do not interfere with your primary configuration.
+
+### Directory structure
+- `scripts/providers/`: Modular worker implementations (GCE, etc.).
+- `scripts/orchestrator.ts`: Local orchestrator (syncs scripts and pops terminal).
+- `scripts/worker.ts`: Remote engine (provisions worktree and runs playbooks).
+- `scripts/check.ts`: Local status poller.
+- `scripts/clean.ts`: Remote cleanup utility.
+- `SKILL.md`: Instructional body used by the Gemini CLI agent.
+
+## Contributing
+
+If you want to improve this skill:
+1. Modify the TypeScript scripts in `scripts/`.
+2. Update `SKILL.md` if the agent's instructions need to change.
+3. Test your changes locally using `npm run workspace <PR>`.
+
+## Testing
+
+The orchestration logic for this skill is fully tested. To run the tests:
+```bash
+npx vitest .gemini/skills/workspace/tests/orchestration.test.ts
+```
+These tests mock the external environment (SSH, GitHub CLI, and the file system) to ensure that the orchestration scripts generate the correct commands and handle environment isolation accurately.
+
@@ -0,0 +1,39 @@
+---
+name: workspaces
+description: Expertise in managing and utilizing Gemini Workspaces for high-performance remote development tasks.
+---
+
+# Gemini Workspaces Skill
+
+This skill enables the agent to utilize **Gemini Workspaces**—a high-performance, persistent remote development platform. It allows the agent to move intensive tasks (PR reviews, complex repairs, full builds) from the local environment to a dedicated cloud worker.
+
+## 🛠️ Key Capabilities
+1. **Persistent Execution**: Jobs run in remote `tmux` sessions. Disconnecting or crashing the local terminal does not stop the remote work.
+2. **Parallel Infrastructure**: The agent can launch a heavy task (like a full build or CI run) in a workspace while continuing to assist the user locally.
+3. **Behavioral Fidelity**: Remote workers have full tool access (Git, Node, Docker, etc.) and high-performance compute, allowing the agent to provide behavioral proofs of its work.
+
+## 📋 Instructions for the Agent
+
+### When to use Workspaces
+- **Intensive Tasks**: Full preflight runs, large-scale refactors, or deep PR reviews.
+- **Persistent Logic**: When a task is expected to take longer than a few minutes and needs to survive local connection drops.
+- **Environment Isolation**: When you need a clean, high-performance environment to verify a fix without polluting the user's local machine.
+
+### How to use Workspaces
+1. **Setup**: If the user hasn't initialized their environment, instruct them to run `npm run workspace:setup`.
+2. **Launch**: Use the `workspace` command to start a playbook:
+   ```bash
+   npm run workspace <PR_NUMBER> [action]
+   ```
+   - Actions: `review` (default), `fix`, `ready`.
+3. **Check Status**: See global state and active sessions with `npm run workspace:status`, or deep-dive into specific PR logs with `npm run workspace:check <PR_NUMBER>`.
+4. **Cleanup**: 
+   - **Bulk**: Clear all sessions/worktrees with `npm run workspace:clean-all`.
+   - **Surgical**: Kill a specific PR task with `npm run workspace:kill <PR_NUMBER> <action>`.
+5. **Fleet**: Manage VM lifecycle with `npm run workspace:fleet [stop|provision|list]`.
+
+## ⚠️ Important Constraints
+- **Absolute Paths**: Always use absolute paths (e.g., `/mnt/disks/data/...`) when orchestrating remote commands.
+- **npx tsx**: When running scripts manually from the skill directory, always prefix with `npx tsx` to ensure dependencies are available.
+- **Be Behavioral**: Prioritize results from live execution (behavioral proofs) over static reading.
+- **Multi-tasking**: Remind the user they can continue chatting in the main window while the heavy workspace task runs in the separate terminal window.
@@ -0,0 +1,46 @@
+# Plan: Worker Provider Abstraction for Workspace System
+
+## Objective
+Abstract the remote execution infrastructure (GCE COS, GCE Linux, Cloud Workstations) behind a common `WorkerProvider` interface. This eliminates infrastructure-specific prompts (like "use container mode") and makes the system extensible to new backends.
+
+## Architectural Changes
+
+### 1. New Provider Abstraction
+Create a modular provider system where each infrastructure type implements a standard interface.
+- **Base Interface**: `WorkerProvider` (methods for `exec`, `sync`, `provision`, `getStatus`).
+- **Implementations**:
+    - `GceCosProvider`: Handles COS with Cloud-Init and `docker exec` wrapping.
+    - `GceLinuxProvider`: Handles standard Linux VMs with direct execution.
+    - `LocalDockerProvider`: (Future) Runs workspace tasks in a local container.
+    - `WorkstationProvider`: (Future) Integrates with Google Cloud Workstations.
+
+### 2. Auto-Discovery
+Modify `setup.ts` to:
+- Prompt for a high-level "Provider Type" (e.g., "Google Cloud (COS)", "Google Cloud (Linux)").
+- Auto-detect environment details where possible (e.g., fetching internal IPs, identifying container names).
+
+### 3. Clean Orchestration
+Refactor `orchestrator.ts` to be provider-agnostic:
+- It asks the provider to "Ensure Ready" (wake VM).
+- It asks the provider to "Prepare Environment" (worktree setup).
+- It asks the provider to "Launch Task" (tmux initialization).
+
+## Implementation Steps
+
+### Phase 1: Infrastructure Cleanup
+- Move existing procedural logic from `fleet.ts`, `setup.ts`, and `orchestrator.ts` into a new `providers/` directory.
+- Create `ProviderFactory` to instantiate the correct implementation based on `settings.json`.
+
+### Phase 2: Refactor Scripts
+- **`fleet.ts`**: Proxy all actions (`provision`, `rebuild`, `stop`) to the provider.
+- **`orchestrator.ts`**: Use the provider for the entire lifecycle of a job.
+- **`status.ts`**: Use the provider's `getStatus()` method to derive state.
+
+### Phase 3: Validation
+- Verify that the `gcli-worker` SSH alias and IAP tunneling remain functional.
+- Ensure "Fast-Path SSH" is still the primary interactive gateway.
+
+## Verification
+- Run `npm run workspace:fleet provision` and ensure it creates a COS-native worker.
+- Run `npm run workspace:setup` and verify it no longer asks cryptic infrastructure questions.
+- Launch a review and verify it uses `docker exec internally for the COS provider.