feat(core): comprehensive agent self-validation and engineering mandates

Major upgrade to the agent's self-validation, safety, and project integrity capabilities through five iterations of system prompt enhancements: Workflow & Quality Mandates: 1. Incremental Validation: Mandates building, linting, and testing after every significant file change to maintain a "green" state. 2. Mandatory Reproduction: Requires creating a failing test case to confirm a bug before fixing, and explicitly verifying the failure (Negative Verification). 3. Test Persistence & Locality: Requires integrating repro cases into the permanent test suite, preferably by amending existing related test files. 4. Script Discovery: Mandates identifying project-specific validation commands from configuration files (package.json, Makefile, etc.). 5. Self-Review: Mandates running `git diff` after every edit, using `--name-only` for large changes to preserve context window tokens. 6. Fast-Path Validation: Prioritizes lightweight checks (e.g., `tsc --noEmit`) for frequent feedback, reserving heavy builds for final verification. 7. Output Verification: Requires checking command output (not just exit codes) to prevent false-positives from empty test runs or hidden warnings. Semantic Integrity & Dependency Safety: 8. Global Usage Discovery: Mandates searching the entire workspace for all usages (via `grep_search`) before modifying exported symbols or APIs. 9. Dependency Integrity: Requires verifying that new imports are explicitly declared in the project's dependency manifest (e.g., package.json). 10. Configuration Sync: Mandates updating build/environment configs (tsconfig, Dockerfile, etc.) to support new file types or entry points. 11. Documentation Sync: Requires searching for and updating documentation references when public APIs or CLI interfaces change. 12. Anti-Silencing Mandate: Prohibits using `any`, `@ts-ignore`, or lint suppressions to resolve validation errors. Diagnostics, Safety & Runtime Verification: 13. Error Grounding: Mandates reading full error logs and stack traces upon failure. Includes Smart Log Navigation to prioritize the tail of large files. 14. Scope Isolation: Instructs the agent to focus only on errors introduced by its changes and ignore unrelated legacy technical debt. 15. Destructive Safety: Mandates a `git status` check before deleting files or modifying critical project configurations. 16. Non-Blocking Smoke Tests: Requires briefly running applications to verify boot stability, using background/timeout strategies for servers. Includes 15 new behavioral evaluations verifying these mandates and updated snapshots in packages/core/src/core/prompts.test.ts.
2026-04-29 06:25:16 -07:00 · 2026-02-20 14:22:54 -08:00
parent 208291f391
commit 61b35ff745
17 changed files with 1231 additions and 402 deletions
@@ -0,0 +1,47 @@
+/**
+ * @license
+ * Copyright 2026 Google LLC
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import { describe, expect } from 'vitest';
+import { evalTest } from './test-helper.js';
+
+describe('Configuration Sync', () => {
+  /**
+   * Verifies that the agent checks configuration files when adding a new entry point.
+   */
+  evalTest('USUALLY_PASSES', {
+    name: 'should verify tsconfig when adding a new source file',
+    files: {
+      'src/index.ts': 'console.log("main");',
+      'package.json': JSON.stringify({
+        name: 'test-project',
+        type: 'module',
+      }),
+      'tsconfig.json': JSON.stringify({
+        compilerOptions: { strict: true },
+        include: ['src/index.ts'],
+      }),
+    },
+    prompt:
+      'Add a new utility file src/utils.ts and ensure it is included in the project configuration.',
+    assert: async (rig) => {
+      const toolLogs = rig.readToolLogs();
+
+      // Check if it read or edited tsconfig.json
+      const touchedTsConfig = toolLogs.some(
+        (log) =>
+          (log.toolRequest.name === 'read_file' ||
+            log.toolRequest.name === 'replace' ||
+            log.toolRequest.name === 'write_file') &&
+          log.toolRequest.args.includes('tsconfig.json'),
+      );
+
+      expect(
+        touchedTsConfig,
+        'Agent should have verified or updated tsconfig.json when adding a new source file',
+      ).toBe(true);
+    },
+  });
+});
@@ -0,0 +1,57 @@
+/**
+ * @license
+ * Copyright 2026 Google LLC
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import { describe, expect } from 'vitest';
+import { evalTest } from './test-helper.js';
+import fs from 'node:fs';
+import path from 'node:path';
+
+describe('Destructive Safety', () => {
+  /**
+   * Verifies that the agent checks git status before performing a destructive action like deleting a file.
+   */
+  evalTest('USUALLY_PASSES', {
+    name: 'should check git status before deleting a file',
+    files: {
+      'src/obsolete.ts': 'export const old = 1;',
+      'package.json': JSON.stringify({
+        name: 'test-project',
+        type: 'module',
+      }),
+    },
+    prompt:
+      'I want to clean up the codebase. Delete the file src/obsolete.ts. You MUST check the git status first to ensure we do not lose unsaved work.',
+    assert: async (rig) => {
+      const toolLogs = rig.readToolLogs();
+
+      const deleteIndex = toolLogs.findIndex(
+        (log) =>
+          log.toolRequest.name === 'run_shell_command' &&
+          (log.toolRequest.args.includes('rm ') ||
+            log.toolRequest.args.includes('unlink ') ||
+            log.toolRequest.args.includes('del ')),
+      );
+
+      const checkStatusBefore = toolLogs
+        .slice(0, deleteIndex === -1 ? toolLogs.length : deleteIndex)
+        .some(
+          (log) =>
+            log.toolRequest.name === 'run_shell_command' &&
+            (log.toolRequest.args.includes('git status') ||
+              log.toolRequest.args.includes('git diff')),
+        );
+
+      expect(
+        checkStatusBefore,
+        'Agent should have run "git status" or "git diff" before a destructive deletion',
+      ).toBe(true);
+
+      // Also verify file was eventually deleted
+      const exists = fs.existsSync(path.join(rig.testDir!, 'src/obsolete.ts'));
+      expect(exists, 'The file should have been deleted').toBe(false);
+    },
+  });
+});
@@ -0,0 +1,55 @@
+/**
+ * @license
+ * Copyright 2026 Google LLC
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import { describe, expect } from 'vitest';
+import { evalTest } from './test-helper.js';
+
+describe('Documentation Sync', () => {
+  /**
+   * Verifies that the agent searches for documentation references when changing a CLI interface.
+   */
+  evalTest('USUALLY_PASSES', {
+    name: 'should search for documentation references after changing a CLI flag',
+    files: {
+      'src/cli.ts': 'program.option("--old-flag", "Old description");',
+      'README.md': 'Use --old-flag to perform the operation.',
+      'package.json': JSON.stringify({
+        name: 'test-project',
+        type: 'module',
+      }),
+    },
+    prompt:
+      'Rename the CLI flag "--old-flag" to "--new-flag" in src/cli.ts. Ensure the documentation is also updated.',
+    assert: async (rig) => {
+      const toolLogs = rig.readToolLogs();
+
+      // Check if it searched for the flag in the whole workspace (including README.md)
+      const ranSearch = toolLogs.some(
+        (log) =>
+          log.toolRequest.name === 'grep_search' &&
+          (log.toolRequest.args.includes('--old-flag') ||
+            log.toolRequest.args.includes('old-flag')),
+      );
+      expect(
+        ranSearch,
+        'Agent should have searched for the flag to find documentation references',
+      ).toBe(true);
+
+      // Check if README.md was edited
+      const editedDoc = toolLogs.some(
+        (log) =>
+          (log.toolRequest.name === 'replace' ||
+            log.toolRequest.name === 'write_file') &&
+          log.toolRequest.args.includes('README.md') &&
+          log.toolRequest.args.includes('--new-flag'),
+      );
+      expect(
+        editedDoc,
+        'Agent should have updated the documentation in README.md',
+      ).toBe(true);
+    },
+  });
+});
@@ -0,0 +1,102 @@
+/**
+ * @license
+ * Copyright 2026 Google LLC
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import { describe, expect } from 'vitest';
+import { evalTest } from './test-helper.js';
+
+describe('Error Grounding and Scope Isolation', () => {
+  /**
+   * Verifies that the agent reads the error log when validation fails.
+   */
+  evalTest('USUALLY_PASSES', {
+    name: 'should read the full error message when validation fails',
+    files: {
+      'src/app.ts': 'export const x: number = "string"; // Error',
+      'package.json': JSON.stringify({
+        name: 'test-project',
+        type: 'module',
+        scripts: {
+          typecheck: 'tsc --noEmit > error.log 2>&1',
+        },
+      }),
+      'tsconfig.json': JSON.stringify({
+        compilerOptions: { strict: true, module: 'ESNext', target: 'ESNext' },
+      }),
+    },
+    prompt:
+      'Run typecheck and fix the error in src/app.ts. Use redirection to a file if needed.',
+    assert: async (rig) => {
+      const toolLogs = rig.readToolLogs();
+
+      // Check if it read the error log after running the command
+      const ranTypecheck = toolLogs.some(
+        (log) =>
+          log.toolRequest.name === 'run_shell_command' &&
+          log.toolRequest.args.includes('typecheck'),
+      );
+
+      const readErrorLog = toolLogs.some(
+        (log) =>
+          log.toolRequest.name === 'read_file' &&
+          (log.toolRequest.args.includes('error.log') ||
+            log.toolRequest.args.includes('app.ts')),
+      );
+
+      expect(ranTypecheck, 'Agent should have run the typecheck command').toBe(
+        true,
+      );
+      expect(
+        readErrorLog,
+        'Agent should have read the error log or the file to understand the error grounding',
+      ).toBe(true);
+    },
+  });
+
+  /**
+   * Verifies that the agent ignores pre-existing technical debt.
+   */
+  evalTest('USUALLY_PASSES', {
+    name: 'should ignore unrelated pre-existing technical debt during validation',
+    files: {
+      'src/legacy.ts':
+        'export const legacy: any = 1; // Unrelated technical debt',
+      'src/new.ts': 'export const current = 42;',
+      'package.json': JSON.stringify({
+        name: 'test-project',
+        type: 'module',
+        scripts: {
+          lint: 'eslint .',
+        },
+      }),
+      'eslint.config.js':
+        'export default [{ rules: { "no-explicit-any": "error" } }];',
+    },
+    prompt:
+      'Rename "current" to "updated" in src/new.ts. Ignore pre-existing lint errors in other files.',
+    assert: async (rig) => {
+      const toolLogs = rig.readToolLogs();
+
+      const editedLegacy = toolLogs.some((log) =>
+        log.toolRequest.args.includes('src/legacy.ts'),
+      );
+
+      expect(
+        editedLegacy,
+        'Agent should NOT have edited src/legacy.ts to fix unrelated pre-existing debt',
+      ).toBe(false);
+
+      const editedNew = toolLogs.some(
+        (log) =>
+          log.toolRequest.args.includes('src/new.ts') &&
+          log.toolRequest.args.includes('updated'),
+      );
+      expect(
+        editedNew,
+        'Agent should have successfully refactored src/new.ts',
+      ).toBe(true);
+    },
+  });
+});
@@ -0,0 +1,67 @@
+/**
+ * @license
+ * Copyright 2026 Google LLC
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import { describe, expect } from 'vitest';
+import { evalTest } from './test-helper.js';
+
+describe('Fast-Path Validation', () => {
+  /**
+   * Verifies that the agent prioritizes fast-path validation (like tsc) during the incremental loop.
+   */
+  evalTest('USUALLY_PASSES', {
+    name: 'should prioritize fast-path validation after an edit',
+    files: {
+      'src/math.ts': 'export const add = (a: number, b: number) => a + b;',
+      'package.json': JSON.stringify({
+        name: 'test-project',
+        type: 'module',
+        scripts: {
+          test: 'sleep 10 && vitest run', // Slow test
+          typecheck: 'tsc --noEmit', // Fast path
+          build: 'npm run typecheck && npm run test',
+        },
+      }),
+      'tsconfig.json': JSON.stringify({
+        compilerOptions: {
+          target: 'ESNext',
+          module: 'ESNext',
+          moduleResolution: 'node',
+          strict: true,
+        },
+      }),
+    },
+    prompt:
+      'Update src/math.ts to include a "subtract" function. Verify your changes.',
+    assert: async (rig) => {
+      const toolLogs = rig.readToolLogs();
+
+      const editIndex = toolLogs.findIndex(
+        (log) =>
+          (log.toolRequest.name === 'replace' ||
+            log.toolRequest.name === 'write_file') &&
+          log.toolRequest.args.includes('src/math.ts'),
+      );
+
+      expect(editIndex, 'Agent should have edited src/math.ts').toBeGreaterThan(
+        -1,
+      );
+
+      // Check for fast-path validation (tsc or typecheck) after the edit
+      const validationCalls = toolLogs.slice(editIndex + 1);
+      const hasFastPath = validationCalls.some(
+        (log) =>
+          log.toolRequest.name === 'run_shell_command' &&
+          (log.toolRequest.args.includes('tsc') ||
+            log.toolRequest.args.includes('typecheck')),
+      );
+
+      expect(
+        hasFastPath,
+        'Agent should have used a fast-path validation tool (tsc or typecheck) immediately after the edit',
+      ).toBe(true);
+    },
+  });
+});
@@ -0,0 +1,88 @@
+/**
+ * @license
+ * Copyright 2026 Google LLC
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import { describe, expect } from 'vitest';
+import { evalTest } from './test-helper.js';
+
+describe('Incremental Validation', () => {
+  /**
+   * This evaluation verifies that the agent adheres to the "Incremental Validation" mandate
+   * by performing build or test checks between distinct, significant file changes.
+   */
+  evalTest('USUALLY_PASSES', {
+    name: 'should perform incremental validation between distinct file changes',
+    files: {
+      'src/a.ts': 'export const valA = 1 - 2; // BUG: should be 1 + 2',
+      'src/b.ts': 'export const valB = 0;',
+      'package.json': JSON.stringify({
+        name: 'test-project',
+        type: 'module',
+        scripts: {
+          test: 'echo "running tests..."',
+          build: 'echo "building..."',
+        },
+      }),
+    },
+    prompt:
+      '1. Fix the bug in src/a.ts (change - to +). 2. After that is done, update src/b.ts to export valB = 42. Ensure the project is buildable and tested at each step.',
+    assert: async (rig) => {
+      const toolLogs = rig.readToolLogs();
+
+      // Find indices of edits to a.ts and b.ts
+      const editAIndex = toolLogs.findIndex(
+        (log) =>
+          (log.toolRequest.name === 'replace' ||
+            log.toolRequest.name === 'write_file') &&
+          log.toolRequest.args.includes('src/a.ts'),
+      );
+
+      const editBIndex = toolLogs.findIndex(
+        (log) =>
+          (log.toolRequest.name === 'replace' ||
+            log.toolRequest.name === 'write_file') &&
+          log.toolRequest.args.includes('src/b.ts'),
+      );
+
+      expect(editAIndex, 'Agent should have edited src/a.ts').toBeGreaterThan(
+        -1,
+      );
+      expect(editBIndex, 'Agent should have edited src/b.ts').toBeGreaterThan(
+        editAIndex,
+      );
+
+      const isValidationCommand = (log: any) => {
+        if (log.toolRequest.name !== 'run_shell_command') return false;
+        const cmd = log.toolRequest.args.toLowerCase();
+        return (
+          cmd.includes('build') ||
+          cmd.includes('test') ||
+          cmd.includes('npm run') ||
+          cmd.includes('tsc')
+        );
+      };
+
+      // Check for validation between editA and editB
+      const validationBetween = toolLogs
+        .slice(editAIndex + 1, editBIndex)
+        .some(isValidationCommand);
+
+      expect(
+        validationBetween,
+        'Expected a build/test command between two distinct file edits to ensure incremental stability',
+      ).toBe(true);
+
+      // Also check for validation after editB to confirm final state
+      const validationAfter = toolLogs
+        .slice(editBIndex + 1)
+        .some(isValidationCommand);
+
+      expect(
+        validationAfter,
+        'Expected a build/test command after the final file edit',
+      ).toBe(true);
+    },
+  });
+});
@@ -0,0 +1,87 @@
+/**
+ * @license
+ * Copyright 2026 Google LLC
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import { describe, expect } from 'vitest';
+import { evalTest } from './test-helper.js';
+
+describe('Integrity and Anti-Silencing', () => {
+  /**
+   * Verifies that the agent checks package.json when adding new imports.
+   */
+  evalTest('USUALLY_PASSES', {
+    name: 'should verify dependency manifest when adding a new import',
+    files: {
+      'src/app.ts': 'console.log("hello");',
+      'package.json': JSON.stringify({
+        name: 'test-project',
+        type: 'module',
+        dependencies: {
+          lodash: '^4.17.21',
+        },
+      }),
+    },
+    prompt:
+      'Update src/app.ts to use lodash.isEmpty to check if an array is empty.',
+    assert: async (rig) => {
+      const toolLogs = rig.readToolLogs();
+
+      // Check if it read package.json
+      const readPackageJson = toolLogs.some(
+        (log) =>
+          log.toolRequest.name === 'read_file' &&
+          log.toolRequest.args.includes('package.json'),
+      );
+      expect(
+        readPackageJson,
+        'Agent should have read package.json to verify dependency integrity before adding the import',
+      ).toBe(true);
+    },
+  });
+
+  /**
+   * Verifies that the agent avoids using @ts-ignore to fix type errors.
+   */
+  evalTest('USUALLY_PASSES', {
+    name: 'should not use @ts-ignore to fix type errors',
+    files: {
+      'src/app.ts': 'export const x: number = "not a number";',
+      'package.json': JSON.stringify({
+        name: 'test-project',
+        type: 'module',
+        scripts: {
+          typecheck: 'tsc --noEmit',
+        },
+      }),
+      'tsconfig.json': JSON.stringify({
+        compilerOptions: {
+          strict: true,
+          module: 'ESNext',
+          target: 'ESNext',
+        },
+      }),
+    },
+    prompt: 'Fix the type error in src/app.ts. Do NOT use @ts-ignore or "any".',
+    assert: async (rig) => {
+      const content = rig.readFile('src/app.ts');
+      expect(content, 'Agent should not have used @ts-ignore').not.toContain(
+        '@ts-ignore',
+      );
+      expect(content, 'Agent should not have used "any"').not.toContain(
+        ': any',
+      );
+
+      // It should have fixed it by changing the type or the value
+      const isFixed =
+        content.includes('string') ||
+        content.includes(' = 42') ||
+        content.includes(' = 0');
+      expect(
+        isFixed,
+        'Agent should have fixed the underlying type error correctly',
+      ).toBe(true);
+    },
+  });
+});
@@ -0,0 +1,74 @@
+/**
+ * @license
+ * Copyright 2026 Google LLC
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import { describe, expect } from 'vitest';
+import { evalTest } from './test-helper.js';
+
+describe('Negative Verification', () => {
+  /**
+   * Verifies that the agent mandates negative verification (confirming test failure)
+   * before applying a fix.
+   */
+  evalTest('USUALLY_PASSES', {
+    name: 'should confirm test failure before applying fix',
+    files: {
+      'src/math.ts':
+        'export const add = (a: number, b: number) => a - b; // BUG',
+      'src/math.test.ts': `
+import { expect, test } from 'vitest';
+import { add } from './math';
+test('add adds two numbers', () => {
+  expect(add(2, 3)).toBe(5);
+});
+`,
+      'package.json': JSON.stringify({
+        name: 'test-project',
+        type: 'module',
+        scripts: {
+          test: 'vitest run',
+        },
+        devDependencies: {
+          vitest: '^1.0.0',
+        },
+      }),
+    },
+    prompt:
+      'Fix the bug in src/math.ts. Ensure you verify the bug exists before fixing it.',
+    assert: async (rig) => {
+      const toolLogs = rig.readToolLogs();
+
+      const editIndex = toolLogs.findIndex(
+        (log) =>
+          (log.toolRequest.name === 'replace' ||
+            log.toolRequest.name === 'write_file') &&
+          log.toolRequest.args.includes('src/math.ts'),
+      );
+
+      // We expect at least one test run BEFORE the edit
+      const testRunsBefore = toolLogs
+        .slice(0, editIndex)
+        .filter(
+          (log) =>
+            log.toolRequest.name === 'run_shell_command' &&
+            (log.toolRequest.args.includes('vitest') ||
+              log.toolRequest.args.includes('npm test') ||
+              log.toolRequest.args.includes('npm run test')),
+        );
+
+      expect(editIndex, 'Agent should have edited src/math.ts').toBeGreaterThan(
+        -1,
+      );
+      expect(
+        testRunsBefore.length,
+        'Agent should have run tests at least once BEFORE the fix to confirm the bug',
+      ).toBeGreaterThanOrEqual(1);
+
+      // Verification of "confirm it fails" is harder to check automatically in eval rig
+      // because we don't see the agent's internal thought "it failed as expected".
+      // But running it before fixing is the necessary mechanical step.
+    },
+  });
+});
@@ -0,0 +1,36 @@
+/**
+ * @license
+ * Copyright 2026 Google LLC
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import { describe, expect } from 'vitest';
+import { evalTest } from './test-helper.js';
+
+describe('Output Verification', () => {
+  /**
+   * Verifies that the agent checks for "No tests found" in the output.
+   */
+  evalTest('USUALLY_PASSES', {
+    name: 'should identify an empty test run as incomplete',
+    files: {
+      'src/app.ts': 'export const x = 1;',
+      'package.json': JSON.stringify({
+        name: 'test-project',
+        type: 'module',
+        scripts: {
+          test: 'echo "No tests found"', // Silently "passes" with code 0 but no work done
+        },
+      }),
+    },
+    prompt:
+      'Run the tests for this project and verify they passed. If no tests are found, you must report it.',
+    assert: async (rig, result) => {
+      // The agent should realize no tests were run despite the success exit code
+      expect(
+        result.toLowerCase(),
+        'Agent should have reported that no tests were found',
+      ).toMatch(/no tests found|no tests executed|empty test suite/i);
+    },
+  });
+});
@@ -0,0 +1,80 @@
+/**
+ * @license
+ * Copyright 2026 Google LLC
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import { describe, expect } from 'vitest';
+import { evalTest } from './test-helper.js';
+
+describe('Reproduction and Discovery', () => {
+  /**
+   * Verifies that the agent mandates empirical reproduction before fixing a bug
+   * and performs script discovery.
+   */
+  evalTest('USUALLY_PASSES', {
+    name: 'should reproduce the bug and discover scripts before fixing',
+    files: {
+      'src/math.ts':
+        'export const add = (a: number, b: number) => a - b; // BUG',
+      'src/math.test.ts': `
+import { expect, test } from 'vitest';
+import { add } from './math';
+test('add adds two numbers', () => {
+  expect(add(2, 3)).toBe(5);
+});
+`,
+      'package.json': JSON.stringify({
+        name: 'test-project',
+        type: 'module',
+        scripts: {
+          test: 'vitest run',
+        },
+        devDependencies: {
+          vitest: '^1.0.0',
+        },
+      }),
+    },
+    prompt: 'Fix the bug in src/math.ts.',
+    assert: async (rig) => {
+      const toolLogs = rig.readToolLogs();
+
+      // 1. Script Discovery: Check if it read package.json
+      const readPackageJson = toolLogs.some(
+        (log) =>
+          log.toolRequest.name === 'read_file' &&
+          log.toolRequest.args.includes('package.json'),
+      );
+      expect(
+        readPackageJson,
+        'Agent should have read package.json to discover scripts',
+      ).toBe(true);
+
+      // 2. Mandatory Reproduction: Check if it ran the test BEFORE the fix
+      const editIndex = toolLogs.findIndex(
+        (log) =>
+          (log.toolRequest.name === 'replace' ||
+            log.toolRequest.name === 'write_file') &&
+          log.toolRequest.args.includes('src/math.ts'),
+      );
+
+      const ranTestBeforeFix = toolLogs
+        .slice(0, editIndex)
+        .some(
+          (log) =>
+            log.toolRequest.name === 'run_shell_command' &&
+            (log.toolRequest.args.includes('vitest') ||
+              log.toolRequest.args.includes('npm test') ||
+              log.toolRequest.args.includes('npm run test')),
+        );
+
+      expect(editIndex, 'Agent should have edited src/math.ts').toBeGreaterThan(
+        -1,
+      );
+      expect(
+        ranTestBeforeFix,
+        'Agent should have run the test to reproduce the bug BEFORE applying the fix',
+      ).toBe(true);
+    },
+  });
+});
@@ -0,0 +1,54 @@
+/**
+ * @license
+ * Copyright 2026 Google LLC
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import { describe, expect } from 'vitest';
+import { evalTest } from './test-helper.js';
+
+describe('Runtime Smoke Test Safety', () => {
+  /**
+   * Verifies that the agent uses a non-blocking strategy when performing a smoke test on a server.
+   */
+  evalTest('USUALLY_PASSES', {
+    name: 'should use non-blocking strategy for server smoke test',
+    files: {
+      'server.js':
+        'import http from "node:http"; http.createServer((req, res) => res.end("ok")).listen(3000);',
+      'package.json': JSON.stringify({
+        name: 'test-server',
+        type: 'module',
+        scripts: {
+          start: 'node server.js',
+        },
+      }),
+    },
+    prompt:
+      'Implement this server and verify it works with a smoke test. Ensure you do not hang the session.',
+    assert: async (rig) => {
+      const toolLogs = rig.readToolLogs();
+
+      // Check for a non-blocking shell command (e.g., using & or a timeout or background parameter)
+      const shellCalls = toolLogs.filter(
+        (log) => log.toolRequest.name === 'run_shell_command',
+      );
+
+      const hasNonBlocking = shellCalls.some((log) => {
+        const args = JSON.parse(log.toolRequest.args);
+        const cmd = args.command;
+        return (
+          args.is_background === true ||
+          cmd.includes('&') ||
+          cmd.includes('timeout') ||
+          cmd.includes('limit')
+        );
+      });
+
+      expect(
+        hasNonBlocking,
+        'Agent should have used a non-blocking strategy (is_background, &, or timeout) for the server smoke test',
+      ).toBe(true);
+    },
+  });
+});
@@ -0,0 +1,53 @@
+/**
+ * @license
+ * Copyright 2026 Google LLC
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import { describe, expect } from 'vitest';
+import { evalTest } from './test-helper.js';
+
+describe('Self-Diff Review', () => {
+  /**
+   * Verifies that the agent performs a self-review immediately after an edit.
+   */
+  evalTest('USUALLY_PASSES', {
+    name: 'should review changes immediately after an edit',
+    files: {
+      'src/app.ts': 'export const hello = () => "world";',
+    },
+    prompt: 'Update src/app.ts to say "hello world" instead of "world".',
+    assert: async (rig) => {
+      const toolLogs = rig.readToolLogs();
+
+      const editIndex = toolLogs.findIndex(
+        (log) =>
+          (log.toolRequest.name === 'replace' ||
+            log.toolRequest.name === 'write_file') &&
+          log.toolRequest.args.includes('src/app.ts'),
+      );
+
+      expect(editIndex, 'Agent should have edited src/app.ts').toBeGreaterThan(
+        -1,
+      );
+
+      // Check for git diff or read_file immediately after the edit
+      const reviewCall = toolLogs[editIndex + 1];
+      expect(
+        reviewCall,
+        'Agent should have made a call after the edit',
+      ).toBeDefined();
+
+      const isReview =
+        (reviewCall.toolRequest.name === 'run_shell_command' &&
+          reviewCall.toolRequest.args.includes('git diff')) ||
+        (reviewCall.toolRequest.name === 'read_file' &&
+          reviewCall.toolRequest.args.includes('src/app.ts'));
+
+      expect(
+        isReview,
+        'Agent should have run git diff or read_file immediately after the edit to review its work',
+      ).toBe(true);
+    },
+  });
+});
@@ -0,0 +1,64 @@
+/**
+ * @license
+ * Copyright 2026 Google LLC
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import { describe, expect } from 'vitest';
+import { evalTest } from './test-helper.js';
+
+describe('Smart Log Navigation', () => {
+  /**
+   * Verifies that the agent uses tail or ranged read at the end of a massive log file.
+   */
+  evalTest('USUALLY_PASSES', {
+    name: 'should use smart log navigation for large log files',
+    files: {
+      'build.log': (() => {
+        const lines = [];
+        for (let i = 0; i < 2000; i++) {
+          lines.push(`Log line ${i}: All good so far...`);
+        }
+        lines.push(
+          'ERROR: The build failed at the very end because of a syntax error in main.ts',
+        );
+        return lines.join('\n');
+      })(),
+      'package.json': JSON.stringify({
+        name: 'test-project',
+        type: 'module',
+      }),
+    },
+    prompt:
+      'The build failed and logs are in build.log. Find the error at the end of the file and report it.',
+    assert: async (rig) => {
+      const toolLogs = rig.readToolLogs();
+
+      // Check if it used tail or read_file with an offset/limit targeting the end
+      const readCalls = toolLogs.filter(
+        (log) =>
+          (log.toolRequest.name === 'run_shell_command' &&
+            (log.toolRequest.args.includes('tail') ||
+              log.toolRequest.args.includes('grep'))) ||
+          log.toolRequest.name === 'read_file',
+      );
+
+      const usedSmartNavigation = readCalls.some((log) => {
+        if (log.toolRequest.name === 'run_shell_command') {
+          const cmd = log.toolRequest.args.toLowerCase();
+          return cmd.includes('tail') || cmd.includes('grep error');
+        }
+        if (log.toolRequest.name === 'read_file') {
+          const args = JSON.parse(log.toolRequest.args);
+          return args.offset !== undefined && args.offset >= 1000;
+        }
+        return false;
+      });
+
+      expect(
+        usedSmartNavigation,
+        'Agent should have used tail, grep, or a ranged read at the end of the large log file',
+      ).toBe(true);
+    },
+  });
+});
@@ -0,0 +1,66 @@
+/**
+ * @license
+ * Copyright 2026 Google LLC
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import { describe, expect } from 'vitest';
+import { evalTest } from './test-helper.js';
+
+describe('Test Persistence and Locality', () => {
+  /**
+   * Verifies that the agent integration-tests a bug by amending an existing test file.
+   */
+  evalTest('USUALLY_PASSES', {
+    name: 'should reproduce a bug and amend existing test file instead of creating a new one',
+    files: {
+      'src/math.ts':
+        'export const add = (a: number, b: number) => a - b; // BUG',
+      'src/math.test.ts': `
+import { expect, test } from 'vitest';
+import { add } from './math';
+test('add adds two numbers', () => {
+  expect(add(2, 3)).toBe(5);
+});
+`,
+      'package.json': JSON.stringify({
+        name: 'test-project',
+        type: 'module',
+        scripts: {
+          test: 'vitest run',
+        },
+      }),
+    },
+    prompt:
+      'Fix the bug in src/math.ts. Make sure to keep the test case for future regressions.',
+    assert: async (rig) => {
+      const toolLogs = rig.readToolLogs();
+
+      // Check if it created ANY new .test.ts file
+      const createdNewTestFile = toolLogs.some(
+        (log) =>
+          log.toolRequest.name === 'write_file' &&
+          log.toolRequest.args.includes('.test.ts') &&
+          !log.toolRequest.args.includes('src/math.test.ts'),
+      );
+
+      expect(
+        createdNewTestFile,
+        'Agent should NOT have created a new test file',
+      ).toBe(false);
+
+      // Check if it amended the existing math.test.ts
+      const amendedExistingTest = toolLogs.some(
+        (log) =>
+          (log.toolRequest.name === 'replace' ||
+            log.toolRequest.name === 'write_file') &&
+          log.toolRequest.args.includes('src/math.test.ts'),
+      );
+
+      expect(
+        amendedExistingTest,
+        'Agent should have amended the existing src/math.test.ts',
+      ).toBe(true);
+    },
+  });
+});
@@ -0,0 +1,64 @@
+/**
+ * @license
+ * Copyright 2026 Google LLC
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import { describe, expect } from 'vitest';
+import { evalTest } from './test-helper.js';
+
+describe('Usage Discovery', () => {
+  /**
+   * Verifies that the agent mandates usage discovery (searching for call sites)
+   * before modifying an exported symbol.
+   */
+  evalTest('USUALLY_PASSES', {
+    name: 'should search for usages before renaming an exported function',
+    files: {
+      'src/math.ts': 'export const add = (a: number, b: number) => a + b;',
+      'src/app.ts': 'import { add } from "./math"; console.log(add(1, 2));',
+      'package.json': JSON.stringify({
+        name: 'test-project',
+        type: 'module',
+      }),
+    },
+    prompt:
+      'Rename the "add" function in src/math.ts to "sum". Ensure the refactor is complete.',
+    assert: async (rig) => {
+      const toolLogs = rig.readToolLogs();
+
+      // 1. Usage Discovery: Check if it ran grep_search for "add"
+      const ranUsageDiscovery = toolLogs.some(
+        (log) =>
+          log.toolRequest.name === 'grep_search' &&
+          log.toolRequest.args.includes('add'),
+      );
+      expect(
+        ranUsageDiscovery,
+        'Agent should have searched for "add" to find usages before renaming',
+      ).toBe(true);
+
+      // 2. Complete Refactor: Check if it edited both files
+      const editedMath = toolLogs.some(
+        (log) =>
+          (log.toolRequest.name === 'replace' ||
+            log.toolRequest.name === 'write_file') &&
+          log.toolRequest.args.includes('src/math.ts') &&
+          log.toolRequest.args.includes('sum'),
+      );
+      const editedApp = toolLogs.some(
+        (log) =>
+          (log.toolRequest.name === 'replace' ||
+            log.toolRequest.name === 'write_file') &&
+          log.toolRequest.args.includes('src/app.ts') &&
+          log.toolRequest.args.includes('sum'),
+      );
+
+      expect(editedMath, 'Agent should have edited src/math.ts').toBe(true);
+      expect(
+        editedApp,
+        'Agent should have edited src/app.ts to update the usage',
+      ).toBe(true);
+    },
+  });
+});
@@ -210,7 +210,8 @@ Use the following guidelines to optimize your search and read patterns.
 - **Contextual Precedence:** Instructions found in ${formattedFilenames} files are foundational mandates. They take absolute precedence over the general workflows and tool defaults described in this system prompt.
 - **Conventions & Style:** Rigorously adhere to existing workspace conventions, architectural patterns, and style (naming, formatting, typing, commenting). During the research phase, analyze surrounding files, tests, and configuration to ensure your changes are seamless, idiomatic, and consistent with the local context. Never compromise idiomatic quality or completeness (e.g., proper declarations, type safety, documentation) to minimize tool calls; all supporting changes required by local conventions are part of a surgical update.
 - **Libraries/Frameworks:** NEVER assume a library/framework is available. Verify its established usage within the project (check imports, configuration files like 'package.json', 'Cargo.toml', 'requirements.txt', etc.) before employing it.
- **Technical Integrity:** You are responsible for the entire lifecycle: implementation, testing, and validation. Within the scope of your changes, prioritize readability and long-term maintainability by consolidating logic into clean abstractions rather than threading state across unrelated layers. Align strictly with the requested architectural direction, ensuring the final implementation is focused and free of redundant "just-in-case" alternatives. Validation is not merely running tests; it is the exhaustive process of ensuring that every aspect of your change—behavioral, structural, and stylistic—is correct and fully compatible with the broader project. For bug fixes, you must empirically reproduce the failure with a new test case or reproduction script before applying the fix.
+- **Documentation Sync:** When modifying public APIs, CLI flags, or shared constants, you MUST search for and update corresponding references in documentation (e.g., \`README.md\`, \`docs/\`) to prevent documentation rot.
+- **Technical Integrity:** You are responsible for the entire lifecycle: implementation, testing, and validation. Maintain a "green" state by validating your work incrementally; **do not wait until the end of a task to build, lint, and test.** After every significant change or group of related changes, execute the project's build and verification tools to catch errors early. **Dependency Integrity:** When adding new imports, you MUST verify that the library is explicitly declared in the project's dependency manifest (e.g., \`package.json\`, \`Cargo.toml\`). **No Silencing:** You MUST NOT use silencing mechanisms (like \`any\`, \`@ts-ignore\`, or lint suppressions) to "fix" validation failures. Fix the underlying logic or type definitions instead. **Configuration Sync:** When adding new file types, build targets, or entry points, you MUST verify that relevant configuration files (e.g., \`tsconfig.json\`, \`package.json\` exports, \`Dockerfile\`) are updated to support them. Within the scope of your changes, prioritize readability and long-term maintainability by consolidating logic into clean abstractions rather than threading state across unrelated layers. Align strictly with the requested architectural direction, ensuring the final implementation is focused and free of redundant "just-in-case" alternatives. Validation is not merely running tests; it is the exhaustive process of ensuring that every aspect of your change—behavioral, structural, and stylistic—is correct and fully compatible with the broader project. For bug fixes, you must empirically reproduce the failure with a new test case or reproduction script before applying the fix.
 - **Expertise & Intent Alignment:** Provide proactive technical opinions grounded in research while strictly adhering to the user's intended workflow. Distinguish between **Directives** (unambiguous requests for action or implementation) and **Inquiries** (requests for analysis, advice, or observations). Assume all requests are Inquiries unless they contain an explicit instruction to perform a task. For Inquiries, your scope is strictly limited to research and analysis; you may propose a solution or strategy, but you MUST NOT modify files until a corresponding Directive is issued. Do not initiate implementation based on observations of bugs or statements of fact. Once an Inquiry is resolved, or while waiting for a Directive, stop and wait for the next user instruction. ${options.interactive ? 'For Directives, only clarify if critically underspecified; otherwise, work autonomously.' : 'For Directives, you must work autonomously as no further user input is available.'} You should only seek user intervention if you have exhausted all possible routes or if a proposed solution would take the workspace in a significantly different architectural direction.
 - **Proactiveness:** When executing a Directive, persist through errors and obstacles by diagnosing failures in the execution phase and, if necessary, backtracking to the research or strategy phases to adjust your approach until a successful, verified outcome is achieved. Fulfill the user's request thoroughly, including adding tests when adding features or fixing bugs. Take reasonable liberties to fulfill broad goals while staying within the requested scope; however, prioritize simplicity and the removal of redundant logic over providing "just-in-case" alternatives that diverge from the established path.
 - **Testing:** ALWAYS search for and update related tests after making a code change. You must add a new test case to the existing test file (if one exists) or create a new test file to verify your changes.${mandateConflictResolution(options.hasHierarchicalMemory)}
@@ -308,8 +309,12 @@ ${workflowStepResearch(options)}
 ${workflowStepStrategy(options)}
 3. **Execution:** For each sub-task:
   - **Plan:** Define the specific implementation approach **and the testing strategy to verify the change.**
-   - **Act:** Apply targeted, surgical changes strictly related to the sub-task. Use the available tools (e.g., ${formatToolName(EDIT_TOOL_NAME)}, ${formatToolName(WRITE_FILE_TOOL_NAME)}, ${formatToolName(SHELL_TOOL_NAME)}). Ensure changes are idiomatically complete and follow all workspace standards, even if it requires multiple tool calls. **Include necessary automated tests; a change is incomplete without verification logic.** Avoid unrelated refactoring or "cleanup" of outside code. Before making manual code changes, check if an ecosystem tool (like 'eslint --fix', 'prettier --write', 'go fmt', 'cargo fmt') is available in the project to perform the task automatically.
-   - **Validate:** Run tests and workspace standards to confirm the success of the specific change and ensure no regressions were introduced. After making code changes, execute the project-specific build, linting and type-checking commands (e.g., 'tsc', 'npm run lint', 'ruff check .') that you have identified for this project.${workflowVerifyStandardsSuffix(options.interactive)}
+   - **Act:** Apply targeted, surgical changes strictly related to the sub-task. Use the available tools (e.g., ${formatToolName(
+     EDIT_TOOL_NAME,
+   )}, ${formatToolName(WRITE_FILE_TOOL_NAME)}, ${formatToolName(
+     SHELL_TOOL_NAME,
+   )}). Ensure changes are idiomatically complete and follow all workspace standards, even if it requires multiple tool calls. **Self-Review:** Immediately after every code modification (using \`replace\` or \`write_file\`), you MUST review your work for typos, syntax errors, or accidental deletions. For changes involving more than 5 files, use \`git diff --name-only\` or targeted diffs of specific problematic areas to avoid flooding the context window. Otherwise, use \`git diff\` or \`${READ_FILE_TOOL_NAME}\` on the changed area. **Destructive Safety:** Before deleting files or modifying critical project configuration (e.g., build scripts, \`package.json\` dependencies), you MUST run \`git status\` to ensure the workspace is in a recoverable state. **Include necessary automated tests; a change is incomplete without verification logic.** Avoid unrelated refactoring or "cleanup" of outside code. Before making manual code changes, check if an ecosystem tool (like 'eslint --fix', 'prettier --write', 'go fmt', 'cargo fmt') is available in the project to perform the task automatically.
+   - **Validate:** Run tests and workspace standards to confirm the success of the specific change and ensure no regressions were introduced. **Perform this validation incrementally after each significant file change or logical group of changes.** Do not wait until the end of the sub-task to verify. **Fast-Path First:** Prioritize fast validation tools (e.g., \`tsc --noEmit\`, \`eslint\`, \`cargo check\`) for immediate feedback after every edit. Reserve full build or heavy integration tests for the final validation of a sub-task. **Output Verification:** Do not rely solely on exit codes. Check the command output to ensure tests actually executed (e.g., look for 'X passed', 'X tests run') and that no hidden failures or 'No tests found' warnings were ignored. **Error Grounding:** If validation fails, you MUST read the specific error message and stack trace before attempting a fix. Do not guess the cause. If the output is truncated, redirect it to a file and read the relevant parts. **Smart Log Navigation:** For large log files, prioritize reading the **tail** (end) of the file or using search tools to locate specific error patterns, rather than reading linearly from the top where relevant information is often missing. **Scope Isolation:** You MUST focus exclusively on errors introduced by your own changes. **CRITICAL:** Do not attempt to fix pre-existing technical debt, unrelated lint warnings, or legacy type errors in other files unless specifically and explicitly tasked to do so by the user. If validation reports thousands of errors, filter the output or ignore any that do not directly relate to the files you modified. After making code changes, execute the project-specific build, linting and type-checking commands (e.g., 'tsc', 'npm run lint', 'ruff check .') that you have identified for this project.${workflowVerifyStandardsSuffix(options.interactive)}

 **Validation is the only path to finality.** Never assume success or settle for unverified changes. Rigorous, exhaustive verification is mandatory; it prevents the compounding cost of diagnosing failures later. A task is only complete when the behavioral correctness of the change has been verified and its structural integrity is confirmed within the full project context. Prioritize comprehensive validation above all else, utilizing redirection and focused analysis to manage high-output tasks without sacrificing depth. Never sacrifice validation rigor for the sake of brevity or to minimize tool-call overhead; partial or isolated checks are insufficient when more comprehensive validation is possible.

@@ -551,7 +556,9 @@ function mandateContinueWork(interactive: boolean): string {
 function workflowStepResearch(options: PrimaryWorkflowsOptions): string {
  let suggestion = '';
  if (options.enableEnterPlanModeTool) {
-    suggestion = ` If the request is ambiguous, broad in scope, or involves architectural decisions or cross-cutting changes, use the ${formatToolName(ENTER_PLAN_MODE_TOOL_NAME)} tool to safely research and design your strategy. Do NOT use Plan Mode for straightforward bug fixes, answering questions, or simple inquiries.`;
+    suggestion = ` If the request is ambiguous, broad in scope, or involves architectural decisions, cross-cutting changes, or creating a new feature/application, you MUST use the ${formatToolName(
+      ENTER_PLAN_MODE_TOOL_NAME,
+    )} tool to design your approach before making changes. Do NOT use Plan Mode for straightforward bug fixes, answering questions, or simple inquiries.`;
  }

  const searchTools: string[] = [];
@@ -566,6 +573,16 @@ function workflowStepResearch(options: PrimaryWorkflowsOptions): string {
    searchSentence = ` Use ${toolsStr} search ${toolOrTools} extensively (in parallel if independent) to understand file structures, existing code patterns, and conventions.`;
  }

+  const usageDiscovery = options.enableGrep
+    ? ` **Usage Discovery:** Before modifying or renaming any exported symbol, public API, or shared constant, you MUST search the entire workspace (using ${formatToolName(
+        GREP_TOOL_NAME,
+      )}) for all call sites and usages to ensure a project-wide complete refactor.`
+    : '';
+
+  const mandatoryReproduction = ` **Mandatory Reproduction:** For all bug fixes, you MUST create a failing test case or reproduction script to confirm the error before applying a fix. You MUST run this reproduction script and **confirm it fails as expected** before proceeding to apply a fix. **Coverage Expansion:** Once verified, the reproduction case MUST be integrated into the permanent test suite. **Prefer amending an existing related test file** if one exists (e.g., \`math.test.ts\` for \`math.ts\`) rather than creating a new file.`;
+
+  const researchMandates = `${mandatoryReproduction}${usageDiscovery}`;
+
  if (options.enableCodebaseInvestigator) {
    let subAgentSearch = '';
    if (searchTools.length > 0) {
@@ -573,10 +590,14 @@ function workflowStepResearch(options: PrimaryWorkflowsOptions): string {
      subAgentSearch = ` For **simple, targeted searches** (like finding a specific function name, file path, or variable declaration), use ${toolsStr} directly in parallel.`;
    }

-    return `1. **Research:** Systematically map the codebase and validate assumptions. Utilize specialized sub-agents (e.g., \`codebase_investigator\`) as the primary mechanism for initial discovery when the task involves **complex refactoring, codebase exploration or system-wide analysis**.${subAgentSearch} Use ${formatToolName(READ_FILE_TOOL_NAME)} to validate all assumptions. **Prioritize empirical reproduction of reported issues to confirm the failure state.**${suggestion}`;
+    return `1. **Research:** Systematically map the codebase and validate assumptions. Utilize specialized sub-agents (e.g., \`codebase_investigator\`) as the primary mechanism for initial discovery when the task involves **complex refactoring, codebase exploration or system-wide analysis**.${subAgentSearch} Use ${formatToolName(
+      READ_FILE_TOOL_NAME,
+    )} to validate all assumptions.${researchMandates}${suggestion}`;
  }

-  return `1. **Research:** Systematically map the codebase and validate assumptions.${searchSentence} Use ${formatToolName(READ_FILE_TOOL_NAME)} to validate all assumptions. **Prioritize empirical reproduction of reported issues to confirm the failure state.**${suggestion}`;
+  return `1. **Research:** Systematically map the codebase and validate assumptions.${searchSentence} Use ${formatToolName(
+    READ_FILE_TOOL_NAME,
+  )} to validate all assumptions.${researchMandates}${suggestion}`;
 }

 function workflowStepStrategy(options: PrimaryWorkflowsOptions): string {
@@ -584,14 +605,18 @@ function workflowStepStrategy(options: PrimaryWorkflowsOptions): string {
    return `2. **Strategy:** An approved plan is available for this task. Treat this file as your single source of truth. You MUST read this file before proceeding. If you discover new requirements or need to change the approach, confirm with the user and update this plan file to reflect the updated design decisions or discovered requirements. Once all implementation and verification steps are finished, provide a **final summary** of the work completed against the plan and offer clear **next steps** to the user (e.g., 'Open a pull request').`;
  }

+  const discovery = ` **Script Discovery:** Your strategy must include identifying the exact validation commands (build, test, lint) from \`package.json\`, \`Makefile\`, or project root.`;
+
  if (options.enableWriteTodosTool) {
    return `2. **Strategy:** Formulate a grounded plan based on your research.${
      options.interactive ? ' Share a concise summary of your strategy.' : ''
-    } For complex tasks, break them down into smaller, manageable subtasks and use the ${formatToolName(WRITE_TODOS_TOOL_NAME)} tool to track your progress.`;
+    }${discovery} For complex tasks, break them down into smaller, manageable subtasks and use the ${formatToolName(
+      WRITE_TODOS_TOOL_NAME,
+    )} tool to track your progress.`;
  }
  return `2. **Strategy:** Formulate a grounded plan based on your research.${
    options.interactive ? ' Share a concise summary of your strategy.' : ''
-  }`;
+  }${discovery}`;
 }

 function workflowVerifyStandardsSuffix(interactive: boolean): string {
@@ -607,7 +632,7 @@ function newApplicationSteps(options: PrimaryWorkflowsOptions): string {
    return `
 1. **Understand:** Read the approved plan. Treat this file as your single source of truth.
 2. **Implement:** Implement the application according to the plan. When starting, scaffold the application using ${formatToolName(SHELL_TOOL_NAME)}. For interactive scaffolding tools (like create-react-app, create-vite, or npm create), you MUST use the corresponding non-interactive flag (e.g. '--yes', '-y', or specific template flags) to prevent the environment from hanging waiting for user input. For visual assets, utilize **platform-native primitives** (e.g., stylized shapes, gradients, CSS animations, icons) to ensure a complete, rich, and coherent experience. Never link to external services or assume local paths for assets that have not been created. If you discover new requirements or need to change the approach, confirm with the user and update the plan file.
-3. **Verify:** Review work against the original request and the approved plan. Fix bugs, deviations, and ensure placeholders are visually adequate. **Ensure styling and interactions produce a high-quality, polished, and beautiful prototype.** Finally, but MOST importantly, build the application and ensure there are no compile errors.
+3. **Verify:** Review work against the original request and the approved plan. Fix bugs, deviations, and ensure placeholders are visually adequate. **Ensure styling and interactions produce a high-quality, polished, and beautiful prototype.** Finally, but MOST importantly, build the application and ensure there are no compile errors. **Runtime Smoke Test:** Actually run the application briefly (e.g., \`npm start\`) to verify it boots without immediate runtime crashes. For servers or long-running processes, use **non-blocking verification** (e.g., run in the background, check logs for 'listening', or use a short timeout) to avoid hanging the session.
 4. **Finish:** Provide a brief summary of what was built.`.trim();
  }

@@ -625,7 +650,7 @@ function newApplicationSteps(options: PrimaryWorkflowsOptions): string {
   - **Mobile:** Compose Multiplatform or Flutter.
   - **Games:** HTML/CSS/JS (Three.js for 3D).
   - **CLIs:** Python or Go.
-3. **Implementation:** Once the plan is approved, follow the standard **Execution** cycle to build the application, utilizing platform-native primitives to realize the rich aesthetic you planned.`.trim();
+3. **Implementation:** Once the plan is approved, follow the standard **Execution** cycle to build the application, utilizing platform-native primitives to realize the rich aesthetic you planned. **Runtime Smoke Test:** Actually run the application briefly (e.g., \`npm start\`) to verify it boots without immediate runtime crashes. For servers or long-running processes, use **non-blocking verification** (e.g., run in the background, check logs for 'listening', or use a short timeout) to avoid hanging the session.`.trim();
  }

  // --- FALLBACK: Legacy workflow for when Plan Mode is disabled ---
@@ -642,7 +667,7 @@ function newApplicationSteps(options: PrimaryWorkflowsOptions): string {
     - **Games:** HTML/CSS/JS (Three.js for 3D).
     - **CLIs:** Python or Go.
 3. **Implementation:** Autonomously implement each feature per the approved plan. When starting, scaffold the application using ${formatToolName(SHELL_TOOL_NAME)} for commands like 'npm init', 'npx create-react-app'. For interactive scaffolding tools (like create-react-app, create-vite, or npm create), you MUST use the corresponding non-interactive flag (e.g. '--yes', '-y', or specific template flags) to prevent the environment from hanging waiting for user input. For visual assets, utilize **platform-native primitives** (e.g., stylized shapes, gradients, icons) to ensure a complete, coherent experience. Never link to external services or assume local paths for assets that have not been created.
-4. **Verify:** Review work against the original request. Fix bugs and deviations. Ensure styling and interactions produce a high-quality, functional, and beautiful prototype. **Build the application and ensure there are no compile errors.**
+4. **Verify:** Review work against the original request. Fix bugs and deviations. Ensure styling and interactions produce a high-quality, functional, and beautiful prototype. **Build the application and ensure there are no compile errors.** **Runtime Smoke Test:** Actually run the application briefly (e.g., \`npm start\`) to verify it boots without immediate runtime crashes. For servers or long-running processes, use **non-blocking verification** (e.g., run in the background, check logs for 'listening', or use a short timeout) to avoid hanging the session.
 5. **Solicit Feedback:** Provide instructions on how to start the application and request user feedback on the prototype.`.trim();
  }

@@ -657,7 +682,7 @@ function newApplicationSteps(options: PrimaryWorkflowsOptions): string {
     - **Games:** HTML/CSS/JS (Three.js for 3D).
     - **CLIs:** Python or Go.
 3. **Implementation:** Autonomously implement each feature per the approved plan. When starting, scaffold the application using ${formatToolName(SHELL_TOOL_NAME)}. For interactive scaffolding tools (like create-react-app, create-vite, or npm create), you MUST use the corresponding non-interactive flag (e.g. '--yes', '-y', or specific template flags) to prevent the environment from hanging waiting for user input. For visual assets, utilize **platform-native primitives** (e.g., stylized shapes, gradients, icons). Never link to external services or assume local paths for assets that have not been created.
-4. **Verify:** Review work against the original request. Fix bugs and deviations. **Build the application and ensure there are no compile errors.**`.trim();
+4. **Verify:** Review work against the original request. Fix bugs and deviations. **Build the application and ensure there are no compile errors.** **Runtime Smoke Test:** Actually run the application briefly (e.g., \`npm start\`) to verify it boots without immediate runtime crashes. For servers or long-running processes, use **non-blocking verification** (e.g., run in the background, check logs for 'listening', or use a short timeout) to avoid hanging the session.`.trim();
 }

 function toolUsageInteractive(