feat(core): comprehensive agent self-validation and engineering mandates

Major upgrade to the agent's self-validation, safety, and project integrity capabilities through five iterations of system prompt enhancements: Workflow & Quality Mandates: 1. Incremental Validation: Mandates building, linting, and testing after every significant file change to maintain a "green" state. 2. Mandatory Reproduction: Requires creating a failing test case to confirm a bug before fixing, and explicitly verifying the failure (Negative Verification). 3. Test Persistence & Locality: Requires integrating repro cases into the permanent test suite, preferably by amending existing related test files. 4. Script Discovery: Mandates identifying project-specific validation commands from configuration files (package.json, Makefile, etc.). 5. Self-Review: Mandates running `git diff` after every edit, using `--name-only` for large changes to preserve context window tokens. 6. Fast-Path Validation: Prioritizes lightweight checks (e.g., `tsc --noEmit`) for frequent feedback, reserving heavy builds for final verification. 7. Output Verification: Requires checking command output (not just exit codes) to prevent false-positives from empty test runs or hidden warnings. Semantic Integrity & Dependency Safety: 8. Global Usage Discovery: Mandates searching the entire workspace for all usages (via `grep_search`) before modifying exported symbols or APIs. 9. Dependency Integrity: Requires verifying that new imports are explicitly declared in the project's dependency manifest (e.g., package.json). 10. Configuration Sync: Mandates updating build/environment configs (tsconfig, Dockerfile, etc.) to support new file types or entry points. 11. Documentation Sync: Requires searching for and updating documentation references when public APIs or CLI interfaces change. 12. Anti-Silencing Mandate: Prohibits using `any`, `@ts-ignore`, or lint suppressions to resolve validation errors. Diagnostics, Safety & Runtime Verification: 13. Error Grounding: Mandates reading full error logs and stack traces upon failure. Includes Smart Log Navigation to prioritize the tail of large files. 14. Scope Isolation: Instructs the agent to focus only on errors introduced by its changes and ignore unrelated legacy technical debt. 15. Destructive Safety: Mandates a `git status` check before deleting files or modifying critical project configurations. 16. Non-Blocking Smoke Tests: Requires briefly running applications to verify boot stability, using background/timeout strategies for servers. Includes 15 new behavioral evaluations verifying these mandates and updated snapshots in packages/core/src/core/prompts.test.ts.
2026-05-13 13:22:35 -07:00 · 2026-02-20 14:22:54 -08:00
parent 208291f391
commit 61b35ff745
17 changed files with 1231 additions and 402 deletions
@@ -0,0 +1,53 @@
+/**
+ * @license
+ * Copyright 2026 Google LLC
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+import { describe, expect } from 'vitest';
+import { evalTest } from './test-helper.js';
+
+describe('Self-Diff Review', () => {
+  /**
+   * Verifies that the agent performs a self-review immediately after an edit.
+   */
+  evalTest('USUALLY_PASSES', {
+    name: 'should review changes immediately after an edit',
+    files: {
+      'src/app.ts': 'export const hello = () => "world";',
+    },
+    prompt: 'Update src/app.ts to say "hello world" instead of "world".',
+    assert: async (rig) => {
+      const toolLogs = rig.readToolLogs();
+
+      const editIndex = toolLogs.findIndex(
+        (log) =>
+          (log.toolRequest.name === 'replace' ||
+            log.toolRequest.name === 'write_file') &&
+          log.toolRequest.args.includes('src/app.ts'),
+      );
+
+      expect(editIndex, 'Agent should have edited src/app.ts').toBeGreaterThan(
+        -1,
+      );
+
+      // Check for git diff or read_file immediately after the edit
+      const reviewCall = toolLogs[editIndex + 1];
+      expect(
+        reviewCall,
+        'Agent should have made a call after the edit',
+      ).toBeDefined();
+
+      const isReview =
+        (reviewCall.toolRequest.name === 'run_shell_command' &&
+          reviewCall.toolRequest.args.includes('git diff')) ||
+        (reviewCall.toolRequest.name === 'read_file' &&
+          reviewCall.toolRequest.args.includes('src/app.ts'));
+
+      expect(
+        isReview,
+        'Agent should have run git diff or read_file immediately after the edit to review its work',
+      ).toBe(true);
+    },
+  });
+});