From 1b021bddab8e76a2663bcbd6d24c7e40c8e776a4 Mon Sep 17 00:00:00 2001
From: Christian Gunderman <gundermanc@google.com>
Date: Fri, 1 May 2026 15:45:58 -0700
Subject: [PATCH] test(critique): improve prompt robustness for scale and rate
 limits

---
 tools/gemini-cli-bot/brain/critique.md | 52 +++++++++++++-------------
 1 file changed, 26 insertions(+), 26 deletions(-)

diff --git a/tools/gemini-cli-bot/brain/critique.md b/tools/gemini-cli-bot/brain/critique.md
index 25e9932dc9..06bdb3ec51 100644
--- a/tools/gemini-cli-bot/brain/critique.md
+++ b/tools/gemini-cli-bot/brain/critique.md
@@ -13,25 +13,22 @@ and logical checklist.
 
 ### Technical Robustness
 
-1. **Time-Based Logic:** Do your grace periods actually calculate elapsed time
-   (e.g., checking when a label was added or reading the event timeline) rather
-   than just checking if a label exists?
-2. **Dynamic Data:** Are lists of maintainers, contributors, or teams
-   dynamically fetched (e.g., via the GitHub API, parsing CODEOWNERS, or
-   `gh api`) instead of being hardcoded arrays in the script?
-3. **Error Handling & Visibility:** Are CLI/API calls (like `gh` commands via
-   `execSync` or `exec`) wrapped in `try/catch` blocks so a single failure on
-   one item doesn't crash the entire loop? Are file reads protected with
-   existence checks or `try/catch` blocks?
-4. **Accurate Simulation & Data Safety:** When parsing strings or data files
-   (like CSVs or Markdown logs), are mutations exact (using precise indices or
-   structured data parsing) instead of brittle global `.replace()` operations?
-5. **Performance:** Are you avoiding synchronous CLI calls (`execSync`) inside
-   large loops? Are you using asynchronous execution (`exec` or `spawn` with
-   `Promise.all` or concurrency limits) where appropriate?
-6. **Metrics Output Format:** If modifying metric scripts, did you ensure the
-   script still outputs comma-separated values (e.g.,
-   `console.log('metric_name,123')`) and NOT JSON or other formats?
+1. **Time-Based Logic:** Do grace periods correctly calculate elapsed time
+   (e.g., measuring from the timeline event when a label was added) rather than
+   just checking for the existence of a label?
+2. **Dynamic Data:** Are lists of maintainers or teams dynamically fetched
+   rather than hardcoded?
+3. **Error Handling & Fault Tolerance:** Are operations wrapped in `try/catch`
+   blocks so a single failure on one item doesn't crash an entire batch process?
+4. **Data Mutations:** Are data manipulations (like parsing CSVs or logs) robust
+   and precise, avoiding brittle global string replacements?
+5. **Scale & Rate Limits:** Will this code time out, hit API rate limits, or
+   consume excessive memory if run against a repository with 5,000 open issues?
+   You MUST reject any script that makes sequential API calls inside an
+   unbounded loop (N+1 queries) or uses excessively broad search queries (like
+   `is:open` without date or state filters).
+6. **Metrics Format:** Do metric scripts output strict comma-separated values
+   (`metric_name,value`) and not JSON or text?
 
 ### Logical & Workflow Integrity
 
@@ -82,15 +79,18 @@ and logical checklist.
     policies. Verify that the LLM is used ONLY for classification and not for
     logic or decision-making.
 
-## Systemic Simulation (MANDATORY FOR TIME-BASED LOGIC)
+## Systemic Simulation (MANDATORY)
 
-If the modified scripts or workflows involve time-based triggers (e.g., cron
-schedules), grace periods, or staleness checks:
+You MUST explicitly write out a timeline and scale simulation in your response
+to prove the logic holds up over time and at scale.
 
-- You MUST explicitly write out a timeline simulation in your response.
-- Step through the execution day by day (e.g., Day 1, Day 7, Day 14).
-- Ensure that the execution frequency (the cron schedule) aligns perfectly with
-  the logical grace periods promised in the code or comments.
+- **Timeline:** Step through the execution day by day (e.g., Day 1, Day 7, Day
+  14). Ensure the execution frequency (the cron schedule) aligns perfectly with
+  the logical grace periods promised.
+- **Scale:** Simulate running the logic against a repository with 5,000 open
+  issues. Does the script retrieve all 5,000 issues at once? If so, does it
+  iterate through them sequentially making API calls for each (N+1)? Reject the
+  change if it fails to handle scale efficiently.
 
 ## Evaluation Mandate