feat(evals): add comprehensive workflow evaluations and tune prompts (Issue #219)

- Established evals for all agent workflows (triage, dedup, refresh). - Refactored all evals to use modern --output-format=json flag for robust validation. - Tuned prompts for strict JSON compliance and corrected spam handling in scheduled triage. - Expanded edge case coverage for false positives, security leaks, and mixed batches.
2026-05-13 05:12:55 -07:00 · 2026-02-03 19:17:38 -05:00
parent 2daee0d066
commit 5c2f477adf
8 changed files with 1128 additions and 71 deletions
@@ -257,6 +257,13 @@ jobs:
            area/unknown
            - Description: Issues that do not clearly fit into any other defined area/ category, or where information is too limited to make a determination. Use this when no other area is appropriate.

+            ## Final Instructions
+
+            - Output ONLY valid JSON format.
+            - Do NOT include any introductory or concluding remarks, explanations, or additional text.
+            - Do NOT include any thoughts or reasoning outside the JSON block.
+            - Ensure the output is a single JSON object with a "labels_to_set" array.
+
      - name: 'Apply Labels to Issue'
        if: |-
          ${{ steps.gemini_issue_analysis.outputs.summary != '' }}
@@ -159,7 +159,7 @@ jobs:
                 }
               ]
               ```
-              If an issue cannot be classified, do not include it in the output array.
+              If an issue cannot be classified (e.g. spam), classify it as area/unknown.
            9. For each issue please check if CLI version is present, this is usually in the output of the /about command and will look like 0.1.5
              - Anything more than 6 versions older than the most recent should add the status/need-retesting label
            10. If you see that the issue doesn't look like it has sufficient information recommend the status/need-information label and leave a comment politely requesting the relevant information, eg.. if repro steps are missing request for repro steps. if version information is missing request for version information into the explanation section below.
@@ -207,7 +207,7 @@ jobs:
            area/enterprise: Telemetry, Policy, Quota / Licensing
            area/extensions: Gemini CLI extensions capability
            area/non-interactive: GitHub Actions, SDK, 3P Integrations, Shell Scripting, Command line automation
-            area/platform: Build infra, Release mgmt, Testing, Eval infra, Capacity, Quota mgmt
+            area/platform: Build infra, Release mgmt, Automated testing infrastructure (evals), Capacity, Quota mgmt. NOT for local test failures.
            area/security: security related issues

            Additional Context:
@@ -215,6 +215,13 @@ jobs:
            - This product is designed to use different models eg.. using pro, downgrading to flash etc.
            - When users report that they dont expect the model to change those would be categorized as feature requests.

+            ## Final Instructions
+
+            - Output ONLY valid JSON format.
+            - Do NOT include any introductory or concluding remarks, explanations, or additional text.
+            - Do NOT include any thoughts or reasoning outside the JSON block.
+            - Ensure the output is a single JSON array of objects.
+
      - name: 'Apply Labels to Issues'
        if: |-
          ${{ steps.gemini_issue_analysis.outcome == 'success' &&