feat(evals): add comprehensive workflow evaluations and tune prompts (Issue #219)

- Established evals for all agent workflows (triage, dedup, refresh).
- Refactored all evals to use modern --output-format=json flag for robust validation.
- Tuned prompts for strict JSON compliance and corrected spam handling in scheduled triage.
- Expanded edge case coverage for false positives, security leaks, and mixed batches.
This commit is contained in:
cocosheng-g
2026-02-03 19:17:38 -05:00
committed by Coco Sheng
parent 2daee0d066
commit 5c2f477adf
8 changed files with 1128 additions and 71 deletions

View File

@@ -257,6 +257,13 @@ jobs:
area/unknown
- Description: Issues that do not clearly fit into any other defined area/ category, or where information is too limited to make a determination. Use this when no other area is appropriate.
## Final Instructions
- Output ONLY valid JSON format.
- Do NOT include any introductory or concluding remarks, explanations, or additional text.
- Do NOT include any thoughts or reasoning outside the JSON block.
- Ensure the output is a single JSON object with a "labels_to_set" array.
- name: 'Apply Labels to Issue'
if: |-
${{ steps.gemini_issue_analysis.outputs.summary != '' }}

View File

@@ -159,7 +159,7 @@ jobs:
}
]
```
If an issue cannot be classified, do not include it in the output array.
If an issue cannot be classified (e.g. spam), classify it as area/unknown.
9. For each issue please check if CLI version is present, this is usually in the output of the /about command and will look like 0.1.5
- Anything more than 6 versions older than the most recent should add the status/need-retesting label
10. If you see that the issue doesn't look like it has sufficient information recommend the status/need-information label and leave a comment politely requesting the relevant information, eg.. if repro steps are missing request for repro steps. if version information is missing request for version information into the explanation section below.
@@ -207,7 +207,7 @@ jobs:
area/enterprise: Telemetry, Policy, Quota / Licensing
area/extensions: Gemini CLI extensions capability
area/non-interactive: GitHub Actions, SDK, 3P Integrations, Shell Scripting, Command line automation
area/platform: Build infra, Release mgmt, Testing, Eval infra, Capacity, Quota mgmt
area/platform: Build infra, Release mgmt, Automated testing infrastructure (evals), Capacity, Quota mgmt. NOT for local test failures.
area/security: security related issues
Additional Context:
@@ -215,6 +215,13 @@ jobs:
- This product is designed to use different models eg.. using pro, downgrading to flash etc.
- When users report that they dont expect the model to change those would be categorized as feature requests.
## Final Instructions
- Output ONLY valid JSON format.
- Do NOT include any introductory or concluding remarks, explanations, or additional text.
- Do NOT include any thoughts or reasoning outside the JSON block.
- Ensure the output is a single JSON array of objects.
- name: 'Apply Labels to Issues'
if: |-
${{ steps.gemini_issue_analysis.outcome == 'success' &&