Commit Graph

3 Commits

Author SHA1 Message Date
cocosheng-g
9da1542071 feat(ci): isolate workflow evals into independent nightly job
- Splits 'Evals: Nightly' into 'evals' (general capabilities) and 'workflow-evals' (specific workflow simulations).
- 'workflow-evals' runs only on 'gemini-2.5-pro' (the target model).
- 'evals' excludes workflow tests to prevent noise/skewed metrics on other models.
- Removes code-level 'targetModels' restrictions in favor of CI configuration.
- Updates aggregation script to handle skipped tests correctly (though exclusion avoids them).
2026-02-03 22:37:15 -05:00
cocosheng-g
8c82777ea0 fix(evals): use direct path to tsx binary in dedup mocks
Avoids reliance on 'npx' which might be flaky or prompt for installation in CI environments.
2026-02-03 21:39:16 -05:00
cocosheng-g
ff4e816a70 refactor(evals): isolate workflow evals and target specific models 2026-02-03 21:02:55 -05:00