fix NaN loss not caught by fast-fail check

`train_loss_f > 100` silently passes on NaN because IEEE 754 NaN comparisons always return False. When an agent experiment produces NaN (e.g. from an aggressive LR change), the run wastes the full 5-minute budget instead of failing fast. `not (x <= 100)` catches both >100 and NaN with no added complexity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 23:51:02 +08:00
parent 068d93da75
commit b5ba8ac00d
1 changed files with 2 additions and 2 deletions
@@ -565,8 +565,8 @@ while True:

    train_loss_f = train_loss.item()

-    # Fast fail: abort if loss is exploding
-    if train_loss_f > 100:
+    # Fast fail: abort if loss is exploding or NaN
+    if not train_loss_f <= 100:
        print("FAIL")
        exit(1)