fix NaN loss not caught by fast-fail check

`train_loss_f > 100` silently passes on NaN because IEEE 754 NaN
comparisons always return False. When an agent experiment produces
NaN (e.g. from an aggressive LR change), the run wastes the full
5-minute budget instead of failing fast.

`not (x <= 100)` catches both >100 and NaN with no added complexity.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
haosenwang1018
2026-03-09 23:51:02 +08:00
parent 068d93da75
commit b5ba8ac00d
+2 -2
View File
@@ -565,8 +565,8 @@ while True:
train_loss_f = train_loss.item()
# Fast fail: abort if loss is exploding
if train_loss_f > 100:
# Fast fail: abort if loss is exploding or NaN
if not train_loss_f <= 100:
print("FAIL")
exit(1)