The evaluator does NOT use Baby Nexus self-descriptions. It scores actual outputs, memory persistence and checkpoint deltas.
No evaluation has been run yet. Click Run evaluation to generate the first report.
The evaluator does NOT use Baby Nexus self-descriptions. It scores actual outputs, memory persistence and checkpoint deltas.