Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> ZeroEval's Leaderboard on hugging face [0] actually shows that it beats even Claude 3.5 Sonnet on CRUX [1] which is a code reasoning benchmark.

The previous version of 4o also beat 3.5 Sonnet on Crux.



which is a good hint that that benchmark sucks. No way 4o beats sonnet 3.5


Sonnet 3.5 has a lot of alignment issues. It many times refused to answer simple coding questions I asked, just because it considered them "unsafe". 4o is much more relaxed. Regarding math, sonnet is a bit better than 4o though.


I think they have secretly released something which is better than 4. In our internal benchmarks also the 4 o mini is performing better than 4 o


The weirdest is that ultimately the best model is supposed to be Gemini Pro according to these benchmarks




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: