Hacker News
new
|
past
|
comments
|
ask
|
show
|
jobs
|
submit
login
usaar333
on Aug 7, 2024
|
parent
|
context
|
favorite
| on:
Structured Outputs in the API
> ZeroEval's Leaderboard on hugging face [0] actually shows that it beats even Claude 3.5 Sonnet on CRUX [1] which is a code reasoning benchmark.
The previous version of 4o also beat 3.5 Sonnet on Crux.
nunodonato
on Aug 7, 2024
[–]
which is a good hint that that benchmark sucks. No way 4o beats sonnet 3.5
lauralex
on Aug 7, 2024
|
parent
|
next
[–]
Sonnet 3.5 has a lot of alignment issues. It many times refused to answer simple coding questions I asked, just because it considered them "unsafe". 4o is much more relaxed. Regarding math, sonnet is a bit better than 4o though.
ashu1461
on Aug 7, 2024
|
root
|
parent
|
next
[–]
I think they have secretly released something which is better than 4. In our internal benchmarks also the 4 o mini is performing better than 4 o
rvnx
on Aug 7, 2024
|
parent
|
prev
[–]
The weirdest is that ultimately the best model is supposed to be Gemini Pro according to these benchmarks
Guidelines
|
FAQ
|
Lists
|
API
|
Security
|
Legal
|
Apply to YC
|
Contact
Search:
The previous version of 4o also beat 3.5 Sonnet on Crux.