I’m wondering about the benchmark too. It’s way above my level to figure out how it can be gamed. But, buried in the article:
Moreover, ARC-AGI-1 is now saturating – besides o3’s new score, the fact is that a large ensemble of low-compute Kaggle solutions can now score 81% on the private eval.
The most expensive o3 version achieved 87.5%
But aren’t they used to dealing with VC?