The record of casual, bizarre AI benchmarks retains rising.
Over the previous few days, some within the AI group on X have become obsessed with a take a look at of how completely different AI fashions, notably so-called reasoning models, deal with prompts like this: “Write a Python script for a bouncing yellow ball inside a form. Make the form slowly rotate, and ensure that the ball stays inside the form.”
Some fashions handle higher on this “ball in rotating form” benchmark than others. According to 1 consumer on X, Chinese language AI lab DeepSeek’s freely available R1 swept the ground with OpenAI’s o1 pro mode, which prices $200 monthly as part of OpenAI’s ChatGPT Pro plan.
👀 DeepSeek R1 (proper) crushed o1-pro (left) 👀
Immediate: “write a python script for a bouncing yellow ball inside a sq., be sure that to deal with collision detection correctly. make the sq. slowly rotate. implement it in python. be sure that ball stays inside the sq.” pic.twitter.com/3Sad9efpeZ
— Ivan Fioravanti ᯅ (@ivanfioravanti) January 22, 2025
Per another X poster, Anthropic’s Claude 3.5 Sonnet and Google’s Gemini 1.5 Pro fashions misjudged the physics, ensuing within the ball escaping the form. Other users reported that Google’s Gemini 2.0 Flash Thinking Experimental, and even OpenAI’s older GPT-4o, aced the analysis in a single go.
Examined 9 AI fashions on a physics simulation activity: rotating triangle + bouncing ball. Outcomes:
🥇 Deepseek-R1
🥈 Sonar Big
🥉 GPT-4oWorst? OpenAI o1: Fully misunderstood the duty 😂
Video beneath ↓ First row = Reasoning fashions, relaxation = Base fashions. pic.twitter.com/EOYrHvNazr
— Aadhithya D (@Aadhithya_D2003) January 22, 2025
However what does it show that an AI can or can’t code a rotating, ball-containing form?
Nicely, simulating a bouncing ball is a classic programming challenge. Correct simulations incorporate collision detection algorithms, which attempt to establish when two objects (e.g. a ball and the aspect of a form) collide. Poorly written algorithms can have an effect on the simulation’s efficiency or result in apparent physics errors.
X consumer N8 Programs, a researcher in residence at AI startup Nous Analysis, says it took him roughly two hours to program a bouncing ball in a rotating heptagon from scratch. “One has to trace a number of coordinate programs, how the collisions are carried out in every system, and design the code from the start to be sturdy,” N8 Applications defined in a post.
However whereas bouncing balls and rotating shapes are an affordable take a look at of programming expertise, they’re not a really empirical AI benchmark. Even slight variations within the immediate can — and do — yield completely different outcomes. That’s why some customers on X report having extra luck with o1, whereas others say that R1 falls short.
If something, viral checks like these level to the intractable drawback of making helpful programs of measurement for AI fashions. It’s typically tough to inform what differentiates one mannequin from one other, outdoors of esoteric benchmarks that aren’t related to most individuals.
Many efforts are underway to construct higher checks, just like the ARC-AGI benchmark and Humanity’s Last Exam. We’ll see how these fare — and within the meantime watch GIFs of balls bouncing in rotating shapes.
AI,ai benchmarks,coding,programming
Add comment