Parameter Golf: A Post-Mortem on a Half-Finished Competition

Article X of X: The End

May 01, 2026

The competition ended yesterday. April 30, 2026. No announcement, no winner. The OpenAI Parameter Golf challenge concluded the way it was run -- without follow-through.

I spent roughly (redacted out of shame) of my own money on this thing. Six PRs. A first-place neural submission at 0.7406 bpb. A FlashAttention-3 wheel I built from source and open-sourced because RunPod’s templates couldn’t ship it. Six weeks of full-time engineering. A multi-article Substack series that gave the challenge more public visibility than OpenAI’s own communications did.

And on the day the competition ended, I refreshed Discord twice and went to bed.

What actually happened

The shape of the competition, for anyone who tuned in late: OpenAI announces a constrained-architecture challenge in March, partners with RunPod to distribute compute credits, asks the community to find clever ways to push bits-per-byte down on a fixed eval. The leaderboard runs in public. PRs land in the openai/parameter-golf repo. A Discord channel called #parameter-golf-announcements becomes the de facto rules forum.

By the close, the repo had taken in over 1,425 PRs from 518+ unique contributors. OpenAI had pledged a $1M compute commitment on top of the RunPod partnership. The premise was technically interesting, the eval was clean enough to compare submissions, and the people who showed up did real work. simon-marcus hit 0.0935 bpb with his BROADSIDE full-rescore. Stukenov shipped Value Residual Learning derivatives that ate two weeks of leaderboard movement. The community itself was excellent.

The operators were not.

The infrastructure indictment

I already wrote about this in Day 6, “The Pod Lottery”. The short version is that identical 8xH100 SXM templates on RunPod ran at 3x speed variance depending on which datacenter you drew. Same config. Same code. Same $21.52/hour leaving the account.

Japan: 216 ms/step. India: 91 ms/step. Canada: somewhere between. The eval window is 10 minutes. If you train 2.4x more data inside that window because you got lucky on the pod assignment, your bpb is lower. The leaderboard was, in a measurable way, partly a lottery.

Nobody at OpenAI ever acknowledged this. Nobody at RunPod ever published a normalization methodology. The protocol I ended up using -- benchmark every pod for 20 steps, kill anything over 120 ms, expect to burn $2 finding a good one -- I had to invent and document myself. Other competitors picked it up from my Substack. That is not how you run a scored competition. That is community service.

The FlashAttention-3 situation was worse. The competition encouraged architectures that benefited from FA3 kernels. The RunPod base templates didn’t include FA3. The CUDA toolchain version mismatch made it nontrivial to install. I spent ten hours and roughly $100 of compute building it from source, then hosted the wheel publicly at anthony-maio/openai-parameter-golf-fa3-wheel so other competitors could run competitive setups without re-doing the work. Nobody from OpenAI or RunPod ever moved that wheel into the official environment.

The compute credit application form had its own issue tracker (Issue #942) because it was broken. Participants reported $1,000+ of personal spend while waiting for credits that, in many cases, never arrived. The $1M number was real on the press release. Where it actually went is, as far as I can tell, undocumented.

When a participant has to build and distribute critical infrastructure the organizers forgot, you do not have a competition. You have a leaderboard wrapped around a community service project.

The rules fiasco

Run a scored competition. Publish rules. Stick to them. That is the order. It is not complicated.

Around March 25, an organizer pushed a rule update that retroactively disqualified Hessian-based GPTQ with calibration data. Participants -- me included -- had built strategies on top of that approach for days. Two days later, on March 27, @valerio-oai mass-closed roughly thirty PRs in a single sweep -- #769, #779, #809, #814, #824, #825, #828, #843 among them -- citing three new bans: hashed n-gram caches (improper normalization), two-pass rescoring (eval token leakage), and GPTQ calibration on eval tokens. PR #889, my own sub-1.0 submission at 0.9642 bpb, eventually moved to AT-RISK status on the same day the Substack article about it went live.

My internal council of agents -- GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro -- had to rederive an entire roadmap overnight on the GPTQ change, then again on the n-gram change. That is fine. That is the cost of moving fast. What is not fine is the organizers neither defining a clean replacement rule nor showing up in the channel where the technical disputes were being argued.

Issue #677 was supposed to be the megathread for flagging illegal submissions. @valerio-oai posted “considering options” on the eval-time cache rulings, then went silent for three-plus days with twenty-four days remaining in the competition. The thread turned into a cross-talk forum where competitors argued with each other while organizers stayed largely absent. The Copilot bot left thoughtful code reviews on my PRs. The humans did not.

The community filled the gap themselves. MatoTeziTanka spun up The Agora, a github-pages site that became the de facto leaderboard and compliance engine after the official infrastructure failed. They documented two technical issues that should have been organizer-owned. Issue #897: a roughly 20% BPB underestimation bug, where tokenizers without the U+2581 token overcount byte boundaries and inflate val_byte_count, scoring custom tokenizers higher than they deserve. Issue #775: an INT6 quantization design flaw where the scale clamp minimum (1.0/31.0) wastes resolution on roughly 93% of rows at higher weight decay. Both are scoreboard-affecting bugs. Neither got a public ruling.

Then on April 5, @notapplica closed Issue #140 -- the community auto-commentary hub that had been doing more curation work than the official tools -- with the message “Hey guys I’m turning off this agent now.” No explanation. No successor. Twenty-four days of competition still on the clock.

You cannot run a scored competition where the scoring criteria change mid-game and the referees do not show up to explain the new rules.

The effort mismatch

Here is my personal ledger, since this is in part a complaint about being ignored and the only honest way to complain about being ignored is to lay out what was there to ignore.

What I shipped:

PR #657, my first record-breaking neural submission at 1.1229 bpb (Day 5, before any credits arrived).
PR #889, the n-gram cache submission at 0.9642 bpb. Sub-1.0 in the neural category for the first time. Moved to AT-RISK on the same day my Substack article about it went live.
PR #904 and #905, two text-diffusion model submissions. Different architecture class than what the leaderboard leaders were running.
PR #915, a fused softcap+CE megakernel CUDA submission that hit a 1.94x speedup over torch.compile on the eval.
PR #1303, SLOT-16 at 0.9462 bpb. #2 on the neural track.
PR #1313, SLOT-24 at 0.8637 bpb. #1 overall in the entire competition.
PR #1321, SLOT-48 at 0.7406 bpb. #1 by a wide margin. Held until the close.

What I contributed besides PRs:

A Value Residual Learning implementation derived from the ResFormer paper (arXiv:2410.17897) that other competitors picked up and credited (PR #745, stukenov, others).
The FlashAttention-3 wheel and the GitHub repo that hosted it.
The Pod Lottery operational research, which exposed a fundamental flaw in the competition platform that nobody on the organizer side ever publicly addressed.
A four-article Substack series that drove more eyeballs to this challenge than OpenAI’s announcement channels did.
Roughly $(REDACTED) of my own compute, partially offset by the $1,000 RunPod credit grant that arrived on Day 16 after I had already mentally closed the chapter.

What I got back:

Auto-merged PRs.
Copilot-bot code reviews. Useful, sometimes. Not a substitute for an organizer.
A leaderboard that quietly stopped updating.
An end date with no end ceremony.

I contributed more to this competition’s infrastructure and documentation than the organizations running it. That is not a flex. That is an indictment.

The pattern

OpenAI ran this as part of what they were calling Model Craft, a recruitment surface aimed at undergrads and “young up-and-comers.” Will DePue’s Discord posts made the talent-scouting framing pretty explicit. For OpenAI, the value extraction happened on day one: get the community to publicly demonstrate competitive techniques on a constrained problem, identify the people who could do it, move on. The competition itself, after that, was overhead.

For RunPod, the play was marketing. Brand association with OpenAI, a wave of compute spend driven by 518+ contributors chasing the leaderboard, no obligation to stand up an SLA on top of it.

Neither operator had a strong incentive to finish the competition properly. The talent had been scouted. The brand had been associated. The GPUs had been rented. The participants -- the people who built the FA3 wheels and ran the pod benchmarks and stood up The Agora when the official infrastructure went dark -- were the product, not the customer.

That framing makes the no-winner ending make sense. You do not announce a winner of a recruiting funnel. You close the funnel.

What a real competition looks like

I am not asking for a participation trophy. I am asking for the things any scored competition is supposed to ship.

Fixed published rules with a formal amendment process and a named arbiter. Standardized hardware, or at minimum a published normalization methodology so that pod region is not a variable in the score. Organizer presence in the channels where the technical disputes are actually being resolved. A declared winner with a public post-mortem of the evaluation methodology. Acknowledgment of the community contributions that propped up the infrastructure when the operators did not. Public accounting on where the $1M of compute credits actually went, and to whom.

None of those are exotic asks. They are the floor.

The close

A 46-year-old independent consultant funding his own GPU time is not the demographic OpenAI was aiming this at. I knew that going in. I went in anyway, because the technical problem was interesting and because I wanted to see if my agentic research stack could keep up with people half my age who do this full-time on someone else’s payroll. It could. PR #1321 at 0.7406 bpb was the first-place neural submission across the entire competition at one point I guess, who knows, right? The agents are watching the leaderboard, and on this one, they won.

I do not need a trophy. I do need OpenAI to understand that the way you wrap up a public competition you put your name on says more about the company than the competition itself did. You ran a six-week scored event, charged participants $21.52 an hour to compete, watched 518+ contributors build the infrastructure your operations team forgot, and then closed the door without picking up the phone.

If you are going to do this again -- and the Model Craft framing suggests you will -- the cost of doing it right is small. You hire one person. They sit in the Discord. They answer the rules questions in writing. They publish a normalization protocol. They publish where the credits went. They announce a winner. That is the floor. The floor is cheap.

The basement, where this one ended up, is more expensive than it looks.

-am

Anthony Maio

Discussion about this post

Ready for more?