Axis of Ordinary

Propose, Solve, Verify (PSV): Self-play for code with proofs, not tests

Most AI coding systems learn from unit tests: they write code, run it on a few examples, and get rewarded if it passes. But tests are incomplete. Code can pass all tests and still be wrong on rare inputs. So the AI can learn “cheap tricks,” and those errors can spread during training.

PSV replaces this with a much stricter judge: formal verification. Instead of checking a handful of examples, a verifier tries to prove mathematically that the program meets the specification for all possible inputs.

The PSV loop

1. Propose: write the “what”

The Proposer invents a new task by writing a specification (a precise description of what the program must do).

Here is the crucial idea: the proposer does not need to be a genius that can foresee what will stump a superhuman solver (like designing a benchmark for Terence Tao). PSV relies on a simpler asymmetry:

- It's often easy to state constraints.
- It's often harder to satisfy them (and prove you did).

A proposer can cheaply say: “Sort this list,” or “Sort it and keep it stable,” or “Sort it, return an index mapping, and prove nothing was lost or duplicated.” Stacking constraints is much easier to describe than to implement and prove them correct.

2. Solve: do the “how”

The Solver tries to write the program (and the proof-style annotations the verifier needs). It samples many attempts (like trying many “mutations”).

3. Verify: harsh selection

A formal verifier checks each attempt. Only solutions that are provably correct count as wins. This is the key difference from unit tests: passing isn't “it worked on a few examples,” it’s “it’s correct for everything.”

4. Learn: keep the survivors

The solver then trains on those verified wins, becoming more likely to produce correct solutions next time.

How problems get harder without a smarter proposer: PSV makes “hard” relative, not absolute. Instead of the proposer guessing difficulty in advance, the system measures it empirically:

- If the solver verifies solutions for a spec 90% of the time, it’s easy.
- If it verifies only 10% of the time, it’s hard.
- If it never verifies, it’s too hard (for now).

The proposer is shown examples labeled EASY / MEDIUM / HARD and asked to generate new specs at a target difficulty. If the solver starts succeeding too often, the system nudges the proposer toward harder specs (more conditions, tighter guarantees). If nothing succeeds, it nudges back.

So the proposer doesn’t need to “outthink” the solver. It just needs to generate many candidate specs, while the system uses feedback (pass rates) to keep the difficulty near the frontier (like a teacher who adjusts homework based on how the student actually performs).

Where the intelligence increase comes from. PSV is basically evolution with a very strict referee:

- Variation: many proposed problems, many attempted solutions.
- Selection: only solutions that pass the verifier survive
- Inheritance: the solver trains on the survivors.
- Moving frontier: as the solver improves, yesterday’s “hard” becomes today’s “medium,” so the system keeps pushing forward.

That’s why self-play works here: the verifier prevents the loop from “learning lies,” and the proposer and feedback mechanism keep generating fresh challenges just beyond the current capability.

A sign it scales: In the paper, generating more proposed tasks per round helped. Increasing from 1,000 to 32,000 proposed questions raised MBPP pass@1 from 22.3% to 44.3%. This is consistent with the idea that more self-generated practice plus strict verification produces real capability gains.

Paper: https://arxiv.org/abs/2512.18160

❤1🤡1

705 views23:03