EdgeBench

An ultra-long-horizon benchmark built to measure learning from environments.

Most benchmarks score what a model already knows. EdgeBench is built to measure something else — how an agent learns from a real-world environment when it is given the time, the feedback, and the room to improve.

First benchmark to measure real-world environment learning
1st

Every workspace, feedback signal, and judge approximates real practice, so a high score reflects what an agent learns.
Ultra-long-horizon real-world tasks
≥12h

Each task runs 12+ hours of continuous operation, long enough for experience to compound. We've pushed past 72 h.
High diversity, mostly built from scratch
134/6

Tasks span science, software engineering, optimization, knowledge work, formal math, and games — most brand-new, built from zero.
Every task hand-reviewed by researchers
57.2h

EdgeBench researchers reviewed and iterated every task with domain experts — each takes an expert 57.2 h to complete, up to 320 h.

We're open-sourcing an initial 51 of the 134 tasks, together with the full evaluation framework, so the community can study how agents learn from real-world environments.

4/39Scientific & ML
13/36Systems & SE
14/19Optimization
4/19Knowledge
8/13Formal
8/8Games

Agent learning curves

Each panel is one task. Every model draws its own best-so-far performance curve over a 12-hour run, in its own colour — all advancing at the same pace in time. The view cycles through the paper's representative tasks, three from each of the six capability families.

1–2 / 134

134 × 3 noisy curves, one clean law.

Add the individual runs one at a time — 134 tasks × 3 runs — and average them in: the mean smooths into a log-sigmoid, \(S(t) = S_{\max}/(1 + (t_{\mathrm{mid}}/t)^{\beta})\). All five models fit it — and plotting \(\log[\,S/(S_{\max} - S)\,]\) against \(\log t\) straightens each model's law into a line of slope \(\beta\).

Per-task curves

A theory of the log-sigmoid law.

The law is not a curve we drew through the data — it falls out of how capability accumulates. Score is the sum of many small units, the nodes of a latent graph, and each stays locked until the capabilities already unlocked scaffold it into reach. A locked unit feels a field \(h = \kappa x\) proportional to everything unlocked, while only the locked mass \(1 - x\) can still yield new score. The product of those two forces is the logistic \(dx/du = \beta x(1 - x)\) in log-time \(u = \ln(t/t_{\mathrm{mid}})\) — whose solution is the log-sigmoid.

Left: a capability graph built from self-similar scales. Each scale is a scaled copy of the one before, reached only after a constant factor more real interaction time (here \(\approx 2.3\times\)), so scale \(d\) sits at real time \(t_d \propto r^{\,d}\). How many new units a scale adds follows the mean-field rate \(\beta x(1 - x)\) — a bell — so the shells are fat in the middle, thin at the ends \([1,2,4,8,16,24,30,\dots]\). The frontier sweeps outward, one scale at a time. Right: the score is the CDF of the units' actual unlock times — not assigned. Because equal scales are equal slabs of \(\ln t\) and each carries the bell's worth of units, the log-sigmoid \(x(u) = 1/(1 + e^{-\beta u})\) (\(\beta = 1.0\)) emerges — here \(R^2 > 0.999\); pull the slider toward 10 and watch it roughen. The faint vertical bands are those scales: a uniform ladder only on this \(\ln t\) axis. Toggle Real time to unpack them into linear \(t\) — the bands crush to the left and progress decelerates as \(1/t\). That constant time-per-self-similar-scale is the whole reason the law lives on \(\ln t\).

AI learns from environments roughly twice as fast every three months.

To isolate environment learning from prior knowledge, we selected 18 tasks where models start from similar initial performance. We then evaluated model releases from September 2025 to May 2026 for two hours, using the 2H gain within that window as the learning-speed metric. The frontier trend shows that AI learning speed from environments roughly doubles every three months.

Inside a single 12-hour run.

One GPT-5.5 run on the gravitational-wave task, traced submission by submission. Across 247 scored attempts the best-so-far score climbs from 42.8 to 67.0, with seven turning points where the agent reframes the problem rather than just tuning.

42.8→67.0 Performance · GPT-5.5 · 12-hour run

Gravitational Wave: reconstruct a gravitational-wave signal from LIGO strain with waveform, spectrogram, and source dynamics.

best learned behavior best-so-far envelope

EDGE-Bench

An ultra-long-horizon benchmark built to measure learning from environments.

First benchmark to measure real-world environment learning

Ultra-long-horizon real-world tasks

High diversity, mostly built from scratch

Every task hand-reviewed by researchers

Agent learning curves

134 × 3 noisy curves, one clean law.

A theory of the log-sigmoid law.

AI learns from environments roughly twice as fast every three months.

Inside a single 12-hour run.

Benchmark leaderboard.