DeepSeek-R1 and the Rise of RL Reasoning, Inference Compute Overturning the Pretraining Scaling Race

It was no accident that 'Learning to Reason with LLMs' and 'DeepSeek-R1' topped HackerNews side by side. As brute pretraining scaling hits diminishing returns, rewarding reasoning directly through reinforcement learning and letting models think longer at answer time has become the new battleground. The shift reshapes HBM demand curves, datacenter power design, and the odds of an open-weight catch-up.

When two posts climbed the HackerNews front page on the same day, the pairing read like a thesis statement. One described training a model to unfurl a long chain of thought before committing to an answer. The other was DeepSeek-R1, claiming that this very capability could be coaxed out through reinforcement learning rewards alone, without an elaborate scaffold of supervised reasoning traces. They pointed in the same direction: the formula that had governed the field for years — bigger models, more data, longer pretraining — had reached a point where it no longer pushed capability upward in a straight line, and a different lever was quietly taking its place.

The curve that flattened, and the one that replaced it

The logic of pretraining scaling was seductive precisely because it was simple. Loss fell predictably as parameters and tokens grew together, and that empirical regularity became a coordinate system for pouring capital in a single direction. But the curve grew expensive on one axis and gentle on the other. High-quality text is finite, and doubling a cluster did not double a model's ability to reason through a hard problem. What DeepSeek-R1 demonstrated is that a base model of fixed size can leap forward on verifiable tasks like mathematics and code when the path to a correct answer is itself turned into a reward signal. Left to optimize against that signal, the model began generating longer deliberations on its own, second-guessing intermediate steps and revisiting its own conclusions. The locus of capability shifted from knowledge frozen into pretrained weights toward computation summoned in the moment of answering.

The economics of this move run opposite to those of pretraining. Pretraining front-loads an enormous one-time cost, after which inference is comparatively cheap. Inference-time reasoning inverts that ledger: every time the model meets a hard question it generates thousands or tens of thousands of fresh tokens of thought, so cost accumulates not at the training stage but at the serving stage, scaling with usage rather than being paid once. A genuinely new design variable enters the picture — trade more thinking time for more accuracy — and with it the prospect that reasoning demand expands without a natural ceiling as adoption grows.

The chain reaction through memory, power, and open weights

The first thing to wobble downstream is the character of memory demand. Producing a long chain of thought means holding a long context and piling key-value caches into memory, which strains bandwidth and capacity before it strains compute cores. If the story propping up HBM demand was once memory to train ever-larger models, the reasoning era reframes it as memory to serve the long deliberations of many concurrent users. Training demand concentrates in a handful of giant clusters; inference demand spreads broad and thick across the entire service frontier. The very shape of the demand curve changes.

Datacenter power design feels the same pressure from a different angle. Training is close to a predictable, sustained load, whereas inference is a volatile one that swells and recedes with user traffic and question difficulty — and a model tuned to think harder spends more tokens and more watts on the same prompt. With power already the real bottleneck on datacenter expansion, operators now weigh the marginal kilowatt-hour against marginal accuracy. The better the answer you want, the larger the electricity bill and the carbon footprint that come attached.

The same shift, almost paradoxically, hands the open-weight camp a credible route to catch up. The pretraining scaling race was a game for the few who could marshal astronomical capital, but reinforcement-learned reasoning can be layered onto a published base model with comparatively modest resources. The proof arrived quickly: once DeepSeek-R1 released its weights and recipe, a wave of follow-on work reproduced and remixed the same approach. If the secret of reasoning lies more in well-designed verifiable rewards than in a secret data pipeline, the moat the closed labs have enjoyed may be shallower than assumed. What DeepSeek-R1 opened was not merely one smarter model, but a new phase that relocated the axis of competition, the center of gravity of cost, and the height of the barrier to entry all at once.

DeepSeek-R1 and the Rise of RL Reasoning, Inference Compute Overturning the Pretraining Scaling Race

The curve that flattened, and the one that replaced it

The chain reaction through memory, power, and open weights

More Insights