The World's RL Gym

‍I want to extend my deep gratitude to Travis Good at Ambient, Ben Fielding at Gensyn, Yuchen Jin at Hyperbolic, Roger Jin at Nous Research, Alexander Long at Pluralis Research, Johannes Hagemann at Prime Intellect, YB from Terminally Onchain, and Henry Freed, an Independent AI/ML Researcher. Your conversations and review were extremely helpful in writing this piece.

Introduction

“There are decades where nothing happens; and there are weeks where decades happen.” Nowhere does this quote feel more relevant than in today’s modern AI landscape. Seemingly every day there is a new breakthrough model, training method, or company bursting onto the scene that forces us to reorient our understanding of what is possible within the world of AI. Earlier this year it was DeepSeek, next it was project Stargate, now it’s Qwen, Manus, MCP, etc. Who knows what will happen next?

At a macro level, one of the most exciting trends in LLM development has been how approaches to improving model performance and functionality have rapidly evolved in just the past couple of years. At present, scaling via pre-training and, more recently, test-time compute, guide much of the industry’s approaches to building better models. But recently, with the release of DeepSeek-R1 and R1-Zero, a greater appreciation for a different approach to scaling models is starting to emerge – reinforcement-learning (RL). The goal of this article is to explore the implications of RL-based model improvement with a specific focus on how the RL process may or may not lend itself well to decentralization.

By the end of this piece, I’ll hope to leave you with three key takeaways:

An understanding of the rough timeline of AI model improvement techniques and how different approaches have evolved over time.
An appreciation for the burgeoning ‘RL Renaissance’ through highlighting the techniques used to post-train DeepSeek-R1 and R1-Zero.
Why some (but perhaps not all) of the components of RL post-training can benefit from decentralization.

Before diving into the nitty gritty of how DeepSeek employed RL to train R1, we’ll walk through a (greatly condensed) timeline of events to understand how we got to where we are today.

A (Very) Brief History of AI/ML Scaling

2020 - Early 2023: Pre-training Scaling Laws, Understanding the Importance of Data in Training

In 2020, researchers at OpenAI published “Scaling Laws for Neural Language Models.” The paper was significant because it explicitly articulated the tradeoffs in model size, data, and compute while scaling LLMs. Later in 2022, DeepMind researchers helped expand the scaling laws, with “Training Compute-Optimal Large Language Models.”

This paper crystallized what is now referred to as the “Chinchilla Scaling Law,” which showed that many existing models at the time were undertrained relative to their parameter count. That is, they had too many parameters relative to the amount of data used to train the model. This work helped researchers understand the optimal ratio of data to parameters (roughly 20 tokens per parameter), which resulted in much greater amounts of data being used in training than had been previously.

*The original* *Scaling Laws* *paper (which also entertainingly reads as a current who's who of LLM pioneers).*

With the crystallization of pre-training scaling laws around 2022-23, the era of “more data + more compute = better models” was ushered in. So long as we could get enough data and compute to throw into pre-training for models, we’d end up with more performant models. The OpenAIs, Metas, and Anthropics of the world hyper-fixated on securing the massive amounts of data and compute needed to meet the demands of training ever-larger frontier models. And in doing so, they were able to consistently release better and better groundbreaking models. But then, in late 2024, OpenAI’s reasoning models introduced a new approach to scaling model performance.

2024: Reasoning Models and Test-Time Compute Scaling

When OpenAI released their o1 models in early September 2024, they were some of the first models demonstrating systematic chain-of-thought reasoning made accessible to the public. These models use deliberate step-by-step reasoning approaches, evaluating multiple potential solutions before arriving at a final answer. Reasoning models showed a massive increase in capabilities on abstract reasoning tasks, illustrated by the incredible increase in scores on the ARC-AGI reasoning tasks:

A *graph from Riley Goodside* *(@goodside) showing the ARC-AGI scoring breakthroughs that came with the release of OpenAI's reasoning models.*

Further, with this release came the understanding that you can make models better after they’ve been trained through increasing test-time compute (the amount of compute used when a model is attempting to solve a problem).

Concretely, researchers at Google DeepMind showed that smaller models, when given enough compute at time of inference, could reliably outperform larger models that had received much more compute at pre-training in the aptly titled, "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters." Want a model to give you a better answer? Give it more time to think about the problem and reason its way to the best solution. This marked a new emphasis on developing approaches around scaling time-test compute to achieve better models.

Late 2024 - Early 2025: Cracks in the Armor of Pre-Training

With TTC scaling, we now had two levers we could pull to make our models better. One when initially training the model, and another after the model had been trained. And this second approach could not have come at a better time. As the TTC scaling law was taking shape, there was growing concern that we were running out of the data needed to continue to push the frontiers of pre-training...

In December of 2024, Ilya Sutskever took the stage to deliver his keynote at NeurIPS 2024. His 20min presentation gave a broad overview of the past decade of AI research, and also outlined where he saw the field going in the future. However, there was one soundbite that sent shockwaves around the industry. Early in his talk, Ilya declared, “Pre-training as we know it will unquestionably end.”

With this declaration, Ilya argued that we had quickly exhausted the internet-scraped data we had been using as the 'fuel' of pre-training. “We have but one internet,” he stated. And our data-hungry models had consumed every last token available.¹

2025: A Newfound Appreciation for RL & The DeepSeek Moment

Unless you’ve been firmly nestled under a rock for the past several months, you’ve likely heard of a Chinese AI company called DeepSeek in the news. With the release of their R1 model, DeepSeek proved the viability of a novel approach to training better models and sparked great excitement to explore model imrovement via RL.²

*The* *DeepSeek-R1 paper, which, among many things, brought a newfound appreciation for RL-based improvement of LLMs.*

Most of us might have heard about RL in the context of AlphaGo – the AI model that mastered the famously complex game of Go, ultimately beating the world’s best human players. Initially trained on a database of games consisting of 30 million human moves, AlphaGo was then made even more performant by using self-play RL.³ It was allowed to simulate thousands of games, improving itself by being rewarded (i.e. “reinforced”) when it made moves that led to success.

Now, RL is not exactly anything new in LLMs. Reinforcement learning from human feedback (RLHF) is used extensively by leading companies like Anthropic and OpenAI. The novelty of DeepSeek was that their R1-Zero model showed you could use RL with extremely limited human intervention and end up with a performant reasoning model.

With the DeepSeek moment, we might now have three overlapping ways to make models better. We can scale pre-training, we can scale TTC, and now we can scale RL in fine-tuning to make our models even better. However, this third approach, RL-based fine-tuning, could be more than just another knob to turn as it unlocks a powerful self-improving feedback loop.

DeepSeek’s innovation lies in its ability to use models to generate their own reasoning traces, refine them with lightweight RL, and then loop those improved outputs back into training. The upgraded model then produces even better traces, which are further refined, and so on. Each turn of the cycle strengthens the model’s reasoning ability across domains. This recursive refinement process – where synthetic data continuously improves the model that generated it – breaks the traditional dependency on fresh human data, pushing performance forward.

*A rough timeline highlighting the key moments in developing new LLM scaling approaches.*

With these categories outlined, we’ll next dive into how DeepSeek-R1-Zero and R1 were trained, exploring their innovation to give color to why we’ve experienced this newfound appreciation for RL. Next we’ll transition to the world of DeAI and theorize how decentralized networks might be able to handle this process, before ending with some final thoughts on the future of this space.

The DeepSeek Stack

The recent string of DeepSeek model releases brought many advancements in the world of LLMs, but likely the most exciting was how they utilized reinforcement learning to create DeepSeek-R1-Zero. We’ll use DeepSeek’s R1 paper to dig into how RL can be used to train models, but before we dive in, it’s important to isolate three distinct DeepSeek models relevant to this section:

DeepSeek-V3: V3 is a 671B parameter sparse Mixture-of-Experts (MoE) model released in December of 2024. Unlike in dense models, MoE models have subsets of model parameters (experts) that activate when processing different types of inputs. This was the model that later made markets freak out about its low training costs.
DeepSeek-R1-Zero: R1-Zero is a reasoning model that DeepSeek trained using V3 as the base model. Importantly, they fine-tuned it using RL without SFT or any human data (a concept explained in further detail later). It is performant but not practical for everyday use because it had issues with generating human-readable outputs and would often mix its output languages. Still, it’s valuable in showing how you can generate technically performant reasoning models through RL with hard-coded verifiers.
DeepSeek-R1: R1 is a “cleaned up” version of R1-Zero. It followed a similar training process to R1-Zero but made use of limited SFT to polish its outputs and make it more fit for everyday use.

*Illustration showing the relationships between V3, R1, and R1-Zero.*

With that, let’s talk about how the DeepSeek team used RL to create R1-Zero before understanding how it might translate to a decentralized setting.

The R1-Zero Process

In RL, a common post-training set-up would be as follows:

Supervised Fine-tuning (SFT) - SFT involves training the model on a carefully curated dataset of high-quality input-output pairs, where the outputs demonstrate desired behaviors such as step-by-step reasoning or following specific instructions. These are things like robust answers to questions, instruction sets and rule following, and/or prompts and chain-of-thought examples. The idea is that through providing a collection of extremely high-quality data to the model, it can best learn to mimic this type of behavior.
Reinforcement Learning with Human Feedback (RLHF) - RLHF usually follows a small amount of SFT. As SFT requires high quality human data, RLHF is able to complement this process by using human preferences to train a reward model, which in turns creates a framework for the model to train itself from its own responses.

But DeepSeek-R1-Zero deviated from this process in several key ways.

Dropping SFT

Instead of taking V3 and running through the two-step process of first using SFT and then RL, DeepSeek’s research team dropped the SFT process altogether. Essentially, DeepSeek took V3 and, with limited guardrails, gave it as much time and compute as it needed to learn how to reason through problems.

Removing the SFT step had several interesting benefits, but also some downsides:

Upside

Reduced the computational needs of training through taking out an entire step of the training process.
Allowed the model a wider window of exploration during RL given it had not been previously influenced by human-based fine-tuning data.⁴

Downside

R1-Zero exhibited poor readability and would often mix languages within answers. It had great reasoning capabilities but essentially wasn’t fit for interfacing with humans. That’s why human-centric data was reintroduced in the training of R1.

GRPO Instead of PPO

Another major difference in how DeepSeek was trained was the use of group relative policy optimization (GRPO) as their RL framework instead of the more common proximal policy optimization (PPO). Here again, this resulted in a much simpler, less computationally intensive approach to RL. We’ll walk through the basic differences between GRPO and PPO, but a full technical discussion is beyond the scope of this paper.⁵ Still, briefly:

Proximal Policy Optimization (PPO)

In RL with PPO, you have three components:

Policy Model - The ‘policy’ is the core model, the model that you are ultimately trying to train.
Reward Model - The reward model is a model that has been trained on human preferences to evaluate the policy model’s output. In practice, humans rate a small portion of an LLM’s outputs, then those ratings are used to train a reward model to reflect the preferences of the humans. The reward model evaluates the policy model so that the policy model can learn to optimize for better responses.
Value Model - The value model (or 'critic') is a neural network that estimates the expected sum of future rewards for a given state, helping guide the policy model by providing value estimates for partial completions.

Let’s use a metaphor to illustrate how these components work together. Imagine you're writing an essay. The value model is like having a tutor looking over your shoulder who can predict your final grade based on what you've written so far. This is useful because you don't want to wait until the entire essay is done to know if you're on the right track. Think about it going like this:

You (the policy model) start writing, "The impact of climate change..."

A tutor looking over your shoulder (the value model), "Good opening, probably heading toward a B+ based on similar essays"

You add, "...is devastating to polar regions..."

Tutor, "Even better now, looking more like an A- trajectory"

You continue, "...because ice cream sales decrease."

Tutor, "Oh no, prediction dropping to a D, that connection doesn't make sense"

[Final essay completed and submitted to your teacher]

Teacher (Reward model), "D+, the essay started strong but made illogical connections."

This example illustrates how the policy, value, and reward models work together to analyze and improve the behavior of an LLM.

Now, state more clearly, the process runs as follows:

The policy model is given a prompt and starts to reason out an answer.
The value model evaluates the current state at each step and predicts the expected future reward, helping guide the policy's decisions as it generates responses.
The reward model evaluates the full response, assigning a score to the final product so that the policy can learn to make better outputs.
For a given response, the predicted score from the value model and the actual score from the reward model are compared (there is much more robust math being used here, but again, not in the scope of this article). This information is then used to improve the policy model.

*A simplified diagram explaining the PPO process.*

There's an important takeaway to have here. The use of a value model in addition to a reward model in PPO was generally thought to be critical in RL because researchers thought you needed to be able to evaluate intermediate model reasoning to train the best models. As the core function of an LLM is choosing the best next token (word) in sequence, it would make sense to have a model that is able to understand how each piece of a response affected the final outcome. For example, the sentence “the cat ran,” involves three decisions (the, cat, and ran). If the reward model was to score this sentence highly, the value model would allow us to understand which specific words were optimal and if any in the three were suboptimal. Maybe “the” and “cat” were great, but opting for “sat” would have gotten the full response an even higher score. It allows the feedback during training to be much more granular. Seems logical, right? It is, but DeepSeek’s use of GRPO showed that this might not be the case.

Group Relative Policy Optimization (GRPO)

GRPO is a different approach to RL post-training. A core difference from PPO is that GRPO drops the value model completely. Instead, you have just two primary components, 1) the policy model and 2) the reward model.

However, furthering DeepSeek’s commitment to simplifying the RL process, their reward model is also not a neural network trained on human preferences. Instead, their reward model is an extremely simple reward framework that focused on verifiable rewards (i.e. 1 or 0 if something was right or wrong). With later iterations of this process, they added additional checks for consistent language use and proper formatting for human-readable reasoning text.

The flow of GRPO is as such:

For a given single prompt, the policy model generates multiple outputs.
The reward model scores all responses.
GRPO calculates a normalized average score for the group of outputs, and evaluates each individual response based on its score compared to the average.
The model uses the highest scoring complete outputs to learn which overall patterns of responses work better.

The below, taken from DeepSeek’s latest paper on their math-focused model, contrasts the PPO and GRPO approaches:

*A diagram from the* *DeepSeekMath paper* *illustrating the differences in PPO and the GRPO approach they implemented.*

The result of this GRPO process is quite profound. GRPO massively reduces memory and compute overhead by drastically simplifying the reward process and removing the critic model altogether, which is typically as large as the policy model itself and needs to be updated throughout the RL process. DeepSeek estimates their overhead reduction from this alone was around 50%!⁶

Now that we’ve walked through SFT and the differences between PPO and GRPO, we can more clearly see just how simple DeepSeek’s R1-Zero training process really was. They started with a performant MoE base model (DeepSeek-V3), implemented a lightweight, hard-coded GRPO framework, and essentially let the model learn through trial and error.

The graphs below show the result: over time, R1-Zero learned to think longer and generate more accurate answers. This wasn’t driven by human labels or curated datasets, but by a closed-loop process: generate reasoning traces, evaluate them, reinforce the best ones, and repeat. That feedback cycle pushed the model forward without the need for new external data, skirting the issues Ilya highlighted in gathering source data for pre-training.

*Graphs from the DeepSeek-R1 paper that show the model learning to think longer (left) and became more accurate (right) as training progressed.*

This approach, while stripped-down, produced a highly capable reasoning model. More importantly, it points to a new path for scaling: models that improve themselves by learning from their own outputs, from synthetic data they generate on their own. That is a key takeaway to understand; it's unlocking a whole new paradigm for model improvement.

*A very simple diagram illustrating the virtuous cycle of model improvement unlocked by GRPO-style RL.*

While this result should in no way be taken lightly, it also must be noted that R1-Zero is not a model that is fit for everyday use. It often mixes languages in its outputs, making it unreadable for humans. To address these issues, DeepSeek followed a slightly more elaborate process to fine-tune R1, their more accessible reasoning model.

The R1 Process

For R1, instead of doing straight GRPO RL on V3, DeepSeek broke the fine-tuning into four stages:

Step 1: Cold Start SFT

To ensure that they would end up with a human-readable model, DeepSeek used a “cold start” SFT process. They basically gave the model a set of data to help set the direction for how they wanted it to ultimately reason. While the full nature of this data was not made available, DeepSeek researchers reference that “thousands” of cold-start data were collected in the form of “few-shot prompting with a long CoT” and “readable DeepSeek-R1-Zero outputs,” but also incorporated some “post-processing by human annotators.” At the very least, here we know that human intervention was needed.

Step 2 - RL using GRPO

This is the same GRPO RL step used to train R1-Zero.

Step 3 - Rejection Sampling SFT

Rejection sampling in this context refers to a selection process where the model's outputs are filtered and ranked according to a reward model's criteria, with only the highest-scoring samples being used for subsequent fine-tuning. For DeepSeek, this was done in two rounds with a collection of 800,000 data points. These data were a combination of about 600,000 reasoning-related samples and 200,000 non-reasoning data samples like writing or self-cognition. The goal here was to expose the model to a set of curated data to ensure that it generalized well across different modalities and that it was not overfitting to just one domain, like math or coding for example.

Step 4 - RL Part 2

Another round of RL is done here, with a focus on prompting and learning to make the model more human-aligned. Specifically, DeepSeek’s goal was to increase the "helpfulness and harmlessness” of the model. DeepSeek reports that they used multiple reward models to encourage the holistic set of human-aligned behaviors they were looking for.

R1-Zero v R1

If you take all of that together and contrast it with the R1-Zero approach, you get a process that looks something like this:

*A diagram contrasting how DeepSeek used V3 as their starting model then employed distinct fine-tuning approaches to arrive at R1-Zero (left) and R1 (right).*

As we transition to thinking about what this might mean for the world of decentralized AI, I want to highlight a few key takeaways from DeepSeek:

Brutally simple RL can elicit complex and performant reasoning behavior in standard LLMs.
This RL process relies heavily on inference-time compute to generate reasoning traces.
This RL process benefits from generating many reasoning traces in parallel for a given prompt.
This style of RL is heavily reliant on the ability to reliably and robustly verify outputs to shape the model's behavior.

We'll explore these implications and why I think they bode well for decentralization in the next section.

Architecting a Decentralized RL Network

DeepSeek did not just show us the value of pure RL with GRPO, it also made apparent the need for vast amounts of reasoning data for and environments in which to generate this data. Indeed, this sentiment was reinforced by two titans of AI. First with an Andrej Karpathy tweet made in the wake of R1’s release:

For friends of open source: imo the highest leverage thing you can do is help construct a high diversity of RL environments that help elicit LLM cognitive strategies. To build a gym of sorts. This is a highly parallelizable task, which favors a large community of collaborators.
— Andrej Karpathy (@karpathy) January 29, 2025

And second, with Andrej’s point being reinforced by Yann LeCun:

I've been arguing for something like this for over a year: crowdsourced distributed fine-tuning. https://t.co/CAQDg3xqyI
— Yann LeCun (@ylecun) January 29, 2025

Interest from these two is certainly exciting for the world of decentralized AI, but I’ve seen limited efforts made to actually walk through how these ideas might work in practice. Let’s talk about what a truly decentralized approach to DeepSeek's RL approach might look like.

Components of Decentralized RL

In designing a decentralized approach to RL, we’re trying to come up with a network that is able to handle the requisite parts of generating a base model, collecting data for SFT, and then performing post-training in order to create state-of-the-art reasoning models. This gives us three main components to focus on, with admittedly cutesy names employed to, at the very least, make this next section easier to follow:

A) “The Foundation” - Base models + decentralized network to train them
B) “The Gym” - Environments to generate diverse, high quality reasoning data + decentralized network to coordinate contribution
C) “The Refinery” - Decentralized network to perform fine-tuning

Recall our diagram explaining how DeepSeek-R1 was ultimately trained. My goal is to try to isolate each component with a specific eye towards discussing the challenges and potential upside to decentralizing each. The basic components will look like this:

*This diagram aims to highlight the different components of our network that will be employed to tackle different pieces of the R1 training process.*

A) The Foundation: Pre-Training Base Models

An important note about DeepSeek's process for generating R1 is that they needed to start with a highly performant base model (V3) in order for their elegant RL process to work. It was only by starting with an extremely capable 673B MoE model that they were able to derive the benefits of GRPO's simplicity. You wouldn't get the same results starting with a distilled version of V3 or a worse model.⁹ And so, while DeepSeek is bringing greater attention to the viability of scaling via more bare-bones RL, it must not obscure the fact that it's still critically important to be able to pre-train better and better models. Just listen to this conversation between the Anthropic team where Dario reminisces on their need to scale models to a sufficient size because the older, smaller models, "weren't smart enough to do RLHF on top of."

Now, as I’ve previously written about decentralized training at length, I’m not going to spend too much time in this section. If you want to read about the challenges and technical considerations in decentralizing pre-training, I would suggest you read this piece.

What I will do is reiterate the fact that being able to pre-train a state-of-the-art base model in a decentralized way is the hardest part of this entire equation by far. The general communication overhead in pre-training is immense and the tricks or shortcuts you can take to get away with compute or memory-constrained collaborators are in short supply.

The easiest path to take to achieve decentralized RL would be to start with a base model trained in a centralized way and introduce decentralization only at fine-tuning. You could take any open-source model (DeepSeek-V3, the latest LLaMa or Qwen models, etc.) and then introduce decentralization only in post-training. This would work great and it would make our lives much, much easier. But it would also defeat the purpose of trying to create an end-to-end trustless process that can produce frontier models.

This might read as more of a philosophical point, but I think decentralizing RL would be somewhat pointless if we’re still reliant on the benevolence of centralized actors to provide us with the base models. As such, I argue for the need to create decentralized pre-training networks.

B) The Gym: Generating Reasoning Data

Fine-tuning R1 took a lot of data. They needed cold start data to begin the fine-tuning, and then over 800k data points at the intermediate stage to steer the model towards better generalizability. The question must now be asked, can we decentralize the generation of this data? Absolutely. And in fact, a distributed environment is quite well suited to this type of task.

Environments and Traces

Recalling our Karpathy tweet, making this process more open and distributed is ideal because the goal is massive scale of data. To achieve this, we’d want to build a framework to allow for anyone to contribute to a massive library of different reasoning samples (called “traces”) for a diverse set of tasks. Contributors should be able to not only submit traces, but also to create new environments to generate different types of data in a standardized way. That is, we’d want standardized environments for generating traces across math reasoning, physics, medicine, engineering, writing, and more. A robust spectrum of environments to generate and collect these traces would lead to a massive database for anyone to tap into for fine-tuning.

This approach is not necessarily novel, but has gained a new sense of importance now that DeepSeek has shown the efficacy of their approach. Back in the early days of OpenAI, the company released a platform called OpenAI Gym which provided an environment for developers to test out different RL algorithms to complete basic tasks. Similarly, SWE-Gym is a popular environment for testing agent software engineering capabilities, CARLA for self-driving vehicles, and Pybullet for physics simulations.

Verifiers

Of course, there would also need to be reliable ways to evaluate the correctness of this reasoning data. In DeepSeek, when unable to verify the output programmatically (for something like a math problem), they used LLM-based evaluation by feeding the samples into DeepSeek-V3 and having it judge them (e.g. to evaluate quality of writing samples). For our gym, it’s not just good to have the environments, but we also need to have verifiers for many, many different types of data – what good is reasoning data if you cannot reliably and consistently verify correct answers? The necessity for robust verification to scale RL is so fundamental, that Rich Sutton, a forefather of AI/ML and author of the “Bitter Lesson,” even wrote about this concept in 2001:

*One of Rich Sutton's lesser known posts, "Verification, The Key to AI," that is highly relevant to today's newly forming RL paradigm.*

And more recently, Sasha Rush, a professor at Cornell and new Cursor researcher, had this reaction to the DeepSeek paper which makes clear the need for robust verification:

*Sasha Rush had this takeaway when* *discussing the implications* *of the DeepSeek RL approach.*

Taking all of the above considerations together, here is an example of what our reasoning data could look like:

*An example of full reasoning data from the open-source project* *General Reasoning*. We have the question "How many milligrams are in 2 kilograms?" and the correct answer, the model's entire reasoning trace, and verification comparing the trace to the given correct answer (in this case verification is provided by a two-part LLM eval).

‍Reward-Shaping Complexity

To add nuance to this need for developing robust verifiers, innovation will need to happen beyond what DeepSeek implemented with R1 and R1-Zero. Their GRPO setup worked extremely well because many of the problems had simple binary verification (e.g. 1 or 0 for a correct answer on a math equation). But how do we account for more nuanced, complex scenarios? How do we handle rewards for requests that are cross-domain? How can we assign partial credit on tasks like coding where we’d want to reward proper syntax even when the output wasn’t perfect? What if the domain itself is ambiguous and we don’t have a reward policy neatly suited to it? How well does model proficiency in more objective domains like math and coding generalize to subjective domains like writing and language? Moving forward, much innovation will certainly happen as further exploration is made into designing the best possible reasoning environments.⁹ I believe the collaboration and open-experimentation inherent to decentralized networks will be key to driving progress here.

Current Innovation

Prime Intellect is a team from the world of decentralized AI making significant progress in this domain. One of their latest endeavors is SYNTHETIC-1, a collaboratively generated library of over 2 million reasoning traces. With these traces they plan to replicate DeepSeek’s SFT and RL process in order to fully recreate the R1 training procedure in an open-source way. One of the core pillars of SYNTHETIC-1 is GENESYS – an open-source library to standardize the generation and verification of synthetic data for RL. Their library allows anyone to create their own data and then run a standardized verification of their data via binary evaluation, LLM judges, or containerized code execution environments. Moving forward, the Prime Intellect team plans to allow individual contributors to add their own environments and verifiers in order to massively scale their library of reasoning traces.

Releasing SYNTHETIC-1: The largest open dataset of 2M reasoning traces from DeepSeek-R1, created by compute contributors across the globe:

- SYNTHETIC-1: Verified math, coding and science reasoning traces
- SYNTHETIC-1-SFT-7B: Fine-tuned on 800k sampleshttps://t.co/OE1lm4v7OY
— Prime Intellect (@PrimeIntellect) February 21, 2025

Started by ex-Meta RL engineer Ross Taylor, General Reasoning is building an open-source, collaborative platform for generating and verifying reasoning data across a wide variety of domains. At present, they have over 2.5m questions and 600k reasoning data across mathematics, medicine, biology, language and more. The platform allows for automated binary verification (1 or 0) but also has areas where humans can score more subjective domains. This is an extremely exciting project and most comprehensively encompasses the full stack of resources needed to generate a wide assortment of high quality environments and verified data for accessible, open RL. I will also note that General Reasoning is explicitly not a decentralized project (and definitely not a crypto company). Contributions are expected to be made on a voluntary basis and not incentivized basis.

*General Reasoning's open platform with diverse reasoning data.*

While not implemented yet, in both of these endeavors, blockchain could serve two purposes. First, blockchains could be used to incentivize and record contribution of reasoning traces. Individuals could earn tokens for the volume of data contributed or based on the downstream usage of their data in post-training other models. Second, it would be highly important to use cryptography to prove the verification of tasks. That is, it is not just enough to have data with accuracy measures attached, but a way to prove that the verification was done correctly. The Prime Intellect team has their own framework, TOPLOC, which uses a very lightweight cryptography scheme for verification of inference, but some concerns have been highlighted in regard to its robustness. Other teams like EZKL are working on verifiable inference via zero-knowledge cryptography, Gensyn through a delegated decentralized verification system called Verde, and, with an eye towards the not so distant future, Ambient will soon be releasing a new low overhead, high security approach

Takeaway

If you are the type of person who views decentralized AI through a skeptical lens, first I must thank you. We need more skeptics in this space. In fact, I think it is totally reasonable to be skeptical of the viability of model-parallel pre-training or zkML verification, but through my writing I’d aim to convince you otherwise.

Still, if you are one of these people, I would isolate this section, “The Gym,” as the part of the entire RL stack that most clearly and neatly benefits from decentralization. Here, we are designing a marketplace of contribution and experimentation rather than a coordinated training network. Decentralization does not introduce the same performance challenges like you encounter in pre-training or the fine-tuning process.

Further, as Karpathy says, the task of creating many different, verified environments to generate RL strategies is “highly parallelizable.” We want many participants contributing to this effort in parallel to generate the best possible strategies. I believe a platform with open participation, incentives for quality contribution, and cryptographic verification will most efficiently achieve the scale we need for global open RL. Open platforms with voluntary contribution and up/down voting are great, but I have hesitation that they will be able to achieve truly frontier scale without the verifiable, incentivized contribution that blockchains are uniquely well-equipped to facilitate.

C) The Refinery: Distributed RL

With sufficient data to handle SFT, the final step to training a fully formed reasoning model is to be able to actually perform the RL. Would it be possible to decentralize the RL fine-tuning process? At a high level, decentralized GRPO-based RL should be much easier to achieve than decentralized pre-training.

One of the most overlooked unlocks in GRPO-style RL is just how central inference becomes.

"[Inference-time compute] is about to up by a billion times." -NVIDIA CEO Jensen Huang. While said in the context of scaling reasoning, the point still holds in the new RL Renaissance. Since training involves evaluating multiple completions per prompt and reinforcing only the best, we might not want to rely on centralized training runs. Here, we can push performance by massively scaling inference, generating millions of reasoning traces, then filtering them down through verification or concordance scoring. In this setup, massively parallel inference isn’t just how we use models, it becomes how we train them. That shift opens the door to distributed systems where high-volume and efficient inference across many, many nodes drives ongoing model improvement.

With that framing, let’s now look at some considerations in decentralizing RL before isolating some concrete approaches:

‍Communication Volume

In the pre-training scenario, the amount of information that must be calculated and communicated over the course of training is significantly higher compared to in fine-tuning. For pre-training, on a token-by-token basis you need to calculate scores for every possible next token and calculate the gradient. For example, a 1,000 token sequence with a 50,000 token vocabulary would need to calculate 50 million values to compute and then backprop. In RL, you’re more simply calculating the advantage score for a set of full string responses – not needing to score at every token step. This makes for a much less memory intensive process.

GRPO’s Efficiency

With DeepSeek showing the viability of GRPO, we have an RL approach that is much better suited to decentralization than PPO. Not only did we see that GRPO massively reduces the overall amount of compute power needed in RL, but recall that DeepSeek also dropped the critic model and made use of a very lightweight reward system. This makes for an RL process that needs much less coordination in the decentralization process. The lack of a critic model means we don’t need to have a decentralized network updating both the policy and the critic during a run. And a lightweight reward model means we also need to spend less computational resources training that model as well.

Quantization

Quantization is a process used to reduce the size of a model for easier deployment. Given that this section is slightly more technical and meatier than the preceding ones, we'll break it into three subsections to help explain.

Overview: Quantization works by representing model weights and activations with lower precision data types like 8-bit integer or 16-bit floating point instead of 32-bit floating point.

To leverage a metaphor to explain quantization, if you imagine models as paintings, a full precision model would be like a painting made with an artist’s full array of paints, every shade and hue. A quantized model would be like trying to make the same painting with a much more restricted set of colors. Say, just black and white. You can still end up with something that clearly represents the original, but the end result has lower fidelity and loses some of its finer details.

*A simple image illustrating the effects of quantization. 10*

This metaphor points to a tradeoff in quantization. While it can allow you to get a more lightweight model, you also end up with a version that is potentially less accurate. If the model has less information in each parameter, the mathematical calculations it performs are naturally going to be less accurate.

Current state of innovation: Quantization is widely used in inference, generally thought to be unsuited to the pre-training context, and underexplored in RL. However, a collaborative study from Harvard and Google DeepMind researchers was able to show effective use of 8-bit quantization in PPO-based RL to achieve significant speed ups in training time.¹⁰The basic set up was to have the quantized “actor” models generating outputs and a full precision “learner” model making updates. With their setup, they reported speed ups of between 1.5-2.5x over full-precision training.

*A diagram illustrating the Learner, Quantizer, Actor setup in QuaRL.*

Beyond this, the DeepSeek actually trained much of V3 in FP8, showing that full precision is not necessary for all pre-training operations. How they pulled this off could be its own entire article, but essentially, the DeepSeek isolated components of pre-training where FP32 or BF16 were critical, and others where the accuracy decrease of FP8 was fine.¹¹

While there is exciting research happening to better incorporate quantization into the full AI/ML stack, current hardware limitations still present a barrier to progress. At present, only 4000 series and newer NVIDIA cards support FP8 quantization natively. This means that only the more high-end consumer cards can take advantage of quantization. Still, with time and a greater proliferation of quantization support in consumer cards, we can expect quantization to be more routinely leveraged.

Takeway: While more research is needed in this area, early signs of progress bode well for decentralization. Why? Spreading compute over a diverse, heterogeneous network of compute often means not every participant in the compute network is going to have clusters of multiple GPUs or even state-of-the-art individual GPUs. Here, memory constraints come into play and those with limited hardware might be excluded from network participation. With the ability to quantize however, we can achieve faster performance while also getting away with shrinking models down to smaller sizes, better facilitating the participation of individual actors with memory-constrained hardware.

‍Distributed Communication Techniques

With the more lightweight nature of RL compared to pre-training, decentralizing the fine-tuning process should be quite possible.

At a very high level, in a decentralized RL training network, you could have very lightweight “inference nodes” and then more robust “worker” nodes collaborating. Inference nodes could be individual participants that download small, quantized models locally or even pieces of the model if implementing a model-parallel approach. These nodes could run inference and calculate rewards, then send results back to trainer models at infrequent intervals, who would then do more computationally intensive gradient updates. Much of the work would be in isolating how and when to coordinate policy updates when handling rollouts across a massive network of parallel workers.

To facilitate this, an efficient routing scheme would be essential to route requests to the inference nodes across the globe. One existing approach to this is Ryabinin et. al's SWARM parallel framework, which, in the pre-training context, is able to take into account geographic distance and a specific node’s compute efficiency when serving training work to geographically dispersed GPUs.¹²

Again, the key would be designing an extremely efficient routing algorithm that can be sure to not overload specific workers, adjust to even out worker completion times, handle fault tolerance, and of course a synchronization algorithm that massively reduces the frequency of advantage and gradient synchronizations. While this is by no means an easy challenge, it presents as much more easily solvable than pre-training.

Below are three approaches tailored to the fine-tuning setting:

PETALS by Borzunov, Baranchuk, Dettmers, Ryabinin, Belkada, et. al¹³

PETALS presents an interesting approach to democratizing access to large language models through collaborative inference and fine-tuning. The system was developed to address a key challenge in the LLM space: while there is a suite of highly performant open-source models available for download, quite often the memory for inference (and significantly more for fine-tuning), put them out of reach for most researchers and practitioners.

PETALS enables collaborative use of large models by distributing computation across multiple participants. In this system, there are two main actors: servers and clients. Each server stores a subset of a model's layers (typically consecutive transformer blocks) and handles requests from clients.

*Diagram from PETALS showing a model being split across servers.*

Clients can call chains of pipeline-parallel servers to run inference across the entire model, with each server holding only as many blocks as its available GPU memory allows.

*Diagram showing requests coming from clients being routed through a chain of servers.*

The system's architecture is particularly clever in how it handles both inference and training. During inference, clients store only the model's token embeddings locally (which comprise a small fraction of the total parameters) and rely on servers to process the transformer blocks. When a client initiates an inference session, it first establishes a chain of servers that collectively hold all model layers. The client then uses its local embedding layer to process input tokens, sends the resulting vectors through the server chain, and receives the final output representations to compute next token probabilities.

A key innovation in PETALS is its approach to fine-tuning. Rather than requiring full model materialization, PETALS enables distributed parameter-efficient training where clients "own" their trained parameters while servers host the original pre-trained layers. Servers can perform backpropagation through their layers and return gradients with respect to activations, but they do not update the server-side parameters. This allows multiple clients to simultaneously run different training tasks on the same set of servers without interference.

For efficiency, PETALS incorporates several optimizations. It uses dynamic blockwise quantization to compress communication buffers between pipeline stages to 8-bit, reducing bandwidth requirements without noticeably affecting generation quality. The system also employs sophisticated routing algorithms to help clients find optimal server chains, taking into account factors like network latency and server load.¹⁴

In practice, PETALS achieved impressive performance for interactive use - running inference of a 176B model on consumer GPUs at approximately 1 step (forward pass) per second. This makes it practical for many interactive applications while maintaining the flexibility researchers need to access model internals and experiment with fine-tuning approaches.

DiPaCo by Douillard, Feng, Rusu, et. al¹⁵

Another promising approach specifically relevant to MoE models is Distributed Path Composition (DiPaCo) from researchers at Google DeepMind. DiPaCo introduces a novel way to distribute and fine-tune MoE models that could be particularly valuable for decentralized networks. Traditional MoE training requires each node to store the entire model in memory - a significant barrier for decentralized networks where participants may have limited resources. DiPaCo takes a different approach by breaking the model into "paths." Each path represents a carefully constructed route through the network that includes a subset of expert modules from each MoE layer, along with the corresponding routing components and necessary layer normalization components.

The key innovation of DiPaCo lies in how it handles training and inference. During training, data is pre-sharded and distributed by path, meaning each worker only needs to process data through its specific path configuration. This is enabled by making routing decisions at the document level rather than per token, allowing batching computation across all tokens of a sequence without needing to swap modules in and out. Each path is designed to be small enough (approximately 150M parameters) to fit on modest GPU hardware, making it feasible for broader participation in a decentralized network.

*A diagram from DiPaCo showing shards of data being routed through the relevant paths hosted on geographically dispersed GPUs.*

In DeepMind's experiments, DiPaCo demonstrated remarkable efficiency - a network of 256 paths of size 150M parameters was able to match the performance of a dense 1.3B parameter model while requiring 45% less wall-clock training time. On the other hand however, this approach proved to be extremely FLOP inefficient as presented; DiPaCo required significantly more compute to achieve similar perplexity scores to the same dense model.

Still, DiPaCo has interesting implications for the decentralized implementation. In DiPaCo, neither during training nor at evaluation time does the entire network need to be materialized in one place. The full model exists only as a virtual composition of paths across dispersed hardware, with each path capable of being served independently. Further, DiPaCo's architecture naturally supports heterogeneous hardware (a mixture of A100s and TPUs across the USA, Japan, and the UK were used in the experiment), allows for elastic resource utilization, and provides built-in fault tolerance through path redundancy. The underlying principles of distributing computation by path could be valuable for decentralized networks, where the ability to participate with limited hardware resources and minimal communication overhead is crucial.

RL Swarm by The Gensyn AI Team¹⁶

Built by researchers at Gensyn, a leading decentralized AI company, RL Swarm is a collaborative approach to distributed reinforcement learning that directly builds on top of DeepSeek's GRPO process for R1, now live on Gensyn's testnet. While we’ve highlighted how DeepSeek showed that models could self-improve through reinforcement learning without SFT or a critic model, RL Swarm takes this concept further by enabling multiple policy models to learn collaboratively in a distributed environment.

The key innovation of RL Swarm lies in its peer-to-peer learning structure where models not only self-assess but also evaluate and learn from each other's reasoning processes. This takes the RL dynamic from a solitary endeavor to a collaborative one, where models benefit from the exploration and insights of their peers.

Gensyn’s experimental setup for RL Swarm leveraged smaller Qwen-2.5b-1.5B models and learned on a mathematics reasoning dataset (GMS8K). They follow a three-step process which, as the Gensyn team highlights, mirrors a collaborative study group:

Answering Stage: Multiple policy models are loaded into separate hardware and then the models independently generate multiple responses to a given prompt (typically eight answers per question), calculate rewards, determine advantages, computes loss, and perform gradient updates following the GRPO methodology. After this individual work, each model shares its best answers with other models in the swarm.
Critiquing Stage: Models examine the answers provided by their peers and offer structured feedback. This creates a dynamic where models are incentivized to both produce high-quality answers and develop skills in evaluating others' responses.
Resolving Stage: Each model votes on what it believes the majority will consider the best answer for each question. Then, based on this collective evaluation, models produce their final revised answers to the original prompts.

An image illustrating the three-step proces of RL Swarm.17

The RL Swarm approach demonstrated several improvements over a solo-trained model for comparison. First, experiments showed that models trained within RL Swarm generally obtained higher rewards than those trained solo (e.g. they consistently produced more optimal outputs). Second, the peer-review process consistently led to more human-readable outputs, as demonstrated by the swarm-trained models producing responses more human-readable outputs with deeper reasoning. Specifically, Swarm models produced longer, more structured responses with better formatting, including proper use of bullet points, spacing, and LaTeX for mathematical notation. This suggests that the collaborative critique process creates an emergent behavior where models optimize not just for correctness but also for clarity and comprehensibility.

*A graph from the RL Swarm paper showing the gap in reponse length between RL Swarm-trained and solo-trained models.*

Given the lightweight nature of the communication required between models and the elimination of complex critic networks, RL Swarm represents a promising approach for scaling distributed reinforcement learning while maintaining training efficiency. The peer learning framework is open-source and already live, leveraging Ryabinin et. al’s Hivemind library to handle cross-node communication. While early in its development, RL Swarm is quite an exciting release for this field – it’s the most concrete distributed RL framework we have today.

Future Areas of Exploration

In a recent appearance on the Dwarkesh Patel podcast, legendary Google engineers Jeff Dean and Noam Shazeer speculated about future approaches to building highly modular models. I thought some of their ideas were quite compelling for applications in decentralized training and fine-tuning. And because this field of decentralized training is so young, I wanted to incorporate some of this speculation into this report. I think it can serve as a useful guide for what type of network we might want to build towards.

At the tail end of their conversation, Dean and Shazeer discuss a future state of AI/ML development. Seemingly influenced from their work on Pathways, they imagine a world where a sparse MoE LLM could be split into modular subdivisions of experts, with each piece being able to be trained and improved individually. The pieces would then be able to be swapped in-and-out to a larger model to expand its capabilities.

While this is by no means feasible today, it imagines an exciting future where you could get away with splitting a model into its smaller expert pieces, use RL to make those expert blocks better at their one task, and then fit them back together into one larger model. This process would be highly parallelizable, as groups of people could be working on refining and updating modules at the same time all over the world. This would obviously translate incredibly well to decentralized RL at scale.

Gensyn has taken one step towards making this future a reality. In their recent paper, HDEE: Heterogeneous Domain Expert Ensemble, they showed that you can train small, heterogeneous and modular expert models in parallel and then connect them in an ensemble via a technique called ELMForest.¹⁸ Researchers showed that these ensembles, while less efficient at inference, outperformed models trained with less heterogeneity. Now, this is not Dean and Shazeer's dream brought to life – the resulting ensemble is not a single model, but separate networks that produce separate outputs which are combined into a uniform answer after inference. While a full deep dive on the differences and future direction is beyond the scope of this piece (though perhaps the focus of a future one…), this is quite an exciting development to follow, and begs the question if it could be merged with RL Swarm to create more performant domain experts. I'm incredibly excited to see how this research evolves over time.

‍Looking Ahead

While some of this work around decentralized RL may seem far-fetched, exciting experimentation is already happening now. Hugging Face is working on Open R1, a project to create a fully open-source version of R1, datasets, training procedures and all. Prime Intellect is already hard at work replicating DeepSeek-R1’s training in a semi-distributed way with their SYNTHETIC-1 run. They’ve already finished their distributed data collection and are moving into the training phase.

We started this paper talking about how DeepSeek brought attention to a new scaling approach in GRPO-based RL. But while there are seminal papers laying the foundation for specific, generally agreed upon scaling principles for both training and TTC, we don’t yet know the limitations of scaling RL. How much data and what type of data do we need to get the most efficient SFT? How massively can we scale up GRPO-based RL to push model performance to the limits? How performant does a base model have to be to derive the benefits of RL? We’re not yet sure of the answers to these questions, but it seems we are entering into a new phase of AI innovation that will put RL to the test in LLM-scaling. One thing I am confident in is that decentralized, crowd-incentivized networks will play a part in it.✦

‍

^{1. Now, I think it’s important to note that scaling via pre-training is not dead. Just look at the Stargate project or Grok’s Memphis supercluster of 100,000 H100s to see that there is still appetite for massive investment in compute infrastructure. Still, Ilya’s words did reinforce the idea that pre-training scaling laws were not going to be the cure-all approach to making better models.}

^{2. There was a lot more significance to the string of DeepSeek papers outside of their RL approach. The}^{reported ~$6m cost of training V3}^,^{their distillation process}^,^{the meta implications for the US / China AI arms race}^{– all of these are big topics in their own right, but are beyond the scope of this paper, which will primarily focus on RL.}

³^{. https://research.google/blog/alphago-mastering-the-ancient-game-of-go-with-machine-learning/}

^{4. This is another note that would be beyond the scope of the paper to discuss in full, but one interesting implication of DeepSeek R1-Zero not employing SFT data is that it could allow for a more exploratory or curiosity-based fine-tuning process. Their approach allowed R1-Zero to perform more '}^{open-ended learning}^{,' where the model could generate responses without being preemptively influenced by human preferences. An interesting way to further develop this could be to create incentives for outside of the box thinking, to}^{reward the model for more creative behavior}^.

^{5. But this is a great resource if interested -}^{https://yugeten.github.io/posts/2025/01/ppogrpo/}

^6.^{https://arxiv.org/pdf/2501.12948}

^7.^{https://www.youtube.com/watch?v=0eMzc-WnBfQ}

^{8. H/T to Nathan Lambert who wonderfully laid out these concepts of reward-shaping and more global RL implications in this presentation, '}^{An Unexpected RL Renaissance.}^'

^9. ^{https://stackoverflow.blog/2023/08/23/fitting-ai-models-in-your-pocket-with-quantization/}^‍

^10.^{https://openreview.net/forum?id=xwWsiFmUEs}

^11.^{https://arxiv.org/html/2412.19437v1}

^12.^{https://arxiv.org/pdf/2301.11913}

^13.^{https://arxiv.org/pdf/2209.01188}

^{14. This Hivemind approach was further developed in later work on SWARM, a collaborative approach to fine-tuning which was detailed}^here^.

^15.^{https://arxiv.org/abs/2403.10616}

^16.^{https://github.com/gensyn-ai/paper-rl-swarm/blob/main/latest.pdf}

^17.^{https://www.gensyn.ai/articles/rl-swarm}

^18.^{https://arxiv.org/abs/2502.19385}

^{Legal Disclosure: This document, and the information contained herein, has been provided to you by Hyperedge Technology LP and its affiliates (“Symbolic Capital”) solely for informational purposes. This document may not be reproduced or redistributed in whole or in part, in any format, without the express written approval of Symbolic Capital. Neither the information, nor any opinion contained in this document, constitutes an offer to buy or sell, or a solicitation of an offer to buy or sell, any advisory services, securities, futures, options or other financial instruments or to participate in any advisory services or trading strategy. Nothing contained in this document constitutes investment, legal or tax advice or is an endorsement of any of the digital assets or companies mentioned herein. You should make your own investigations and evaluations of the information herein. Any decisions based on information contained in this document are the sole responsibility of the reader. Certain statements in this document reflect Symbolic Capital’s views, estimates, opinions or predictions (which may be based on proprietary models and assumptions, including, in particular, Symbolic Capital’s views on the current and future market for certain digital assets), and there is no guarantee that these views, estimates, opinions or predictions are currently accurate or that they will be ultimately realized. To the extent these assumptions or models are not correct or circumstances change, the actual performance may vary substantially from, and be less than, the estimates included herein. None of Symbolic Capital nor any of its affiliates, shareholders, partners, members, directors, officers, management, employees or representatives makes any representation or warranty, express or implied, as to the accuracy or completeness of any of the information or any other information (whether communicated in written or oral form) transmitted or made available to you. Each of the aforementioned parties expressly disclaims any and all liability relating to or resulting from the use of this information. Certain information contained herein (including financial information) has been obtained from published and non-published sources. Such information has not been independently verified by Symbolic Capital and, Symbolic Capital, does not assume responsibility for the accuracy of such information. Affiliates of Symbolic Capital may have owned or may own investments in some of the digital assets and protocols discussed in this document. Except where otherwise indicated, the information in this document is based on matters as they exist as of the date of preparation and not as of any future date, and will not be updated or otherwise revised to reflect information that subsequently becomes available, or circumstances existing or changes occurring after the date hereof. This document provides links to other websites that we think might be of interest to you. Please note that when you click on one of these links, you may be moving to a provider’s website that is not associated with Symbolic Capital. These linked sites and their providers are not controlled by us, and we are not responsible for the contents or the proper operation of any linked site. The inclusion of any link does not imply our endorsement or our adoption of the statements therein. We encourage you to read the terms of use and privacy statements of these linked sites as their policies may differ from ours. The foregoing does not constitute a “research report” as defined by FINRA Rule 2241 or a “debt research report” as defined by FINRA Rule 2242 and was not prepared by Symbolic Capital Partners LLC. For all inquiries, please email info@symbolic.capital. © Copyright Hyperedge Capital LP 2024. All rights reserved.}

‍