Frontier Training

‍Thanks to Alexander Long at Pluralis Research, Jason Morton of EZKL, and the Nous Research team for their thoughtful discussion and feedback on this piece.

Introduction

Frontier AI models are rapidly embedding themselves in all facets of our lives, promising transformation of industries from healthcare to finance. However, these models are incredibly expensive to develop and deploy, with only a few large companies possessing the financial and computational power to support their training. Training frontier models currently requires enormous data centers housing tens of thousands of interconnected GPUs. Current projections estimate training costs for frontier models reaching $10 billion in expenses in the near future.¹ This high cost of development has positioned large-scale model training as a force of centralization, consolidating power among only the few companies with the infrastructure and capital to support such massive computational demands.

Colossus Data Center Compute Hall — An image of xAI's Colossus data center. https://www.datacenterfrontier.com/machine-learning/article/55244139/the-colossus-ai-supercomputer

Decentralized training has emerged as a potential counter to this paradigm. Decentralized training is a new way to generate AI models through distributing workloads across a network of GPUs that are not physically co-located in a single data center. This approach promises to lower costs and make large-scale AI training accessible to more players. In decentralized training, multiple participants contribute computing resources, each handling parts of the training process, allowing for a shared cost structure and reduced dependency on monolithic data centers. However, as promising as this concept sounds, decentralized AI training faces three critical hurdles: technical feasibility, achieving trustless and private handling of data and model weights and scaling networks to compete with centralized solutions. In this report, we will attempt to address these challenges. First, we overview how centralized training currently works and why it’s evolved to require such massive amounts of co-located compute. Next, we detail various approaches to decentralized training and the latest advancements in the field before weighing the potential upside and risks of decentralized training. We then highlight some of the companies at the forefront of this space, and conclude by highlighting key areas of future research.

Centralized Training

To isolate the need for decentralized training, we first need to understand the current paradigm for training models in a centralized manner. Training frontier models like OpenAI’s GPT-4, Meta’s LLaMA, and Google’s Gemini involves a multi-phase, resource-intensive process that makes these projects extremely costly. For instance, the estimated training cost for GPT-4 is approximately $63 million, excluding researcher salaries.² The following section outlines the training process step-by-step, with examples from these leading models.

Step 1: Collecting and Preparing Data

Training starts by gathering huge amounts of text data from a variety of sources, including books, articles, and web pages. This data is cleaned to ensure high-quality inputs by removing duplicates, irrelevant content, and errors. The size and diversity of these datasets are critical—GPT-4, for instance, used a significantly larger dataset than its predecessor GPT-3, and Meta’s LLaMA incorporated both publicly available and licensed data for broader language processing abilities.

Step 2: Building and Training the Model

Once the data is ready, the AI model itself is built using a “transformer” architecture, a type of design that helps the AI analyze relationships between words and predict what comes next in a sequence. This phase involves feeding the cleaned data into the model and teaching it to predict patterns. There are numerous ways for both the data and model to be processed and divided across GPUs in these clusters. A full explanation of centralized training methodologies is beyond the scope of this article. If you would like to learn more, we suggest this detailed piece.

The scale of this task is enormous. Google’s Gemini used advanced TPU v4 pods (a specialized type of hardware) to handle its training needs and GPT-4 relied on Microsoft Azure’s AI-optimized supercomputers. These setups allow billions of computations to happen simultaneously, a requirement for models with billions or even trillions of parameters (think of parameters as the “neurons” of the AI brain). To date, centralized data centers are key as they allow GPUs to be physically connected via extremely high speed interconnects. For example, NVIDIA NLINK interconnects allow for 1,800 GB/s of data transfer. And for contrast, the average internet connection runs from around 100 MB/s up to 1 GB/s for higher performance consumer internet.

Step 3: Fine-Tuning with Human Feedback

After the model learns the basics during training, it undergoes a fine-tuning phase where humans play a role. In this step, humans review the AI’s outputs, providing feedback on what is correct or helpful and what isn’t. The initial adopted version of this process was dubbed Reinforcement Learning from Human Feedback (RLHF), but newer techniques like Direct Preference Optimization (DPO) are gaining greater popularity for their simplicity.

OpenAI fine-tuned GPT-4 with input from over 50 domain experts, while Google integrated real-world user interactions into Gemini’s training, ensuring it was better suited to practical applications. This step makes the models more reliable and adaptable to complex tasks.

Step 4: Iterative Updates and Continuous Improvement

The work doesn’t stop after the model is deployed. AI developers constantly update these models based on real-world user feedback and ongoing research. This continuous improvement process demands robust infrastructure to manage uninterrupted cycles of training, testing, and deployment. Centralized data centers provide the necessary foundation, supporting the large-scale operations required to keep these models state-of-the-art.

Implications

This process as outlined above has become an incredibly expensive endeavor, with costs skyrocketing in recent years. According to a recent study by Epoch AI, the costs of hardware and energy for training leading models have grown by an average of 2.4 times per year since 2016. These high costs have concentrated AI development among a few tech giants, creating significant barriers for smaller competitors.

*https://epochai.org/blog/how-much-does-it-cost-to-train-frontier-ai-models*

A key reason for these escalating costs is the centralized nature of training. As alluded to in the pre-training section, modern AI systems rely on clusters of GPUs that need constant, high-speed communication to keep every piece of hardware in sync. This tightly coupled setup ensures model parameters are updated in unison, but it also makes scaling harder. Adding more GPUs to a data center increases the burden on networking infrastructure, which inflates costs and makes it increasingly complex to build and operate these systems.

If we continue with this centralized approach, the only way forward will be to build ever-larger data centers, adding more compute power and investing heavily in advanced networking to avoid bottlenecks. But it’s unclear if this model of scaling will even hold. Recent comments from tech leaders like Marc Andreessen, Ben Horowitz and Ilya Sutskever have alluded to the fact that leading AI companies are no longer getting linear improvements from throwing more data and power at their training runs, and instead shifting their focus to scaling compute at time of inference.^3,4

But what if training didn’t need constant synchronization? Decentralized approaches that we will dive into further in this report are exploring how training tasks could be handled more independently, with nodes working asynchronously. This shift could allow geographically distributed systems to work together without the same bandwidth or latency requirements, easing the load on centralized data centers. As the next section explores, whether decentralized AI can live up to its promise will depend on addressing its three main challenges.

Decentralized Training

Decentralized training presents as a promising, but extremely early alternative approach to the pre-training of foundation models. In the following section, we’ll dive into the latest research in this field and cover the most compelling new approaches to decentralized training. Importantly, we’ll also spend some time distinguishing distributed training from decentralized training, which we isolate as a nuanced but critical distinction to make.

Distributed Training ≠ Decentralized Training

Before diving into the specifics of decentralized training, we want to make a distinction between two terms that are often used interchangeably, but should not be used as such: distributed training and decentralized training. Let’s make some distinctions.

In our view, distributed training is a more general term that refers to the process of training via hardware that is not physically co-located. At present, this most consistently looks like pre-training being run across several data centers, stocked with uniform hardware, that are located in the same or neighboring states. In fact, many of the leading centralized AI giants like Google and OpenAI are employing distributed training techniques right now.⁵

Decentralized training is different. Decentralized training is similar to distributed training in that the hardware being used for pre-training is not co-located, but it differs in that the hardware being used is heterogeneous and not trusted. Truly decentralized training is training that can be done by non-trusting parties. Anyone with hardware to lend to the training network should be able to do so, without needing to supply a mini cluster of H100s. Further, in the decentralized setting, anyone is able to enter and exit the network and process training data in a trustless and permissionless manner. Overall, decentralized training aligns with the ethos of web3, whereas distributed training merely reflects geographic diversity in hardware locations.

For avoidance of doubt, no decentralized training network is live to date. A handful of companies building in this space have performed distributed training at scale, but none have launched fully decentralized training networks.

Considerations in Constructing a Decentralized Training Network

With our definition of decentralized training detailed, we’ll spend the next section outlining what we believe to be the most important considerations when evaluating decentralized training networks. Namely, we’ll highlight the various technical approaches to solving the challenges imposed by training in a decentralized manner, considerations in achieving cost competitiveness with centralized solutions, security considerations, and how crypto incentives can be used to bootstrap and maintain these networks.

Technical Approaches: DiLoCo, Open DiLoCo, DisTrO, DeMo, & SWARM Parallel

Recall that the leading paradigm in centralized training is that massive amounts of data need to be communicated across islands of hardware at every step of the pre-training process. This introduces significant bandwidth requirements, necessitating physically co-located GPUs linked via high speed interconnects. As such, the core technical challenge decentralized training must overcome is how to effectively pre-train models when working with poorly connected devices in low bandwidth, high latency environments. What follows is a selection of some of the most promising approaches to solving this issue.

DiLoCo⁶

While not the very first publication in distributed training, Google DeepMind’s 2023 paper “DiLoCo: Distributed Low-Communication Training of Language Models” is where we’ll start our survey of the decentralized training landscape as it has had an oversized influence on current decentralized training techniques. In essence, DiLoCo is an optimization algorithm that allows for significantly less communication during pre-training (500 times less to be exact) while still producing models that perform as well as models trained with full communication. So how does it work?

In the centralized setting, GPUs share their weights with all other GPUs after every step during pre-training, those weights are averaged and sent to all GPUs, and then the next step in pre-training occurs. This process repeats with synchronizations every step off the way until training is complete.

With DiLoCo, every GPU receives a copy of the model and a selection of the data used to train the model (this is what is referred to as “data parallelism”). As training is done on each GPU, updates need to be shared across devices for the model to optimize and take shape. DiLoCo divides these optimizations into inner and outer optimizations:

Inner Optimization: Each GPU has an inner optimizer that allows it to sample data from its shard and update its local version of the model parameters as it learns. To use a metaphor, this is like imagining each GPU as an explorer in a foreign land. As it makes new discoveries, it updates their personal map, but doesn't need to share every single new discovery with the explorers in different lands (the other GPUs).

Outer Optimization: The outer optimization is when communication between workers occurs. At this step, the various gradients of each worker are collected and averaged to create a new, updated set of parameters which is then shared back to the workers to be used as the starting point for the next round of inner optimizations. Importantly here, only updates to the gradient and not the entire gradient need to be shared’. That is, just the delta from the last optimization is being communicated, or “pseudo-gradient,” which is a much more resource efficient approach than redundantly sharing the entire set of information. Here, this would be like all of the explorers coming back together, sharing just the new findings of each group, and then collectively updating one master version of their map to land at a more detailed version. They then take this new and improved map and go back out on their isolated expeditions.

DiLoCo’s approach was impactful because of how it showed that communication at every step during pre-training was not necessary to achieve performant models. Instead they found that, with clever optimizations, you can train comparable models with 500x less communication. While DiLoCo presents a promising framework for distributed training, their approach of inner and outer optimizations was more of a research project than a real-world implementation. However, since its original publication in 2023, DiLoCo has since been leveraged by teams explicitly building decentralized training networks.

INTELLECT-1 & Open DiLoCo^7,8

Open DiLoCo, published by the Prime Intellect team, is a practical implementation of DiLoCo. Through their work, they were able to not just replicate DiLoCo in the wild, but also scale its utility. Further, they also adopted the DiLoCo approach for PyTorch and open sourced it, allowing other research teams to much more easily implement their own versions of DiLoCo and reproduce the results.

While the original DiLoCo research was able to produce models at the scale of 60 to 400 million parameters. Open DiLoCo scaled this to a 1B parameter model with their first implementation, leveraging four clusters of GPUs in the US, Canada, and Finland.

However, with their second, more recent implementation dubbed INTELLECT-1, the Prime Intellect team was able to train a 10B parameter model on clusters of H100s across three continents. For this implementation, they took a hybrid approach to training by using DiLoCo for communication across nodes, Fully Sharded Data Parallelism (FSDP) for communication within nodes, and their load-balancing architecture titled PRIME to handle devices joining and leaving the network. While we’ve already detailed DiLoCo’s approach to inner / outer optimization, we’ll detail FSDP and PRIME below.

Fully Sharded Data Parallelism:⁹ Developed by researchers at Meta AI, FSDP is an approach to data parallelism that attempts to avoid key drawbacks to the method, namely the fact that the entire model needs to be stored in memory on each device. FSDP “shards” the model parameters, gradients and optimizer states across GPUs which allows for each device to only hold a smaller portion of the model. During training, FSDP is able to collect parameters across shards when needed, perform calculations, then offload parameters once calculations are done.

*https://engineering.fb.com/2021/07/15/open-source/fsdp/*

This dynamic approach is extremely memory efficient, but does require fast interconnects between nodes to facilitate communication during synchronization operations. For Prime Intellect, this was well suited to their distributed training setup which consisted of 30 nodes across three continents. Each node was a cluster of 8 physically interconnected H100s. FSDP was a viable approach to be used within these nodes of GPUs (because of their high speed interconnects) and DiLoCo was used to synchronize across the 30 nodes.

*A map detailing the nodes participating in the INTELLECT-1 run.* *https://github.com/PrimeIntellect-ai/prime/blob/main/INTELLECT_1_Technical_Report.pdf*

‍PRIME: In an effort to make DiLoCo more robust in the context of truly decentralized training, the Prime Intellect team developed PRIME, their own architecture for handling fault tolerance and robustness of nodes in their network. PRIME is focused on two main problems in distributed training – new nodes entering the network and active nodes dropping out of the network.

On the entry of new nodes, PRIME leverages “blocking synchronization.” In blocking synchronization, when a new node joins the network, the training run is temporarily halted so that the new node can download the model and optimizer state to “catch up” to the other nodes before training resumes. In training INTELLECT-1, new nodes were added relatively infrequently (only every couple days), so this approach worked well for them. However, in a more dynamic network where nodes join with greater frequency, a non-blocking approach might be preferred which trades less likelihood of loss spiking for more consistent utilization.

On nodes dropping out of the network, PRIME uses a clever monitoring technique. Every node emits a signal or “heartbeat” to the network at two second intervals. If a node has not sent a signal in six seconds, it is assumed to have experienced a failure and is removed from the network. In the case of a planned exit from the network, a node can send a “deathrattle” to the network which signals its intent to drop, allowing the network to plan resources accordingly.

Taken together, PRIME’s handling of network entries and exits makes for a much more robust approach to distributed training.

DisTrO & DeMo^10,11

Developed by the Nous Research team, Distributed Training Over-The-Internet or “DisTrO” is a family of optimizers that have allowed for massive reductions in communication during pre-training. Importantly, while Nous has run a public decentralized training run of a 15B model, they are yet to publish a comprehensive breakdown of how they achieved it. However, they have published a detailed report on a specific optimizer they developed called Decoupled Momentum Optimization, or DeMo for short.

If you recall from the DiLoCo approach, an inner optimization was done for workers to be able to update their own copies of the model. The specific optimization algorithm used was called AdamW, an approach to optimization developed by researcher Diederik P. Kingma in 2014.¹² With DeMo, Nous identified several shortcomings of AdamW. Namely, AdamW relies on transmission of large amounts of data within clusters of GPUs and therefore is generally expected to be implemented in cases where high speed communication is achieved via interconnects like InfiniBand or NVLink. DeMo solves this by designing an optimization algorithm that allows for significantly less communication which is better suited for low bandwidth environments. So how does it work?

While the full technical details are beyond the scope of this paper, in essence, DeMo was able to achieve a several orders of magnitude decrease in communication per step by only sharing what they refer to as the most important momentum components So what is momentum? In pre-training, momentum is a concept that refers to the model keeping an average of past gradients to help guide how it updates its gradient at subsequent steps of training. Momentum in training allows the model to maintain some of its prior history in order to achieve smoother optimizations that don’t get stuck or make too severe of updates with each step.

Using a mathematical technique called Discrete Cosine Transform during optimization, Nous showed that you can compress the amount of data needed to communicate to other nodes by isolating principal momentum components into slow-moving and fast-moving buckets.

Fast-moving components: These components represent changes in the model that have a high, immediate impact on the model function. Importantly, these updates can be synchronized across GPUs quite efficiently. Recalling our explore metaphor, this would be pressing discoveries like a massive sinkhole or landslide covering a road you’d want to know about ASAP.
Slow-moving components: These components are more gradual trends in the model’s behavior and require more work to synchronize. These tend to only develop over the course of the model’s run and would vary from run to run if the training was to be done multiple times. Here, this would be tracking the greater topology of the terrain and more global patterns of the land.

Nous realized that by isolating these two components, you could achieve similar model outcomes by only synchronizing the fast-moving components at each optimization step and saving the synchronization of the slow-moving components to the end of the run. By replacing AdamW with DeMo, Nous was able to show that the data communicated per step decreased by several orders of magnitude without any loss in the performance of the final model (and in some cases saw improvements!). Leveraging DeMo, Nous recently publicly trained a 15B parameter model on a collection of heterogeneous devices spread across the globe.¹³

SWARM Parallelism¹⁴

Stochastically Wired Adaptively Rebalanced Model (SWARM) Parallelism is an altogether different approach to distributed training from those outlined above. Led by Max Ryabinin (an AI researcher currently at Together.AI), SWARM Parallel focuses on a decentralized model-parallel approach to training. Unlike DeMo, which focused on optimizing how gradients are synchronized in a data parallel setup, SWARM’s model-parallel framework derives its performance benefits from developing a new approach to how the model and data are shared across and passed through a network of heterogeneous devices. Here’s how it works:

SWARM is a targeted approach to pipeline-parallel training. Pipeline-parallel training is a technique in pre-training that splits layers of the model across nodes in the training network. In this setup, data sequentially flows through different nodes with each node handling a different part of the model's layers. Importantly, this massively reduces communication needs because instead of requiring all-to-all communication across devices to relay the updated gradient, devices only need to communicate to the devices handling model layers before and after its location in the pipeline. Again, this is different from data-parallelism in which every node maintains a copy of the model and processes a different subset of data.

Pipeline-parallelism has a distinct advantage over data-parallelism in the decentralized setting in that it allows for larger models to be trained across networks of RAM-constrained consumer devices. As models increase in size, consumer grade GPUs are likely to encounter memory limitations in being able to store the entirety of large models (an issue in the data-parallel scenario). In pipeline-parallel however, you can break the model into smaller pieces and spread them across devices, resulting in less storage overhead.

However, even with all of these advantages, pipeline-parallel approaches to pre-training have a major flaw when it comes to training via decentralized, heterogeneous hardware. Given the sequential nature of the process, any training run is going to be bottlenecked by the slowest hardware in the network. Imagine training over a decentralized network where one model layer is being handled by state of the art H100s with 1gb/s connectivity while the subsequent layer is hosted on a consumer grade GPU with 100mb/s internet. The run will only be able to move as fast as the weakest link. SWARM fixes this limitation in pipeline-parallel training by introducing dynamic or “stochastic” pipelines that spin up and down as computational resources are needed at different stages.

The key innovation with SWARM is their approach to handling the flow of data through the network of devices in a messy training network. SWARM’s algorithm handles two critical tasks in input routing and load balancing:

Stochastic wiring: Within a layer, SWARM dynamically routes data through the group of devices based on a device’s performance. More performant devices are routed proportionally more work, and less performant devices less. This approach also takes into account proximity of nodes, favoring devices closer to each other which can transmit data faster. This helps address the bottleneck issues outlined above.
Adaptive swarm rebalancing: SWARM is also able to allocate underutilized devices based on the needs of various layers at any given time. If you imagine a heterogeneous network of devices across the world, it’s inevitable that some GPUs may go online and offline during a run – they could get unplugged, lose network connectivity, experience a hardware defect, or any other number of faults. SWARM addresses this issue by allowing for devices to move across stages of the pipeline based on computational need at any given time. If one layer of the training is experiencing high faults and/or has more computational demand than it can handle, devices from underutilized layers can be reassigned to the layer in need. The same goes for any devices that join the network in the middle of a training run. This dynamic rebalancing ensures that computational resources are allocated in the most efficient manner possible.

Perhaps most importantly, Ryabinin et. al discovered what they call the “Square-Cube Law of Distributed Training,” which showed that time of computation grows more than time of communication as models get larger. This means that if you compare compute vs. communication overhead, larger models are more communication efficient and therefore proportionally better suited to distributed training.

With the approach as outlined above, SWARM was used to train a 1B parameter model across a network of cloud-based T4 GPUs with less than 200Mb/s of network speed. They showed that their approach was comparable to more traditional approaches like data-parallel.

Summary: "The models yearn for the swarm"

As we conjecture about the future of decentralized training techniques, many researchers in the industry have experienced particular enthusiasm for SWARM’s approach. Why? Modern models are proving to be too big to store in VRAM of consumer hardware.¹⁵

For example, an H100 GPU has 80gb of VRAM, and high-end consumer graphics cards range in the 10-24gb range. Meta's Llama range of models need the below VRAM requirements, which illustrates the challenges of running anything beyond the smallest models on memory-constrained consumer GPUs.

A table showing memory requirements for various models sizes (y-axis) across varying levels of precision (x-axis). https://huggingface.co/blog/llama31.

As previously mentioned, data parallel techniques require the entire model to be stored on each device or split and stored across 'nodes' of interconnected devices. Data-parallel or sharded data-parallel techniques like those used in Nous and Prime Intellect’s research are thus not entirely well suited for a fully decentralized training network where individual users should be able to participate via lending compute power from individual consumer GPUs. DiLoCo works well when you are able to store an entire model on a single GPU. FSDP works when you have interconnected GPUs that can form a node collectively and shard the model within the node of GPUs. Indeed, Prime Intellect's run leveraged nodes of 8 interconnected H100s to shard models, and Nous used tensor parallelism across 2 interconnected H100s. In contrast, SWARM Parallelism does not necessitate the same need for node interconnectivity or high storage capacity.

Beyond this however, SWARM is quite appealing given its ability to produce an extremely efficient and fault tolerant process for allocating compute resources. And on top of this, the Square-Cube law positions SWARM to only become more efficient at larger scales.

Looking ahead, what would be quite interesting is to see the performance of an approach that leverages the architecture and fault tolerance of SWARM with DeMo’s massive reduction in communication overhead, or approaches that are able to leverage FSDP across clusters of heterogeneous consumer hardware over the internet. It will come as no surprise that some of the most recent discussion among researchers in this space has been centered on these topics.

A recent conversation on X between researchers from Nous, Prime Intellect, Hyperbolic, and Max Ryabinin (author of SWARM). https://x.com/Yuchenj_UW/status/1863627052680872152

Security & Sybil Resistance

To date, all implementations of the above methods have been done in environments that assume a certain amount of trust. In decentralized training, there must be ways to protect against adversarial actors who would participate in the network with intentions to tamper with models, exploit user data, or take over the network. What are some potential solutions to achieving a trustless and secure decentralized training network? Below are a selection of techniques being explored:

Cryptographic Security: ZK, FHE, & TEEs

One approach to ensuring valid participation in a decentralized training network would be to leverage applied cryptography to verify that computation has been done correctly during training. There are a handful of different approaches being experimented with to date, all with varying degrees of viability.

Zero Knowledge Proofs (ZKPs)

ZKPs have already been deployed in the context of AI inference. Zero-knowledge machine learning (ZKML) has been used by teams to attest to the performance of a model without revealing the model’s underlying weights.¹⁶ Teams like EZKL have developed libraries that allow for developers to implement this type of attestation for any ML model.

While we are not aware of any practical implementation in the context of pre-training, it is possible to apply a similar type of approach to attest to the correct calculation of a gradient without having to reveal the underlying data or model weights to a verifier in a decentralized training network. EZKL has recently published on proving Convex Optimization to training, which is a promising approach to the marriage of ZK and training. To date however, ZKML’s adoption has been limited due to its high computational overhead. The process of generating ZKPs can be quite resource intensive, and while massive progress is being made to decrease the additional compute required to enable practical implementations, in the context of decentralized training in which reducing latency and communication is the name of the game, ZK will need further development before its practical.

Fully Homomorphic Encryption (FHE)

FHE allows for computation on encrypted data, which presents as an ideal solution in the context of decentralized training where malicious nodes could gain access to and leak sensitive training data. Indeed, FHE has been labeled the holy grail of privacy-preserving computation given the magic of being able to perform complex operations on encrypted data. However, this potential upside brings with it a major drawback in the form of even greater computational intensity and complexity than ZK. To date, pre-training on encrypted data has simply been unfeasible. However, there are companies like Lattica.AI who are developing novel FHE schemes specific to AI/ML applications in order to massively reduce the overhead. Time will tell if these become practical in the near future.

Trusted Execution Environments (TEEs)

TEEs are a hardware-based approach to private compute demonstrating great promise in the domain of AI/ML. A full explanation of TEEs is beyond the scope of this paper, but the basic idea is that a physical, isolated enclave on a processor can process data while guaranteeing the information remains private. The upside of utilizing TEEs vs the above cryptographic solutions is that TEEs are much more computationally efficient. Model weights could be held and processed in TEEs to ensure privacy without introducing significant overhead. And further, TEEs being hardware-based allows for comparative performance to non-specialized hardware, instead of introducing new computational overhead in generating proofs or performing operations on encrypted data.

Companies like Phala Network and Sentient have been researching the application of TEEs in AI. Most recently, researchers at Phala published some initial work on fine-tuning in a TEEs and saw little performance degradation compared to a non-TEE approach.¹⁷

*Code summarizing Phala's TEE pre-training run.* *https://x.com/tolak_eth/status/1866873437119189027*

One downside to TEEs is that, at present, they are only present on specialized hardware. Many of the AI-tailored GPUs like NVIDIA H100s boast them, but the average card in a consumer laptop or gaming PC is not going to have a dedicated secure enclave. By requiring all participants in a decentralized training network to provide hardware with TEEs, you are setting the barrier to participation extremely high. An ideal decentralized training network with a wide spectrum of heterogeneous, consumer hardware is not going to be one full of TEEs. Or at least we won’t see one for quite some time as TEE-enabled hardware takes time to proliferate at the consumer level.

Economic Security

An altogether different solution to achieving security in decentralized training would be economic-based approaches. Instead of relying on cryptographic schemes to enforce privacy, decentralized networks could design financial incentives such that it would be economically irrational to tamper with training runs. One type of setup, proposed in different forms by companies like Gensyn and Pluralis Research, would be to require all participants in the training network to stake a certain amount of capital in order to join the network.^18,19If a node was determined to be producing improper gradients or otherwise subverting runs, they would be financially penalized through seeing their stake slashed. This would mirror many proven staking / slashing setups in other parts of crypto, but would also bring with it certain drawbacks. One being that requiring stakes to participate in the network could exclude the everyday user from being able to participate in what should be designed as an open network to all participants.

In conclusion, nearly all research in this space has been focused on reducing communication overhead to make the training process feasible. Moving forward, we expect to see more innovation at the security level.

Cost

Another question that needs answering in the context of decentralized training is – even if it becomes technically feasible to train models at scale in a decentralized manner, who will care if it's prohibitively expensive compared to a centralized solution? Here again, there are no live networks offering real-time pricing to evaluate. Instead, we’ll pursue a heuristics-based approach highlighting potential pros and cons of decentralized training.

For simplicity’s sake, we can understand the costs of decentralized training via the basic formula below:

Centralized Training → Cost of compute = training cost
Decentralized Training → Cost of compute * [“decentralization multiplier”] = decentralized training cost

While simplistic, this formula captures the idea that decentralization adds significant overhead to the efficiency of pre-training. This “decentralization multiplier” reflects the inefficiencies of low bandwidth communication, additional computation needed for verification and security, fault tolerance, etc. If you assume that decentralized training has the same cost of compute as centralized offerings, decentralized training will always be a less price-performance efficient solution than the centralized version. However, there are actually some scenarios in which decentralized training might provide lower costs and in turn greater efficiency than a centralized counterpart.

Cooling: In the centralized training paradigm, a significant amount of work is put into cooling GPUs that generate massive amounts of heat in close proximity to each other. For example, major data centers located in US desert regions are estimated to consume anywhere from 3-5 million gallons of water every day to liquid cool their facilities.²⁰ In a decentralized setting, where a vast network of individual devices across the world are leveraged, cooling is no longer an issue. If the GPUs aren’t in the same physical location throwing massive amounts of heat, you don’t need to introduce additional cooling overhead. Here, this represents an increase in efficiency in the decentralized paradigm.

*An aerial view of a Google data center, highlighting the watercooling components.* *https://semianalysis.com/2024/09/04/multi-datacenter-training-openais/*

Data Center Builds: The current path to scaling centralized training has been to construct larger and larger power-hungry data centers to handle greater compute needs. Data centers require massive capital outlays to construct, with OpenAI quoting up to $100B for their latest 5GW center and on average data centers costing in the $1-4B range.²¹ This detailed article from Dylan Patel outlines the numerous costs that go into building a data center. Of these costs, there are many outside of pure compute. We have already discussed cooling, but there are myriad other factors like permitting, staffing, new fiber and utilities installation, and more.

Crypto Incentives or “Protocol Learning": Coined by the Pluralis Research team, “Protocol Learning” is an approach to the decentralized training of models that eschews cost considerations altogether.²² Through blockchain-based fractional ownership, network participants can receive, in lieu of upfront payment, ownership in models proportional to the amount of work performed to train said model. In turn, any revenue generated by the model in the future can flow back to the network participants who had a hand in creating it. Taken from the other side, this allows researchers and developers to train their models without massive cash outlays upfront. This approach opens up an entirely new path for model creation and is one that allows for the viability of decentralized training while technical advancements are made to increase its price-performance efficiency in contrast to centralized approaches.

Scale & Crypto Incentives

If there is one thing crypto is good at, it’s using financial incentives to bootstrap large networks of resources. From general purpose chains like Bitcoin and Ethereum to specialized networks like Bittensor and Akash, crypto has shown that decentralized networks can achieve massive scale through providing thoughtfully designed incentives.

Let’s use an example to illustrate why we think bootstrapping a decentralized training network at frontier-model scale is feasible. GPT4 is rumored to have taken anywhere from 50,000 to 72,00MWh hours to train over the course of 5-6 months.²³ Over the course of 2023, Bitcoin mining is estimated to have consumed 120 TWh. That is over 1,500x more energy provided by a decentralized network than is needed to train the latest frontier models. With the right optimizations, a decentralized network with even a fraction of Bitcoin’s compute would be able to train useful models at scale.

Market Landscape

As outlined above, the decentralized training space is a young but emerging space in deAI. Only two out of the four companies covered (Nous and Prime Intellect) in the following section have published on distributed training runs, and none have released robust technical documentation on approaches to incentivization or privacy and security. As such, what follows is analysis of hard metrics and benchmarks where feasible, but in absence of such information, we attempt to describe what each company has proposed to or expects to do in the future.

Gensyn

Gensyn has published quite little about their approach to achieving communication efficiency in the decentralized setting. Their recent publications point to admiration for the achievements of SWARM Parallel, DiLoCo, and DisTrO, but little has been revealed about their own specific technical approaches. They do envision their network being open, trustless and agnostic to the type of compute (consumer or professional).²⁴

One thing that Gensyn has been more open about than other companies is their approach to privacy and security. Gensyn’s 2022 litepaper outlines their envisioned security model. They essentially outline their network with several actors in a verification and challenge setup. When work is submitted, a “solver” performs the work of training and generates a cryptographic proof of the work done. Next, a “verifier” checks the work of the solver by replicating parts of the solver’s proof to ensure the model performs as expected. And last, a “whistleblower” also checks the verification work of the verifier. If they catch a fault in the verifier’s work, they can submit a challenge and receive a reward (those familiar with optimistic rollups might see some parallels here). This is an interesting approach that combines both cryptographic and economic approaches to security.

Pluralis Research

Similar to Gensyn, Pluralis Research has published very little about their own technical approaches to facilitating decentralized training. However, founder Alexander Long has pointed to SWARM parallel’s unique ability to achieve greater performance as it scales as a promising approach.

One novel approach that Pluralis Research has been vocal about is their previously designed “Protocol Learning” design. Their network is envisioned as one full of heterogeneous compute that allows nodes to be rewarded in model ownership and those using the protocol’s service the ability to train their models without massive upfront expenditure.²⁵ This revenue sharing model is an elegant approach that drastically reduces the barriers to entry for developers looking to train their own models.

Additionally, Pluralis Research has been the first of these companies to isolate some of the ethical / existential risks of decentralized training networks. In their recent publication, they identify what they call the “No-Off” Problem. In essence, Pluralis Research’s team argues that, in the decentralized setting, the risk of harmful AGI is much higher as there is no centralized party that can pull the plug on the model. Decentralized networks are much harder to control once they have scaled – just look at the challenges of making big changes to the Bitcoin network. How might decentralized training networks develop kill switches or controls on the potential for harmful models to be developed and deployed on their networks? While no clear answer has been given yet, this will be a major area of future research for the industry.

Nous Research

Nous has been one of the foremost companies in advancing the technical capabilities of decentralized training. They’ve published a light, preliminary report on DisTrO, their family of optimizers designed for distributed settings and a more in-depth report of DeMo, a specific optimizer they developed that is >2x faster than AdamW and requires 100x less bandwidth reps. Most recently, they finished their distributed training run of a 15B parameter model on hardware distributed across the US and Europe. Nous has teased the release of Psyche, the term they’re using to refer to what appears to be their incentivized decentralized training network.²⁶

Nous’s approach differs from something like SWARM in that it leverages data parallel techniques with a tweak. Instead of storing the entire 15B model on a single GPU, they used tensor parallelism across nodes of two H100 GPUs to split their 15B model across the memory of both devices.²⁷ It will be interesting to see how Nous handles the question of GPU memory and model vs. data parallelism as they release more information on their decentralized network.

Prime Intellect

Prime Intellect was the first company of these four to publish on a distributed training run. As discussed above, their work in implementing Open DiLocCo was a first for the industry. To be clear however, they did not merely take DeepMind’s work and replicate it. They leveraged HiveMind to make the framework more fault tolerant and robust for training in the decentralized setting. Leveraging this, they facilitated a 10B parameter training run on four clusters of 8 H100s across the US, Canada and Finland. In their report on Open DiLoCo, Prime Intellect highlighted more efficient compute methods and model merging techniques as key areas for future research, which seems to dovetail nicely with Nous’ work on DeMo.

Future Areas of Research & Open Questions

The innovation in decentralized training over just the past year has been astounding. Yet, there remains much progress still to be made. What follows are some of the most pressing areas of future research from our perspective:

Is model parallel the future?

If decentralized networks are to run on consumer hardware, it would seem to follow that modern models will simply be too large to store on consumer GPUs or even nodes of consumer GPUs. In a fully decentralized training network where we might not have the luxury of creating homogeneous nodes of interconnected devices, could data parallel approaches still hold? Alternatively, are there sharded data parallel approaches that would be both low communication and allow for the grouping of heterogeneous hardware?

What would a shift towards test-time compute-based scaling mean for decentralized training?

Recently, much has been made about centralized pre-training hitting a scaling wall due to a lack of quality data to feed into larger and larger training runs.²⁸ In lieu of this, researchers might now be shifting towards greater innovation at the time of inference to further push the performance of models. If this pattern holds, what would it mean for decentralized training? Would less of a focus by centralized players on hyperscaling at pre-training allow for models trained in a decentralized way to catch up? Would smaller models trained on (at least initially) less competitive decentralized networks suddenly become higher utility if additional performance can be squeezed out via TTC? Will massive amounts of compute directed at inference recreate the same have / have nots dynamic we saw in pre-training?

How might decentralized pre-training and inference coexist on the same network?

Many industry experts believe inference-focused compute is going to be a much larger market than compute for training.^29,30 From a business model standpoint, a decentralized network that can handle both pre-training of models and inference after the model has been trained is more attractive than just a specialized training network. What technical innovations might be needed for a decentralized network to handle both training and inference?

Which approaches to privacy and security will hold?

From our perspective, the sheer amount of research needed to make decentralized training technically feasible has led to a dearth of research on making these networks trustless and secure. We’re looking forward to seeing how both different cryptographic and economic approaches are employed to make the jump from distributed training to truly decentralized training.

What is the moral case for developing trustless, decentralized approaches to developing frontier models?

This piece intentionally stayed away from discussing the moral imperative of creating trustless networks to train models; thoughtfully discussing such a topic would necessitate its own article. However, on our mind are questions like what would happen if Meta stopped open sourcing their LLaMa models and developers could no longer fine tune custom versions? What if a government decided to take control over all model development (as was alluded to in a recent Marc Andreessen interview³¹)?

Where is demand for decentralized training going to come from?

Taking the other side of the above point, if Meta continues to open source frontier models and allow individuals to easily fine tune these models to their needs, where is the demand for pre-training models on decentralized networks going to come from? Put another way, why would a developer pay to train something from scratch when I could just fine-tune an off the shelf model?

We welcome any insights or comments on the above questions, and look forward to more research in these key categories.

Conclusion

Throughout this paper, we have examined some of the major innovations in and pending challenges to decentralized training. We explored the nuances of emerging techniques like DiLoCo and SWARM Parallelism, highlighted the need for memory-efficient approaches, and discussed some of the most pressing issues yet to be resolved like security and economic incentivization.

Still, with all of the innovation in this space over just the past year, one cannot help but feel tremendously optimistic about where decentralized training is heading. In our opinion, decentralized training at scale has become not a question of if, but when. Of course, there is still significant work that needs to be done for companies in this space to make the jump from distributed training to decentralized training. The entire purpose of decentralized training was to push back against the centralizing and extractive forces of big AI; trustlessness and verifiable privacy are essential to any decentralized training solution. While we are yet to see such a network exist, we know that one network will be launched in the near future and with it, we hope to see a new era of open and permissionless AI arrive. ✦

‍

^1.^{https://epochai.org/blog/how-much-does-it-cost-to-train-frontier-ai-models}

^{2. Ibid.}

^3.^{https://www.reuters.com/technology/artificial-intelligence/openai-rivals-seek-new-path-smarter-ai-current-methods-hit-limitations-2024-11-11/}

^4.^{https://x.com/tsarnick/status/1853898866464358795}

^5.^{https://semianalysis.com/2024/09/04/multi-datacenter-training-openais/}

^6.^{https://arxiv.org/pdf/2311.08105}

^7.^{https://arxiv.org/pdf/2407.07852}

^8.^{https://github.com/PrimeIntellect-ai/prime/blob/main/INTELLECT_1_Technical_Report.pdf}

^9.^{https://arxiv.org/pdf/2304.11277}

^10.^{https://github.com/NousResearch/DisTrO/blob/main/A_Preliminary_Report_on_DisTrO.pdf}

^11.^{https://arxiv.org/pdf/2411.19870}

^{12. Importantly, Kingma is a collaborator on Nous’s DeMo research which builds on his work on AdamW.}

^13.^{https://x.com/NousResearch/status/1863622813317464157}

^14.^{https://arxiv.org/pdf/2301.11913}

^15.^{https://www.youtube.com/watch?v=Ichh_3gQF94}

^16.^{https://arxiv.org/pdf/2402.02675}

^17.^{https://x.com/tolak_eth/status/1866873443955904826}

^18.^{https://arxiv.org/pdf/2412.07890}

^19.^{https://docs.gensyn.ai/litepaper}

^{20. This usage was so extreme that new build data centers are being forced to switch to air cooling in the region.}^{https://www.theatlantic.com/technology/archive/2024/03/ai-water-climate-microsoft/677602/}^,^{https://www.washingtonpost.com/climate-environment/2023/04/25/data-centers-drought-water-use/}

^21.^{https://www.bain.com/insights/ai-changes-big-and-small-computing-tech-report-2024/}

^22.^{https://www.pluralisresearch.com/p/article-2-protocol-learning-protocol}

^23.^{https://www.trgdatacenters.com/resource/ai-chatbots-energy-usage-of-2023s-most-popular-chatbots-so-far/}

^24.^{https://mirror.xyz/gensyn.eth/_K2v2uuFZdNnsHxVL3Bjrs4GORu3COCMJZJi7_MxByo}

^25.^{https://www.pluralisresearch.com/p/article-2-protocol-learning-protocol}

^26.^{https://x.com/Teknium1/status/1870248999221113214}

^27.^{https://x.com/bloc97_/status/1863631013953388695}

^28.^{https://www.youtube.com/watch?v=WQQdd6qGxNs}

^29.^{https://www.youtube.com/watch?v=Z77jZkYDpIE}

^30.^{https://joincolossus.com/episode/chetan-puttagunta-and-modest-proposal-capital-compute-ai-scaling/}

^31.^{https://www.youtube.com/watch?v=sgTeZXw-ytQ}

^{Legal Disclosure: This document, and the information contained herein, has been provided to you by Hyperedge Technology LP and its affiliates (“Symbolic Capital”) solely for informational purposes. This document may not be reproduced or redistributed in whole or in part, in any format, without the express written approval of Symbolic Capital. Neither the information, nor any opinion contained in this document, constitutes an offer to buy or sell, or a solicitation of an offer to buy or sell, any advisory services, securities, futures, options or other financial instruments or to participate in any advisory services or trading strategy. Nothing contained in this document constitutes investment, legal or tax advice or is an endorsement of any of the digital assets or companies mentioned herein. You should make your own investigations and evaluations of the information herein. Any decisions based on information contained in this document are the sole responsibility of the reader. Certain statements in this document reflect Symbolic Capital’s views, estimates, opinions or predictions (which may be based on proprietary models and assumptions, including, in particular, Symbolic Capital’s views on the current and future market for certain digital assets), and there is no guarantee that these views, estimates, opinions or predictions are currently accurate or that they will be ultimately realized. To the extent these assumptions or models are not correct or circumstances change, the actual performance may vary substantially from, and be less than, the estimates included herein. None of Symbolic Capital nor any of its affiliates, shareholders, partners, members, directors, officers, management, employees or representatives makes any representation or warranty, express or implied, as to the accuracy or completeness of any of the information or any other information (whether communicated in written or oral form) transmitted or made available to you. Each of the aforementioned parties expressly disclaims any and all liability relating to or resulting from the use of this information. Certain information contained herein (including financial information) has been obtained from published and non-published sources. Such information has not been independently verified by Symbolic Capital and, Symbolic Capital, does not assume responsibility for the accuracy of such information. Affiliates of Symbolic Capital may have owned or may own investments in some of the digital assets and protocols discussed in this document. Except where otherwise indicated, the information in this document is based on matters as they exist as of the date of preparation and not as of any future date, and will not be updated or otherwise revised to reflect information that subsequently becomes available, or circumstances existing or changes occurring after the date hereof. This document provides links to other websites that we think might be of interest to you. Please note that when you click on one of these links, you may be moving to a provider’s website that is not associated with Symbolic Capital. These linked sites and their providers are not controlled by us, and we are not responsible for the contents or the proper operation of any linked site. The inclusion of any link does not imply our endorsement or our adoption of the statements therein. We encourage you to read the terms of use and privacy statements of these linked sites as their policies may differ from ours. The foregoing does not constitute a “research report” as defined by FINRA Rule 2241 or a “debt research report” as defined by FINRA Rule 2242 and was not prepared by Symbolic Capital Partners LLC. For all inquiries, please email info@symbolic.capital. © Copyright Hyperedge Capital LP 2024. All rights reserved.}

‍