Accelerating Mixture of Experts Training With Rail Optimized InfiniBand Networking in Crusoe Cloud

855 views · Feb 12, 2025 · 17:45 min · Watch on YouTube ↗

Takeaway

Rail-optimized InfiniBand on green-powered Crusoe Cloud cuts the all-reduce communication penalty that otherwise idles GPUs for 25-30% of MoE training time.

Summary

Yinko (Crusoe PM, GPU networking) introduces Crusoe Cloud, an AI infra provider whose data centers (Texas, central US, Iceland geothermal) are powered by stranded/wasted/renewable energy aiming for net-zero training.
Three pillars: high performance, AI-developer-focused UX (CLI/API/GUI hiding hyperscaler complexity), and climate-aligned energy sourcing.
Distributed training spends 25-30% of step time in network all-reduce; even with computation/communication overlap, ~10% reduction leaves significant idle GPU cost — motivating dedicated GPU networking.
Crusoe runs a rail-optimized InfiniBand cluster fabric separate from the front-end VPC, optimizing topology for GPU-to-GPU communication; particularly important for Mixture-of-Experts training with heavy all-to-all traffic.
Partners include Together AI (using Crusoe for training/fine-tuning/inference) and code-gen labs training new foundation models on the infrastructure.

infinibandgpu-trainingmoe

Original description

State-of-the-art machine learning models are increasingly using techniques like mixture of experts that enable larger-scale models to be trained more efficiently by distributing layers of the model across multiple neural networks. This sparse distribution of model state puts increasing pressure on cluster-level networking while training. At Crusoe Cloud, we’ve built a high-performance InfiniBand network that's designed to provide the highest possible performance for these state-of-the-art training techniques. We use a “rail-optimized” design, reducing the number of hops between any set of GPUs in our cluster, accelerating all2all performance, and reducing training time. Learn more about how to utilize Crusoe Cloud rail-optimized networks to accelerate your training workloads.

Recorded live in San Francisco at the AI Engineer World's Fair. See the full schedule of talks at https://www.ai.engineer/worldsfair/2024/schedule & join us at the AI Engineer World's Fair in 2025! Get your tickets today at https://ai.engineer/2025

About Ievgen
Ievgen is a product manager at Crusoe, focused on building reliable and scalable AI-cloud infrastructure. He defines and guides the design of large and ultra-large-scale, multi-tenant GPU clusters, enabling customers to use thousands of GPUs simultaneously for ML training and inference. Before joining Crusoe, Ievgen held several different technical and product positions at networking vendors.