From Mixture of Experts to Mixture of Agents with Super Fast Inference - Daniel Kim & Daria Soboleva

4.1K views · Jun 27, 2025 · 53:15 min · Watch on YouTube ↗

Takeaway

Mixture-of-Agents on fast Cerebras inference lets teams approximate MoE-style specialization at the application layer without pretraining from scratch.

Summary

Cerebras workshop: explains MoE (replaces monolithic FFN with router + multiple FFN experts) and how it scales params without proportional inference cost
Cerebras hardware reportedly runs Llama 3.3 70B ~15.5x faster than the fastest GPU inference provider
Workshop builds a Mixture-of-Agents app: combine pre-trained models as agents instead of pre-training a new MoE
References GPT-3 (13B) → Llama 3 (400B) → DeepSeek V3 (~600B) scaling trajectory enabled by data and architecture innovations

inferencemoecerebras

Original description

Our hands-on workshop will walk you through how to build your own Mixture of Agents (MoA) system using the fastest, and most capable open models available: Qwen3-32B and Llama 3.3-70B. MoA is an emerging architecture that combines the strengths of multiple large language models in a layered, agent-based design. This approach delivers superior performance by enabling specialized agents to collaborate across layers—outperforming today’s frontier models in both accuracy and efficiency.

To ground this new paradigm in its roots, we’ll also explore how Mixture of Experts (MoE) architectures continue to push the boundaries of scale and specialization. Learn how Cerebras trains state-of-the-art MoEs from Daria Soboleva, Head Research Scientist.

About Daniel Kim
I'm currently the Head of Growth at Cerebras Systems, the world's fastest provider of AI Inference built on the Cerebras Wafer-Scale Engine. I live in sunny and foggy San Francisco, CA. You can find me relaxing in the park, eating spicy noodles, and recently running!

About Daria Soboleva
Daria Soboleva is a Head Research Scientist at Cerebras working on efficient AI systems. Prior to Cerebras, Daria worked at Google, building expertise in research and engineering. She's the creator of SlimPajama (627B token dataset with 1M+ downloads) and BTLM-3B-8K, a model achieving 7B-level performance with less compute. Daria specializes in optimizing LLM architectures with focus on mixture-of-experts models and hardware-efficient training.

Recorded at the AI Engineer World's Fair in San Francisco. Stay up to date on our upcoming events and content by joining our newsletter here: https://www.ai.engineer/newsletter