← back

Llamafile: bringing AI to the masses with fast CPU inference: Stephen Hood and Justine Tunney

48.5K views · Jul 16, 2024 · 17:24 min · Watch on YouTube ↗
Takeaway

Llamafile turns LLM weights into a portable, hardware-agnostic executable while pushing CPU inference performance close to GPU territory through targeted matmul optimizations.

Summary

  • Llamafile is a Mozilla open-source project packaging LLM weights as a single Cosmopolitan executable that runs on six OSes (Linux, macOS, Windows, BSDs) without install.
  • Speed-ups of 30–500% over llama.cpp on CPU inference depending on hardware/model, achieved via outer-loop unrolling for matrix multiplication and AVX-512 exploitation on AMD Threadripper/Intel Alder Lake.
  • Includes tinyBLAS so GPU inference works on Windows with only the driver installed, removing 500MB+ proprietary CUDA blobs.
  • Models run fully locally with no network; available on Hugging Face as a filterable file type with Mozilla-published llamafiles.
  • Framing: in an era of Nvidia GPU dependence, CPU inference unlocks a planet of existing affordable hardware and keeps open-source AI competitive.
llamafilecpu-inferenceopen-source
Original description
Mozilla's Llamafile open source project democratizes access to AI not only by making open models easier to use, but also by making them run fast on consumer CPUs. Lead developer Justine Tunney will share the insights, tricks, and hacks that she and the project community are using to deliver these performance breakthroughs, and project leader Stephen Hood will discuss Mozilla's approach to supporting open source AI.

Recorded live in San Francisco at the AI Engineer World's Fair. See the full schedule of talks at https://www.ai.engineer/worldsfair/2024/schedule & join us at the AI Engineer World's Fair in 2025! Get your tickets today at https://ai.engineer/2025

About Stephen
Open source AI at Mozilla. Formerly of del.icio.us, Yahoo Search. Co-founder of Storium (AI-assisted storytelling game) and Blockboard.

About Justine
Justine is a founder of Mozilla’s LLaMAfile project, a Google Brain alumni, and the owner of the Cosmopolitan C Library. She's focusing on democratizing access to open source AI software while elevating its performance and quality.