Running LLMs on your iPhone: 40 tok/s Gemma 4 with MLX — Adrien Grondin, Locally AI

Original: Running LLMs on your iPhone: 40 tok/s Gemma 4 with MLX — Adrien Grondin, Locally AI

9.3K views · Apr 20, 2026 · 10:50 min · Watch on YouTube ↗

Takeaway

Modern iPhones plus MLX make on-device LLMs (Gemma 4 at 40 tok/s) production-viable; quantize to 4-8 bit and use MLX Community weights for fastest path.

Summary

Adrien Grondin (Locally AI, recently acquired by LM Studio) demos Gemma 4 running on iPhone at ~40 tok/s via Apple's MLX framework using MLX Swift LM.
Pipeline: grab quantized weights (4-bit to 8-bit) from MLX Community on Hugging Face (~4-5k models), pass model ID to MLX Swift LM, integrate in <10 min of iOS code.
Quantization guidance: stay between 4-bit and 8-bit; below 4-bit quality degrades sharply; smaller 300-350M Liquid models run in iOS Shortcuts for text-processing automation.
MLX ecosystem expanding to MLX VLM (vision), MLX Audio, MLX Video; tool calling supported, structured generation still maturing.

on-devicemlxgemma

Original description

See more: https://x.com/adrgrondin/status/2040512861953270226

Speaker info:
- https://x.com/adrgrondin