← back
Running LLMs on your iPhone: 40 tok/s Gemma 4 with MLX — Adrien Grondin, Locally AI
Original: Running LLMs on your iPhone: 40 tok/s Gemma 4 with MLX — Adrien Grondin, Locally AI
Takeaway
Modern iPhones plus MLX make on-device LLMs (Gemma 4 at 40 tok/s) production-viable; quantize to 4-8 bit and use MLX Community weights for fastest path.
Summary
- Adrien Grondin (Locally AI, recently acquired by LM Studio) demos Gemma 4 running on iPhone at ~40 tok/s via Apple's MLX framework using MLX Swift LM.
- Pipeline: grab quantized weights (4-bit to 8-bit) from MLX Community on Hugging Face (~4-5k models), pass model ID to MLX Swift LM, integrate in <10 min of iOS code.
- Quantization guidance: stay between 4-bit and 8-bit; below 4-bit quality degrades sharply; smaller 300-350M Liquid models run in iOS Shortcuts for text-processing automation.
- MLX ecosystem expanding to MLX VLM (vision), MLX Audio, MLX Video; tool calling supported, structured generation still maturing.
on-devicemlxgemma
Original description
See more: https://x.com/adrgrondin/status/2040512861953270226 Speaker info: - https://x.com/adrgrondin