← back

Running LLMs locally: Practical LLM Performance on DGX Spark — Mozhgan Kabiri chimeh, NVIDIA

3.8K views · Apr 10, 2026 · 10:15 min · Watch on YouTube ↗
Takeaway

DGX Spark with NVFP4 quantization lets developers run 14B-class models locally at user-facing speeds and scale the same stack to production.

Summary

  • NVIDIA DGX Spark (GB10 Grace Blackwell superchip, 128GB unified memory, NVLink) targets local development of models up to ~200B parameters under-desk.
  • Benchmarking harness uses vLLM in Docker with 3 mandatory warm-ups, 1s GPU metrics logging, and per-run versioned artifacts for reproducibility.
  • 1.5B instruct model hits 61.73 tok/s; the 14B NVFP4-quantized model still hits 20.19 tok/s — above human reading speed and 3.4x faster TTFT than the unquantized 14B base.
  • NVFP4 4-bit floating-point quantization on Blackwell is the engineering sweet spot — quantization format matters as much as hardware.
  • Same NVIDIA software stack runs on Spark and in datacenters, so workflows move from desktop to cloud with minimal change.
inferencehardwarenvidia
Original description
Moving LLM workloads from the cloud to local infrastructure requires a shift in engineering strategy. In this talk, I share my journey of serving and benchmarking open-source models (1.5B to 14B) on an NVIDIA DGX Spark workstation. Using a reproducible methodology with vLLM, I analyze real-world trade-offs in throughput, latency, and the benefits of the 128GB Grace Blackwell unified memory architecture. You will leave with a clear framework for local model sizing, an understanding of quantization performance like NVFP4, and a guide for when local compute is the right choice for your AI stack.

Speaker info:
- LinkedIn https://www.linkedin.com/in/mozhgankch/