Customized, production ready inference with open source models: Dmytro (Dima) Dzhulgakov

1.6K views · Feb 16, 2025 · 18:54 min · Watch on YouTube ↗

Takeaway

Customize and fine-tune open models on a dedicated serving stack to beat frontier models on cost and latency for narrow production workloads.

Summary

Fireworks AI founders are ex-PyTorch leads from Meta; built a custom serving stack from kernels up, optimizing per-customer for latency/throughput/cost.
Smaller fine-tuned open models can match GPT-4 quality on narrow domains while running ~10x faster and cheaper — Berkeley benchmark cited for function calling.
Long-prompt RAG workloads benefit from runtime tuning and caching; Fireworks tops 'artificial analysis' long-prompt benchmarks.
Supports multi-modality: SDXL, SD3 (powers Stability's own API), ASR/TTS; multi-LoRA serving lets thousands of fine-tunes share a GPU with serverless per-token pricing.
Argues the future is many specialized models orchestrated together, not one giant model serving everything.

fireworksinferencefine-tuning

Original description

Recorded live in San Francisco at the AI Engineer World's Fair. See the full schedule of talks at https://www.ai.engineer/worldsfair/2024/schedule & join us at the AI Engineer World's Fair in 2025! Get your tickets today at https://ai.engineer/2025

About Dmytro
Dmytro (Dima) Dzhulgakov is the co-founder and CTO of ‪@fireworksai‬ which focuses on the transition to AI-powered business via interactive experimentation and a production platform centered around PyTorch technologies. Fireworks.ai offers high-performance low-cost LLM inference service that helps to try out and productionize large models.

Dmytro is one of PyTorch core maintainers. Previously he helped to bring PyTorch from a research framework to numerous production applications across Meta's AI use cases and broader industry.