Making Open Models 10x faster and better for Modern Application Innovation: Dmytro (Dima) Dzhulgakov

1.8K views · Oct 09, 2024 · 18:54 min · Watch on YouTube ↗

Takeaway

Open-model inference becomes 10x cheaper and faster than frontier APIs when you co-design serving stack, optimization knobs, and per-tenant LoRA packing.

Summary

Dima (Fireworks AI co-founder, former PyTorch core maintainer at Meta) argues open models will dominate inference because of domain adaptability and cost (millions per year on GPT-4o for moderate-traffic apps).
Tuned Llama 3 can match GPT-4-class on function calling at ~10x speed (Berkeley benchmark); customization at runtime/deployment level enables long-prompt RAG, image gen (SDXL, SD3 — Stability's API routes to Fireworks).
Built custom serving stack from CUDA kernels up; LoRA tricks pack thousands of fine-tuned variants on the same GPU with serverless pay-per-token economics.
Fastest provider on artificial-analysis benchmarks for long prompts; supports many modalities including ASR/TTS and Voyage-AI-style embeddings.

inferenceopen-modelsfireworks-ai

Original description

Generative AI powers the next generation of real time applications. The key to success of modern application development in the Gen AI era is secure, latency-sensitive and low cost LLM serving solution, which Firework’s enterprise grade deployment provides. Fireworks AI accelerates innovation through its SaaS platform of low latency inference and high quality fine-tuning of 100+ models, across the state of the art LLMs, image/video/audio generation, embedding and multimodality models. These advantages are delivered through Fireworks' proprietary FireAttention technology, 4x-15x faster than the OSS alternatives. To bring the totality of knowledge together, Fireworks tuned their own FireFunction model to integrate hundreds of models and API calling together. Fireworks' adoption is the fastest in the industry and it also enables a software stack capable of extracting the most across different hardware and deployment options.

Recorded live in San Francisco at the AI Engineer World's Fair. See the full schedule of talks at https://www.ai.engineer/worldsfair/2024/schedule & join us at the AI Engineer World's Fair in 2025! Get your tickets today at https://ai.engineer/2025

About Dmytro
Dmytro (Dima) Dzhulgakov is the co-founder and CTO of ‪@fireworksai‬ which focuses on the transition to AI-powered business via interactive experimentation and a production platform centered around PyTorch technologies. Fireworks.ai offers high-performance low-cost LLM inference service that helps to try out and productionize large models.

Dmytro is one of PyTorch core maintainers. Previously he helped to bring PyTorch from a research framework to numerous production applications across Meta's AI use cases and broader industry.