Build enterprise generative AI apps using Llama 3 at 1,000 tokens/s on the SambaNova AI platform

3.8K views · Sep 11, 2024 · 54:34 min · Watch on YouTube ↗

Takeaway

SambaNova bundles its own chip, system, and software into a turnkey enterprise inference stack hitting 1000 tokens/sec on Llama-3 with sovereign-AI privacy.

Summary

SambaNova (Stanford spinout, founded 2017, $1B+ funding from BlackRock, Google Ventures, Intel, GIC) pitches a full-stack AI platform on its 4th-generation chip targeting 1000+ tokens/sec Llama-3 inference.
Claims to outperform other inference providers on Artificial Analysis benchmarks for throughput; demo's 'One Turbo' (formerly Sambaverse) provides live benchmark.
Targets enterprise + government 'Sovereign AI' customers needing data privacy, ownership, and cost control vs. per-token GPT/Claude pricing.
Workshop covered fine-tuning + pre-training + deployment via integrated stack — avoids forcing customers to pick a chip, OS, model individually.
Argues that the future is many smaller expert models fine-tuned on proprietary data, not monolithic closed APIs.

inferencesambanovahardware

Original description

In this workshop, you will learn how to build LLM-based apps, such as a question-answering system with RAG, in LangChain using Llama-3 at 1,000 tokens per second on the SambaNova AI Platform.

Level: Intermediate

SambaNova delivers generative AI capabilities to the enterprise. In this workshop, you will learn:

● About SambaNova’s full-stack generative AI platform, powered by the SN40L AI chip and delivering unparalleled performance for training and inference
● Samba-1, a trillion parameter composition of experts (CoE) model, and how it can be used for enterprise settings
● How to build and deploy a question-answering app end-to-end with retrieval augmented generation (RAG) for enterprise search using the following suite: LangChain as framework, Unstructured for pre-processing text documents, E5-large-v2 embedding, ChromaDB vector store, and Llama-3-8B-Instruct running at speed record of 1,000 tokens per second via SambaNova.

This workshop is designed for tech professionals, engineers, and anyone interested in enterprise generative AI applications.

Prerequisites: Experience programming, ideally in Python, a Github account, and laptop

Assets: We will provide a link to the Github repo with step-by-step instructions on how to install the required libraries and how to run the Jupyter notebooks and Streamlit apps. We will also provide SambaNova API keys for the CoE and Llama-3 endpoints.

GitHub Repo: https://github.com/sambanova/ai-starter-kit/tree/main/workshops/ai_engineer_2024/ Dev Setup for Exercise 1: https://github.com/sambanova/ai-starter-kit/blob/main/workshops/ai_engineer_2024/basic_examples/README.md Dev Setup for Exercise 2: https://github.com/sambanova/ai-starter-kit/blob/main/workshops/ai_engineer_2024/ekr_rag/README.md

Recorded live in San Francisco at the AI Engineer World's Fair. See the full schedule of talks at https://www.ai.engineer/worldsfair/2024/schedule & join us at the AI Engineer World's Fair in 2025! Get your tickets today at https://ai.engineer/2025

About Varun
Varun is a Sr Principal AI Solutions Engineer at SambaNova Systems. He is currently investigating the benefits of fine-tuning embedding & decoder LLMs in retrieval augmented generation (RAG). Previously, he led the deployment of AI/ML applications across CRM, e-commerce, healthcare, finance, energy, manufacturing, fraud detection, and cyber security at Fortune 500 enterprises. He has worked at C3.ai, Cisco, IBM Research, ABB, and research institutes in Singapore. He holds a Ph.D. in Computer Engineering from the University of Illinois at Urbana-Champaign.

About Petro
Petro is a Principal Engineer in the AI Solutions team at SambaNova Systems, where he is currently working on developing applications powered by large language models. His expertise spans AI for Science and Generative AI. He obtained his PhD and MS from the Georgia Institute of Technology and previously interned at Argonne National Laboratory. He has given several tutorials at conferences, e.g., Supercomputing, and in 2023, he was selected to participate in the prestigious US Frontiers of Engineering organized by the National Academy of Engineering.