The Hierarchy of Needs for Training Dataset Development: Chang She and Noah Shpak

3.0K views · Oct 15, 2024 · 16:32 min · Watch on YouTube ↗

Takeaway

Training-data infra is the bottleneck; Lance format + materialization service enables Character.AI's iteration speed by satisfying filter+shuffle+blob-stream simultaneously.

Summary

Chang She (LanceDB CEO, ex-pandas co-author) and Noah Shpak (Character.AI AI data platform lead) describe the data hierarchy of needs for foundation-model training: clean data, evals, mixtures, analytics, classification, then synthetics and human labeling at the top.
Pre-training cares about breadth (domains, token counts, scaling laws); post-training cares about granular per-example difficulty (math problem hardness, code function count, multiple-choice ease).
Character.AI uses Spark/Trino plus GPU services for prompting/embedding to enrich data — quality scoring via in-house classifiers or even prompt-based classification, plus synthetic preference pairs to bootstrap method exploration.
Materialization service separates training-job logic from data prep (YAML + SQL block describes outputs); allows iteration without rewriting training loops, critical at multimodal data volumes.
Lance format addresses an 'AI data CAP theorem': workloads need fast scans (filter), fast random access (shuffle), and large-blob streaming all at once — existing formats handle at most two.

datasetslancetraining

Original description

Training and fine-tuning models depends critically on how you construct your dataset. Part art, part science, we’ll share with you practical lessons in dataset construction at Character AI and how to build a data platform to support rapid iterative refinement of training data. For LLMs, data scale is much larger and workloads are more diverse. This is especially true for multimodal datasets. To deal with these challenges, we'll show you how LanceDB is used in production to solve many pain-points around the storage, management, and querying of large scale AI data.

Recorded live in San Francisco at the AI Engineer World's Fair. See the full schedule of talks at https://www.ai.engineer/worldsfair/2024/schedule & join us at the AI Engineer World's Fair in 2025! Get your tickets today at https://ai.engineer/2025

About Chang
Chang She is the CEO and cofounder of LanceDB, the developer-friendly, open-source database for multi-modal AI. A serial entrepreneur, Chang has been building DS/ML tooling for nearly two decades and is one of the original contributors to the pandas library. Prior to founding LanceDB, Chang was VP of Engineering at TubiTV, where he focused on personalized recommendations and ML experimentation.

About Noah
Noah is a Research Engineer with a passion for building data systems and ML platforms from the ground up.

He leads the Data Platform team at Character, focusing on accelerating foundation model research, alignment, and product development through internet-scale data mining, prompting tools, and retrieval systems. Making data go vroom while gpus go brrrr is what makes him (and the team) tic!