🏗️ AI Infrastructure
GPU clusters, training stacks, autoscaling inference, data pipelines, feature stores, observability for AI workloads.
The workflow
flowchart LR
A[Workload] --> B{Stage}
B -->|Train| C[Multi-GPU cluster<br/>NVLink / RDMA]
B -->|Serve| D[Autoscaling fleet<br/>spot + reserved]
B -->|Data| E[Lakehouse +<br/>feature store]
C --> F[Observability<br/>traces + metrics]
D --> F
E --> F
F --> G[Cost / perf<br/>tuning loop]
Training is bursty; serving is steady; the bill is paid by serving.
Key takeaways
Videos (36)
The Geopolitics of AI Infrastructure - Dylan Patel, SemiAnalysis
Export controls have not stopped Chinese compute scale-up, and Middle East gigawatt-scale builds are reshaping where frontier training happens.
CI/CD Is Dead, Agents Need Continuous Compute and Computers — Hugo Santos and Madison Faulkner
Agent-driven development needs a continuous-compute substrate that replaces PR-centric CI/CD with high-throughput, machine-paced merging.
How to Build Your Own AI Data Center in 2025 — Paul Gilbert, Arista Networks
Enterprise AI data centers need rail-optimized, isolated backend GPU networks tuned for job-completion time, sized differently for training vs inference.
Efficient Reinforcement Learning – Rhythm Garg & Linden Li, Applied Compute
Applied Compute scales enterprise RL by trading off policy staleness against throughput in asynchronous pipeline RL — the efficient frontier between speed and learning stability.
Arrakis: How To Build An AI Sandbox From Scratch - Abhishek Bhardwaj, OpenAI
Run AI-agent-generated code inside lightweight microVMs with snapshot/restore rather than containers — speed and isolation both matter at agent scale.
AI Platform Engineering: Patrick Debois
Build a centralized AI platform (models, vectors, connectors, observability, eval monitoring) so feature teams don't each reinvent rag-ops and governance.
Rishabh Garg, Tesla Optimus — Challenges in High Performance Robotics Systems
High-performance robotics depends on pipelined, synchronized comms threads — and most 'bad policy' bugs are actually CAN/thread timing bugs you can only see with bus-level logging.
Productionizing GenAI Models – Lessons from the world's best AI teams: Lukas Biewald
In GenAI, the learnings (not the code) are your IP — track every experiment automatically so iteration time, not feature velocity, becomes your competitive edge.
We accidentally made an AI platform: Jamie Turner
A reactive backend platform turned out to be the right substrate for shipping AI apps with confidence, with vector indexes and component libraries as natural extensions.
Two Roads to Durable Agents: Replay vs. Snapshot — Eric Allam, CEO, Trigger.dev
Durable agents need either replay-style journaling (Temporal) or snapshot-style state capture; replay's determinism constraints make it awkward for LLM-driven workflows.
Unlocking Developer Productivity across CPU and GPU with MAX: Chris Lattner
Framework fragmentation is the bottleneck for production GenAI; MAX targets one runtime spanning CPU and GPU to bridge research and prod.
Building LinkedIn's GenAI Platform — Xiaofeng Wang
LinkedIn's GenAI platform evolved through three generations to support multi-agent systems with distributed orchestration, a skill registry, layered memory, and OpenTelemetry-based observability.
AI Kernel Generation: What's working, what's not, what's next – Natalie Serrino, Gimlet Labs
AI agents can already deliver double-digit kernel speedups on real workloads by iterating compile-run-profile loops, but struggle on the most complex kernels — promising for cross-hardware porting.
GPU-less, Trust-less, Limit-less: Reimagining the Confidential AI Cloud - Mike Bursell
TEE-based confidential AI plus a decentralized marketplace (Super Protocol) enables training and inference on sensitive data without trusting any cloud provider.
Why, and how you need to sandbox AI-Generated Code? — Harshil Agrawal, Cloudflare
Always sandbox AI-generated code with capability-based security — V8 isolates for fast lightweight execution, containers when you need a full Linux environment.
Production software keeps breaking and it will only get worse — Anish Agarwal, Traversal.ai
Autonomous incident debugging requires fusing causal ML, semantics and custom agent control flow — neither AIOps, plain LLMs nor ReAct agents alone can do it.
Compilers in the Age of LLMs — Yusuf Olokoba, Muna
A Python compiler with LLM-assisted verification can turn AI inference code into portable native binaries that run anywhere, sidestepping container-based deployment.
Why We Don't Need More Data Centers - Dr. Jasper Zhang, Hyperbolic
Distributed GPU marketplaces are a faster, cheaper way to satisfy AI compute demand than waiting 7+ years for new hyperscale data centers.
Scaling AI Agents Without Breaking Reliability — Preeti Somal, Temporal
Temporal's decade-old durable workflow engine maps cleanly onto agent reliability needs and is now seeing Python SDK overtake others as agents go to production.
AX is the only Experience that Matters - Ivan Burazin, Daytona
Build tools for agents (Agent Experience) not for humans-with-AI — speed, API-first, agent-readable docs, and autonomy-by-default are the bar.
The Hierarchy of Needs for Training Dataset Development: Chang She and Noah Shpak
Training-data infra is the bottleneck; Lance format + materialization service enables Character.AI's iteration speed by satisfying filter+shuffle+blob-stream simultaneously.
Platforms for Humans and Machines: Engineering for the Age of Agents — Juan Herreros Elorza
To make coding agents productive in enterprise, redesign internal platforms around self-service APIs with shift-left feedback so agents (and humans) can iterate without humans-in-the-loop.
OpenLLMetry is all you need
OpenLLMetry brings the OpenTelemetry standard and ecosystem to LLM apps so you get vendor-neutral, drop-in observability across providers, vector DBs and agent frameworks.
Infrastructure for the Singularity — Jesse Han, Morph
Future agentic infra needs sub-second VM snapshot/branch/replicate primitives so agents can fork environments faster than humans can deploy them.
Context Platform Engineering to Reduce Token Anxiety — Val Bercovici, WEKA
Treat agent context as a tiered storage problem and maximize KV-cache hit rate at the platform layer — that beats prompt-cache arbitrage and most other inference optimizations.
Keynote: The AI developer experience doesn't have to suck – why and how we built Modal
Sub-second container start + Python-native serverless makes Modal feel like local iteration while scaling to thousands of GPUs.
Conquering Agent Chaos — Rick Blalock, Agentuity
Agents need agent-specific infra — long runtime, stateful routing, framework-agnostic deployment — not stateless serverless.
Infra that fixes itself, thanks to coding agents — Mahmoud Abdelwahab, Railway
Combine durable workflows with a headless coding agent (OpenCode) so infrastructure issues become reviewable pull requests instead of pages.
Accelerating Mixture of Experts Training With Rail Optimized InfiniBand Networking in Crusoe Cloud
Rail-optimized InfiniBand on green-powered Crusoe Cloud cuts the all-reduce communication penalty that otherwise idles GPUs for 25-30% of MoE training time.
Luminal - Search-Based Deep Learning Compilers - Joe Fioti
Reduce deep learning to ~12 primitives and let search-based compilers generate the fast code — a path to vastly simpler ML stacks that still hit peak hardware performance.
How agents broke app-level infrastructure - Evan Boyle
Agentic workloads break Web 2.0 infrastructure assumptions about latency and reliability; we need durable execution layers built for seconds-to-hours requests.
Building Agentic Applications w/ Heroku Managed Inference and Agents — Julián Duque & Anush Dsouza
Heroku now ships managed inference + MCP-based agents + pgvector so apps can attach AI and tools the same way they attach Postgres.
Substrate Launch: the API for modular AI
Substrate runs multi-model computation graphs as a coordinated cluster, replacing many slow API calls with microsecond inter-node hops and reliable structured outputs.
Continuous Profiling for GPUs — Matthias Loibl, Polar Signals
Always-on sampled profiling with eBPF + NVML + GPU time attribution finally gives the GPU equivalent of CPU flame charts in production.
Building agent fleet architectures your CISO doesn't hate — Lou Bichard, Gitpod
For regulated buyers, the right agent-fleet architecture is a substrate: customer owns the workload + source code on their cloud, vendor manages the control plane via minimal telemetry — not pure SaaS or pure self-hosted.
Accelerate your AI journey with Azure AI model catalog: Sharmila Chokalingam
Azure positions itself as a unified catalog + serving layer letting enterprises prototype, optimize, and operationalize across 1,600+ models with consistent APIs and data privacy.