← back
Context Platform Engineering to Reduce Token Anxiety — Val Bercovici, WEKA
Takeaway
Treat agent context as a tiered storage problem and maximize KV-cache hit rate at the platform layer — that beats prompt-cache arbitrage and most other inference optimizations.
Summary
- WEKA open-sourced a Context Platform Engineering toolkit with a load generator that simulates agent swarms with SLOs, deterministic/random prompts and disaggregated prefill/decode
- Echoing Manus's context-engineering blog: KV cache hit rate is the single most important production metric for agentic AI
- Without platform-level context engineering, teams resort to 'context financial engineering' — arbitraging input/output vs 5-min/1-hour cache write/read tiers
- Agent workloads show massive cadence mismatch: median time between requests is 10–15s while mean hits minutes/hours due to human approvals — and most context is tool use/results, not user text
- Multi-agent orchestrator+subagent patterns share common context; SLAs translate to SLOs via KV-cache slots in tiered token storage rather than just compute
kv-cachecontext-engineeringinference
Original description
Context Platform Engineering is the set of skills and tools to design, size, and configure systems optimized for Agent Swarm Context, at any scale. “KV-cache hit rate is the single most important metric for a production-stage AI agent“ according to Manus AI. Context platform engineering simplifies the maximization of KV Cache hit rates. This talk covers WEKA’s new open source context platform engineering toolkit, which helps translate Service Level Agreement (SLA) requirements of AI Agents, into Agent+LLM inference platform Service Level Objectives (SLOs) which meet required SLAs. We present research results from WEKA Labs which provide new observability into both unit, and aggregate KV Cache hit rates, consumed by agent swarms of various leading AI coding agents. This talk concludes with benchmark results for sizing agent swarm context for arbitrary working sets. Including context window sizes, latency, concurrency, and throughput SLOs per agent unit (swarm or sub-task) across modern GPU memory hierarchies, supporting KV Cache offloading plug-ins like vLLM/LMCache, SGLang HiCache, and NVIDIA Dynamo KVBM/NIXL. Callan Fox is the product leader for Context Platforms at WEKA, following a series of technical expertise and leadership roles at Dell/EMC, CGI and HPE. Val Bercovici is the Chief AI Officer at WEKA. Previously he was CTO of NetApp/SolidFire, and founding governing board member of the Kubernetes CNCF in the Linux Foundation. --- Resources: - https://www.linkedin.com/pulse/visual-guide-how-ai-agents-use-inference-inside-llm-callan-fox-q9brc - https://medium.com/@callan.j.fox/evaluating-management-of-kv-cache-within-an-inference-system-2d7c3d266c3a - https://www.linkedin.com/pulse/importance-context-platform-engineering-callan-fox-i81wc/ --- Socials: - LinkedIn: https://www.linkedin.com/in/valentinbercovici - X (Twitter): https://x.com/AccBalanced - GitHub: https://github.com/weka/LMCache - Website: https://www.weka.io/product/augmented-memory-grid/ - Company: WEKA (https://weka.io)