Will Agent evaluation via MCP Stabilize Agent Networks? - Ari Heljakka

457 views · Jun 03, 2025 · 14:11 min · Watch on YouTube ↗

Takeaway

Exposing evaluators via MCP lets any agent get scored, explained feedback inline — turning evals from a one-off harness into a continuous stabilization loop for agent networks.

Summary

Ari Heljakka (CEO Root Signals) asks whether exposing eval engines via MCP can stabilize agent swarms attempting complex tasks.
Just adding an eval stack doesn't help — evaluations themselves must be continuously improved alongside the agents to stay aligned with business requirements.
Evaluation landscape covers reality-grounding (left side) and behavioral aspects (right side: goal inference, progress, tool selection); example: hotel reservation agent gets evaluators for policy adherence, output accuracy, and behavior.
Demos Root Signals' open-source MCP server: from Cursor's UI you list charges/evaluators, score an agent's output, get numeric score + explanation, and feed that back to drive a stabilization loop.

evalsmcpagent-networks

Original description

Exposing complex AI Evaluation frameworks to AI agents via MCP allows for a new paradigm of agents to self-improve in a controllable manner. Unlike the often unstable straight-forward self-criticism loops, the MCP-accessible evaluation frameworks can provide the persistence layer that stabilizes and standardizes the measure of progress towards plan fulfillment with agents. 

In this talk, we show how MCP-enabled evaluation engine already allows agents to self-improve in a way that is independent of agent architectures and frameworks, and holds promise to become a cornerstone of rigorous agent development.