We’ve been spending a lot of time recently talking to teams deploying AI agents — in healthcare, financial services, energy. Across the board, the same question comes up: how do we know what our agents are actually doing?
It sounds simple. It isn’t. And one reason it isn’t is surprisingly mundane: agent frameworks don’t agree on what to call things.
The Tower of Babel problem
If you’ve worked across more than one agent framework, you’ve hit this. LangChain organizes execution into chains and nodes. LangGraph also uses nodes, but means something subtly different — a node in LangGraph is a stateful computation unit, whereas in earlier LangChain parlance it was closer to a transformation step. CrewAI introduces tasks and crews, where a task is something a crew member is assigned to accomplish. Microsoft’s recently launched Agent SDK (and Copilot Studio underneath it) speaks in terms of activities and turns, a vocabulary borrowed from the Bot Framework. Claude Code describes what it’s doing as tool calls and steps.
None of these map cleanly onto each other.
This isn’t just a naming inconvenience. It means there is currently no shared language for agent behavior. Ask five teams what an “agent step” is and you’ll get five different answers depending on which framework they’re using.
Why this makes governance structurally hard
Governance requires a stable vocabulary. You can’t write a policy over things that don’t have consistent names.
Suppose you want to enforce: “an agent must not write to an external system without first performing a data validation step.” In LangGraph, you’d express this in terms of node transitions. In CrewAI, you’d be reasoning about task ordering within a crew. In Copilot Studio, you’d be working with activity sequences. You’re describing the same constraint, but you’d have to implement it three separate times, in three different mental models, with no guarantee they actually cover the same behavioral ground.
The deeper problem: when you’re running agents across multiple frameworks — which is increasingly common in enterprise environments — you lose the ability to reason about behavior as a whole. Incidents that span a LangGraph orchestrator and a Copilot Studio sub-agent don’t fit neatly into either framework’s vocabulary. You can’t write a unified audit trail. You can’t enforce a consistent policy. You’re flying blind.
What a shared language needs to do
The frameworks aren’t wrong, exactly. They made reasonable choices for their own purposes. LangGraph’s node abstraction is great for building stateful graphs. CrewAI’s task/crew model is intuitive for multi-agent collaboration. The problem is that these are implementation vocabularies, not behavioral vocabularies.
What governance needs is a layer above the implementation — a semantic layer that asks: what is the agent actually trying to accomplish, and what discrete actions did it take toward that goal?
At Kyvvu, we call these Semantic Templates. The idea is that for every agent framework, you write a thin mapping — a template — that translates framework-native events into a common, semantically meaningful vocabulary. The canonical vocabulary is simple on purpose: tasks (what an agent is working toward) and steps (the discrete actions it takes). From there, the policy engine works in that shared language, not in framework-specific dialect.
One policy, written once, enforced everywhere — regardless of what the underlying framework calls things.
This is not a solved problem
We want to be honest about where we are. Writing good Semantic Templates for a new framework takes real work. The mapping isn’t always obvious — especially for frameworks that bundle multiple behavioral concepts into a single abstraction. And the canonical vocabulary, however clean, will eventually run into edge cases.
But the alternative — continuing to write governance logic in framework-specific dialect, per deployment, with no shared ground — doesn’t scale. As the number of deployed agent frameworks grows, and as enterprises run agents from multiple vendors side-by-side, the absence of a shared behavioral language becomes an increasingly serious operational risk.
The frameworks solved the problem of building agents well. The field still needs to solve the problem of describing what agents do in a way that’s stable enough to reason over. That’s what we’re working on.
If you’re deploying agents across more than one framework and thinking about this problem, we’d genuinely like to compare notes — or nodes.