The Hot Path Tax: Why Runtime Governance Has to Be Sub-Millisecond
Most of the past week at Kyvvu was spent on a refactor we’d been putting off: moving the policy engine fully into the agent runtime. No new feature, no new rule type. Just a cleaner separation between where policies are evaluated and where policies are authored.
The refactor is worth a post, because it forces a question we should have written down earlier: why is the engine in-process at all? Why not run it as a service, the way most policy-as-code tooling does?
The answer is uncomfortably simple. If runtime governance is slow, it doesn’t get deployed. Or rather: it gets deployed, then quietly disabled the first time the agent feels sluggish in a demo. So the architecture isn’t really about features — it’s about not paying a hot-path tax that makes governance optional in practice.
Here’s where we landed.
The architecture, in five layers

Reading left to right, the picture splits cleanly into two halves: an integration layer owned by the developer who runs the agent, and a policy plane owned by whoever specifies how the agent should behave (typically the CISO or legal/compliance). Five components, each doing one thing:
-
Agent → events. The agent runs on whatever framework it runs on — plain Python, LangChain, LangGraph, CrewAI, Claude Code, the Microsoft Agent SDK. We don’t care, and the engine doesn’t know.
-
Semantic Templates → atomic behaviours. The framework’s events are translated into our canonical vocabulary of 12 atomic behaviours — task lifecycle events plus eight kinds of step (
step.model,step.resource,step.exec,step.gate, …). One language, regardless of where the agent came from. -
Kyvvu Engine — in-process policy evaluation. Every atomic behaviour is evaluated against the active policy set before it executes. The decision is
allow,warn, orblock. This is the part that has to be fast. -
Kyvvu Policy Library — external. Policies are authored, versioned, and managed on the platform. The engine fetches them on a TTL and caches them locally. Policy authors don’t touch the runtime; runtime authors don’t touch the policies.
-
Logs and incidents — optional sinks. Every evaluation produces a structured log line and, on a violation, an incident. Both are emitted to whatever you point them at — our platform, your SIEM, stdout, a file. The engine doesn’t depend on them.
The dashed boxes in the diagram (incidents, structured observability) are optional on purpose: an agent can run with the engine alone and never call home, and the governance still works. That property matters more than it sounds — it means the integration layer has no critical-path network dependency on us.
Why in-process
Most policy engines run as a sidecar or a remote service. A common variant in the AI space is the LLM gateway pattern: route every model call through a proxy that inspects prompts and responses and applies policy in flight. That’s the wrong shape for AI agents — and not only for performance reasons.
An agent step is small and frequent. A long-horizon task can easily emit hundreds or thousands of atomic behaviours. If each one pays a network round-trip to a policy service — even on the same VPC, even with a warm connection pool — you’re adding milliseconds per step that compound into seconds per task. Worse, you’ve introduced a dependency that can fail open (silent loss of governance) or fail closed (the agent halts when the policy service hiccups). Neither is acceptable.
The gateway pattern has a deeper problem on top of the latency one: it only sees LLM calls. It doesn’t see the tool invocations, the resource reads, the credential fetches, or the gates the agent crosses between model calls — which is precisely where the governable behaviour lives. You can’t write a policy on a path if you only observe one kind of node on that path.
So the engine runs in the agent’s process. Policies are fetched in the background and cached. Evaluation is pure CPU, zero I/O, no shared state across agents. The wire boundary is the policy fetch (rare, async, non-blocking) and the log/incident emission (fire-and-forget). The hot path — the decision itself — never leaves the process.
This is why we keep saying “behavioural firewall.” Not because firewall is a fashionable analogy, but because the topology is the same: enforcement sits at the boundary of the thing it’s protecting, in the data path, and the decision is local. A firewall that phones home to ask permission for every packet is not a firewall.
The numbers
Sub-millisecond policy evaluation. 100 policies against a 50-step history in under 300µs (p99). The engine adds zero perceptible latency to your agent’s hot path — an LLM call takes about 1,000× longer.
Here are the measurements from the current release (Apple Silicon M-series, Python 3.12):
| Scenario | p50 | p95 | p99 |
|---|---|---|---|
evaluate — 100 policies, 50-step history (200 samples) |
0.181 ms | 0.232 ms | 0.296 ms |
evaluate — 10 policies, empty history |
— | — | 0.034 ms |
evaluate — 0 policies |
— | — | 0.003 ms |
record — single completed step (50 samples) |
— | — | 0.002 ms |
load_policies — 100 policies |
— | — | 0.134 ms |
Reproduce with pytest tests/test_latency.py -v -s against the public kyvvu-engine package.
A few things to read out of this table:
- The floor is essentially free. With no policies, an
evaluate()call costs about 3µs — that’s the dispatch overhead. Adding the engine to an agent that has no policies configured is not a performance decision. - History scales gracefully. Going from an empty history to a 50-step history with 100 policies costs roughly 250µs at p99. We do this with linear scans over per-task history, indexed by
step_type, with rules dispatched through a flat lookup table. There’s no compiled DSL, no JIT, nothing exotic — the engine is straightforward Python that happens to do very few unnecessary things on the hot path. - The asymmetry between
evaluateandrecordis intentional. Recording a completed step is purely an append; the heavy work (rule evaluation) happens at decision time, not at logging time.
For context: a typical LLM call is 200–800ms end-to-end. A policy decision at p99 is roughly four orders of magnitude smaller. You will not measure the engine in your agent’s latency budget.
Why this shape, specifically
The architecture above isn’t an accident; it falls out of taking “policies on paths” seriously as a model.
If governance is pathwise — if the decision about whether the agent can take its next step depends on the ordered history of what it has already done in the current task — then the engine needs per-task state. That state can live in one of two places: locally in the agent’s process, or remotely in a policy service. Locally is faster, simpler, and avoids a class of consistency problems (what happens when two engine replicas disagree about the path?). Remotely is necessary if you’re trying to share state across agents, which we explicitly aren’t: each agent gets its own engine, its own history, its own decisions.
Cross-agent reasoning still happens — rate limits across executions, fleet-wide rules — but it happens through pre-fetched aggregate counts on the EvalContext, not through a shared mutable state. The hot path stays local. The aggregation stays slow.
The split between integration layer (developer) and policy plane (CISO/legal) follows the same logic. Authoring a policy is not a hot-path operation. Evaluating it is. Putting them in different planes lets each side optimise for its own constraint: the policy plane optimises for human review, version control, and approval workflows; the engine optimises for cycles per decision.
What’s next
The refactor unblocks a few things we’ve been wanting to ship: a standalone kyvvu serve mode for harnesses written in languages other than Python, tighter latency budgets for the next release, and broader template coverage (we’re working through CrewAI and a Claude Code proxy at the moment).
But the more important point is that the boring architectural work — the kind that doesn’t show up as a feature on the changelog — is what makes the rest of the pitch credible. “Runtime enforcement” is only a real claim if the runtime cost is one you can ignore. We think 300µs at p99 against 100 policies is comfortably in that territory.
If you’re deploying agents and the governance story you’re being sold lives at reporting time, or behind a network round-trip per step, that’s worth a closer look. The hot path is where the agent actually decides what to do. That’s where governance has to live.
The kyvvu-engine package is on PyPI; benchmarks are reproducible from the README. Full architecture documentation at docs.kyvvu.com.