All posts
Behind the build

Event-Driven Multi-Agent Orchestration for Proactive Nudging

June 30, 2026 · 7 min read

Most AI products are reactive by design. A user sends a message, the system responds, the interaction ends. That model is well understood and relatively straightforward to build on top of modern language models.

Proactive AI is a different problem entirely. The system has to decide, on its own, when to act, how to act. It has to determine what to say, whether now is even the right moment to say anything at all and why is it the right time. And it has to do this continuously, across many users, while staying current with what is actually happening in each person’s life.

Getting this right is not primarily a model problem. The model can generate a thoughtful message of however you tune it. The harder question is the infrastructure that decides when to invoke the model, which piece of context to surface, how to recover when something goes wrong, and how to cancel a planned action the moment new information makes it irrelevant. That is an orchestration problem.

Why isolated agents break under real conditions

The instinct when building a proactive AI system is to assign a dedicated agent to each type of nudge. One agent handles follow-ups. Another handles re-engagement. A third watches for calendar events. Fourth handles curiosity-driven check-ins, tryna break the ice. Each runs on its own schedule, pulling its own context, making its own decisions.

In testing this looks fine. In production it produces three compounding problems.

The first is silent failure. When an agent fails, there is no visible symptom. The nudge simply does not arrive. The user wouldn't notice. The system logs an error. Nobody knows something was supposed to happen and did not.

The second is lack of coordination. Multiple isolated agents making independent decisions will overlap. Two agents can both decide a user needs outreach and deliver separate messages within minutes of each other. No individual agent did anything wrong. But the combined effect is an experience that feels noisy, unaware of itself, and ultimately untrustworthy.

The third is inability to cancel. An agent schedules a follow-up based on something a user mentioned. Two days later the user mentions that the situation resolved itself. The isolated agent has no way to know. The follow-up fires anyway, perfectly on schedule, and proves to the user that nothing was actually paying attention.

These failures do not come from weak models. They come from an architecture that was not designed for coordination.

Event-driven architecture as the foundation

The shift that fixes all problems is moving from schedule-driven agents to event-driven agents. Instead of each agent operating on a timer, the entire system responds to events. A user opens the app: an event. A conversation ends: an event. A message mentions something significant: an event. The passage of time itself becomes one event source among many, not the primary mechanism for triggering agent work.

This matters because events create a shared representation of what is happening. When agents respond to the same event stream, coordination becomes possible. A central orchestration layer can see all pending events at once, decide which agents should run, and prevent two agents from making conflicting decisions simultaneously.

Events also create the foundation for cancellation. If an agent schedules future work in response to an event, a later event can invalidate that work before it executes. The system can reason about what has changed and adjust accordingly. That capability does not exist in a schedule-driven model, because nothing connects the earlier decision to the later information.

For this to work at the infrastructure level, events need to be durable. In environments where processes can shut down between the time an event is created and the time it is consumed, an in-memory dispatch model is not sufficient. The event needs to be persisted before it is dispatched, so that a restart or timeout cannot cause it to disappear silently. This is the architectural principle that keeps the event stream reliable under real operating conditions.

Defining agents with clear, bounded responsibilities

In a well-designed proactive system, each agent owns exactly one kind of decision. Not a category of decisions. One decision, scoped precisely.

A follow-up agent decides whether to send a follow-up on a specific topic the user raised. A re-engagement agent decides whether the user has been quiet long enough to warrant a nudge. A digest agent decides whether this week’s summary is worth sending. Each agent has a single intent and a single responsibility.

This is not just a code organization principle. It is what makes orchestration tractable. When each agent’s scope is clear, the orchestration layer can reason about which agents apply to a given situation, prevent the same intent from being triggered twice, and make it obvious when two agents might conflict. Agents with broad, overlapping mandates produce systems that are difficult to reason about and hard to debug when something goes wrong.

Each agent also needs to know when not to act. Before doing anything, an agent should evaluate consent, timing, and relevance. Has the user opted into this type of outreach? Is this a reasonable hour? Is the thing it wants to say actually worth saying given what has happened since the last check? Most of the time the answer is no, and the right outcome is to stand down cleanly. An agent that stands down gracefully is doing its job as much as one that sends a message.

The orchestration layer: one pass, one decision

Coordination across agents requires a central orchestration layer that sees the full picture before any agent acts. When a batch of events arrives, the orchestrator reads them all, determines which agents are relevant, and dispatches work in a single coherent pass. No agent acts in isolation. The orchestrator is the point where competing signals get resolved before they reach any individual agent.

The orchestrator also enforces exclusivity. Only one orchestration pass should run per user at any given moment. If two events arrive close together and trigger two concurrent passes, the result is two independent decisions that may contradict each other. A single-pass model eliminates that class of bug entirely.

For most routing decisions, the orchestrator does not need to call a model. Each agent declares which events it cares about, and the orchestrator builds a work list from those subscriptions mechanically. It knows that the follow-up agent should run when a follow-up intent matures, that the re-engagement agent should run when a long-silence signal arrives, and so on. Deterministic routing keeps the control plane fast, cheap, and reproducible. When a routing decision goes wrong, you can trace exactly why.

Self-healing: what agents need to behave reliably in production

Even well-designed agents encounter failure conditions. A model call times out. A response comes back but the quality is not good enough to send. An external service is slow. In a naive agent design, these situations produce silence: the agent fails, the error is logged, and the user receives nothing.

Production agents need a recovery discipline that goes beyond standard error handling. After each attempt, the agent should evaluate whether the result was actually good enough to deliver. Not just whether it ran without throwing an error, but whether a person should receive it. A response can be technically valid and still be the wrong thing to send.

When a result does not meet the bar, the agent should try again with a different approach. If the quality was low, it might reframe the prompt. If the service was unavailable, it might retry with backoff. But there has to be a hard limit on how many times it tries. An agent that retries indefinitely creates worse problems than one that fails cleanly.

This is the piece that most proactive AI systems get wrong: the absence of a bounded recovery loop. Without a cap on retries and re-attempts, an agent can get stuck in a cycle that consumes resources, delays other work, and eventually produces a stale or inappropriate result when it finally exits. Two attempts is typically enough. If the agent cannot produce something worth sending after two tries, the honest move is to stand down and let the next natural opportunity take over.

Each agent should also have a deterministic fallback: something it can say that does not depend on a successful model call. A simpler version of the message, a safe generic alternative, or a graceful silence. The fallback is not the ideal outcome, but it prevents the worst outcome, which is an escalating failure that the user experiences as broken behavior.

Revocable plans: when future context should cancel scheduled work

One of the most important properties of a proactive AI system is one that users never see directly: the ability to cancel a planned action before it executes.

When an agent decides to follow up on something a user mentioned, that decision is a plan, not a commitment. The world will continue to change between the moment the plan is made and the moment it is scheduled to execute. The user might resolve the situation. The topic might become irrelevant. A new conversation might make the planned follow-up feel tone-deaf.

A system that cannot update its plans in response to new information is not proactive in any meaningful sense. It is just executing a queue. Users feel the difference immediately. A follow-up that arrives after the situation has already resolved does not feel thoughtful. It feels like proof that nothing was listening.

The architectural primitive that enables this is the revocable intent: a scheduled action that can be invalidated by a later event. When an agent decides to follow up in two days, it does not enqueue a fixed task. It creates an intent that remains open and subject to cancellation until the moment it fires. When new context arrives, the system checks whether any open intents are affected and cancels the ones that are no longer relevant.

Detecting resolution correctly requires semantic understanding, not string matching. A user mentioning “the surgery went fine” resolves a follow-up tagged as “upcoming procedure.” Those strings do not match, but the meaning does. The right approach is to ask the model to classify new messages against a list of open intent subjects and determine which, if any, the message resolves. That keeps the resolution detection robust without being brittle to surface-level phrasing.

What the system produces, from the user’s perspective

When all of this works together, the user does not see the architecture. They see something that feels like a thoughtful presence. They get a check-in at the right moment about the right thing. They do not get two messages in ten minutes. They do not get a follow-up about something that already resolved. They do not notice that anything was wrong.

That invisibility is the target. A proactive AI system that is doing its job well should feel effortless to receive. The complexity sits underneath: the event bus, the orchestration pass, the recovery loops, the revocable intents, the standing down. All of that infrastructure exists so the user gets a single, well-timed, relevant message instead of a poorly-coordinated stream of automated noise.

The hardest part of building proactive AI is not making it reach out. It is making it reach out in a way that earns the trust to keep doing it. That requires getting coordination, recovery, and cancellation right at the infrastructure level, not just at the model level. The model generates the words. The architecture determines whether those words arrive at the right moment or not at all.