ADR-002

Synchronous vs Async Communication Patterns

Context

The system has multiple components that need to communicate: Gateway calls Policy Engine, policies write to Audit Logger, escalations go to HITL Queue, etc. Each interaction could be sync (caller waits for response) or async (fire-and-forget or eventual response).

The core question: when should we block waiting for a response vs. when should we publish an event and move on?

Decision

Use sync for the main request flow, async for everything else.

Synchronous (caller waits):

Gateway → Policy Engine (input evaluation)
Gateway → Model Router
Gateway → Policy Engine (output evaluation)
HITL Service → Gateway (when human makes decision and request resumes)

Asynchronous (fire-and-forget):

Any component → Audit Logger
Policy Engine → HITL Queue (when escalation happens)
All background jobs (metrics, cleanup, archiving)

Why this split:

Main request flow is sync because the user is waiting. Can't return a response without knowing if policies passed.

Logging is async because it shouldn't slow down responses. If audit write fails, retry in background.

HITL escalation is async because human review takes minutes/hours. Can't block the request thread that long.

Alternatives Considered

All sync:

Rejected because logging and escalations would block request threads. A slow database write blocks the entire request. Trade-off: Simpler code for worse performance.

All async:

Rejected because the main flow requires answers before proceeding. You can't route to a model without knowing which model to use. Trade-off: Lower latency for broken semantics.

Async HITL with webhooks:

Rejected for v1 because it requires the caller to provide a callback URL and handle async responses. Enterprise clients aren't ready for this. Trade-off: Simpler client integration for blocked request threads during review.

Consequences

Wins:

Main request flow is fast (only blocks on necessary decisions)
Audit logging never slows down responses
Clear separation: sync = user is waiting, async = background work
Can scale audit and HITL independently from request handling

Costs:

Dual communication patterns (both REST and message queue)
More infrastructure (need Redis/Kafka for async)
Failure modes are different (sync = retry immediately, async = dead letter queue)
HITL escalations block request threads (can't handle millions of concurrent reviews)

When to revisit:

If HITL review volume grows to thousands/day (need async with webhooks)

If audit writes become a bottleneck (need faster async pipeline)

If policy evaluation becomes slow enough that sync calls are noticeable (>100ms)

Status

Status:Accepted

Date:2026

Number:ADR-002