Behind the scenes: building a modern RMM stack

What goes into an RMM that runs on 10,000 endpoints without blinking? Here’s the architectural shape of LynxTrac, for the curious.

Top-level view

Three planes:

Data plane. Agents on endpoints talk to regional relays via persistent outbound TLS connections. Relays broker sessions, accept telemetry, and fan out to the control plane.
Control plane. Identity, policy, scopes, tags, audit — the slow-changing state that drives behavior.
Analytics plane. Metrics, logs, events — the fast-changing data that gets queried.

Separating these is how you scale each independently.

The agent

One static binary, Go, ~15 MB. Described elsewhere in detail. The key property for scale: it does almost nothing in steady state, so 10,000 agents cost almost nothing in aggregate.

The relays

Horizontally scaled, multi-region. Each relay handles thousands of persistent agent connections and brokers sessions on demand. State is minimal — session brokering is intentionally stateless so a relay restart doesn’t drop active sessions (only pending ones).

The control plane

Postgres + Redis. Postgres for durable state (identities, policies, scopes). Redis for session locks and pub/sub. Multi-region active/passive; failover measured in seconds.

The analytics plane

ClickHouse for metrics and logs. Immutable, column-oriented, scales to billions of rows per day without sweating. We shard by tenant for large customers; smaller tenants share shards.

Query latency: p50 100ms on 30-day windows, p95 500ms on 90-day windows.

The session model

Three concepts:

Agent tunnel. Long-lived outbound TLS from agent to relay.
Operator session. Short-lived, SSO-authenticated, scoped to one endpoint and one protocol (shell, RDP, file transfer).
Audit stream. Every operator session produces a keystroke/screen log routed to the audit store.

Separation lets us scale each layer — tunnels scale with endpoint count, operator sessions scale with team size, audit storage scales independently of both.

Failure modes we design for

Agent loses connectivity. Resumes on its own; no manual intervention.
Relay fails. Agents reconnect to a sibling relay; pending sessions retry.
Control plane degraded. Agents and sessions keep working; new policy changes pause.
Analytics plane degraded. Alerts may delay; agent and access paths unaffected.

The “agent and access paths unaffected” property is critical. You want to still be able to reach production during a partial outage of the RMM itself.

What we don’t do

No embedded Kubernetes. The agent doesn’t run containers.
No built-in backups. We integrate with your backup tools; we don’t replace them.
No endpoint AV. That’s a different product category.

Our opinion is boring: do one thing (IT operations) well, integrate cleanly with adjacent tools, don’t try to be everything.