Behind the scenes: building a modern RMM stack
What goes into an RMM that runs on 10,000 endpoints without blinking? Here is a look under the hood at the architecture choices we made.
What goes into an RMM that runs on 10,000 endpoints without blinking? Here’s the architectural shape of LynxTrac, for the curious.
Top-level view
Three planes:
- Data plane. Agents on endpoints talk to regional relays via persistent outbound TLS connections. Relays broker sessions, accept telemetry, and fan out to the control plane.
- Control plane. Identity, policy, scopes, tags, audit — the slow-changing state that drives behavior.
- Analytics plane. Metrics, logs, events — the fast-changing data that gets queried.
Separating these is how you scale each independently.
The agent
One static binary, Go, ~15 MB. Described elsewhere in detail. The key property for scale: it does almost nothing in steady state, so 10,000 agents cost almost nothing in aggregate.
The relays
Horizontally scaled, multi-region. Each relay handles thousands of persistent agent connections and brokers sessions on demand. State is minimal — session brokering is intentionally stateless so a relay restart doesn’t drop active sessions (only pending ones).
The control plane
Postgres + Redis. Postgres for durable state (identities, policies, scopes). Redis for session locks and pub/sub. Multi-region active/passive; failover measured in seconds.
The analytics plane
ClickHouse for metrics and logs. Immutable, column-oriented, scales to billions of rows per day without sweating. We shard by tenant for large customers; smaller tenants share shards.
Query latency: p50 100ms on 30-day windows, p95 500ms on 90-day windows.
The session model
Three concepts:
- Agent tunnel. Long-lived outbound TLS from agent to relay.
- Operator session. Short-lived, SSO-authenticated, scoped to one endpoint and one protocol (shell, RDP, file transfer).
- Audit stream. Every operator session produces a keystroke/screen log routed to the audit store.
Separation lets us scale each layer — tunnels scale with endpoint count, operator sessions scale with team size, audit storage scales independently of both.
Failure modes we design for
- Agent loses connectivity. Resumes on its own; no manual intervention.
- Relay fails. Agents reconnect to a sibling relay; pending sessions retry.
- Control plane degraded. Agents and sessions keep working; new policy changes pause.
- Analytics plane degraded. Alerts may delay; agent and access paths unaffected.
The “agent and access paths unaffected” property is critical. You want to still be able to reach production during a partial outage of the RMM itself.
What we don’t do
- No embedded Kubernetes. The agent doesn’t run containers.
- No built-in backups. We integrate with your backup tools; we don’t replace them.
- No endpoint AV. That’s a different product category.
Our opinion is boring: do one thing (IT operations) well, integrate cleanly with adjacent tools, don’t try to be everything.
Try it yourself
LynxTrac is free forever for 2 servers — no credit card, no sales call. Start in under 2 minutes →
Related posts
Lightweight RMMs vs enterprise tools: what small teams need
Small teams do not benefit from enterprise-scale RMM — they are paying for friction. Here is how to choose tooling that moves with you.
Designing an RMM agent that doesn't slow systems down
Every RMM agent is a tax. Here is how we designed ours to stay under 1% CPU and under 50 MB RSS without dropping signal.
Lightweight RMM for DevOps teams
DevOps teams do not want a tool that behaves like 2010 enterprise software. Here's what a lightweight, CI-friendly RMM looks like in practice.