Designing an RMM agent that doesn't slow systems down

Every RMM agent is a tax on the systems it manages. The best agents make that tax nearly free. Here are the design decisions that kept ours under 1% CPU and under 50 MB RSS without dropping signal.

Principle 1: event-driven, not polled

Most legacy agents poll — every second, every 10 seconds, every minute. Polling wastes CPU on “nothing changed” and misses events that happen between polls.

Modern agents should be event-driven where the OS supports it:

File changes → inotify (Linux), FSEvents (macOS), ReadDirectoryChangesW (Windows)
Process changes → proc connectors (Linux), ETW (Windows)
Network state → netlink sockets, AF_NETLINK / RTM_NEWLINK
Metric snapshots → still polled, but at the minimum necessary rate

Principle 2: batch everything

Single-syscall operations are expensive. An agent that flushes each metric individually does 10-100x more work than one that batches.

Our rule: batch for at least 100ms, up to 1 second, before flushing. The user-visible latency doesn’t care; the CPU and network do.

Principle 3: compress before ship

Network bandwidth is more expensive than CPU cycles, usually by an order of magnitude. Gzip everything that goes over the wire. The CPU cost is in the noise; the bandwidth savings are 80-95%.

Principle 4: no background busy work

If nothing is happening, the agent should consume nothing. No heartbeats firing every second, no log scans running on empty files, no metric poll waking the CPU from idle.

Steady-state idle: should be 0.0% in top. Anything above that is a bug.

Inside the agent, frameworks are shared: one serialization layer, one transport layer, one scheduler. Adding a new capability means re-using existing primitives, not stamping out another copy of the boilerplate.

This keeps the binary small and the behavior consistent.

Principle 6: no GUI, ever

GUIs are enormous. Adding Qt, GTK, or (worst of all) Electron to an agent doubles or triples its footprint. If your agent has a UI, it’s not lightweight.

What we measured

Benchmark: 100 endpoints, typical IT workload, 30-day observation window.

Our agent: p50 CPU 0.2%, p99 CPU 1.8% (during log bursts)
Legacy Agent A: p50 CPU 2.1%, p99 CPU 8.4%
Legacy Agent B: p50 CPU 1.5%, p99 CPU 12.1%

Memory:

Our agent: p50 RSS 38 MB, p99 RSS 52 MB
Legacy Agent A: p50 RSS 212 MB, p99 RSS 480 MB

These aren’t cherry-picked numbers; they’re steady-state under realistic load.

Why this matters to operators

A heavy agent is an alerting target all its own. You end up monitoring the monitor. You get paged when the agent consumes too much memory. You schedule maintenance to restart the monitoring agent. That’s not an RMM — that’s a second job.

Light agents get out of the way and let operators focus on the real systems.