First 30 minutes of an IT incident: what great teams do
The first 30 minutes make or break MTTR. Here are the concrete moves high-performing teams make — and the anti-patterns we see everywhere else.
The first 30 minutes of an IT incident decide the MTTR. Great teams have a specific shape to how they spend them. Here’s what we see, and the anti-patterns that make incidents worse.
Minute 0-2: triage
The pager fires. Great teams:
- Acknowledge in under 60 seconds
- Open the affected dashboards before opening the chat
- Read the alert details twice before posting
Anti-pattern: opening Slack first. Every second you spend asking “is anyone else seeing this?” is a second not spent figuring out what’s happening.
Minute 2-5: confirm and assess
Great teams:
- Confirm the alert is real (not a monitor flap)
- Assess user impact (is this visible to customers yet?)
- Declare an incident with a severity tier
Anti-pattern: skipping the severity declaration. Without one, nobody knows whether this is a 3-person war room or a “I’ll handle it.”
Minute 5-10: context gathering
Great teams:
- Pull the last hour of relevant metrics
- Check recent deploys (the #1 cause of incidents is a change)
- Scan the affected service’s logs for the first minute of the spike
- Identify likely subsystems
Anti-pattern: going deep on one hypothesis without considering others. Confirmation bias eats hours.
Minute 10-20: hypothesis formation and first action
Great teams:
- State a hypothesis explicitly
- Identify the smallest safe action that would confirm or falsify it
- Take that action and observe
Anti-pattern: “let’s just restart it.” Restarting without a hypothesis is how you lose the state needed to understand the root cause.
Minute 20-30: first mitigation
Great teams:
- Apply a mitigation (not necessarily the root-cause fix)
- Monitor for 5 minutes before claiming recovery
- Keep the incident open even after mitigation; root cause is still pending
Anti-pattern: closing the incident at mitigation time. Premature closure means the post-mortem never happens.
What great teams don’t do
- They don’t have 12 people on the call. They have 3-5, tightly scoped roles.
- They don’t speculate in the ticket. They state observations.
- They don’t skip the post-mortem because “we fixed it.”
- They don’t reuse the same hero every time. Practice is how the team gets better.
The rotation practice
The best teams run incident rotations where juniors lead under a senior’s supervision. The senior is there to stop a catastrophe, not to take over. This is how you scale incident response capability across the team, instead of relying on one or two firefighters.
The platform piece
A tool that removes “get to the affected system” from minute 5-10 is a big compression. When access is ambient (click the alert, get a shell), context gathering collapses from 10 minutes to 1.
Try it yourself
LynxTrac is free forever for 2 servers — no credit card, no sales call. Start in under 2 minutes →
Related posts
Incident response without VPN access: a practical guide
Your pager just went off and the VPN is down. Here is a practical runbook for getting to the affected system, gathering context, and fixing it without tunnels.
How modern RMM tools reduce MTTR (mean time to resolution)
Modern RMM tooling shortens MTTR by compressing diagnosis, access, and fix into one surface. Here is where the minutes actually come from.
Using AWS KMS for secure SSH credential management
Storing SSH credentials safely is harder than it looks. Here is how AWS KMS fits into a modern SSH access flow — the good, the friction, and the pitfalls.