First 30 minutes of an IT incident: what great teams do

The first 30 minutes of an IT incident decide the MTTR. Great teams have a specific shape to how they spend them. Here’s what we see, and the anti-patterns that make incidents worse.

Minute 0-2: triage

The pager fires. Great teams:

Acknowledge in under 60 seconds
Open the affected dashboards before opening the chat
Read the alert details twice before posting

Anti-pattern: opening Slack first. Every second you spend asking “is anyone else seeing this?” is a second not spent figuring out what’s happening.

Minute 2-5: confirm and assess

Great teams:

Confirm the alert is real (not a monitor flap)
Assess user impact (is this visible to customers yet?)
Declare an incident with a severity tier

Anti-pattern: skipping the severity declaration. Without one, nobody knows whether this is a 3-person war room or a “I’ll handle it.”

Minute 5-10: context gathering

Great teams:

Pull the last hour of relevant metrics
Check recent deploys (the #1 cause of incidents is a change)
Scan the affected service’s logs for the first minute of the spike
Identify likely subsystems

Anti-pattern: going deep on one hypothesis without considering others. Confirmation bias eats hours.

Minute 10-20: hypothesis formation and first action

Great teams:

State a hypothesis explicitly
Identify the smallest safe action that would confirm or falsify it
Take that action and observe

Anti-pattern: “let’s just restart it.” Restarting without a hypothesis is how you lose the state needed to understand the root cause.

Minute 20-30: first mitigation

Great teams:

Apply a mitigation (not necessarily the root-cause fix)
Monitor for 5 minutes before claiming recovery
Keep the incident open even after mitigation; root cause is still pending

Anti-pattern: closing the incident at mitigation time. Premature closure means the post-mortem never happens.

What great teams don’t do

They don’t have 12 people on the call. They have 3-5, tightly scoped roles.
They don’t speculate in the ticket. They state observations.
They don’t skip the post-mortem because “we fixed it.”
They don’t reuse the same hero every time. Practice is how the team gets better.

The rotation practice

The best teams run incident rotations where juniors lead under a senior’s supervision. The senior is there to stop a catastrophe, not to take over. This is how you scale incident response capability across the team, instead of relying on one or two firefighters.

The platform piece

A tool that removes “get to the affected system” from minute 5-10 is a big compression. When access is ambient (click the alert, get a shell), context gathering collapses from 10 minutes to 1.

Try it yourself

LynxTrac is free forever for 2 servers — no credit card, no sales call. Start in under 2 minutes →

MTTR Feb 21, 2026 · 3 min read

Incident response without VPN access: a practical guide

Your pager just went off and the VPN is down. Here is a practical runbook for getting to the affected system, gathering context, and fixing it without tunnels.

Read article

MTTR Dec 23, 2025 · 3 min read

How modern RMM tools reduce MTTR (mean time to resolution)

Modern RMM tooling shortens MTTR by compressing diagnosis, access, and fix into one surface. Here is where the minutes actually come from.

Read article

KMS Feb 22, 2026 · 3 min read

Using AWS KMS for secure SSH credential management

Storing SSH credentials safely is harder than it looks. Here is how AWS KMS fits into a modern SSH access flow — the good, the friction, and the pitfalls.

Read article

Minute 0-2: triage

Minute 2-5: confirm and assess

Minute 5-10: context gathering

Minute 10-20: hypothesis formation and first action

Minute 20-30: first mitigation

What great teams don’t do

The rotation practice

The platform piece

Try it yourself

Related posts

Incident response without VPN access: a practical guide

How modern RMM tools reduce MTTR (mean time to resolution)

Using AWS KMS for secure SSH credential management