MTTR · 3 min read

First 30 minutes of an IT incident: what great teams do

The first 30 minutes make or break MTTR. Here are the concrete moves high-performing teams make — and the anti-patterns we see everywhere else.

The first 30 minutes of an IT incident decide the MTTR. Great teams have a specific shape to how they spend them. Here’s what we see, and the anti-patterns that make incidents worse.

Minute 0-2: triage

The pager fires. Great teams:

  • Acknowledge in under 60 seconds
  • Open the affected dashboards before opening the chat
  • Read the alert details twice before posting

Anti-pattern: opening Slack first. Every second you spend asking “is anyone else seeing this?” is a second not spent figuring out what’s happening.

Minute 2-5: confirm and assess

Great teams:

  • Confirm the alert is real (not a monitor flap)
  • Assess user impact (is this visible to customers yet?)
  • Declare an incident with a severity tier

Anti-pattern: skipping the severity declaration. Without one, nobody knows whether this is a 3-person war room or a “I’ll handle it.”

Minute 5-10: context gathering

Great teams:

  • Pull the last hour of relevant metrics
  • Check recent deploys (the #1 cause of incidents is a change)
  • Scan the affected service’s logs for the first minute of the spike
  • Identify likely subsystems

Anti-pattern: going deep on one hypothesis without considering others. Confirmation bias eats hours.

Minute 10-20: hypothesis formation and first action

Great teams:

  • State a hypothesis explicitly
  • Identify the smallest safe action that would confirm or falsify it
  • Take that action and observe

Anti-pattern: “let’s just restart it.” Restarting without a hypothesis is how you lose the state needed to understand the root cause.

Minute 20-30: first mitigation

Great teams:

  • Apply a mitigation (not necessarily the root-cause fix)
  • Monitor for 5 minutes before claiming recovery
  • Keep the incident open even after mitigation; root cause is still pending

Anti-pattern: closing the incident at mitigation time. Premature closure means the post-mortem never happens.

What great teams don’t do

  • They don’t have 12 people on the call. They have 3-5, tightly scoped roles.
  • They don’t speculate in the ticket. They state observations.
  • They don’t skip the post-mortem because “we fixed it.”
  • They don’t reuse the same hero every time. Practice is how the team gets better.

The rotation practice

The best teams run incident rotations where juniors lead under a senior’s supervision. The senior is there to stop a catastrophe, not to take over. This is how you scale incident response capability across the team, instead of relying on one or two firefighters.

The platform piece

A tool that removes “get to the affected system” from minute 5-10 is a big compression. When access is ambient (click the alert, get a shell), context gathering collapses from 10 minutes to 1.

Try it yourself

LynxTrac is free forever for 2 servers — no credit card, no sales call. Start in under 2 minutes →

Related posts