Day 22: Production Debugging: When It's On Fire
Picture the worst debugging scenario. Something’s broken in production. You can’t add console.log and redeploy. You can’t attach a debugger. You can’t reproduce it locally because you don’t know what’s causing it.
All you have is what’s already there: logs, error messages, metrics if you were smart enough to add them.
This is where AI earns its keep. Not because it has magic access to your systems, but because it’s good at exactly the thing you need: analyzing information and finding patterns. You paste in logs, stack traces, error messages. AI spots the anomaly you’d miss after staring at the screen for an hour.
The Incident Response Prompt
When things are on fire, start here:
Production incident in progress. Help me triage.
Error message:
[paste the error]
What I'm seeing:
[describe symptoms - errors, timeouts, user reports]
What changed recently:
[recent deploys, config changes, traffic patterns]
System overview:
[relevant architecture - what talks to what]
Help me:
1. What's the most likely cause?
2. What should I check first?
3. What's the fastest way to confirm?
4. What's the quickest mitigation (even if not a fix)?
Keep it short. You’re in a hurry.
Stack Trace Analysis Under Pressure
When you have a stack trace:
Quick analysis of this production error:
[paste stack trace]
Tell me:
1. What's the actual error (one sentence)?
2. Is this our code or a library?
3. What file and line to look at first?
4. Most likely cause?
No elaborate explanations needed. You need direction, not education.
Log Pattern Recognition
When you have logs but can’t see the pattern:
Find the pattern in these production logs.
These are from the last hour. Something is wrong.
[paste logs]
What I'm looking for:
- When did the problem start?
- What's different before vs after?
- Any correlation with specific users, endpoints, or data?
- What's the error rate pattern?
AI can spot patterns in logs faster than scrolling through them manually.
The Correlation Prompt
When multiple things seem broken:
Multiple things are failing. Help me find the root cause.
Symptom 1: [describe]
Symptom 2: [describe]
Symptom 3: [describe]
Started at: [time]
Recent changes: [deploys, configs]
What single cause could explain all of these?
What should I check to confirm?
Often multiple symptoms have one root cause. AI helps find the connection.
Quick Mitigation Strategies
When you need to stop the bleeding:
Production is broken. I need mitigation options.
The problem: [describe]
The impact: [who's affected, how badly]
Rollback possible: [yes/no/partial]
Give me mitigation options ranked by:
1. Speed to implement
2. Risk of making things worse
3. Effectiveness
I need to stop the bleeding, then I can fix properly.
Sometimes the right answer is a workaround that buys you time.
Database Issue Diagnosis
Database problems need specific analysis:
Production database issue.
Symptoms:
[slow queries / connection errors / deadlocks / ???]
Current metrics:
- Connections: [number]
- Active queries: [number]
- Slow query log: [paste if relevant]
Recent database changes:
[migrations, new queries, data growth]
What's the likely cause?
What query would show me the problem?
What's the quick fix?
The Rollback Decision
When you’re not sure whether to rollback:
Deciding whether to rollback.
Current situation: [describe the problem]
Last deploy: [what changed, when]
Rollback would: [describe what gets reverted]
Rollback risks: [data migration issues, etc.]
Help me decide:
1. Is the problem likely caused by the deploy?
2. What would rollback fix?
3. What would rollback break?
4. Should I rollback or fix forward?
Post-Incident Analysis Prompt
After the fire is out:
We had an incident. Help me analyze it.
What happened:
[timeline of events]
Impact:
[duration, users affected, severity]
Root cause:
[what we found]
Help me create:
1. Timeline of events with gaps identified
2. What monitoring would have caught this sooner
3. What would have prevented this
4. Action items for follow-up
The “I Have No Idea” Prompt
Sometimes you’re truly stuck:
I'm stuck. Production is broken and I don't know why.
What I see: [symptoms]
What I've checked: [what you ruled out]
What I've tried: [attempted fixes]
I'm out of ideas. Help me:
1. What haven't I checked?
2. What assumptions might be wrong?
3. What's a completely different angle?
Admitting you’re stuck is the first step to getting unstuck.
Information Gathering Under Pressure
When you need to collect more data quickly:
I need more information to debug this. Generate the commands.
System: [Linux, AWS, Kubernetes, etc.]
Problem: [what's failing]
Generate commands to check:
1. System resources (CPU, memory, disk, network)
2. Process status
3. Recent logs
4. Database connections
5. Network connectivity
6. Service health
Just the commands, I'll run them.
The Customer Impact Prompt
When you need to communicate:
Help me write a status update for customers.
What happened: [technical description]
Current status: [ongoing / resolved / monitoring]
Impact: [what customers experienced]
ETA: [if known]
Write a customer-facing update that is:
- Honest without being alarming
- Non-technical but not condescending
- Includes what we're doing about it
- Includes when we'll update next
Communication matters during incidents. AI can help you write clearly when you’re stressed.
Building Your Incident Playbook
After enough incidents, you have patterns. Document them:
I want to create an incident response playbook.
Common incident types we see:
1. [type 1]
2. [type 2]
3. [type 3]
For each type, help me create:
- Symptoms to look for
- First things to check
- Common causes
- Mitigation options
- Resolution steps
- Verification steps
Then when an incident happens, you have a starting point.
What AI Can’t Do in Incidents
AI can’t:
- Access your systems
- See your actual metrics
- Know your specific architecture
- Make decisions for you
- Tell you what changed recently
- Know your organizational context
You still need to:
- Gather the information
- Run the commands
- Make the judgment calls
- Communicate with stakeholders
- Execute the fixes
AI is your thinking partner, not your incident commander.
The Incident Checklist
Keep this handy:
□ Acknowledge the incident
□ Assess severity and impact
□ Notify stakeholders
□ Check recent changes
□ Gather logs and errors
□ Form hypothesis
□ Test hypothesis
□ Mitigate or fix
□ Verify resolution
□ Communicate resolution
□ Schedule post-mortem
Tomorrow
Incidents often come from edge cases you didn’t anticipate. Tomorrow I’ll show you how to use AI to find edge cases before they find you. Proactive problem-finding instead of reactive firefighting.
Try This Today
Think about your last production incident.
- What information did you have?
- What questions did you need answered?
- How would you have prompted AI?
Having prompts ready before incidents happen means you can move faster when they do.
Get new articles in your inbox
Subscribe to get each day's article delivered directly to you.
No spam. Unsubscribe anytime.