Incident Response + Troubleshooting

Mở đầu

3 giờ sáng, điện thoại rung điên cuồng, prod service toàn diện liệt — bạn làm gì? Với team Internet, fault không phải "có xảy ra không", mà "khi nào". Team tốt không phải không fault, mà fault xảy ra response nhanh, recover hiệu quả, học để không lặp.

Bạn sẽ học:

Severity tier: P0~P4
Response flow: phát hiện → recover
Org collaboration: roles
Alert escalation: đảm bảo key issue không sót
Postmortem: 5 whys, no-blame culture

Chương	Nội dung
1	Severity tiers
2	Response timeline
3	Incident command
4	Alert escalation
5	Postmortem

0. Toàn cảnh: fault = best teacher

Netflix có Chaos Monkey — random kill server prod. Nghe điên, logic rõ: thay vì đợi fault tới, chủ động tạo fault train team.

Incident response không phải improvisation, mà process + role + tool triad. Như fire department không thành lập lúc cháy — họ train + drill + maintain daily.

4 core elements

Detect fast: monitor + alert tốt, problem phát hiện trước user
Collaborate efficiently: role + sync mechanism rõ, tránh chaos
Recover fast: ưu tiên restore service, không root cause. Stop bleeding trước, treat sau
Continuous improvement: mỗi fault là learning, postmortem cải tiến system + process

1. Severity tiers

Color sai vs payment system die không cùng tier. Severity tiering cho team respond đúng intensity — không over react waste resource, không under react làm lan rộng.

致命事故 (Critical)

定义

核心业务完全不可用，大面积用户受影响，造成严重经济损失或数据丢失风险。

响应时间

立即响应，5 分钟内到位

通知方式

电话短信即时通讯邮件

真实案例

主数据库宕机，所有读写请求失败

支付系统完全不可用，用户无法下单

用户数据大规模泄露

响应要求

✓事故指挥官必须在 5 分钟内就位

✓每 15 分钟向管理层通报进展

✓所有相关团队取消休假立即支援

✓事后 24 小时内完成复盘报告

各级别对比一览

级别	用户影响	响应时间	值班要求
P0	全部用户	立即响应，5 分钟内到位	全员到位
P1	大量用户	15 分钟内响应	核心团队
P2	部分用户	1 小时内响应	值班工程师
P3	极少用户	当天确认，本周处理	正常排期
P4	无直接影响	按优先级排期	无需值班

Tier	Name	Impact	Response	Example
P0	Fatal	Core business hoàn toàn không dùng	Immediate, all hands	Payment die, data breach
P1	Critical	Core function nặng	Trong 15 phút	Login fail rate >50%, API mass timeout
P2	Major	Partial function abnormal	Trong 1h	Search wrong, 1 phần page 500
P3	Minor	Non-core abnormal	Work hour	Avatar load fail, non-critical notify delay
P4	Trivial	UX issue	Iteration plan	UI misalign, typo

Tiering principles

User impact count: 100% user impact P2 có thể urgent hơn 1% user P1
Business loss: ảnh hưởng revenue (payment, order) ưu tiên cao
Degradable: có workaround thì có thể downgrade
Dynamic: tier có thể adjust khi điều tra sâu

2. Response timeline

Response như relay, mỗi stage có goal + handoff rõ.

发现

T+0

分级

T+5min

止血

T+15min

解决

T+1h

复盘

T+48h

5 stages

Detection: monitor/user/internal phát hiện anomaly. Goal: sớm, giảm MTTD (Mean Time To Detect).
Response: confirm incident, eval severity, call team, set comm channel. Goal: organize force.
Mitigation: action tạm restore service — rollback, switch backup, rate limit, degradation. Goal: stop bleeding, restore UX.
Resolution: tìm root cause + fix. Goal: eliminate, prevent recurrence.
Postmortem: review, analyze root cause, improvement. Goal: learn, robust system.

Metric	Note	Optimize
MTTD	Mean Time to Detect	Improve monitor coverage, lower alert threshold
MTTR	Mean Time to Recovery	Auto recovery, runbook drill
MTBF	Mean Time Between Failures	Improve reliability, eliminate SPOF

3. Incident Command: ai chỉ huy?

Incident lớn sợ nhất không phải tech khó, mà chaos — vài chục người cùng debug, không ai biết người khác làm gì, info key phân mảnh nhiều group.

🎖️

事故指挥官

Incident Commander

📢

通讯协调员

Communications Lead

🔧

运维负责人

Operations Lead

💻

开发负责人

Development Lead

🎖️事故指挥官

核心职责

1统筹协调整个事故响应过程

2做出关键决策（回滚、切流、降级等）

3确保各角色高效协作，避免混乱

4控制事故响应节奏，定时同步进展

关键能力

全局视野决策能力沟通协调压力管理

常见话术

"当前状态：支付服务不可用。运维组排查数据库，后端组准备回滚方案，通讯组每 10 分钟同步一次。"

模拟场景：支付系统 P0 事故

14:02监控支付成功率从 99.9% 骤降至 12%，触发 P0 告警

14:03指挥官确认 P0 事故，开启事故频道，召集各角色

14:05通讯通知管理层，更新状态页为"服务降级"

14:08运维发现数据库主节点 CPU 100%，连接池耗尽

14:10开发定位到昨日上线的慢查询是根因

14:12指挥官决策：立即回滚昨日变更 + 数据库主从切换

14:15运维数据库主从切换完成，连接恢复

14:18开发代码回滚部署完成

14:20通讯支付成功率恢复至 99.8%，通知各方服务恢复

3 core roles

Incident Commander (IC): tổng phụ trách. Decide, coordinate, pace. IC không cần technical strongest, nhưng phải coolest + global view.
Communication Lead: external comm — update status page, notify customer, sync management. IC + tech tập trung giải vấn đề.
Tech Lead: technical investigation + fix. Organize team, report IC progress + solution.

4. Alert escalation

Alert system = "mắt" của response. Quá ít = miss, quá nhiều = alert fatigue (mỗi ngày vài trăm alert, real important bị drown).

📡

监控系统检测T+0s

Prometheus 检测到数据库连接池耗尽，所有查询超时

自动触发 P0 级别告警

📱

值班工程师T+30s

电话 + 短信 + 即时通讯同时通知值班 DBA

👥

团队负责人T+5min

自动升级至数据库团队负责人和后端团队负责人

🎖️

技术总监T+15min

问题未缓解，自动升级至技术总监

🏢

VP / CTOT+30min

重大事故升级至高管层，准备对外沟通

升级规则说明

P3/P4 告警：仅通知值班工程师，无需升级

P2 告警：15 分钟未响应则升级至团队负责人

P1 告警：5 分钟未响应升级，30 分钟未解决升级至总监

P0 告警：立即通知全链路，15 分钟未缓解升级至 VP/CTO

3-tier escalation

L1 (first responder): alert trigger → notify on-call. 15 phút chưa ack → auto escalate.
L2: notify team lead + domain expert. 30 phút chưa mitigate → tiếp tục escalate.
L3: notify CTO + management, full emergency response.

Tier	Notify	SLA	Escalate condition
Warning	IM	Work hour	Không recover 30 phút
Critical	Phone + IM	Ack 15 phút	Không ack/mitigate
Fatal	Phone bombing + SMS	Respond 5 phút	Auto escalate management

5. Postmortem: học từ fault

Sau recover, step quan trọng nhất = postmortem. Không phải đổ lỗi, mà tìm systemic improvement. Google, Meta, big tech theo "blameless postmortem" — focus "system sao cho phép error này", không "ai gây ra".

现象深度 0 / 4

💡支付系统在高峰期完全不可用，持续 18 分钟

复盘报告模板

1事故概述+

2时间线+

3影响评估+

4根因分析+

5改进措施+

6经验教训+

"5 Whys"

Từ symptom, hỏi "why" liên tục đến root cause:

Sao service chết? → DB connection pool exhausted
Sao pool exhausted? → Slow query chiếm connection không release
Sao có slow query? → Thiếu index, full scan
Sao thiếu index? → New table launch không DBA review
Sao không review? → Không có SQL audit process

Root cause không phải "1 người quên add index", mà "thiếu SQL audit process". Fix root cause = prevent recurrence.

Tổng kết

Incident response = capability mọi team cần. Không phải individual hero, mà systematic process + role + improvement.

Tier response: P0~P4 đảm bảo respond đúng intensity
Timeline rõ: detect→respond→mitigate→resolve→postmortem
Command: IC + Comm + Tech split work, tránh chaos
Alert escalation: tier + auto escalate, đảm bảo key issue không sót
Blameless postmortem: 5 whys, focus system improvement

2026 cho VN dev

Modern tools 2026:
- PagerDuty: incident management mainstream
- Opsgenie / FireHydrant: alternative PagerDuty
- Rootly / incident.io: GenAI-powered incident
- Datadog / New Relic Incident Intelligence: AIOps detect anomaly
VN context:
- Banking VN có strict regulation về incident report (SBV)
- E-commerce VN (Shopee, Tiki): mature SRE culture
- Startup: dùng Sentry + Slack alert đủ
AI-era:
- AIOps: ML detect anomaly trước alert
- AI postmortem: LLM draft postmortem từ logs + Slack
- ChatOps: bot Slack chat-driven incident response
Bài tập: chạy 1 chaos engineering exercise / quarter

Incident Response + Troubleshooting ​

0. Toàn cảnh: fault = best teacher ​

1. Severity tiers ​

2. Response timeline ​

3. Incident Command: ai chỉ huy? ​

4. Alert escalation ​

5. Postmortem: học từ fault ​

Tổng kết ​

Tài liệu ​