Must known terminologies for SRE

Must known terminologies for SRE

Blameless Postmortem

UpdatedOctober 8, 2025

•3 min read

Abhishek Reddy A N

it’s a retrospective after an incident where the focus is on understanding what happened and why rather than assigning blame.

🔍 Definition

A Blameless Postmortem is a detailed, objective report created after an incident or outage, focusing on:

What happened
Why it happened
How we can prevent it from happening again

👉 The key idea:

No one gets blamed.
Instead of pointing fingers, the team focuses on improving systems and processes.

⚙️ Why It’s Important

When incidents happen (and they will), the goal is to:

Learn from the failure
Improve reliability
Encourage open, honest communication

If people fear punishment, they’ll hide mistakes, and the organization won’t learn.
A blameless culture ensures everyone can talk openly about what went wrong — and why — without fear.

🧩 Real-World Example

🧨 Scenario:

Your web service went down for 45 minutes because a DevOps engineer accidentally ran a command that deleted a Kubernetes namespace in production.

🚫 What not to do:

“It’s Rahul’s fault — he made a mistake. He should be more careful.”

This creates fear, resentment, and hides future mistakes.

✅ Blameless Postmortem approach:

“A manual deletion command was executed in production because the same permissions were used for staging and production. We’ll fix this by enforcing role-based access and requiring peer review for high-risk commands.”

Focus shifts from “Who did it?” → “Why was it possible?”

⚙️ Postmortem vs. Blameless Postmortem

Feature	Postmortem	Blameless Postmortem
Definition	A report or analysis written after an incident or outage to understand what happened and how to prevent it in the future.	A postmortem that intentionally avoids blaming individuals, and instead focuses on systemic causes, process improvements, and learning.
Goal	To document and analyze the cause of an outage or issue.	To learn from incidents without fear or punishment, and make systems and processes more resilient.
Tone / Culture	Can sometimes be punitive — individuals might be blamed for mistakes.	Psychologically safe — everyone can admit errors openly because no one is punished.
Focus	“Who did this?” or “Whose fault was it?”	“Why did this happen?” and “How can we improve so it doesn’t happen again?”
Outcome	Fixes are often short-term or reactive (e.g., “Don’t do that again”).	Fixes are systemic and preventive (e.g., “Add automation or access control to avoid manual errors”).
Team Behavior	Engineers may hide mistakes to avoid criticism.	Engineers feel safe to report issues quickly and honestly.
Example	“The outage occurred because Ramesh pushed a bad config file.”	“The outage occurred because our deployment process allowed an untested config to go to production. We’ll add CI/CD validation to prevent this.”

🧩 Real-World Example

🔴 Traditional Postmortem:

Incident: The payment API went down for 1 hour.
Root Cause: A developer ran a wrong command on production.
Action: Warned the developer and asked them to be more careful next time.

🔸 Problem: No systemic improvement. The same mistake could happen again.

🟢 Blameless Postmortem:

Incident: The payment API went down for 1 hour.
Root Cause: Production and staging environments share the same credentials, allowing manual access without approval.
Action:

Implemented separate credentials for prod/staging.

Added pre-deployment checks and role-based access.

Updated documentation for all engineers.

✅ Focus: Process improvement, not human fault.

Comments

Join the discussion

No comments yet. Be the first to comment.

More from this blog

Enterprise governance.

Enterprise Governance is the framework of policies, procedures, and controls that ensures an organization’s IT and business operations are: Aligned with business objectives Compliant with regulations and standards (like ISO 27001, SOC 2, GDPR) Sec...

Oct 8, 20251 min read

secrets management

Secrets Management is the process of securely storing, accessing, and rotating sensitive information such as: API keys Passwords SSH keys Certificates Tokens It ensures that sensitive data is not hardcoded in code or configuration files, reduc...

Oct 8, 20251 min read

Vulnerability Scanning

🔍 What is Vulnerability Scanning? Vulnerability scanning is the process of identifying security weaknesses in systems, applications, or networks using automated tools.It helps organizations detect misconfigurations, missing patches, outdated softwar...

Oct 8, 20251 min read

CIS Benchmarks

Center for Internet Security (CIS). CIS Benchmarks are well-defined, consensus-based best practices to securely configure operating systems, network devices, applications, and cloud services.They help organizations reduce vulnerabilities, improve sec...

Oct 8, 20252 min read

Error Budget

🔍 Definition An Error Budget is the maximum amount of unreliability (downtime, errors, or failed requests) your system is allowed to have without breaching your SLO. It’s literally the “budget” for failure that you can spend in a given period. 🧠 ...

Oct 8, 20252 min read

Must known terminologies for SRE

8 posts