<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Must known terminologies for SRE]]></title><description><![CDATA[Must known terminologies for SRE]]></description><link>https://must-known-terminologies-for-sre.hashnode.dev</link><generator>RSS for Node</generator><lastBuildDate>Thu, 25 Jun 2026 08:06:27 GMT</lastBuildDate><atom:link href="https://must-known-terminologies-for-sre.hashnode.dev/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Enterprise governance.]]></title><description><![CDATA[Enterprise Governance is the framework of policies, procedures, and controls that ensures an organization’s IT and business operations are:

Aligned with business objectives

Compliant with regulations and standards (like ISO 27001, SOC 2, GDPR)

Sec...]]></description><link>https://must-known-terminologies-for-sre.hashnode.dev/enterprise-governance</link><guid isPermaLink="true">https://must-known-terminologies-for-sre.hashnode.dev/enterprise-governance</guid><category><![CDATA[Enterprise governance]]></category><category><![CDATA[SRE]]></category><dc:creator><![CDATA[Abhishek Reddy A N]]></dc:creator><pubDate>Wed, 08 Oct 2025 05:44:19 GMT</pubDate><content:encoded><![CDATA[<p>Enterprise Governance is the <strong>framework of policies, procedures, and controls</strong> that ensures an organization’s IT and business operations are:</p>
<ul>
<li><p><strong>Aligned with business objectives</strong></p>
</li>
<li><p><strong>Compliant with regulations and standards</strong> (like ISO 27001, SOC 2, GDPR)</p>
</li>
<li><p><strong>Secure, auditable, and risk-managed</strong></p>
</li>
</ul>
<p>In IT and cloud environments, enterprise governance covers:</p>
<ul>
<li><p>Security policies (access control, incident response)</p>
</li>
<li><p>Compliance enforcement (CIS Benchmarks, vulnerability remediation)</p>
</li>
<li><p>Risk management and reporting</p>
</li>
<li><p>Cloud resource usage policies</p>
</li>
</ul>
<hr />
<h3 id="heading-tools-amp-practices">⚙️ <strong>Tools &amp; Practices</strong></h3>
<ul>
<li><p><strong>Cloud Governance</strong>: AWS Organizations, Azure Policy, GCP Organization Policy</p>
</li>
<li><p><strong>Compliance Tools</strong>: CIS-CAT, Qualys, Cloud Security Posture Management (CSPM) tools</p>
</li>
<li><p><strong>Monitoring &amp; Reporting</strong>: Logging, auditing, and dashboards for security &amp; compliance</p>
</li>
</ul>
<h3 id="heading-real-world-example"><strong>Real-World Example</strong></h3>
<ul>
<li><p>Defining <strong>cloud policies</strong> for multi-account AWS environments to enforce encryption, IAM restrictions, and tagging standards</p>
</li>
<li><p>Auditing access and usage of cloud resources for compliance</p>
</li>
<li><p>Implementing automated <strong>policy enforcement</strong> to ensure CIS Benchmark compliance and security governance.</p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[secrets management]]></title><description><![CDATA[Secrets Management is the process of securely storing, accessing, and rotating sensitive information such as:

API keys

Passwords

SSH keys

Certificates

Tokens


It ensures that sensitive data is not hardcoded in code or configuration files, reduc...]]></description><link>https://must-known-terminologies-for-sre.hashnode.dev/secrets-management</link><guid isPermaLink="true">https://must-known-terminologies-for-sre.hashnode.dev/secrets-management</guid><category><![CDATA[secrets management]]></category><category><![CDATA[SRE devops]]></category><dc:creator><![CDATA[Abhishek Reddy A N]]></dc:creator><pubDate>Wed, 08 Oct 2025 05:42:59 GMT</pubDate><content:encoded><![CDATA[<p>Secrets Management is the <strong>process of securely storing, accessing, and rotating sensitive information</strong> such as:</p>
<ul>
<li><p>API keys</p>
</li>
<li><p>Passwords</p>
</li>
<li><p>SSH keys</p>
</li>
<li><p>Certificates</p>
</li>
<li><p>Tokens</p>
</li>
</ul>
<p>It ensures that sensitive data is <strong>not hardcoded in code or configuration files</strong>, reducing the risk of unauthorized access.</p>
<h3 id="heading-cloud-and-enterprise-tools">⚙️ <strong>Cloud and Enterprise Tools</strong></h3>
<ul>
<li><p><strong>AWS Secrets Manager</strong></p>
</li>
<li><p><strong>Azure Key Vault</strong></p>
</li>
<li><p><strong>HashiCorp Vault</strong></p>
</li>
<li><p><strong>GCP Secret Manager</strong></p>
</li>
<li><p><strong>Kubernetes Secrets</strong></p>
</li>
</ul>
<p>Best practices include:</p>
<ul>
<li><p>Encrypting secrets at rest and in transit</p>
</li>
<li><p>Automating secret rotation</p>
</li>
<li><p>Limiting access via role-based access control (RBAC)</p>
</li>
<li><p>Auditing access logs</p>
</li>
</ul>
<h3 id="heading-real-world-example">💼 <strong>Real-World Example</strong></h3>
<ul>
<li><p>Storing database credentials in <strong>AWS Secrets Manager</strong> instead of code</p>
</li>
<li><p>Configuring <strong>CI/CD pipelines</strong> to retrieve secrets dynamically for deployments</p>
</li>
<li><p>Implementing <strong>periodic secret rotation</strong> and logging access for compliance</p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Vulnerability Scanning]]></title><description><![CDATA[🔍 What is Vulnerability Scanning?
Vulnerability scanning is the process of identifying security weaknesses in systems, applications, or networks using automated tools.It helps organizations detect misconfigurations, missing patches, outdated softwar...]]></description><link>https://must-known-terminologies-for-sre.hashnode.dev/vulnerability-scanning</link><guid isPermaLink="true">https://must-known-terminologies-for-sre.hashnode.dev/vulnerability-scanning</guid><category><![CDATA[vulnerability scanning ]]></category><category><![CDATA[SRE devops]]></category><dc:creator><![CDATA[Abhishek Reddy A N]]></dc:creator><pubDate>Wed, 08 Oct 2025 05:42:12 GMT</pubDate><content:encoded><![CDATA[<h3 id="heading-what-is-vulnerability-scanning">🔍 <strong>What is Vulnerability Scanning?</strong></h3>
<p>Vulnerability scanning is the <strong>process of identifying security weaknesses</strong> in systems, applications, or networks using automated tools.<br />It helps organizations <strong>detect misconfigurations, missing patches, outdated software, and other potential security risks</strong> before attackers can exploit them.</p>
<p>In cloud environments (AWS, Azure, GCP), vulnerability scanning is the <strong>process of detecting security risks across cloud resources</strong> — including virtual machines, containers, storage, and network configurations.</p>
<h3 id="heading-common-tools">⚙️ <strong>Common Tools</strong></h3>
<ul>
<li><p><strong>Nessus</strong></p>
</li>
<li><p><strong>OpenVAS</strong></p>
</li>
<li><p><strong>Qualys</strong></p>
</li>
<li><p><strong>Rapid7 InsightVM</strong></p>
</li>
<li><p><strong>Nmap (with vulnerability scripts)</strong></p>
</li>
</ul>
<ul>
<li><p>Detect <strong>unpatched OS images</strong>, <strong>misconfigured cloud services</strong>, <strong>exposed endpoints</strong>, or <strong>non-compliant IAM policies</strong>.</p>
</li>
<li><p>Use <strong>cloud-native tools</strong> like <strong>AWS Inspector, Azure Defender, or GCP Security Scanner</strong>, along with third-party tools like Nessus or Qualys.</p>
</li>
<li><p>Helps maintain <strong>compliance standards</strong> (CIS Benchmarks, NIST, SOC 2) and <strong>reduce the cloud attack surface</strong>.</p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[CIS Benchmarks]]></title><description><![CDATA[Center for Internet Security (CIS).
CIS Benchmarks are well-defined, consensus-based best practices to securely configure operating systems, network devices, applications, and cloud services.They help organizations reduce vulnerabilities, improve sec...]]></description><link>https://must-known-terminologies-for-sre.hashnode.dev/cis-benchmarks</link><guid isPermaLink="true">https://must-known-terminologies-for-sre.hashnode.dev/cis-benchmarks</guid><category><![CDATA[CIS]]></category><category><![CDATA[SRE devops]]></category><dc:creator><![CDATA[Abhishek Reddy A N]]></dc:creator><pubDate>Wed, 08 Oct 2025 05:40:20 GMT</pubDate><content:encoded><![CDATA[<p><strong>Center for Internet Security (CIS)</strong>.</p>
<p>CIS Benchmarks are <strong>well-defined, consensus-based best practices</strong> to securely configure operating systems, network devices, applications, and cloud services.<br />They help organizations <strong>reduce vulnerabilities</strong>, <strong>improve security posture</strong>, and <strong>maintain compliance</strong> with frameworks like ISO 27001, NIST, and SOC 2.</p>
<h3 id="heading-example">🧩 Example:</h3>
<p>Let’s say you’re managing <strong>Linux servers</strong> or <strong>network switches</strong>:</p>
<ul>
<li><p>CIS Benchmark for <strong>Ubuntu Linux</strong> recommends disabling root SSH login, enforcing password complexity, and setting audited rules.</p>
</li>
<li><p>CIS Benchmark for <strong>Dell/Network Devices (or Cisco IOS)</strong> might suggest disabling unused services (like CDP/LLDP), applying secure SNMP configurations, or using SSH instead of Telnet.</p>
</li>
</ul>
<h3 id="heading-why-its-important-for-sre-network-devops-roles">💼 Why It’s Important for SRE / Network / DevOps Roles:</h3>
<ul>
<li><p>Ensures <strong>secure baseline configurations</strong> for servers, containers, and network devices.</p>
</li>
<li><p>Helps meet <strong>security &amp; compliance audit</strong> requirements.</p>
</li>
<li><p>Enables <strong>proactive vulnerability management</strong> by hardening systems before deployment.</p>
</li>
</ul>
<hr />
<h3 id="heading-real-time-example">🛠️ Real-time Example:</h3>
<p><strong>Scenario:</strong><br />You’re deploying EC2 instances on AWS for production workloads.<br />Before going live, you use the <strong>CIS AWS Foundations Benchmark</strong> to:</p>
<ul>
<li><p>Ensure CloudTrail is enabled across all regions.</p>
</li>
<li><p>Restrict root account usage.</p>
</li>
<li><p>Enforce MFA for console logins.</p>
</li>
<li><p>Encrypt S3 buckets with KMS.</p>
</li>
</ul>
<p>These checks align your cloud setup with CIS security standards.</p>
<h2 id="heading-benchmark-levels">Benchmark Levels</h2>
<p>Each CIS Benchmark provides two main security levels:</p>
<ul>
<li><p><strong>Level 1:</strong> Basic, essential settings that have minimal impact on functionality and are widely applicable.</p>
<ul>
<li>Everyday enterprise systems needing baseline protection</li>
</ul>
</li>
<li><p><strong>Level 2:</strong> Advanced, stringent recommendations for highly sensitive environments, which may reduce system functionality but offer stronger security.</p>
<ul>
<li>Highly secure or regulated environments (banks, defense, healthcare)</li>
</ul>
</li>
<li><p><strong>A third level</strong>, the STIG profile, may appear for benchmarks aligned to US government Defense Information Systems Agency guidelines</p>
<ul>
<li><p><strong>Security Technical Implementation Guide</strong>.</p>
</li>
<li><p>STIGs define <em>how to configure systems securely</em> so they meet strict military-grade security requirements.</p>
</li>
</ul>
</li>
</ul>
<h3 id="heading-key-takeaways">✅ Key Takeaways:</h3>
<ul>
<li><p><strong>CIS Benchmarks</strong> = standardized hardening guides.</p>
</li>
<li><p>Applied across <strong>OS</strong>, <strong>Databases</strong>, <strong>Cloud</strong>, <strong>Network Devices</strong>.</p>
</li>
<li><p>Used in <strong>security automation tools</strong> like <em>Ansible, Chef, Terraform, and CIS-CAT</em> for compliance checks.</p>
</li>
<li><p>Enhances <strong>incident prevention</strong> and <strong>reduces attack surface</strong>.</p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Error Budget]]></title><description><![CDATA[🔍 Definition

An Error Budget is the maximum amount of unreliability (downtime, errors, or failed requests) your system is allowed to have without breaching your SLO.
It’s literally the “budget” for failure that you can spend in a given period.

🧠 ...]]></description><link>https://must-known-terminologies-for-sre.hashnode.dev/error-budget</link><guid isPermaLink="true">https://must-known-terminologies-for-sre.hashnode.dev/error-budget</guid><category><![CDATA[SRE devops]]></category><dc:creator><![CDATA[Abhishek Reddy A N]]></dc:creator><pubDate>Wed, 08 Oct 2025 05:05:50 GMT</pubDate><content:encoded><![CDATA[<h3 id="heading-definition">🔍 Definition</h3>
<blockquote>
<p>An <strong>Error Budget</strong> is the <strong>maximum amount of unreliability (downtime, errors, or failed requests)</strong> your system is allowed to have <strong>without breaching your SLO</strong>.</p>
<p>It’s literally the “budget” for <strong>failure</strong> that you can spend in a given period.</p>
</blockquote>
<h3 id="heading-why-it-exists">🧠 Why It Exists</h3>
<p>You can’t have <strong>100% reliability</strong> — it’s too expensive and slows innovation.<br />So companies define <strong>SLOs</strong> (targets like 99.9% uptime), and whatever remains is the <strong>Error Budget</strong>.</p>
<blockquote>
<p><strong>Reliability target (SLO)</strong> + <strong>Error Budget</strong> = 100%</p>
</blockquote>
<p>The Error Budget represents how much risk or failure your service can tolerate.</p>
<h2 id="heading-real-world-example">💡 <strong>Real-world Example</strong></h2>
<p>Let’s say you’re an SRE managing a <strong>payment API</strong>.</p>
<ul>
<li><p><strong>SLO:</strong> 99.9% uptime per month</p>
</li>
<li><p><strong>Error Budget:</strong> 0.1% downtime = 43.2 minutes/month</p>
</li>
</ul>
<h3 id="heading-scenario-1-normal-month">Scenario 1: Normal Month</h3>
<p>If your API was down for only <strong>20 minutes</strong>,<br />✅ You’re <strong>within budget</strong> (23.2 minutes remaining).<br />You can safely continue releasing new features or deployments.</p>
<h3 id="heading-scenario-2-bad-month">Scenario 2: Bad Month</h3>
<p>If your API was down for <strong>60 minutes</strong>,<br />❌ You <strong>exceeded your error budget</strong> by 16.8 minutes.<br />You should:</p>
<ul>
<li><p><strong>Freeze risky deployments</strong></p>
</li>
<li><p>Focus on improving reliability (bug fixes, better monitoring)</p>
</li>
<li><p>Conduct a <strong>postmortem</strong> to understand what went wrong</p>
</li>
</ul>
<p>🧠 Quick Summary</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Term</td><td>Meaning</td><td>Example</td></tr>
</thead>
<tbody>
<tr>
<td><strong>SLO</strong></td><td>Target reliability</td><td>99.9% uptime</td></tr>
<tr>
<td><strong>Error Budget</strong></td><td>Tolerable failure allowance</td><td>0.1% downtime (≈43 min/month)</td></tr>
<tr>
<td><strong>If Within Budget</strong></td><td>Deploy normally</td><td>Release features faster</td></tr>
<tr>
<td><strong>If Exceeded</strong></td><td>Freeze releases, fix issues</td><td>Improve reliability</td></tr>
</tbody>
</table>
</div>]]></content:encoded></item><item><title><![CDATA[Sla, Slo , Sli]]></title><description><![CDATA[📜1️⃣ SLA – Service Level Agreement
🔍 Definition

SLA is a formal contract between a service provider and the customer.It defines what level of service is promised and what happens if that level isn’t met.

It often includes penalties or compensatio...]]></description><link>https://must-known-terminologies-for-sre.hashnode.dev/sla-slo-sli</link><guid isPermaLink="true">https://must-known-terminologies-for-sre.hashnode.dev/sla-slo-sli</guid><category><![CDATA[SRE devops]]></category><dc:creator><![CDATA[Abhishek Reddy A N]]></dc:creator><pubDate>Wed, 08 Oct 2025 04:55:10 GMT</pubDate><content:encoded><![CDATA[<h2 id="heading-1-sla-service-level-agreement">📜1️⃣ <strong>SLA – Service Level Agreement</strong></h2>
<h3 id="heading-definition">🔍 Definition</h3>
<blockquote>
<p><strong>SLA</strong> is a <strong>formal contract</strong> between a service provider and the customer.<br />It defines what level of service is promised and what happens if that level isn’t met.</p>
</blockquote>
<p>It often includes <strong>penalties or compensation</strong> for not meeting the target.<br />⚙️ Real-world Example</p>
<p>If AWS promises <strong>99.99% uptime per month</strong>, that’s an SLA.</p>
<p>If uptime drops below 99.99%, AWS might:</p>
<ul>
<li><p>Credit you part of your monthly bill.</p>
</li>
<li><p>Publish an incident RCA.</p>
</li>
</ul>
<h2 id="heading-2-slo-service-level-objective">🎯 2️⃣ <strong>SLO – Service Level Objective</strong></h2>
<h3 id="heading-definition-1">🔍 Definition</h3>
<blockquote>
<p><strong>SLO</strong> is the <strong>target goal or threshold</strong> for your SLI — the performance level you <em>want to maintain</em>.</p>
</blockquote>
<p>It’s what you aim for internally as a reliability objective.</p>
<h3 id="heading-example">💡 Example</h3>
<ul>
<li><p><strong>SLI:</strong> 99.95% availability</p>
</li>
<li><p><strong>SLO:</strong> “Our goal is to maintain <strong>99.9% or higher</strong> availability for the payment API per month.”</p>
</li>
</ul>
<p>If your SLI drops below 99.9%, it’s a signal that reliability is degrading and you need to take corrective actions.</p>
<h3 id="heading-real-world-example">⚙️ Real-world Example</h3>
<p>A video streaming service may define:</p>
<ul>
<li><p><strong>SLI:</strong> Percentage of video plays without buffering.</p>
</li>
<li><p><strong>SLO:</strong> 99.5% of all videos should play without buffering.</p>
</li>
</ul>
<p>If buffering increases due to CDN issues, you’re breaching your SLO.</p>
<h2 id="heading-3-sli-service-level-indicator">⚙️ 3️⃣ <strong>SLI – Service Level Indicator</strong></h2>
<h3 id="heading-definition-2">🔍 Definition</h3>
<blockquote>
<p><strong>SLI</strong> is a <em>measurable metric</em> that shows how a system is performing — basically, “How are we doing?”</p>
</blockquote>
<p>It’s a <strong>quantitative measurement</strong> of a specific aspect of reliability, such as:</p>
<ul>
<li><p><strong>Availability</strong> (uptime)</p>
</li>
<li><p><strong>Latency</strong> (response time)</p>
</li>
<li><p><strong>Error rate</strong> (failed requests)</p>
</li>
<li><p><strong>Throughput</strong> (requests handled per second)</p>
</li>
</ul>
<h3 id="heading-formula-example">📈 Formula Example</h3>
<p><strong>Availability SLI = (Number of successful requests) / (Total requests)</strong></p>
<p>If you had:</p>
<ul>
<li>999,500 successful requests out of 1,000,000 total<br />  → <strong>SLI = 99.95% availability</strong></li>
</ul>
<h3 id="heading-real-world-example-1">🧩 Real-world Example</h3>
<p>In a <strong>payment gateway API</strong>, an SLI could be:</p>
<blockquote>
<p>“Percentage of successful transactions in the last 30 days.”</p>
</blockquote>
<p>If 0.1% of transactions failed, SLI = 99.9%.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Concept</td><td>Stands For</td><td>Defines</td><td>Example</td></tr>
</thead>
<tbody>
<tr>
<td><strong>SLI</strong></td><td>Service Level Indicator</td><td>What you measure</td><td>“99.95% uptime”</td></tr>
<tr>
<td><strong>SLO</strong></td><td>Service Level Objective</td><td>What you aim for</td><td>“We target ≥99.9% uptime”</td></tr>
<tr>
<td><strong>SLA</strong></td><td>Service Level Agreement</td><td>What you promise (to customer)</td><td>“99.9% uptime guaranteed, or 10% refund”</td></tr>
</tbody>
</table>
</div>]]></content:encoded></item><item><title><![CDATA[Blameless Postmortem]]></title><description><![CDATA[it’s a retrospective after an incident where the focus is on understanding what happened and why rather than assigning blame.
🔍 Definition
A Blameless Postmortem is a detailed, objective report created after an incident or outage, focusing on:

What...]]></description><link>https://must-known-terminologies-for-sre.hashnode.dev/blameless-postmortem</link><guid isPermaLink="true">https://must-known-terminologies-for-sre.hashnode.dev/blameless-postmortem</guid><category><![CDATA[SRE devops]]></category><dc:creator><![CDATA[Abhishek Reddy A N]]></dc:creator><pubDate>Wed, 08 Oct 2025 04:53:38 GMT</pubDate><content:encoded><![CDATA[<p>it’s a retrospective after an incident where the focus is on understanding what happened and why rather than assigning blame.</p>
<h3 id="heading-definition">🔍 Definition</h3>
<p>A <strong>Blameless Postmortem</strong> is a <strong>detailed, objective report</strong> created <strong>after an incident or outage</strong>, focusing on:</p>
<ul>
<li><p><strong>What happened</strong></p>
</li>
<li><p><strong>Why it happened</strong></p>
</li>
<li><p><strong>How we can prevent it from happening again</strong></p>
</li>
</ul>
<p>👉 The key idea:</p>
<blockquote>
<p><strong>No one gets blamed.</strong><br />Instead of pointing fingers, the team <strong>focuses on improving systems and processes</strong>.</p>
</blockquote>
<h2 id="heading-why-its-important">⚙️ Why It’s Important</h2>
<p>When incidents happen (and they <em>will</em>), the goal is to:</p>
<ul>
<li><p>Learn from the failure</p>
</li>
<li><p>Improve reliability</p>
</li>
<li><p>Encourage <strong>open, honest communication</strong></p>
</li>
</ul>
<p>If people fear punishment, they’ll <strong>hide mistakes</strong>, and the organization won’t learn.<br />A <strong>blameless culture</strong> ensures everyone can talk openly about what went wrong — and why — without fear.</p>
<h2 id="heading-real-world-example">🧩 Real-World Example</h2>
<h3 id="heading-scenario">🧨 Scenario:</h3>
<p>Your web service went down for 45 minutes because a DevOps engineer accidentally ran a command that deleted a Kubernetes namespace in production.</p>
<h3 id="heading-what-not-to-do">🚫 What <em>not</em> to do:</h3>
<blockquote>
<p>“It’s Rahul’s fault — he made a mistake. He should be more careful.”</p>
</blockquote>
<p>This creates fear, resentment, and hides future mistakes.</p>
<h3 id="heading-blameless-postmortem-approach">✅ Blameless Postmortem approach:</h3>
<blockquote>
<p>“A manual deletion command was executed in production because the same permissions were used for staging and production. We’ll fix this by enforcing role-based access and requiring peer review for high-risk commands.”</p>
</blockquote>
<p><strong>Focus shifts</strong> from <em>“Who did it?”</em> → <em>“Why was it possible?”</em></p>
<p>⚙️ <strong>Postmortem vs. Blameless Postmortem</strong></p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Feature</td><td><strong>Postmortem</strong></td><td><strong>Blameless Postmortem</strong></td></tr>
</thead>
<tbody>
<tr>
<td><strong>Definition</strong></td><td>A report or analysis written after an incident or outage to understand what happened and how to prevent it in the future.</td><td>A postmortem that <em>intentionally avoids blaming individuals</em>, and instead focuses on <strong>systemic causes</strong>, <strong>process improvements</strong>, and <strong>learning</strong>.</td></tr>
<tr>
<td><strong>Goal</strong></td><td>To document and analyze the cause of an outage or issue.</td><td>To learn from incidents <strong>without fear or punishment</strong>, and make systems and processes more resilient.</td></tr>
<tr>
<td><strong>Tone / Culture</strong></td><td>Can sometimes be <em>punitive</em> — individuals might be blamed for mistakes.</td><td><em>Psychologically safe</em> — everyone can admit errors openly because no one is punished.</td></tr>
<tr>
<td><strong>Focus</strong></td><td>“Who did this?” or “Whose fault was it?”</td><td>“Why did this happen?” and “How can we improve so it doesn’t happen again?”</td></tr>
<tr>
<td><strong>Outcome</strong></td><td>Fixes are often short-term or reactive (e.g., “Don’t do that again”).</td><td>Fixes are <strong>systemic</strong> and <strong>preventive</strong> (e.g., “Add automation or access control to avoid manual errors”).</td></tr>
<tr>
<td><strong>Team Behavior</strong></td><td>Engineers may <strong>hide mistakes</strong> to avoid criticism.</td><td>Engineers feel <strong>safe to report issues</strong> quickly and honestly.</td></tr>
<tr>
<td><strong>Example</strong></td><td>“The outage occurred because Ramesh pushed a bad config file.”</td><td>“The outage occurred because our deployment process allowed an untested config to go to production. We’ll add CI/CD validation to prevent this.”</td></tr>
</tbody>
</table>
</div><h2 id="heading-real-world-example-1">🧩 Real-World Example</h2>
<h3 id="heading-traditional-postmortem">🔴 Traditional Postmortem:</h3>
<blockquote>
<p><em>Incident:</em> The payment API went down for 1 hour.<br /><em>Root Cause:</em> A developer ran a wrong command on production.<br /><em>Action:</em> Warned the developer and asked them to be more careful next time.</p>
</blockquote>
<p>🔸 <strong>Problem:</strong> No systemic improvement. The same mistake could happen again.</p>
<hr />
<h3 id="heading-blameless-postmortem">🟢 Blameless Postmortem:</h3>
<blockquote>
<p><em>Incident:</em> The payment API went down for 1 hour.<br /><em>Root Cause:</em> Production and staging environments share the same credentials, allowing manual access without approval.<br /><em>Action:</em></p>
<ul>
<li><p>Implemented separate credentials for prod/staging.</p>
</li>
<li><p>Added pre-deployment checks and role-based access.</p>
</li>
<li><p>Updated documentation for all engineers.</p>
</li>
</ul>
</blockquote>
<p>✅ <strong>Focus:</strong> Process improvement, not human fault.</p>
]]></content:encoded></item><item><title><![CDATA[What is a Sev-1 Outage?]]></title><description><![CDATA[🔍 Definition

Sev-1 (Severity-1) is the highest level of incident severity — a critical outage that severely impacts business or customer experience.

It usually means:

The production system is completely down

Critical functionality is unavailable...]]></description><link>https://must-known-terminologies-for-sre.hashnode.dev/what-is-a-sev-1-outage</link><guid isPermaLink="true">https://must-known-terminologies-for-sre.hashnode.dev/what-is-a-sev-1-outage</guid><category><![CDATA[SRE devops]]></category><dc:creator><![CDATA[Abhishek Reddy A N]]></dc:creator><pubDate>Wed, 08 Oct 2025 04:48:59 GMT</pubDate><content:encoded><![CDATA[<h3 id="heading-definition">🔍 Definition</h3>
<blockquote>
<p><strong>Sev-1 (Severity-1)</strong> is the <strong>highest level of incident severity</strong> — a <strong>critical outage</strong> that severely impacts business or customer experience.</p>
</blockquote>
<p>It usually means:</p>
<ul>
<li><p>The production system is <strong>completely down</strong></p>
</li>
<li><p><strong>Critical functionality</strong> is unavailable</p>
</li>
<li><p><strong>Revenue or SLAs</strong> are affected</p>
</li>
<li><p>Requires <strong>immediate response</strong>, 24×7</p>
</li>
</ul>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Severity</td><td>Impact</td><td>Typical Example</td></tr>
</thead>
<tbody>
<tr>
<td><strong>Sev-1</strong></td><td>Critical – total outage</td><td>Website or payment API completely down</td></tr>
<tr>
<td><strong>Sev-2</strong></td><td>Major – partial outage</td><td>Login failing for some users</td></tr>
<tr>
<td><strong>Sev-3</strong></td><td>Minor – degraded performance</td><td>Reports load slowly</td></tr>
<tr>
<td><strong>Sev-4</strong></td><td>Low – cosmetic or informational</td><td>Typo in UI, documentation bug</td></tr>
</tbody>
</table>
</div><h2 id="heading-what-happens-during-a-sev-1-outage">⚙️ <strong>What happens during a Sev-1 outage</strong></h2>
<p>When a Sev-1 incident occurs, the SRE (or on-call engineer) performs a structured <strong>Incident Response</strong> process.</p>
<h3 id="heading-typical-steps">🧭 Typical steps:</h3>
<ol>
<li><p><strong>Detection</strong></p>
<ul>
<li><p>Alerts from monitoring (Prometheus, Datadog, CloudWatch, etc.)</p>
</li>
<li><p>Example: “Payment API error rate &gt; 80%”</p>
</li>
</ul>
</li>
<li><p><strong>Acknowledgment</strong></p>
<ul>
<li><p>SRE <strong>acknowledges</strong> the incident immediately.</p>
</li>
<li><p>Paging tools like <strong>PagerDuty</strong>, <strong>Opsgenie</strong>, or <strong>VictorOps</strong> notify on-call engineers.</p>
</li>
</ul>
</li>
<li><p><strong>Communication</strong></p>
<ul>
<li><p>Create a Slack/Teams “war room” channel.</p>
</li>
<li><p>Inform key stakeholders (engineering, product, management, customers if needed).</p>
</li>
<li><p>Update incident ticket in tools like <strong>Jira</strong>, <strong>ServiceNow</strong>, or <strong>Statuspage</strong>.</p>
</li>
</ul>
</li>
<li><p><strong>Mitigation</strong></p>
<ul>
<li><p>Quickly <strong>restore service</strong>, even temporarily.</p>
</li>
<li><p>Example: Roll back a bad deployment, switch traffic to standby servers, or disable a faulty feature flag.</p>
</li>
</ul>
</li>
<li><p><strong>Resolution</strong></p>
<ul>
<li><p>Apply permanent fixes after the service is stable.</p>
</li>
<li><p>Collect logs, metrics, traces for analysis.</p>
</li>
</ul>
</li>
<li><p><strong>Post-Incident (RCA/Postmortem)</strong></p>
<ul>
<li>Conduct a <strong>Blameless Postmortem</strong> to find root cause and add preventive actions.</li>
</ul>
</li>
</ol>
<h2 id="heading-example-scenario-real-world-sev-1-outage">🧠 Example Scenario: Real-World Sev-1 Outage</h2>
<h3 id="heading-incident">🔴 Incident</h3>
<p>Your company’s <strong>e-commerce website</strong> is not processing any payments.<br />Monitoring shows 100% failure rate on the <strong>checkout API</strong>.</p>
<h3 id="heading-what-you-do">⚙️ What you do</h3>
<ol>
<li><p>Alert fired → you acknowledge within 2 minutes.</p>
</li>
<li><p>Check deployment history — notice a new microservice version was released 10 minutes ago.</p>
</li>
<li><p>Roll back the deployment → service recovers.</p>
</li>
<li><p>Communicate resolution to stakeholders.</p>
</li>
<li><p>Later in RCA:</p>
<ul>
<li><p>Root cause: New code broke payment API authentication.</p>
</li>
<li><p>Fix: Add automated integration tests before deployment.</p>
</li>
</ul>
</li>
</ol>
<h3 id="heading-outcome">✅ Outcome</h3>
<ul>
<li><p><strong>MTTR (Mean Time To Recovery)</strong> improved.</p>
</li>
<li><p>Documentation updated.</p>
</li>
<li><p>Confidence in incident management increased.</p>
</li>
</ul>
<h2 id="heading-interview-talking-points-example">🎯 Interview Talking Points Example</h2>
<p>When asked <strong>“Have you handled Sev-1 incidents?”</strong>, you can respond like this:</p>
<blockquote>
<p>“Yes, I’ve been on the on-call rotation and handled multiple Sev-1 incidents. For example, once our core API was down due to a failed deployment. I led the incident bridge — identified the rollback plan, coordinated with the Dev team, restored the service within 15 minutes, and later drove the blameless postmortem to improve our CI/CD rollback automation.”</p>
</blockquote>
<p>That’s the <strong>STAR format (Situation, Task, Action, Result)</strong> — perfect for interviews.</p>
<h2 id="heading-real-life-example-to-mention-in-interview">🧠 Real-Life Example (To Mention in Interview)</h2>
<blockquote>
<p>“Once during a Sev-1 outage, our API gateway started returning 5xx errors due to a bad config push.<br />I was the incident commander — I coordinated with the DevOps team to roll back the config, updated leadership every 15 minutes, and restored service within 25 minutes. Later, I led a blameless postmortem and added CI/CD validation to prevent config pushes without syntax checks.”</p>
</blockquote>
]]></content:encoded></item></channel></rss>