Must known terminologies for SRE

Enterprise governance.

Abhishek Reddy A N — Wed, 08 Oct 2025 05:44:19 GMT

Enterprise Governance is the framework of policies, procedures, and controls that ensures an organization’s IT and business operations are:

Aligned with business objectives
Compliant with regulations and standards (like ISO 27001, SOC 2, GDPR)
Secure, auditable, and risk-managed

In IT and cloud environments, enterprise governance covers:

Security policies (access control, incident response)
Compliance enforcement (CIS Benchmarks, vulnerability remediation)
Risk management and reporting
Cloud resource usage policies

⚙️ Tools & Practices

Cloud Governance: AWS Organizations, Azure Policy, GCP Organization Policy
Compliance Tools: CIS-CAT, Qualys, Cloud Security Posture Management (CSPM) tools
Monitoring & Reporting: Logging, auditing, and dashboards for security & compliance

Real-World Example

Defining cloud policies for multi-account AWS environments to enforce encryption, IAM restrictions, and tagging standards
Auditing access and usage of cloud resources for compliance
Implementing automated policy enforcement to ensure CIS Benchmark compliance and security governance.

secrets management

Abhishek Reddy A N — Wed, 08 Oct 2025 05:42:59 GMT

Secrets Management is the process of securely storing, accessing, and rotating sensitive information such as:

API keys
Passwords
SSH keys
Certificates
Tokens

It ensures that sensitive data is not hardcoded in code or configuration files, reducing the risk of unauthorized access.

⚙️ Cloud and Enterprise Tools

AWS Secrets Manager
Azure Key Vault
HashiCorp Vault
GCP Secret Manager
Kubernetes Secrets

Best practices include:

Encrypting secrets at rest and in transit
Automating secret rotation
Limiting access via role-based access control (RBAC)
Auditing access logs

💼 Real-World Example

Storing database credentials in AWS Secrets Manager instead of code
Configuring CI/CD pipelines to retrieve secrets dynamically for deployments
Implementing periodic secret rotation and logging access for compliance

Vulnerability Scanning

Abhishek Reddy A N — Wed, 08 Oct 2025 05:42:12 GMT

🔍 What is Vulnerability Scanning?

Vulnerability scanning is the process of identifying security weaknesses in systems, applications, or networks using automated tools.
It helps organizations detect misconfigurations, missing patches, outdated software, and other potential security risks before attackers can exploit them.

In cloud environments (AWS, Azure, GCP), vulnerability scanning is the process of detecting security risks across cloud resources — including virtual machines, containers, storage, and network configurations.

⚙️ Common Tools

Nessus
OpenVAS
Qualys
Rapid7 InsightVM
Nmap (with vulnerability scripts)

Detect unpatched OS images, misconfigured cloud services, exposed endpoints, or non-compliant IAM policies.
Use cloud-native tools like AWS Inspector, Azure Defender, or GCP Security Scanner, along with third-party tools like Nessus or Qualys.
Helps maintain compliance standards (CIS Benchmarks, NIST, SOC 2) and reduce the cloud attack surface.

CIS Benchmarks

Abhishek Reddy A N — Wed, 08 Oct 2025 05:40:20 GMT

Center for Internet Security (CIS).

CIS Benchmarks are well-defined, consensus-based best practices to securely configure operating systems, network devices, applications, and cloud services.
They help organizations reduce vulnerabilities, improve security posture, and maintain compliance with frameworks like ISO 27001, NIST, and SOC 2.

🧩 Example:

Let’s say you’re managing Linux servers or network switches:

CIS Benchmark for Ubuntu Linux recommends disabling root SSH login, enforcing password complexity, and setting audited rules.
CIS Benchmark for Dell/Network Devices (or Cisco IOS) might suggest disabling unused services (like CDP/LLDP), applying secure SNMP configurations, or using SSH instead of Telnet.

💼 Why It’s Important for SRE / Network / DevOps Roles:

Ensures secure baseline configurations for servers, containers, and network devices.
Helps meet security & compliance audit requirements.
Enables proactive vulnerability management by hardening systems before deployment.

🛠️ Real-time Example:

Scenario:
You’re deploying EC2 instances on AWS for production workloads.
Before going live, you use the CIS AWS Foundations Benchmark to:

Ensure CloudTrail is enabled across all regions.
Restrict root account usage.
Enforce MFA for console logins.
Encrypt S3 buckets with KMS.

These checks align your cloud setup with CIS security standards.

Benchmark Levels

Each CIS Benchmark provides two main security levels:

Level 1: Basic, essential settings that have minimal impact on functionality and are widely applicable.
- Everyday enterprise systems needing baseline protection
Level 2: Advanced, stringent recommendations for highly sensitive environments, which may reduce system functionality but offer stronger security.
- Highly secure or regulated environments (banks, defense, healthcare)
A third level, the STIG profile, may appear for benchmarks aligned to US government Defense Information Systems Agency guidelines
- Security Technical Implementation Guide.
- STIGs define how to configure systems securely so they meet strict military-grade security requirements.

✅ Key Takeaways:

CIS Benchmarks = standardized hardening guides.
Applied across OS, Databases, Cloud, Network Devices.
Used in security automation tools like Ansible, Chef, Terraform, and CIS-CAT for compliance checks.
Enhances incident prevention and reduces attack surface.

Error Budget

Abhishek Reddy A N — Wed, 08 Oct 2025 05:05:50 GMT

🔍 Definition

An Error Budget is the maximum amount of unreliability (downtime, errors, or failed requests) your system is allowed to have without breaching your SLO.

It’s literally the “budget” for failure that you can spend in a given period.

🧠 Why It Exists

You can’t have 100% reliability — it’s too expensive and slows innovation.
So companies define SLOs (targets like 99.9% uptime), and whatever remains is the Error Budget.

Reliability target (SLO) + Error Budget = 100%

The Error Budget represents how much risk or failure your service can tolerate.

💡 Real-world Example

Let’s say you’re an SRE managing a payment API.

SLO: 99.9% uptime per month
Error Budget: 0.1% downtime = 43.2 minutes/month

Scenario 1: Normal Month

If your API was down for only 20 minutes,
✅ You’re within budget (23.2 minutes remaining).
You can safely continue releasing new features or deployments.

Scenario 2: Bad Month

If your API was down for 60 minutes,
❌ You exceeded your error budget by 16.8 minutes.
You should:

Freeze risky deployments
Focus on improving reliability (bug fixes, better monitoring)
Conduct a postmortem to understand what went wrong

🧠 Quick Summary

Term	Meaning	Example
SLO	Target reliability	99.9% uptime
Error Budget	Tolerable failure allowance	0.1% downtime (≈43 min/month)
If Within Budget	Deploy normally	Release features faster
If Exceeded	Freeze releases, fix issues	Improve reliability

Sla, Slo , Sli

Abhishek Reddy A N — Wed, 08 Oct 2025 04:55:10 GMT

📜1️⃣ SLA – Service Level Agreement

🔍 Definition

SLA is a formal contract between a service provider and the customer.
It defines what level of service is promised and what happens if that level isn’t met.

It often includes penalties or compensation for not meeting the target.
⚙️ Real-world Example

If AWS promises 99.99% uptime per month, that’s an SLA.

If uptime drops below 99.99%, AWS might:

Credit you part of your monthly bill.
Publish an incident RCA.

🎯 2️⃣ SLO – Service Level Objective

🔍 Definition

SLO is the target goal or threshold for your SLI — the performance level you want to maintain.

It’s what you aim for internally as a reliability objective.

💡 Example

SLI: 99.95% availability
SLO: “Our goal is to maintain 99.9% or higher availability for the payment API per month.”

If your SLI drops below 99.9%, it’s a signal that reliability is degrading and you need to take corrective actions.

⚙️ Real-world Example

A video streaming service may define:

SLI: Percentage of video plays without buffering.
SLO: 99.5% of all videos should play without buffering.

If buffering increases due to CDN issues, you’re breaching your SLO.

⚙️ 3️⃣ SLI – Service Level Indicator

🔍 Definition

SLI is a measurable metric that shows how a system is performing — basically, “How are we doing?”

It’s a quantitative measurement of a specific aspect of reliability, such as:

Availability (uptime)
Latency (response time)
Error rate (failed requests)
Throughput (requests handled per second)

📈 Formula Example

Availability SLI = (Number of successful requests) / (Total requests)

If you had:

999,500 successful requests out of 1,000,000 total
→ SLI = 99.95% availability

🧩 Real-world Example

In a payment gateway API, an SLI could be:

“Percentage of successful transactions in the last 30 days.”

If 0.1% of transactions failed, SLI = 99.9%.

Concept	Stands For	Defines	Example
SLI	Service Level Indicator	What you measure	“99.95% uptime”
SLO	Service Level Objective	What you aim for	“We target ≥99.9% uptime”
SLA	Service Level Agreement	What you promise (to customer)	“99.9% uptime guaranteed, or 10% refund”

Blameless Postmortem

Abhishek Reddy A N — Wed, 08 Oct 2025 04:53:38 GMT

it’s a retrospective after an incident where the focus is on understanding what happened and why rather than assigning blame.

🔍 Definition

A Blameless Postmortem is a detailed, objective report created after an incident or outage, focusing on:

What happened
Why it happened
How we can prevent it from happening again

👉 The key idea:

No one gets blamed.
Instead of pointing fingers, the team focuses on improving systems and processes.

⚙️ Why It’s Important

When incidents happen (and they will), the goal is to:

Learn from the failure
Improve reliability
Encourage open, honest communication

If people fear punishment, they’ll hide mistakes, and the organization won’t learn.
A blameless culture ensures everyone can talk openly about what went wrong — and why — without fear.

🧩 Real-World Example

🧨 Scenario:

Your web service went down for 45 minutes because a DevOps engineer accidentally ran a command that deleted a Kubernetes namespace in production.

🚫 What not to do:

“It’s Rahul’s fault — he made a mistake. He should be more careful.”

This creates fear, resentment, and hides future mistakes.

✅ Blameless Postmortem approach:

“A manual deletion command was executed in production because the same permissions were used for staging and production. We’ll fix this by enforcing role-based access and requiring peer review for high-risk commands.”

Focus shifts from “Who did it?” → “Why was it possible?”

⚙️ Postmortem vs. Blameless Postmortem

Feature	Postmortem	Blameless Postmortem
Definition	A report or analysis written after an incident or outage to understand what happened and how to prevent it in the future.	A postmortem that intentionally avoids blaming individuals, and instead focuses on systemic causes, process improvements, and learning.
Goal	To document and analyze the cause of an outage or issue.	To learn from incidents without fear or punishment, and make systems and processes more resilient.
Tone / Culture	Can sometimes be punitive — individuals might be blamed for mistakes.	Psychologically safe — everyone can admit errors openly because no one is punished.
Focus	“Who did this?” or “Whose fault was it?”	“Why did this happen?” and “How can we improve so it doesn’t happen again?”
Outcome	Fixes are often short-term or reactive (e.g., “Don’t do that again”).	Fixes are systemic and preventive (e.g., “Add automation or access control to avoid manual errors”).
Team Behavior	Engineers may hide mistakes to avoid criticism.	Engineers feel safe to report issues quickly and honestly.
Example	“The outage occurred because Ramesh pushed a bad config file.”	“The outage occurred because our deployment process allowed an untested config to go to production. We’ll add CI/CD validation to prevent this.”

🧩 Real-World Example

🔴 Traditional Postmortem:

Incident: The payment API went down for 1 hour.
Root Cause: A developer ran a wrong command on production.
Action: Warned the developer and asked them to be more careful next time.

🔸 Problem: No systemic improvement. The same mistake could happen again.

🟢 Blameless Postmortem:

Incident: The payment API went down for 1 hour.
Root Cause: Production and staging environments share the same credentials, allowing manual access without approval.
Action:

Implemented separate credentials for prod/staging.

Added pre-deployment checks and role-based access.

Updated documentation for all engineers.

✅ Focus: Process improvement, not human fault.

What is a Sev-1 Outage?

Abhishek Reddy A N — Wed, 08 Oct 2025 04:48:59 GMT

🔍 Definition

Sev-1 (Severity-1) is the highest level of incident severity — a critical outage that severely impacts business or customer experience.

It usually means:

The production system is completely down
Critical functionality is unavailable
Revenue or SLAs are affected
Requires immediate response, 24×7

Severity	Impact	Typical Example
Sev-1	Critical – total outage	Website or payment API completely down
Sev-2	Major – partial outage	Login failing for some users
Sev-3	Minor – degraded performance	Reports load slowly
Sev-4	Low – cosmetic or informational	Typo in UI, documentation bug

⚙️ What happens during a Sev-1 outage

When a Sev-1 incident occurs, the SRE (or on-call engineer) performs a structured Incident Response process.

🧭 Typical steps:

Detection
- Alerts from monitoring (Prometheus, Datadog, CloudWatch, etc.)
- Example: “Payment API error rate > 80%”
Acknowledgment
- SRE acknowledges the incident immediately.
- Paging tools like PagerDuty, Opsgenie, or VictorOps notify on-call engineers.
Communication
- Create a Slack/Teams “war room” channel.
- Inform key stakeholders (engineering, product, management, customers if needed).
- Update incident ticket in tools like Jira, ServiceNow, or Statuspage.
Mitigation
- Quickly restore service, even temporarily.
- Example: Roll back a bad deployment, switch traffic to standby servers, or disable a faulty feature flag.
Resolution
- Apply permanent fixes after the service is stable.
- Collect logs, metrics, traces for analysis.
Post-Incident (RCA/Postmortem)
- Conduct a Blameless Postmortem to find root cause and add preventive actions.

🧠 Example Scenario: Real-World Sev-1 Outage

🔴 Incident

Your company’s e-commerce website is not processing any payments.
Monitoring shows 100% failure rate on the checkout API.

⚙️ What you do

Alert fired → you acknowledge within 2 minutes.
Check deployment history — notice a new microservice version was released 10 minutes ago.
Roll back the deployment → service recovers.
Communicate resolution to stakeholders.
Later in RCA:
- Root cause: New code broke payment API authentication.
- Fix: Add automated integration tests before deployment.

✅ Outcome

MTTR (Mean Time To Recovery) improved.
Documentation updated.
Confidence in incident management increased.

🎯 Interview Talking Points Example

When asked “Have you handled Sev-1 incidents?”, you can respond like this:

“Yes, I’ve been on the on-call rotation and handled multiple Sev-1 incidents. For example, once our core API was down due to a failed deployment. I led the incident bridge — identified the rollback plan, coordinated with the Dev team, restored the service within 15 minutes, and later drove the blameless postmortem to improve our CI/CD rollback automation.”

That’s the STAR format (Situation, Task, Action, Result) — perfect for interviews.

🧠 Real-Life Example (To Mention in Interview)

“Once during a Sev-1 outage, our API gateway started returning 5xx errors due to a bad config push.
I was the incident commander — I coordinated with the DevOps team to roll back the config, updated leadership every 15 minutes, and restored service within 25 minutes. Later, I led a blameless postmortem and added CI/CD validation to prevent config pushes without syntax checks.”