The 48 Most Common DevOps Interview Questions

Welcome to the definitive guide for DevOps engineering interviews. This comprehensive list covers DevOps culture, modern CI/CD pipelines, Infrastructure as Code, Containerization, Observability, and Site Reliability Engineering practices.

Section 1: DevOps Culture & Platform Engineering

Q1: How is Platform Engineering different from traditional DevOps? Traditional DevOps often required developers to manage their own infrastructure scripts. Platform Engineering (a major 2026 standard) provides an Internal Developer Platform (IDP) —a golden path self-service portal where developers can spin up environments without needing to know the underlying Kubernetes or Terraform complexities.

Q2: Explain the 'Shift-Left' approach in DevSecOps. Shift-left means moving testing, security, and performance evaluations as early in the software development lifecycle (SDLC) as possible, rather than waiting for the deployment phase. It reduces costs and prevents vulnerabilities from reaching production.

Q3: What are the core principles of GitOps? GitOps uses a Git repository as the single source of truth for declarative infrastructure and applications. The core principles are: the system is described declaratively, the state is versioned in Git, changes are automatically applied, and software agents (like ArgoCD) continuously ensure the cluster matches the Git state.

Q4: Describe a common DevOps 'Anti-Pattern'. A common anti-pattern is having a 'DevOps Team' that acts as a silo between Dev and Ops, where developers still throw code over the wall to the DevOps engineers to deploy. DevOps is a culture of shared responsibility, not just a job title.

Q5: How do you measure DevOps success (DORA metrics)? Success is measured using the four key DORA metrics: Deployment Frequency, Lead Time for Changes (time from commit to production), Mean Time to Recovery (MTTR -time to recover from a failure), and Change Failure Rate (percentage of deployments causing failures).

Section 2: Continuous Integration & Delivery (CI/CD)

Q6: Compare Jenkins with modern CI tools like GitHub Actions or GitLab CI. Jenkins is highly customizable but requires significant maintenance (plugins, master/worker nodes). GitHub Actions and GitLab CI are cloud-native, deeply integrated with the repo, use declarative YAML, and offer fully managed runners, making them the preferred choice for 2026 modern stacks.

Q7: How do you handle database migrations in an automated CI/CD pipeline? Database changes should be version-controlled using tools like Flyway or Liquibase. Migrations are executed automatically during the CI/CD pipeline before the application code is deployed, using backward-compatible changes to prevent downtime.

Q8: Explain the difference between Canary and Blue/Green deployments. Blue/Green requires two identical environments; you switch traffic from the old (Blue) to the new (Green) all at once. Canary routes a small percentage of real user traffic (e.g., 5%) to the new version, monitors for errors, and gradually ramps up to 100%.

Q9: What is the purpose of an Artifact Repository? Tools like JFrog Artifactory or AWS ECR store compiled binaries, Docker images, and package dependencies. They ensure builds are immutable, repeatable, and securely scanned for vulnerabilities before deployment.

Q10: How do you roll back a failed deployment in a GitOps workflow? In a true GitOps workflow, you do not manually run a 'rollback' script. You simply git revert the bad commit in the infrastructure repository. The GitOps agent (e.g., ArgoCD) detects the state change and automatically syncs the cluster back to the previous stable state.

Section 3: Infrastructure as Code (IaC)

Q11: Why is Terraform State crucial, and how do you secure it? The state file maps real-world resources to your configuration. It must be secured because it contains sensitive data (passwords, IPs) in plaintext. In production, state is stored remotely (e.g., AWS S3) with encryption enabled, and state-locking (via DynamoDB) is used to prevent concurrent modifications.

Q12: What is 'Configuration Drift' and how do you resolve it? Drift occurs when infrastructure is manually changed outside of the IaC tool (e.g., someone clicking around the AWS console). To resolve it, you run a Terraform plan to detect the drift, and either update your code to match reality or run an apply to overwrite the manual changes.

Q13: How does Pulumi differ from Terraform? Terraform uses HCL (HashiCorp Configuration Language), a domain-specific language. Pulumi allows engineers to write IaC using general-purpose programming languages like Python, TypeScript, or Go, enabling better looping, testing, and integration with standard software engineering practices.

Q14: Explain the use of Terraform Modules. Modules are self-contained packages of Terraform configurations that manage a specific set of related resources (e.g., a standard VPC setup). They promote code reusability, standardization, and simplify complex infrastructure deployments.

Q15: How do you test Infrastructure as Code? Use static analysis tools like Checkov or tfsec for security scanning before deployment. For integration testing, tools like Terratest (written in Go) can deploy the infrastructure to a sandbox, validate it works, and then tear it down.

Section 4: Containerization & Docker

Q16: How do you reduce the size of a Docker image? Use Multi-Stage Builds to separate the build environment from the runtime environment. Also, use minimal base images like Alpine or 'Distroless' images, remove unnecessary package caches, and combine RUN commands to reduce the number of image layers.

Q17: What are 'Distroless' container images and why use them? Distroless images contain only your application and its exact runtime dependencies. They do not contain package managers, shells (no bash), or standard Linux utilities. This drastically reduces the attack surface and improves security.

Q18: Explain the difference between CMD and ENTRYPOINT in a Dockerfile. ENTRYPOINT sets the primary executable that the container will run and is harder to override. CMD provides default arguments to the ENTRYPOINT. If you want a container to behave strictly like a specific executable, use ENTRYPOINT.

Q19: How do you handle secrets (passwords, API keys) in Docker? Never bake secrets into the Dockerfile or commit them to the image. Pass them at runtime using environment variables, or better yet, mount them securely at runtime using a secrets manager or Docker Swarm/Kubernetes secrets.

Q20: What is the Container Runtime Interface (CRI)? CRI is a standard API that allows Kubernetes to use different container runtimes (like containerd or CRI-O) instead of being hardcoded to Docker. This decoupling was why Docker 'support' was removed from Kubernetes, shifting to the lighter containerd.

Section 5: Kubernetes Core & Architecture

Q21: Describe the components of the Kubernetes Control Plane. The API Server (front-end), etcd (distributed key-value store for cluster state), Scheduler (assigns pods to nodes), and Controller Manager (maintains cluster state, like ensuring the right number of pod replicas are running).

Q22: What is the difference between a Deployment and a StatefulSet? Deployments are for stateless applications where pods are identical and interchangeable. StatefulSets are for stateful applications (like databases) where pods require persistent storage, unique network identifiers, and ordered, graceful deployment and scaling.

Q23: How does a Kubernetes Service differ from an Ingress? A Service (like ClusterIP or NodePort) exposes an application running on a set of Pods within the cluster. An Ingress exposes HTTP and HTTPS routes from outside the cluster to Services within the cluster, acting as a smart, path-based reverse proxy.

Q24: What is the Gateway API in Kubernetes? The Gateway API is the modern (2026 standard) evolution of Ingress. It provides a more expressive, extensible, and role-oriented way to route traffic into a cluster, separating the responsibilities of infrastructure providers, cluster operators, and application developers.

Q25: Explain the purpose of a Sidecar container. A sidecar is a secondary container that runs alongside the main application container within the same Pod. It enhances the main app without changing its code, commonly used for log forwarding, proxying (like Envoy in a Service Mesh), or fetching secrets.

Section 6: Advanced Cloud & Infrastructure

Q26: What is a Service Mesh (e.g., Istio) and why use it? A service mesh is a dedicated infrastructure layer for managing service-to-service communication. It abstracts networking logic (mTLS encryption, retries, circuit breaking, tracing) away from application code using sidecar proxies.

Q27: How does Karpenter improve upon the standard Cluster Autoscaler? Karpenter (highly popular in AWS/EKS) is a high-performance, flexible node provisioning tool. Instead of relying on rigid Auto Scaling Groups, Karpenter observes unschedulable pods, calculates the exact compute needed, and provisions the right-sized instances directly in milliseconds.

Q28: What is eBPF and how is it changing DevOps? eBPF (Extended Berkeley Packet Filter) allows running sandboxed programs within the Linux kernel without changing kernel source code. In 2026, it is revolutionizing DevOps by enabling highly efficient, zero-instrumentation network observability and security (e.g., Cilium).

Q29: How do you achieve High Availability across multiple Cloud Regions? Deploy workloads independently in multiple regions, replicate databases asynchronously or use globally distributed databases (like Spanner or DynamoDB Global Tables), and use a global DNS routing service (like Route53) with health checks to route traffic to healthy regions.

Q30: What is KEDA? Kubernetes Event-driven Autoscaling (KEDA) allows you to drive the scaling of any container based on the number of events needing to be processed (e.g., scaling based on the length of a Kafka queue or an AWS SQS queue, rather than just CPU/Memory).

Section 7: Observability & Reliability

Q31: What is OpenTelemetry? OpenTelemetry is the 2026 industry standard framework for generating, collecting, and exporting telemetry data (metrics, logs, and traces). It provides a vendor-neutral standard, allowing you to switch observability backends (like Datadog, New Relic, or Grafana) without rewriting application code.

Q32: Explain the difference between SLI, SLO, and SLA. SLI (Service Level Indicator) is a real-time measurement of performance (e.g., 99% of requests < 200ms). SLO (Service Level Objective) is the internal goal your team sets for the SLI. SLA (Service Level Agreement) is the external, legal contract with customers that dictates penalties if the SLO is not met.

Q33: What is an Error Budget? An error budget is the allowable threshold of failure (100% minus your SLO). If your SLO is 99.9% uptime, your error budget is 0.1%. If you consume the budget, you must halt new feature deployments and focus entirely on reliability.

Q34: How does Prometheus pull metrics? Prometheus uses a 'pull-based' architecture. Applications expose a /metrics HTTP endpoint containing their current state in plain text. The Prometheus server periodically scrapes (pulls) this data, which is highly efficient for dynamic microservices environments.

Q35: What is Synthetic Monitoring? Synthetic monitoring simulates user interactions (like logging in or adding an item to a cart) at regular intervals from different global locations. It proactively alerts you if a critical user journey is broken, even if overall server metrics look healthy.

Section 8: DevSecOps & AI Operations

Q36: Explain the difference between SAST, DAST, and SCA. SAST analyzes source code for flaws without running it. DAST tests the running application from the outside for vulnerabilities (like SQL injection). SCA (Software Composition Analysis) scans third-party open-source libraries for known vulnerabilities (CVEs).

Q37: What is Zero Trust Architecture in a Kubernetes cluster? Zero Trust assumes the internal network is already compromised. Within Kubernetes, this means explicitly denying all pod-to-pod communication by default using Network Policies, and requiring mutual TLS (mTLS) for all internal traffic authorization.

Q38: How do you secure the Software Supply Chain (SLSA framework)? By cryptographically signing commits, signing Docker images (using tools like Cosign/ Sigstore), generating SBOMs (Software Bill of Materials), and locking down CI/CD pipelines to ensure code cannot be tampered with between development and production.

Q39: How is AI/LLM being integrated into DevOps in 2026 (AIOps)? AI is heavily used for anomaly detection in logs, auto-remediating common incidents without human intervention, summarizing complex alerts into plain English for on-call engineers, and generating IaC templates via tools like GitHub Copilot for CLI.

Q40: What are the challenges of managing MLOps (Machine Learning Ops)? MLOps differs from DevOps because it requires versioning large datasets and ML models, not just code. The infrastructure requires specialized hardware (GPUs/TPUs), and deployments must handle 'model drift,' where AI accuracy degrades over time as real-world data changes.

Section 9: Incident Management & SRE

Q41: What is a Blameless Post-Mortem? A cultural practice where teams investigate an outage to understand how the system failed, not who caused it. It assumes everyone acted with the best intentions based on the information they had, focusing on improving system resilience and processes.

Q42: Describe Chaos Engineering. The practice of intentionally injecting failures into a production or staging system (like killing random pods, simulating network latency, or dropping availability zones) to verify that the system's fault-tolerance mechanisms work as expected.

Q43: What is 'Toil' in SRE terminology? Toil is manual, repetitive, tactical work tied to running a production service that scales linearly with service growth (like manually resetting passwords or manually expanding disk volumes). SREs aim to eliminate toil through automation.

Q44: How do you handle a 'Thundering Herd' problem? A thundering herd occurs when many clients retry a failed request simultaneously, overwhelming the system. Mitigation includes implementing exponential backoff with 'jitter' (randomizing retry intervals) in the client applications, and aggressive rate-limiting at the API gateway.

Q45: What is the role of an Incident Commander? During a major outage, the Incident Commander controls the response. They do not debug systems; instead, they coordinate communication, assign tasks, make executive decisions, and ensure engineers have the focus and resources they need to resolve the issue.

Section 10: Final Prep & Behavioral

Q46: How do you approach learning a completely new tool or cloud ser vic e? Demonstrate a structured approach: read official documentation/architecture overviews, run a local sandbox or small proof-of-concept, integrate it into a CI pipeline, and review security best practices before proposing it for production.

Q47: Describe a time you brought down a production system. How did you handle it? (Behavioral) The interviewer wants honesty, accountability, and a focus on resolution. Detail the mistake, the immediate actions taken to restore service (rollback), the blameless post-mortem process, and the permanent guardrails you implemented to prevent recurrence.

Q48: What is your recommended study plan to master these 48 questions? Understand the 'Why' behind the tools, not just the 'How'. Build a hands-on project that incorporates an end-to-end GitOps pipeline, deploy a microservice to Kubernetes using Terraform, and monitor it with Prometheus. Practical experience solidifies theoretical answers.

Download Interview Notes