Kubernetes — Advanced: Operators & Beyond
The Operator Pattern
An Operator is a controller that encodes human operational knowledge about a stateful application into Kubernetes-native automation. It watches custom resources, compares desired vs actual state, and reconciles.
The Control Loop (Reconciliation)
Watch → Detect drift → Reconcile → Repeat
Every built-in Kubernetes controller (Deployment, ReplicaSet) runs this loop. Operators extend it with your own resources and logic.
User applies CR → Operator watches → Compares desired state vs actual → Takes action → Updates status
Custom Resource Definitions (CRDs)
CRDs extend the Kubernetes API with your own resource types.
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: databases.mycompany.io
spec:
group: mycompany.io
versions:
- name: v1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
engine:
type: string
enum: [postgres, mysql]
replicas:
type: integer
minimum: 1
storageGB:
type: integer
status:
type: object
properties:
phase:
type: string
readyReplicas:
type: integer
subresources:
status: {} # enables /status subresource
scope: Namespaced
names:
plural: databases
singular: database
kind: Database
shortNames: [db]
Once the CRD is installed, you use it like any built-in resource:
kubectl get databases
kubectl describe database my-postgres
Custom Resource (CR) instance
apiVersion: mycompany.io/v1
kind: Database
metadata:
name: my-postgres
namespace: production
spec:
engine: postgres
replicas: 3
storageGB: 100
Building an Operator
Option 1: kubebuilder (recommended, Go)
# Bootstrap
kubebuilder init --domain mycompany.io --repo github.com/mycompany/db-operator
kubebuilder create api --group mycompany --version v1 --kind Database
# Generates:
# api/v1/database_types.go — CRD struct
# controllers/database_controller.go — reconcile loop
# config/crd/ — CRD manifests
# config/rbac/ — RBAC for operator SA
Option 2: Operator SDK (supports Go, Ansible, Helm)
operator-sdk init --domain mycompany.io --repo github.com/mycompany/db-operator
operator-sdk create api --group mycompany --version v1 --kind Database --resource --controller
The Reconcile loop (Go)
func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := log.FromContext(ctx)
// 1. Fetch the CR
db := &mycompanyv1.Database{}
if err := r.Get(ctx, req.NamespacedName, db); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// 2. Compute desired state
desired := buildStatefulSet(db)
// 3. Fetch actual state
actual := &appsv1.StatefulSet{}
err := r.Get(ctx, types.NamespacedName{Name: db.Name, Namespace: db.Namespace}, actual)
if errors.IsNotFound(err) {
// 4a. Create if missing
if err := r.Create(ctx, desired); err != nil {
return ctrl.Result{}, err
}
} else if err == nil {
// 4b. Update if different
actual.Spec = desired.Spec
if err := r.Update(ctx, actual); err != nil {
return ctrl.Result{}, err
}
}
// 5. Update status
db.Status.Phase = "Running"
r.Status().Update(ctx, db)
return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
}
Key points:
- Idempotent — reconcile can be called any number of times
- Return
ctrl.Result{}to stop;ctrl.Result{RequeueAfter: ...}to requeue - IgnoreNotFound — CR deleted, clean up and return nil
- ctrl.SetControllerReference — tie child resources to parent (owner references → GC)
Owner References (cascading delete)
ctrl.SetControllerReference(db, statefulSet, r.Scheme)
// When the Database CR is deleted, the StatefulSet is automatically GC'd
Well-Known Operators
| Operator | What it manages |
|---|---|
| cert-manager | TLS certs via Certificate CR; integrates Let's Encrypt, Vault |
| prometheus-operator | ServiceMonitor, PrometheusRule CRs — no manual scrape config editing |
| postgres-operator (Zalando) | HA Postgres clusters with failover, backups, users |
| strimzi | Apache Kafka clusters on k8s |
| ArgoCD | GitOps — Application CRs sync Git repos to cluster state |
| Flux | GitOps — HelmRelease, Kustomization CRs |
| Velero | Cluster backup/restore |
| keda | Event-driven autoscaling — scale on queue depth, cron, custom metrics |
| crossplane | Infrastructure as CRs — provision cloud resources (RDS, S3) from k8s |
Horizontal Pod Autoscaler (HPA)
Scales Deployment replicas based on CPU, memory, or custom metrics.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: AverageValue
averageValue: 200Mi
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # wait 5 min before scaling down
scaleUp:
stabilizationWindowSeconds: 30
kubectl get hpa
kubectl describe hpa my-app-hpa
Requires: metrics-server installed in cluster.
KEDA (Kubernetes Event-Driven Autoscaling)
Extends HPA to scale on anything — queue depth, Kafka lag, cron, custom metrics, even down to zero.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: my-app-scaler
spec:
scaleTargetRef:
name: my-app
minReplicaCount: 0 # scale to zero!
maxReplicaCount: 50
triggers:
- type: rabbitmq
metadata:
queueName: jobs
queueLength: "5" # 1 replica per 5 messages
- type: cron
metadata:
timezone: Europe/London
start: "0 8 * * 1-5" # scale up Mon-Fri 8am
end: "0 18 * * 1-5"
desiredReplicas: "5"
Vertical Pod Autoscaler (VPA)
Recommends (or auto-applies) right-sized CPU/memory requests.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: my-app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
updatePolicy:
updateMode: "Off" # Off | Initial | Recreate | Auto
resourcePolicy:
containerPolicies:
- containerName: app
minAllowed:
cpu: 100m
memory: 50Mi
maxAllowed:
cpu: 2
memory: 2Gi
Off — recommendations only (read with kubectl describe vpa).
Auto — restarts pods with new requests. Not safe for stateful workloads.
VPA and HPA on same metric = conflict. Use VPA for requests, HPA for CPU utilization.
Pod Disruption Budgets (PDB)
Guarantees a minimum number of pods stay running during voluntary disruptions (node drain, rolling update).
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-app-pdb
spec:
minAvailable: 2 # or maxUnavailable: 1
selector:
matchLabels:
app: my-app
kubectl get pdb
kubectl drain <node> --ignore-daemonsets # respects PDBs — blocks if it would violate
Priority Classes
Determines eviction order when nodes run out of resources.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "Critical production workloads"
# In pod spec
priorityClassName: high-priority
Higher value = harder to evict. System-critical pods use 2000001000; your pods should stay below 1000000000.
Admission Webhooks
Intercept API requests before they persist. Two types:
| Type | Can modify? | Can reject? | Use for |
|---|---|---|---|
| MutatingAdmissionWebhook | Yes | Yes | Inject sidecars, set defaults |
| ValidatingAdmissionWebhook | No | Yes | Policy enforcement |
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
name: sidecar-injector
webhooks:
- name: inject.mycompany.io
clientConfig:
service:
name: sidecar-injector-svc
namespace: kube-system
path: /mutate
caBundle: <base64-ca-cert>
rules:
- operations: ["CREATE"]
apiGroups: [""]
apiVersions: ["v1"]
resources: ["pods"]
namespaceSelector:
matchLabels:
injection: enabled
failurePolicy: Fail # Fail | Ignore
admissionReviewVersions: ["v1"]
sideEffects: None
The webhook receives an AdmissionReview JSON object and must return one with allowed: true/false and optionally a JSON patch for mutations.
cert-manager can auto-rotate the webhook TLS cert — recommended.
Service Mesh (Istio / Linkerd)
A service mesh adds a sidecar proxy to every pod (envoy for Istio, linkerd-proxy for Linkerd). The control plane manages proxy config; you get:
| Feature | How |
|---|---|
| mTLS between pods | Automatic cert rotation per service identity |
| Traffic splitting | VirtualService weight routing (canary, A/B) |
| Retry / timeout / circuit breaker | Per-route policy, no code changes |
| Observability | Automatic metrics, traces, access logs per request |
| Rate limiting | EnvoyFilter or RateLimitService |
Istio — key CRDs
# VirtualService — traffic routing
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: my-app
spec:
hosts: [my-app]
http:
- match:
- headers:
x-canary: { exact: "true" }
route:
- destination: { host: my-app, subset: v2 }
- route:
- destination: { host: my-app, subset: v1 }
weight: 90
- destination: { host: my-app, subset: v2 }
weight: 10
# DestinationRule — defines subsets
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: my-app
spec:
host: my-app
subsets:
- name: v1
labels: { version: v1 }
- name: v2
labels: { version: v2 }
trafficPolicy:
connectionPool:
tcp: { maxConnections: 100 }
outlierDetection:
consecutive5xxErrors: 5
interval: 30s
baseEjectionTime: 30s
GitOps with ArgoCD
ArgoCD watches a Git repo and syncs it to the cluster. Drift is detected and can be auto-corrected.
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: my-app
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/mycompany/my-app
targetRevision: main
path: deploy/k8s
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true # delete resources removed from Git
selfHeal: true # revert manual changes to cluster
syncOptions:
- CreateNamespace=true
argocd app list
argocd app sync my-app
argocd app diff my-app
argocd app history my-app
Multi-tenancy Patterns
| Pattern | Isolation level | Tool |
|---|---|---|
| Namespace per team | Soft — shared API server | NetworkPolicy + RBAC + ResourceQuota |
| vCluster | Medium — virtual cluster per tenant | vcluster (Loft Labs) |
| Separate clusters | Hard — full isolation | Cluster API, EKS, GKE |
Hierarchical Namespaces (HNC)
# Create child namespace inheriting parent RBAC/NetworkPolicy
kubectl hns create staging --namespace production
Cluster API (CAPI)
Manage cluster lifecycle (create, upgrade, delete) using Kubernetes CRs — clusters as code.
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
name: prod-cluster
spec:
clusterNetwork:
pods:
cidrBlocks: ["192.168.0.0/16"]
controlPlaneRef:
kind: KubeadmControlPlane
name: prod-cp
infrastructureRef:
kind: AWSCluster
name: prod-aws
Infrastructure providers: AWS, GCP, Azure, vSphere, OpenStack.
Useful Advanced kubectl
# Force-replace (delete + create — breaks connections)
kubectl replace --force -f manifest.yaml
# Server-side apply (tracks field ownership)
kubectl apply --server-side -f manifest.yaml
# Strategic merge patch
kubectl patch deployment my-app -p '{"spec":{"replicas":5}}'
# JSON patch
kubectl patch pod my-pod --type='json' \
-p='[{"op":"replace","path":"/spec/containers/0/image","value":"nginx:1.26"}]'
# Get with go-template
kubectl get pods -o go-template='{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}'
# Wait for condition
kubectl wait --for=condition=Ready pod -l app=my-app --timeout=120s
kubectl wait --for=condition=complete job/my-job --timeout=300s
# Debug with ephemeral container (k8s 1.23+)
kubectl debug -it my-pod --image=busybox --target=app
# Copy running pod spec for debugging
kubectl get pod my-pod -o yaml | kubectl run debug-pod --dry-run=client -f -
# Check RBAC
kubectl auth can-i create deployments --as=system:serviceaccount:default:my-sa -n production
kubectl auth whoami
Security Hardening
Pod Security Standards (replaces PSP in k8s 1.25+)
# Enforce restricted standard on a namespace
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
Three levels: privileged → baseline → restricted.
Secure pod spec
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 2000
seccompProfile:
type: RuntimeDefault
containers:
- name: app
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
volumeMounts:
- mountPath: /tmp
name: tmp-dir # writable scratch if readOnlyRootFilesystem
volumes:
- name: tmp-dir
emptyDir: {}
Image scanning
# Trivy (most common)
trivy image nginx:1.25
trivy k8s --report summary cluster # scan whole cluster
etcd — What's Under the Hood
All cluster state lives in etcd (distributed key-value store). Nodes, pods, secrets, CRDs — everything.
# Direct etcd read (from control plane node)
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
get /registry/pods/default --prefix --keys-only
# Backup
etcdctl snapshot save backup.db
etcdctl snapshot restore backup.db --data-dir /var/lib/etcd-restore
Secrets are stored in etcd — enable encryption at rest (EncryptionConfiguration) if not using a secrets manager.
Summary — Complexity Ladder
Pods + Deployments + Services ← baseline cheat sheet
↓
CRDs + Operators ← extend the API; encode operational knowledge
↓
HPA / VPA / KEDA + PDB ← autoscaling + resilience
↓
Admission Webhooks ← policy enforcement + defaults injection
↓
Service Mesh (Istio/Linkerd) ← L7 observability + mTLS + traffic control
↓
GitOps (ArgoCD/Flux) ← cluster state as Git truth
↓
Multi-cluster / Cluster API ← fleet management