# Kubernetes — Advanced: Operators & Beyond ## The Operator Pattern An **Operator** is a controller that encodes human operational knowledge about a stateful application into Kubernetes-native automation. It watches custom resources, compares desired vs actual state, and reconciles. ### The Control Loop (Reconciliation) ``` Watch → Detect drift → Reconcile → Repeat ``` Every built-in Kubernetes controller (Deployment, ReplicaSet) runs this loop. Operators extend it with your own resources and logic. ``` User applies CR → Operator watches → Compares desired state vs actual → Takes action → Updates status ``` --- ## Custom Resource Definitions (CRDs) CRDs extend the Kubernetes API with your own resource types. ```yaml apiVersion: apiextensions.k8s.io/v1 kind: CustomResourceDefinition metadata: name: databases.mycompany.io spec: group: mycompany.io versions: - name: v1 served: true storage: true schema: openAPIV3Schema: type: object properties: spec: type: object properties: engine: type: string enum: [postgres, mysql] replicas: type: integer minimum: 1 storageGB: type: integer status: type: object properties: phase: type: string readyReplicas: type: integer subresources: status: {} # enables /status subresource scope: Namespaced names: plural: databases singular: database kind: Database shortNames: [db] ``` Once the CRD is installed, you use it like any built-in resource: ```bash kubectl get databases kubectl describe database my-postgres ``` ### Custom Resource (CR) instance ```yaml apiVersion: mycompany.io/v1 kind: Database metadata: name: my-postgres namespace: production spec: engine: postgres replicas: 3 storageGB: 100 ``` --- ## Building an Operator ### Option 1: kubebuilder (recommended, Go) ```bash # Bootstrap kubebuilder init --domain mycompany.io --repo github.com/mycompany/db-operator kubebuilder create api --group mycompany --version v1 --kind Database # Generates: # api/v1/database_types.go — CRD struct # controllers/database_controller.go — reconcile loop # config/crd/ — CRD manifests # config/rbac/ — RBAC for operator SA ``` ### Option 2: Operator SDK (supports Go, Ansible, Helm) ```bash operator-sdk init --domain mycompany.io --repo github.com/mycompany/db-operator operator-sdk create api --group mycompany --version v1 --kind Database --resource --controller ``` ### The Reconcile loop (Go) ```go func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { log := log.FromContext(ctx) // 1. Fetch the CR db := &mycompanyv1.Database{} if err := r.Get(ctx, req.NamespacedName, db); err != nil { return ctrl.Result{}, client.IgnoreNotFound(err) } // 2. Compute desired state desired := buildStatefulSet(db) // 3. Fetch actual state actual := &appsv1.StatefulSet{} err := r.Get(ctx, types.NamespacedName{Name: db.Name, Namespace: db.Namespace}, actual) if errors.IsNotFound(err) { // 4a. Create if missing if err := r.Create(ctx, desired); err != nil { return ctrl.Result{}, err } } else if err == nil { // 4b. Update if different actual.Spec = desired.Spec if err := r.Update(ctx, actual); err != nil { return ctrl.Result{}, err } } // 5. Update status db.Status.Phase = "Running" r.Status().Update(ctx, db) return ctrl.Result{RequeueAfter: 30 * time.Second}, nil } ``` Key points: - **Idempotent** — reconcile can be called any number of times - **Return `ctrl.Result{}`** to stop; `ctrl.Result{RequeueAfter: ...}` to requeue - **IgnoreNotFound** — CR deleted, clean up and return nil - **ctrl.SetControllerReference** — tie child resources to parent (owner references → GC) ### Owner References (cascading delete) ```go ctrl.SetControllerReference(db, statefulSet, r.Scheme) // When the Database CR is deleted, the StatefulSet is automatically GC'd ``` --- ## Well-Known Operators | Operator | What it manages | |---|---| | **cert-manager** | TLS certs via `Certificate` CR; integrates Let's Encrypt, Vault | | **prometheus-operator** | `ServiceMonitor`, `PrometheusRule` CRs — no manual scrape config editing | | **postgres-operator (Zalando)** | HA Postgres clusters with failover, backups, users | | **strimzi** | Apache Kafka clusters on k8s | | **ArgoCD** | GitOps — `Application` CRs sync Git repos to cluster state | | **Flux** | GitOps — `HelmRelease`, `Kustomization` CRs | | **Velero** | Cluster backup/restore | | **keda** | Event-driven autoscaling — scale on queue depth, cron, custom metrics | | **crossplane** | Infrastructure as CRs — provision cloud resources (RDS, S3) from k8s | --- ## Horizontal Pod Autoscaler (HPA) Scales Deployment replicas based on CPU, memory, or custom metrics. ```yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: my-app-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: my-app minReplicas: 2 maxReplicas: 20 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: AverageValue averageValue: 200Mi behavior: scaleDown: stabilizationWindowSeconds: 300 # wait 5 min before scaling down scaleUp: stabilizationWindowSeconds: 30 ``` ```bash kubectl get hpa kubectl describe hpa my-app-hpa ``` **Requires:** metrics-server installed in cluster. ### KEDA (Kubernetes Event-Driven Autoscaling) Extends HPA to scale on anything — queue depth, Kafka lag, cron, custom metrics, even down to zero. ```yaml apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: my-app-scaler spec: scaleTargetRef: name: my-app minReplicaCount: 0 # scale to zero! maxReplicaCount: 50 triggers: - type: rabbitmq metadata: queueName: jobs queueLength: "5" # 1 replica per 5 messages - type: cron metadata: timezone: Europe/London start: "0 8 * * 1-5" # scale up Mon-Fri 8am end: "0 18 * * 1-5" desiredReplicas: "5" ``` --- ## Vertical Pod Autoscaler (VPA) Recommends (or auto-applies) right-sized CPU/memory requests. ```yaml apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: my-app-vpa spec: targetRef: apiVersion: apps/v1 kind: Deployment name: my-app updatePolicy: updateMode: "Off" # Off | Initial | Recreate | Auto resourcePolicy: containerPolicies: - containerName: app minAllowed: cpu: 100m memory: 50Mi maxAllowed: cpu: 2 memory: 2Gi ``` `Off` — recommendations only (read with `kubectl describe vpa`). `Auto` — restarts pods with new requests. Not safe for stateful workloads. **VPA and HPA on same metric = conflict.** Use VPA for requests, HPA for CPU utilization. --- ## Pod Disruption Budgets (PDB) Guarantees a minimum number of pods stay running during voluntary disruptions (node drain, rolling update). ```yaml apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: my-app-pdb spec: minAvailable: 2 # or maxUnavailable: 1 selector: matchLabels: app: my-app ``` ```bash kubectl get pdb kubectl drain --ignore-daemonsets # respects PDBs — blocks if it would violate ``` --- ## Priority Classes Determines eviction order when nodes run out of resources. ```yaml apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: high-priority value: 1000000 globalDefault: false description: "Critical production workloads" ``` ```yaml # In pod spec priorityClassName: high-priority ``` Higher value = harder to evict. System-critical pods use `2000001000`; your pods should stay below `1000000000`. --- ## Admission Webhooks Intercept API requests before they persist. Two types: | Type | Can modify? | Can reject? | Use for | |---|---|---|---| | **MutatingAdmissionWebhook** | Yes | Yes | Inject sidecars, set defaults | | **ValidatingAdmissionWebhook** | No | Yes | Policy enforcement | ```yaml apiVersion: admissionregistration.k8s.io/v1 kind: MutatingWebhookConfiguration metadata: name: sidecar-injector webhooks: - name: inject.mycompany.io clientConfig: service: name: sidecar-injector-svc namespace: kube-system path: /mutate caBundle: rules: - operations: ["CREATE"] apiGroups: [""] apiVersions: ["v1"] resources: ["pods"] namespaceSelector: matchLabels: injection: enabled failurePolicy: Fail # Fail | Ignore admissionReviewVersions: ["v1"] sideEffects: None ``` The webhook receives an `AdmissionReview` JSON object and must return one with `allowed: true/false` and optionally a JSON patch for mutations. **cert-manager** can auto-rotate the webhook TLS cert — recommended. --- ## Service Mesh (Istio / Linkerd) A service mesh adds a sidecar proxy to every pod (envoy for Istio, linkerd-proxy for Linkerd). The control plane manages proxy config; you get: | Feature | How | |---|---| | mTLS between pods | Automatic cert rotation per service identity | | Traffic splitting | `VirtualService` weight routing (canary, A/B) | | Retry / timeout / circuit breaker | Per-route policy, no code changes | | Observability | Automatic metrics, traces, access logs per request | | Rate limiting | `EnvoyFilter` or `RateLimitService` | ### Istio — key CRDs ```yaml # VirtualService — traffic routing apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: my-app spec: hosts: [my-app] http: - match: - headers: x-canary: { exact: "true" } route: - destination: { host: my-app, subset: v2 } - route: - destination: { host: my-app, subset: v1 } weight: 90 - destination: { host: my-app, subset: v2 } weight: 10 # DestinationRule — defines subsets apiVersion: networking.istio.io/v1alpha3 kind: DestinationRule metadata: name: my-app spec: host: my-app subsets: - name: v1 labels: { version: v1 } - name: v2 labels: { version: v2 } trafficPolicy: connectionPool: tcp: { maxConnections: 100 } outlierDetection: consecutive5xxErrors: 5 interval: 30s baseEjectionTime: 30s ``` --- ## GitOps with ArgoCD ArgoCD watches a Git repo and syncs it to the cluster. Drift is detected and can be auto-corrected. ```yaml apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: my-app namespace: argocd spec: project: default source: repoURL: https://github.com/mycompany/my-app targetRevision: main path: deploy/k8s destination: server: https://kubernetes.default.svc namespace: production syncPolicy: automated: prune: true # delete resources removed from Git selfHeal: true # revert manual changes to cluster syncOptions: - CreateNamespace=true ``` ```bash argocd app list argocd app sync my-app argocd app diff my-app argocd app history my-app ``` --- ## Multi-tenancy Patterns | Pattern | Isolation level | Tool | |---|---|---| | Namespace per team | Soft — shared API server | NetworkPolicy + RBAC + ResourceQuota | | vCluster | Medium — virtual cluster per tenant | vcluster (Loft Labs) | | Separate clusters | Hard — full isolation | Cluster API, EKS, GKE | ### Hierarchical Namespaces (HNC) ```bash # Create child namespace inheriting parent RBAC/NetworkPolicy kubectl hns create staging --namespace production ``` --- ## Cluster API (CAPI) Manage cluster lifecycle (create, upgrade, delete) using Kubernetes CRs — clusters as code. ```yaml apiVersion: cluster.x-k8s.io/v1beta1 kind: Cluster metadata: name: prod-cluster spec: clusterNetwork: pods: cidrBlocks: ["192.168.0.0/16"] controlPlaneRef: kind: KubeadmControlPlane name: prod-cp infrastructureRef: kind: AWSCluster name: prod-aws ``` Infrastructure providers: AWS, GCP, Azure, vSphere, OpenStack. --- ## Useful Advanced kubectl ```bash # Force-replace (delete + create — breaks connections) kubectl replace --force -f manifest.yaml # Server-side apply (tracks field ownership) kubectl apply --server-side -f manifest.yaml # Strategic merge patch kubectl patch deployment my-app -p '{"spec":{"replicas":5}}' # JSON patch kubectl patch pod my-pod --type='json' \ -p='[{"op":"replace","path":"/spec/containers/0/image","value":"nginx:1.26"}]' # Get with go-template kubectl get pods -o go-template='{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' # Wait for condition kubectl wait --for=condition=Ready pod -l app=my-app --timeout=120s kubectl wait --for=condition=complete job/my-job --timeout=300s # Debug with ephemeral container (k8s 1.23+) kubectl debug -it my-pod --image=busybox --target=app # Copy running pod spec for debugging kubectl get pod my-pod -o yaml | kubectl run debug-pod --dry-run=client -f - # Check RBAC kubectl auth can-i create deployments --as=system:serviceaccount:default:my-sa -n production kubectl auth whoami ``` --- ## Security Hardening ### Pod Security Standards (replaces PSP in k8s 1.25+) ```yaml # Enforce restricted standard on a namespace apiVersion: v1 kind: Namespace metadata: name: production labels: pod-security.kubernetes.io/enforce: restricted pod-security.kubernetes.io/audit: restricted pod-security.kubernetes.io/warn: restricted ``` Three levels: `privileged` → `baseline` → `restricted`. ### Secure pod spec ```yaml spec: securityContext: runAsNonRoot: true runAsUser: 1000 fsGroup: 2000 seccompProfile: type: RuntimeDefault containers: - name: app securityContext: allowPrivilegeEscalation: false readOnlyRootFilesystem: true capabilities: drop: ["ALL"] volumeMounts: - mountPath: /tmp name: tmp-dir # writable scratch if readOnlyRootFilesystem volumes: - name: tmp-dir emptyDir: {} ``` ### Image scanning ```bash # Trivy (most common) trivy image nginx:1.25 trivy k8s --report summary cluster # scan whole cluster ``` --- ## etcd — What's Under the Hood All cluster state lives in etcd (distributed key-value store). Nodes, pods, secrets, CRDs — everything. ```bash # Direct etcd read (from control plane node) ETCDCTL_API=3 etcdctl \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \ --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \ get /registry/pods/default --prefix --keys-only # Backup etcdctl snapshot save backup.db etcdctl snapshot restore backup.db --data-dir /var/lib/etcd-restore ``` Secrets are stored in etcd — enable encryption at rest (`EncryptionConfiguration`) if not using a secrets manager. --- ## Summary — Complexity Ladder ``` Pods + Deployments + Services ← baseline cheat sheet ↓ CRDs + Operators ← extend the API; encode operational knowledge ↓ HPA / VPA / KEDA + PDB ← autoscaling + resilience ↓ Admission Webhooks ← policy enforcement + defaults injection ↓ Service Mesh (Istio/Linkerd) ← L7 observability + mTLS + traffic control ↓ GitOps (ArgoCD/Flux) ← cluster state as Git truth ↓ Multi-cluster / Cluster API ← fleet management ```