# Kubernetes — Advanced: Operators & Beyond

## The Operator Pattern

An **Operator** is a controller that encodes human operational knowledge about a stateful application into Kubernetes-native automation. It watches custom resources, compares desired vs actual state, and reconciles.

### The Control Loop (Reconciliation)

```
Watch → Detect drift → Reconcile → Repeat
```

Every built-in Kubernetes controller (Deployment, ReplicaSet) runs this loop. Operators extend it with your own resources and logic.

```
User applies CR → Operator watches → Compares desired state vs actual → Takes action → Updates status
```

---

## Custom Resource Definitions (CRDs)

CRDs extend the Kubernetes API with your own resource types.

```yaml
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: databases.mycompany.io
spec:
  group: mycompany.io
  versions:
  - name: v1
    served: true
    storage: true
    schema:
      openAPIV3Schema:
        type: object
        properties:
          spec:
            type: object
            properties:
              engine:
                type: string
                enum: [postgres, mysql]
              replicas:
                type: integer
                minimum: 1
              storageGB:
                type: integer
          status:
            type: object
            properties:
              phase:
                type: string
              readyReplicas:
                type: integer
    subresources:
      status: {}        # enables /status subresource
  scope: Namespaced
  names:
    plural: databases
    singular: database
    kind: Database
    shortNames: [db]
```

Once the CRD is installed, you use it like any built-in resource:

```bash
kubectl get databases
kubectl describe database my-postgres
```

### Custom Resource (CR) instance

```yaml
apiVersion: mycompany.io/v1
kind: Database
metadata:
  name: my-postgres
  namespace: production
spec:
  engine: postgres
  replicas: 3
  storageGB: 100
```

---

## Building an Operator

### Option 1: kubebuilder (recommended, Go)

```bash
# Bootstrap
kubebuilder init --domain mycompany.io --repo github.com/mycompany/db-operator
kubebuilder create api --group mycompany --version v1 --kind Database

# Generates:
# api/v1/database_types.go     — CRD struct
# controllers/database_controller.go  — reconcile loop
# config/crd/                  — CRD manifests
# config/rbac/                 — RBAC for operator SA
```

### Option 2: Operator SDK (supports Go, Ansible, Helm)

```bash
operator-sdk init --domain mycompany.io --repo github.com/mycompany/db-operator
operator-sdk create api --group mycompany --version v1 --kind Database --resource --controller
```

### The Reconcile loop (Go)

```go
func (r *DatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    log := log.FromContext(ctx)

    // 1. Fetch the CR
    db := &mycompanyv1.Database{}
    if err := r.Get(ctx, req.NamespacedName, db); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // 2. Compute desired state
    desired := buildStatefulSet(db)

    // 3. Fetch actual state
    actual := &appsv1.StatefulSet{}
    err := r.Get(ctx, types.NamespacedName{Name: db.Name, Namespace: db.Namespace}, actual)

    if errors.IsNotFound(err) {
        // 4a. Create if missing
        if err := r.Create(ctx, desired); err != nil {
            return ctrl.Result{}, err
        }
    } else if err == nil {
        // 4b. Update if different
        actual.Spec = desired.Spec
        if err := r.Update(ctx, actual); err != nil {
            return ctrl.Result{}, err
        }
    }

    // 5. Update status
    db.Status.Phase = "Running"
    r.Status().Update(ctx, db)

    return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
}
```

Key points:
- **Idempotent** — reconcile can be called any number of times
- **Return `ctrl.Result{}`** to stop; `ctrl.Result{RequeueAfter: ...}` to requeue
- **IgnoreNotFound** — CR deleted, clean up and return nil
- **ctrl.SetControllerReference** — tie child resources to parent (owner references → GC)

### Owner References (cascading delete)

```go
ctrl.SetControllerReference(db, statefulSet, r.Scheme)
// When the Database CR is deleted, the StatefulSet is automatically GC'd
```

---

## Well-Known Operators

| Operator | What it manages |
|---|---|
| **cert-manager** | TLS certs via `Certificate` CR; integrates Let's Encrypt, Vault |
| **prometheus-operator** | `ServiceMonitor`, `PrometheusRule` CRs — no manual scrape config editing |
| **postgres-operator (Zalando)** | HA Postgres clusters with failover, backups, users |
| **strimzi** | Apache Kafka clusters on k8s |
| **ArgoCD** | GitOps — `Application` CRs sync Git repos to cluster state |
| **Flux** | GitOps — `HelmRelease`, `Kustomization` CRs |
| **Velero** | Cluster backup/restore |
| **keda** | Event-driven autoscaling — scale on queue depth, cron, custom metrics |
| **crossplane** | Infrastructure as CRs — provision cloud resources (RDS, S3) from k8s |

---

## Horizontal Pod Autoscaler (HPA)

Scales Deployment replicas based on CPU, memory, or custom metrics.

```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: AverageValue
        averageValue: 200Mi
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300   # wait 5 min before scaling down
    scaleUp:
      stabilizationWindowSeconds: 30
```

```bash
kubectl get hpa
kubectl describe hpa my-app-hpa
```

**Requires:** metrics-server installed in cluster.

### KEDA (Kubernetes Event-Driven Autoscaling)

Extends HPA to scale on anything — queue depth, Kafka lag, cron, custom metrics, even down to zero.

```yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: my-app-scaler
spec:
  scaleTargetRef:
    name: my-app
  minReplicaCount: 0      # scale to zero!
  maxReplicaCount: 50
  triggers:
  - type: rabbitmq
    metadata:
      queueName: jobs
      queueLength: "5"    # 1 replica per 5 messages
  - type: cron
    metadata:
      timezone: Europe/London
      start: "0 8 * * 1-5"   # scale up Mon-Fri 8am
      end: "0 18 * * 1-5"
      desiredReplicas: "5"
```

---

## Vertical Pod Autoscaler (VPA)

Recommends (or auto-applies) right-sized CPU/memory requests.

```yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Off"     # Off | Initial | Recreate | Auto
  resourcePolicy:
    containerPolicies:
    - containerName: app
      minAllowed:
        cpu: 100m
        memory: 50Mi
      maxAllowed:
        cpu: 2
        memory: 2Gi
```

`Off` — recommendations only (read with `kubectl describe vpa`).  
`Auto` — restarts pods with new requests. Not safe for stateful workloads.  
**VPA and HPA on same metric = conflict.** Use VPA for requests, HPA for CPU utilization.

---

## Pod Disruption Budgets (PDB)

Guarantees a minimum number of pods stay running during voluntary disruptions (node drain, rolling update).

```yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 2         # or maxUnavailable: 1
  selector:
    matchLabels:
      app: my-app
```

```bash
kubectl get pdb
kubectl drain <node> --ignore-daemonsets   # respects PDBs — blocks if it would violate
```

---

## Priority Classes

Determines eviction order when nodes run out of resources.

```yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "Critical production workloads"
```

```yaml
# In pod spec
priorityClassName: high-priority
```

Higher value = harder to evict. System-critical pods use `2000001000`; your pods should stay below `1000000000`.

---

## Admission Webhooks

Intercept API requests before they persist. Two types:

| Type | Can modify? | Can reject? | Use for |
|---|---|---|---|
| **MutatingAdmissionWebhook** | Yes | Yes | Inject sidecars, set defaults |
| **ValidatingAdmissionWebhook** | No | Yes | Policy enforcement |

```yaml
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  name: sidecar-injector
webhooks:
- name: inject.mycompany.io
  clientConfig:
    service:
      name: sidecar-injector-svc
      namespace: kube-system
      path: /mutate
    caBundle: <base64-ca-cert>
  rules:
  - operations: ["CREATE"]
    apiGroups: [""]
    apiVersions: ["v1"]
    resources: ["pods"]
  namespaceSelector:
    matchLabels:
      injection: enabled
  failurePolicy: Fail    # Fail | Ignore
  admissionReviewVersions: ["v1"]
  sideEffects: None
```

The webhook receives an `AdmissionReview` JSON object and must return one with `allowed: true/false` and optionally a JSON patch for mutations.

**cert-manager** can auto-rotate the webhook TLS cert — recommended.

---

## Service Mesh (Istio / Linkerd)

A service mesh adds a sidecar proxy to every pod (envoy for Istio, linkerd-proxy for Linkerd). The control plane manages proxy config; you get:

| Feature | How |
|---|---|
| mTLS between pods | Automatic cert rotation per service identity |
| Traffic splitting | `VirtualService` weight routing (canary, A/B) |
| Retry / timeout / circuit breaker | Per-route policy, no code changes |
| Observability | Automatic metrics, traces, access logs per request |
| Rate limiting | `EnvoyFilter` or `RateLimitService` |

### Istio — key CRDs

```yaml
# VirtualService — traffic routing
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: my-app
spec:
  hosts: [my-app]
  http:
  - match:
    - headers:
        x-canary: { exact: "true" }
    route:
    - destination: { host: my-app, subset: v2 }
  - route:
    - destination: { host: my-app, subset: v1 }
      weight: 90
    - destination: { host: my-app, subset: v2 }
      weight: 10

# DestinationRule — defines subsets
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: my-app
spec:
  host: my-app
  subsets:
  - name: v1
    labels: { version: v1 }
  - name: v2
    labels: { version: v2 }
  trafficPolicy:
    connectionPool:
      tcp: { maxConnections: 100 }
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 30s
```

---

## GitOps with ArgoCD

ArgoCD watches a Git repo and syncs it to the cluster. Drift is detected and can be auto-corrected.

```yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/mycompany/my-app
    targetRevision: main
    path: deploy/k8s
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true       # delete resources removed from Git
      selfHeal: true    # revert manual changes to cluster
    syncOptions:
    - CreateNamespace=true
```

```bash
argocd app list
argocd app sync my-app
argocd app diff my-app
argocd app history my-app
```

---

## Multi-tenancy Patterns

| Pattern | Isolation level | Tool |
|---|---|---|
| Namespace per team | Soft — shared API server | NetworkPolicy + RBAC + ResourceQuota |
| vCluster | Medium — virtual cluster per tenant | vcluster (Loft Labs) |
| Separate clusters | Hard — full isolation | Cluster API, EKS, GKE |

### Hierarchical Namespaces (HNC)

```bash
# Create child namespace inheriting parent RBAC/NetworkPolicy
kubectl hns create staging --namespace production
```

---

## Cluster API (CAPI)

Manage cluster lifecycle (create, upgrade, delete) using Kubernetes CRs — clusters as code.

```yaml
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: prod-cluster
spec:
  clusterNetwork:
    pods:
      cidrBlocks: ["192.168.0.0/16"]
  controlPlaneRef:
    kind: KubeadmControlPlane
    name: prod-cp
  infrastructureRef:
    kind: AWSCluster
    name: prod-aws
```

Infrastructure providers: AWS, GCP, Azure, vSphere, OpenStack.

---

## Useful Advanced kubectl

```bash
# Force-replace (delete + create — breaks connections)
kubectl replace --force -f manifest.yaml

# Server-side apply (tracks field ownership)
kubectl apply --server-side -f manifest.yaml

# Strategic merge patch
kubectl patch deployment my-app -p '{"spec":{"replicas":5}}'

# JSON patch
kubectl patch pod my-pod --type='json' \
  -p='[{"op":"replace","path":"/spec/containers/0/image","value":"nginx:1.26"}]'

# Get with go-template
kubectl get pods -o go-template='{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}'

# Wait for condition
kubectl wait --for=condition=Ready pod -l app=my-app --timeout=120s
kubectl wait --for=condition=complete job/my-job --timeout=300s

# Debug with ephemeral container (k8s 1.23+)
kubectl debug -it my-pod --image=busybox --target=app

# Copy running pod spec for debugging
kubectl get pod my-pod -o yaml | kubectl run debug-pod --dry-run=client -f -

# Check RBAC
kubectl auth can-i create deployments --as=system:serviceaccount:default:my-sa -n production
kubectl auth whoami
```

---

## Security Hardening

### Pod Security Standards (replaces PSP in k8s 1.25+)

```yaml
# Enforce restricted standard on a namespace
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted
```

Three levels: `privileged` → `baseline` → `restricted`.

### Secure pod spec

```yaml
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 2000
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: app
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop: ["ALL"]
    volumeMounts:
    - mountPath: /tmp
      name: tmp-dir     # writable scratch if readOnlyRootFilesystem
  volumes:
  - name: tmp-dir
    emptyDir: {}
```

### Image scanning

```bash
# Trivy (most common)
trivy image nginx:1.25
trivy k8s --report summary cluster    # scan whole cluster
```

---

## etcd — What's Under the Hood

All cluster state lives in etcd (distributed key-value store). Nodes, pods, secrets, CRDs — everything.

```bash
# Direct etcd read (from control plane node)
ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  get /registry/pods/default --prefix --keys-only

# Backup
etcdctl snapshot save backup.db
etcdctl snapshot restore backup.db --data-dir /var/lib/etcd-restore
```

Secrets are stored in etcd — enable encryption at rest (`EncryptionConfiguration`) if not using a secrets manager.

---

## Summary — Complexity Ladder

```
Pods + Deployments + Services          ← baseline cheat sheet
    ↓
CRDs + Operators                       ← extend the API; encode operational knowledge
    ↓
HPA / VPA / KEDA + PDB                ← autoscaling + resilience
    ↓
Admission Webhooks                     ← policy enforcement + defaults injection
    ↓
Service Mesh (Istio/Linkerd)           ← L7 observability + mTLS + traffic control
    ↓
GitOps (ArgoCD/Flux)                   ← cluster state as Git truth
    ↓
Multi-cluster / Cluster API            ← fleet management
```