Kubernetes in Production: Hard-Won Lessons from Enterprise Deployments
Practical insights from deploying and operating Kubernetes clusters in enterprise environments. Covering security, networking, observability, and disaster recovery.
Kubernetes in Production: Hard-Won Lessons from Enterprise Deployments
Kubernetes is powerful but complex. After deploying and operating clusters at Vattenfall, Vitrifi, and other enterprises, I've accumulated hard-won lessons about what it takes to run Kubernetes reliably in production.
Security First
Security in Kubernetes is multi-layered. A breach at any layer can compromise your entire infrastructure.
Pod Security Standards
The old Pod Security Policies (PSPs) are deprecated. Pod Security Standards (PSS) are the replacement:
Privileged: Unrestricted policy for trusted workloads Baseline: Minimally restrictive, prevents known privilege escalations Restricted: Heavily restricted, follows security best practices
Apply at namespace level:
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/warn: restrictedNetwork Policies Are Not Optional
By default, all pods can communicate with all other pods. This is unacceptable in production.
Implement network policies that:
- Default deny all ingress and egress
- Explicitly allow only required communication paths
- Separate concerns by namespace (frontend, backend, databases)
- Log denied connections for security monitoring
RBAC Requires Careful Planning
Role-Based Access Control determines who can do what in your cluster:
Principle of least privilege: Grant only the permissions needed Service accounts per workload: Don't share service accounts across deployments Audit regularly: Review who has cluster-admin and why Use namespaced roles where possible: ClusterRoles should be rare
Secrets Management Needs a Strategy
Kubernetes Secrets are base64-encoded, not encrypted. This is insufficient for sensitive data.
Options for proper secrets management:
- External Secrets Operator: Sync secrets from HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault
- Sealed Secrets: Encrypt secrets that can only be decrypted by the cluster
- SOPS: Encrypt secret files in Git with automatic decryption during deployment
We used External Secrets Operator with Azure Key Vault at Lloyds Banking Group, enabling centralized secret rotation and audit trails.
Observability Stack
Without observability, you're operating blind. Kubernetes generates massive amounts of data—the challenge is making it useful.
Metrics with Prometheus
Prometheus is the de facto standard for Kubernetes metrics:
Cluster metrics: Node CPU, memory, disk, network Kubernetes metrics: Pod status, replica counts, resource usage Application metrics: Custom metrics exposed via /metrics endpoints
Key practices:
- Use ServiceMonitor resources for automatic scraping
- Implement alerting rules for critical conditions
- Retain metrics long enough for capacity planning
- Consider long-term storage with Thanos or Cortex
Logging with Loki or Elasticsearch
Centralized logging is essential for debugging:
Loki: Prometheus-native log aggregation, cost-effective, label-based queries Elasticsearch: Full-text search, more powerful queries, higher operational overhead
Regardless of choice:
- Include structured metadata (namespace, pod, container)
- Implement log retention policies
- Create dashboards for common debugging scenarios
Distributed Tracing
Traces show how requests flow through your services:
Jaeger: Popular, Kubernetes-native, good Prometheus integration OpenTelemetry: Vendor-neutral standard, supports metrics, logs, and traces
Tracing reveals:
- Which services are slow
- Where errors originate
- How load distributes across services
- Dependency relationships between services
Unified Observability
Connect your observability tools:
- Link from alerts to relevant dashboards
- Correlate logs with traces using trace IDs
- Enable drill-down from high-level metrics to specific pods
Resource Management
Kubernetes resource management is crucial for stability and cost optimization.
Requests and Limits
Every container should specify resource requests and limits:
Requests: Guaranteed resources; scheduler uses these for placement decisions Limits: Maximum resources; exceeding limits causes throttling (CPU) or OOMKill (memory)
Common mistakes:
- Not setting any limits (resource contention)
- Setting limits too high (wasted capacity)
- Setting requests equal to limits (no burst capacity)
Start with requests at expected usage and limits at 2x requests, then tune based on actual behavior.
Vertical Pod Autoscaler
VPA automatically adjusts resource requests based on actual usage:
Benefits:
- Right-sized resources without manual tuning
- Reduced waste from over-provisioning
- Better bin-packing efficiency
Limitations:
- Requires pod restarts to apply changes
- May conflict with Horizontal Pod Autoscaler
- Needs enough historical data to make good recommendations
Horizontal Pod Autoscaler
HPA scales the number of pods based on metrics:
CPU-based scaling: Simple but often insufficient Custom metrics: Scale on queue depth, request latency, or business metrics External metrics: Scale based on external systems (message queue length, database connections)
Quality of Service Classes
Kubernetes assigns QoS classes based on resource specifications:
Guaranteed: Requests equal limits for all containers—highest priority, last to be evicted Burstable: Requests less than limits—medium priority BestEffort: No requests or limits—lowest priority, first to be evicted
Critical workloads should be Guaranteed; development workloads can be Burstable or BestEffort.
Networking Deep Dive
Kubernetes networking is notoriously complex.
CNI Selection
The Container Network Interface (CNI) determines how pods communicate:
Calico: Feature-rich, strong network policy support, BGP routing Cilium: eBPF-based, excellent performance, advanced observability Flannel: Simple overlay, limited features, easy to understand
For enterprise deployments, Calico or Cilium provide the security features you'll need.
Service Mesh Considerations
Service meshes (Istio, Linkerd) add capabilities but also complexity:
When to use:
- You need mutual TLS between all services
- Complex traffic management (canary releases, traffic splitting)
- Advanced observability requirements
When to avoid:
- Simple architectures with few services
- Teams without dedicated platform engineers
- When the overhead isn't justified by benefits
We successfully operated at Vitrifi without a service mesh by using NATS for service communication and implementing application-level resilience patterns.
Disaster Recovery
Hope is not a strategy. Plan for failure and test your recovery procedures.
etcd Backups
etcd stores all cluster state. Losing etcd means losing your cluster.
Backup strategies:
- Automated snapshots with etcdctl
- Velero for cluster-wide backup including persistent volumes
- Off-cluster storage for backup resilience
Test restores regularly. Backups you haven't tested are backups you can't trust.
Multi-Cluster Strategies
For critical workloads, consider multi-cluster architectures:
Active-Passive: Secondary cluster ready to take over on failure Active-Active: Workloads spread across clusters with load balancing Federation: Unified control plane across clusters (complex, maturing)
Tested Runbooks
Document procedures for common failure scenarios:
- Node failure and replacement
- etcd recovery from backup
- Network partition handling
- Storage failure recovery
Run game days to practice these procedures. The first time you execute a disaster recovery should not be during an actual disaster.
Key Takeaways
- Security is foundational: Retrofit security is expensive; build it in from the start
- Observability enables operations: You can't fix what you can't see
- Resource management affects stability: Uncontrolled resource usage causes cascading failures
- Networking complexity is real: Invest in understanding your CNI and network policies
- Disaster recovery requires practice: Untested recovery plans will fail when you need them
- Start simpler than you think: Not every organization needs a service mesh or multi-cluster federation