Kubernetes for GPU workloads — what nobody tells you
Why default K8s scheduling will starve your GPU jobs. How to use Kueue + the NVIDIA device plugin for proper gang scheduling and fractional GPU sharing.
The default scheduler is wrong for GPU work
Stock Kubernetes schedules pods one at a time, greedily, packing them onto whichever node has resources. That's exactly wrong for distributed training where 8 GPU pods need to land on the same fabric, at the same time, or none of them should land at all. The default scheduler will happily place 6 of your 8 workers, leaving the job stuck waiting forever while it holds onto valuable GPUs.
This is the first lesson everyone learns the hard way.
Step 1 · Install the NVIDIA device plugin
You need the GPU operator and device plugin so K8s sees GPUs as schedulable resources. Without this, nvidia.com/gpu doesn't exist and your pods sit in Pending forever.
helm install nvidia-device-plugin nvdp/nvidia-device-plugin \
--version=0.15.0 \
--namespace nvidia-device-plugin \
--create-namespace
Then your pod spec can request:
resources:
limits:
nvidia.com/gpu: 1
Step 2 · Gang scheduling with Kueue
Kueue is the Kubernetes-native batch scheduler we wish came in the box. It supports:
- Gang scheduling: all-or-nothing for jobs that need N pods.
- Queues with quotas: reserve capacity per team or per priority class.
- Preemption: a high-priority training job can evict a low-priority experiment.
- Topology-aware: place pods on nodes that share an InfiniBand switch for tight collective communication.
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata: { name: gpu-train }
spec:
resourceGroups:
- coveredResources: [nvidia.com/gpu]
flavors:
- name: a100
resources:
- name: nvidia.com/gpu
nominalQuota: 32
A 32-GPU job submitted as a Kueue Workload will queue if 32 GPUs aren't free, then schedule atomically when they are. No partial scheduling, no held resources.
Step 3 · Fractional GPU sharing
For inference workloads, an A100 80GB is overkill for a 7B model that uses 14 GB. Three options:
- Time-slicing: NVIDIA device plugin can advertise multiple "virtual" GPUs per physical card. Pods context-switch on the GPU. Cheap, but no isolation.
- MPS (Multi-Process Service): spatial sharing with weak isolation. Better for cooperative workloads.
- MIG (Multi-Instance GPU): hardware-level partitioning on A100/H100. Each MIG instance is a real, isolated GPU. Best for multi-tenant inference.
We default to MIG for production inference (1g.10gb / 2g.20gb / 3g.40gb / 7g.80gb partitions), and time-slicing for dev environments where isolation isn't critical.
Step 4 · Topology-aware scheduling
NCCL all-reduce performance depends on which physical interconnect carries the traffic. NVLink > PCIe > InfiniBand > Ethernet. If a worker pod lands on a node that only has Ethernet to its peers, your 32-GPU job runs at one-tenth the speed of one that all sits on NVLinked nodes.
The fix: NodeAffinity rules + topology labels. We label every node with glixy.dev/fabric: ib-rack-01 and require workers to share a label.
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels: { app: train }
topologyKey: glixy.dev/fabric
Step 5 · GPU monitoring
Stock Prometheus metrics tell you nothing about GPU health. You need dcgm-exporter from NVIDIA, which exposes:
- GPU utilization (compute and memory) — the metric you actually care about
- VRAM usage per process
- Temperature, power draw, ECC errors
- NVLink throughput per direction
- PCIe replay events (a sign of cabling problems)
Build a dashboard that shows util-vs-allocation across your fleet. If you're allocating 80 GPUs but only utilizing 40%, you have a chunky workload that should be using fewer GPUs at higher per-GPU util.
Step 6 · Spot/preemptible nodes for training
Training jobs that checkpoint can run on cheaper preemptible nodes. The trick: configure auto-restart with checkpoint restoration. We use Kueue's preemption + a sidecar that uploads checkpoints to NVMe-oF every N steps. Job gets evicted, comes back on different nodes, resumes from last checkpoint, total downtime under 90 seconds.
Real cost saving: 30-50% on the training line.
What we learned the hard way
- Don't run pods as root in GPU containers. Use a dedicated UID + seccomp profile. NVIDIA driver vulnerabilities exist.
- Pre-pull large images. A 30 GB CUDA-PyTorch image pulled cold delays your job by 5+ minutes per node. Use a registry mirror inside your cluster.
- Set
--shm-size. PyTorch DataLoaders need shared memory. Default is too small.volumesmount tmpfs at/dev/shmwith 8-16 GB. - Watch for thermal throttling. A100s in poorly-cooled racks throttle to 70% performance under sustained load. Always check power and temp dashboards before assuming your code is slow.
- Don't rely on
nvidia-smialone. It shows current state, not history. Trend monitoring is what catches degradation.
📞 Want us to run K8s + GPUs for you? →
Related: DevOps service · GPU clusters