DEVOPS

Kubernetes for GPU workloads — what nobody tells you

Why default K8s scheduling will starve your GPU jobs. How to use Kueue + the NVIDIA device plugin for proper gang scheduling and fractional GPU sharing.

RM Ravi Mehta · Principal DevOps 10 min read · 12 Mar 2026

The default scheduler is wrong for GPU work

Stock Kubernetes schedules pods one at a time, greedily, packing them onto whichever node has resources. That's exactly wrong for distributed training where 8 GPU pods need to land on the same fabric, at the same time, or none of them should land at all. The default scheduler will happily place 6 of your 8 workers, leaving the job stuck waiting forever while it holds onto valuable GPUs.

This is the first lesson everyone learns the hard way.

Step 1 · Install the NVIDIA device plugin

You need the GPU operator and device plugin so K8s sees GPUs as schedulable resources. Without this, nvidia.com/gpu doesn't exist and your pods sit in Pending forever.

helm install nvidia-device-plugin nvdp/nvidia-device-plugin \
  --version=0.15.0 \
  --namespace nvidia-device-plugin \
  --create-namespace

Then your pod spec can request:

resources:
  limits:
    nvidia.com/gpu: 1

Step 2 · Gang scheduling with Kueue

Kueue is the Kubernetes-native batch scheduler we wish came in the box. It supports:

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata: { name: gpu-train }
spec:
  resourceGroups:
  - coveredResources: [nvidia.com/gpu]
    flavors:
    - name: a100
      resources:
      - name: nvidia.com/gpu
        nominalQuota: 32

A 32-GPU job submitted as a Kueue Workload will queue if 32 GPUs aren't free, then schedule atomically when they are. No partial scheduling, no held resources.

Step 3 · Fractional GPU sharing

For inference workloads, an A100 80GB is overkill for a 7B model that uses 14 GB. Three options:

  1. Time-slicing: NVIDIA device plugin can advertise multiple "virtual" GPUs per physical card. Pods context-switch on the GPU. Cheap, but no isolation.
  2. MPS (Multi-Process Service): spatial sharing with weak isolation. Better for cooperative workloads.
  3. MIG (Multi-Instance GPU): hardware-level partitioning on A100/H100. Each MIG instance is a real, isolated GPU. Best for multi-tenant inference.

We default to MIG for production inference (1g.10gb / 2g.20gb / 3g.40gb / 7g.80gb partitions), and time-slicing for dev environments where isolation isn't critical.

Step 4 · Topology-aware scheduling

NCCL all-reduce performance depends on which physical interconnect carries the traffic. NVLink > PCIe > InfiniBand > Ethernet. If a worker pod lands on a node that only has Ethernet to its peers, your 32-GPU job runs at one-tenth the speed of one that all sits on NVLinked nodes.

The fix: NodeAffinity rules + topology labels. We label every node with glixy.dev/fabric: ib-rack-01 and require workers to share a label.

affinity:
  podAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchLabels: { app: train }
      topologyKey: glixy.dev/fabric

Step 5 · GPU monitoring

Stock Prometheus metrics tell you nothing about GPU health. You need dcgm-exporter from NVIDIA, which exposes:

Build a dashboard that shows util-vs-allocation across your fleet. If you're allocating 80 GPUs but only utilizing 40%, you have a chunky workload that should be using fewer GPUs at higher per-GPU util.

Step 6 · Spot/preemptible nodes for training

Training jobs that checkpoint can run on cheaper preemptible nodes. The trick: configure auto-restart with checkpoint restoration. We use Kueue's preemption + a sidecar that uploads checkpoints to NVMe-oF every N steps. Job gets evicted, comes back on different nodes, resumes from last checkpoint, total downtime under 90 seconds.

Real cost saving: 30-50% on the training line.

What we learned the hard way

  1. Don't run pods as root in GPU containers. Use a dedicated UID + seccomp profile. NVIDIA driver vulnerabilities exist.
  2. Pre-pull large images. A 30 GB CUDA-PyTorch image pulled cold delays your job by 5+ minutes per node. Use a registry mirror inside your cluster.
  3. Set --shm-size. PyTorch DataLoaders need shared memory. Default is too small. volumes mount tmpfs at /dev/shm with 8-16 GB.
  4. Watch for thermal throttling. A100s in poorly-cooled racks throttle to 70% performance under sustained load. Always check power and temp dashboards before assuming your code is slow.
  5. Don't rely on nvidia-smi alone. It shows current state, not history. Trend monitoring is what catches degradation.

📞 Want us to run K8s + GPUs for you? →

Related: DevOps service · GPU clusters