Project-based guide

Kubernetes for AI Workloads explained like a tutor teaches an engineer

You already know Kubernetes concepts like Pods, Deployments, Services and logs. This guide teaches how the same Kubernetes foundation extends into AI workloads: GPU nodes, device plugins, model serving, storage, scaling, observability and safe troubleshooting.

DevOpsSREKubernetesLLMOpsGPU workloads
Student

I understand normal Kubernetes applications. But AI workloads sound different. Are they completely different from web apps?

Tutor

Not completely. The core Kubernetes ideas remain the same: a workload runs inside containers, Pods are scheduled to nodes, Services expose traffic, logs and events help troubleshooting, and rollouts manage changes. What changes is the workload profile. AI workloads often need GPUs, larger memory, model files, specialized runtimes, model-serving APIs, longer startup time, more expensive nodes and tighter observability.

Simple mental model: Kubernetes does not automatically make AI reliable. Kubernetes gives you scheduling, isolation, rollout, networking and automation primitives. The engineer still designs resource requests, probes, storage, monitoring, access control, rollback strategy and cost boundaries.

End-to-end scenario: deploy a model-serving API on Kubernetes

We will use one project scenario throughout the page. Imagine your team wants to expose an internal text-classification or log-summary model as an API for DevOps tooling.

Prepare nodes: Identify whether the model needs CPU-only or GPU-backed nodes.
Install device plugin: If GPUs are used, expose GPU resources to Kubernetes through a vendor device plugin.
Package model server: Containerize the model-serving application.
Define resources: Set CPU, memory and GPU limits carefully.
Expose safely: Use an internal Service or Ingress depending on who should access it.
Observe it: Monitor latency, errors, GPU usage, memory, request volume and model-specific failures.
Troubleshoot: Use events, logs, metrics and rollout history before assuming the model is the problem.

Foundation theory: how GPUs appear inside Kubernetes

Student

In a normal Deployment I request CPU and memory. How does a Pod request a GPU?

Tutor

Kubernetes itself has a device plugin framework. A vendor-specific plugin runs on GPU nodes and advertises GPU resources to the kubelet. After that, Pods can request those GPU resources in their container limits. The scheduler then places the Pod on a node where that GPU resource is available.

resources: limits: nvidia.com/gpu: 1 memory: "8Gi" cpu: "4"
Important: GPU scheduling is not like ordinary CPU overcommit. In normal Kubernetes GPU usage, you request GPU resources through limits, and the Pod is scheduled only where that device resource exists. Do not treat GPU nodes like unlimited shared compute.
ConceptNormal appAI workload
ComputeCPU and memory are usually enoughMay require GPU, high memory or model acceleration
StartupUsually fastMay be slow because model files load into memory/GPU
StorageConfig, uploads, DB connectionsModel weights, tokenizer files, vector data, cache
ObservabilityHTTP latency, errors, CPU/memoryPlus GPU usage, queue time, token latency, model errors
CostModerate nodesGPU nodes can be expensive and scarce

Practical project: minimal AI model-serving Deployment

This is a simplified teaching example. In real environments, the image, model loading path and runtime depend on the serving framework you choose.

apiVersion: apps/v1 kind: Deployment metadata: name: ai-log-summary-api namespace: ai-platform spec: replicas: 1 selector: matchLabels: app: ai-log-summary-api template: metadata: labels: app: ai-log-summary-api spec: containers: - name: model-server image: registry.example.com/ai-log-summary-api:v1 ports: - containerPort: 8080 resources: requests: cpu: "2" memory: "4Gi" limits: cpu: "4" memory: "8Gi" nvidia.com/gpu: 1 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 60 periodSeconds: 10

Why each part matters

GPU limit

The GPU limit tells Kubernetes that this Pod needs a GPU resource exposed by the device plugin. Without available GPU capacity, the Pod remains Pending.

Readiness probe

AI model servers can take time to load models. A readiness probe prevents traffic from reaching the Pod before the model is ready.

Memory limit

Model files and inference memory can be large. A low memory limit may cause OOMKilled failures during startup or heavy requests.

Single replica first

Start with one replica in a lab. After proving resource behavior, scaling and observability, increase replicas carefully.

Troubleshooting scenario 1: Pod is Pending

Student

My model-serving Pod is stuck in Pending. What should I check first?

Tutor

Do not start by changing YAML randomly. First read scheduler events. Pending usually means Kubernetes cannot find a node that satisfies the Pod's requirements: GPU not available, taint not tolerated, node selector mismatch, insufficient memory, or missing device plugin.

kubectl get pod -n ai-platform kubectl describe pod ai-log-summary-api-xxxxx -n ai-platform kubectl get nodes -o wide kubectl describe node <gpu-node-name> kubectl get events -n ai-platform --sort-by=.lastTimestamp

AI prompt for this scenario

You are assisting with Kubernetes AI workload troubleshooting. Analyze the following kubectl describe pod, node details and events. Return: 1. Why the Pod is likely Pending 2. Evidence from events 3. Which resource or scheduling rule is blocking placement 4. Validation commands 5. What not to change until confirmed Do not suggest deleting workloads or removing taints without explaining the risk.
Engineer rule: AI can summarize the events and suggest hypotheses. The engineer must still validate node capacity, device plugin status, taints, tolerations and resource limits before changing anything.

Troubleshooting scenario 2: Pod starts but model API is not ready

This is common with AI workloads because the container may start, but the model may still be loading or may fail to initialize.

kubectl logs deployment/ai-log-summary-api -n ai-platform --tail=100 kubectl describe pod -n ai-platform -l app=ai-log-summary-api kubectl get endpoints -n ai-platform ai-log-summary-api kubectl port-forward svc/ai-log-summary-api 8080:8080 -n ai-platform curl -v http://localhost:8080/ready

Possible cause: slow model load

Increase readiness initial delay or optimize model loading. Do not route traffic until the model responds to health checks.

Possible cause: missing model file

Check mounted paths, object storage access and container environment variables.

Possible cause: insufficient memory

Look for OOMKilled, container restarts or model load failures in logs.

Possible cause: GPU/runtime mismatch

Verify GPU driver, runtime and device plugin health before blaming the application.

Best practices for Kubernetes AI workloads

Start CPU-only when possible

Not every AI workload needs GPU. Start with CPU for small models, batch jobs and learning labs. Move to GPU only when latency, throughput or model size requires it.

Use namespaces and quotas

GPU nodes are expensive. Use namespaces, ResourceQuota and RBAC to prevent accidental overuse.

Design health checks carefully

A container can be running while the model is not ready. Separate liveness and readiness checks.

Monitor GPU and app signals

Do not monitor only Pod status. Track latency, errors, queue depth, GPU usage, memory and request patterns.

Keep model changes controlled

Treat model version changes like application releases. Use rollout history, version labels and rollback planning.

Protect data

AI workloads may process logs, prompts, documents or internal data. Apply access control and avoid exposing model endpoints publicly without controls.

Interview scenario answer

Interviewer

How would you deploy and troubleshoot an AI workload on Kubernetes?

Strong candidate answer

I would first understand whether the workload needs CPU or GPU, the model size, latency requirement and traffic pattern. For GPU workloads, I would ensure GPU nodes are prepared and the vendor device plugin exposes GPU resources to Kubernetes. I would define CPU, memory and GPU limits, readiness probes that wait for model load, and internal Services for controlled access. For troubleshooting, I would check Pod events, node capacity, device plugin status, logs, readiness endpoints, GPU metrics and rollout history. I would not blindly restart or scale the workload until I confirm whether the issue is scheduling, model loading, resource pressure, networking or application behavior.

Continue the AI in DevOps path

Use this Kubernetes knowledge together with local AI setup, AIOps concepts and scenario-based interview practice.

Official references

References are included so learners can verify the Kubernetes and GPU concepts from official or primary project documentation.