I understand normal Kubernetes applications. But AI workloads sound different. Are they completely different from web apps?
Not completely. The core Kubernetes ideas remain the same: a workload runs inside containers, Pods are scheduled to nodes, Services expose traffic, logs and events help troubleshooting, and rollouts manage changes. What changes is the workload profile. AI workloads often need GPUs, larger memory, model files, specialized runtimes, model-serving APIs, longer startup time, more expensive nodes and tighter observability.
End-to-end scenario: deploy a model-serving API on Kubernetes
We will use one project scenario throughout the page. Imagine your team wants to expose an internal text-classification or log-summary model as an API for DevOps tooling.
Foundation theory: how GPUs appear inside Kubernetes
In a normal Deployment I request CPU and memory. How does a Pod request a GPU?
Kubernetes itself has a device plugin framework. A vendor-specific plugin runs on GPU nodes and advertises GPU resources to the kubelet. After that, Pods can request those GPU resources in their container limits. The scheduler then places the Pod on a node where that GPU resource is available.
| Concept | Normal app | AI workload |
|---|---|---|
| Compute | CPU and memory are usually enough | May require GPU, high memory or model acceleration |
| Startup | Usually fast | May be slow because model files load into memory/GPU |
| Storage | Config, uploads, DB connections | Model weights, tokenizer files, vector data, cache |
| Observability | HTTP latency, errors, CPU/memory | Plus GPU usage, queue time, token latency, model errors |
| Cost | Moderate nodes | GPU nodes can be expensive and scarce |
Practical project: minimal AI model-serving Deployment
This is a simplified teaching example. In real environments, the image, model loading path and runtime depend on the serving framework you choose.
Why each part matters
GPU limit
The GPU limit tells Kubernetes that this Pod needs a GPU resource exposed by the device plugin. Without available GPU capacity, the Pod remains Pending.
Readiness probe
AI model servers can take time to load models. A readiness probe prevents traffic from reaching the Pod before the model is ready.
Memory limit
Model files and inference memory can be large. A low memory limit may cause OOMKilled failures during startup or heavy requests.
Single replica first
Start with one replica in a lab. After proving resource behavior, scaling and observability, increase replicas carefully.
Troubleshooting scenario 1: Pod is Pending
My model-serving Pod is stuck in Pending. What should I check first?
Do not start by changing YAML randomly. First read scheduler events. Pending usually means Kubernetes cannot find a node that satisfies the Pod's requirements: GPU not available, taint not tolerated, node selector mismatch, insufficient memory, or missing device plugin.
AI prompt for this scenario
Troubleshooting scenario 2: Pod starts but model API is not ready
This is common with AI workloads because the container may start, but the model may still be loading or may fail to initialize.
Possible cause: slow model load
Increase readiness initial delay or optimize model loading. Do not route traffic until the model responds to health checks.
Possible cause: missing model file
Check mounted paths, object storage access and container environment variables.
Possible cause: insufficient memory
Look for OOMKilled, container restarts or model load failures in logs.
Possible cause: GPU/runtime mismatch
Verify GPU driver, runtime and device plugin health before blaming the application.
Best practices for Kubernetes AI workloads
Start CPU-only when possible
Not every AI workload needs GPU. Start with CPU for small models, batch jobs and learning labs. Move to GPU only when latency, throughput or model size requires it.
Use namespaces and quotas
GPU nodes are expensive. Use namespaces, ResourceQuota and RBAC to prevent accidental overuse.
Design health checks carefully
A container can be running while the model is not ready. Separate liveness and readiness checks.
Monitor GPU and app signals
Do not monitor only Pod status. Track latency, errors, queue depth, GPU usage, memory and request patterns.
Keep model changes controlled
Treat model version changes like application releases. Use rollout history, version labels and rollback planning.
Protect data
AI workloads may process logs, prompts, documents or internal data. Apply access control and avoid exposing model endpoints publicly without controls.
Interview scenario answer
How would you deploy and troubleshoot an AI workload on Kubernetes?
I would first understand whether the workload needs CPU or GPU, the model size, latency requirement and traffic pattern. For GPU workloads, I would ensure GPU nodes are prepared and the vendor device plugin exposes GPU resources to Kubernetes. I would define CPU, memory and GPU limits, readiness probes that wait for model load, and internal Services for controlled access. For troubleshooting, I would check Pod events, node capacity, device plugin status, logs, readiness endpoints, GPU metrics and rollout history. I would not blindly restart or scale the workload until I confirm whether the issue is scheduling, model loading, resource pressure, networking or application behavior.
Continue the AI in DevOps path
Use this Kubernetes knowledge together with local AI setup, AIOps concepts and scenario-based interview practice.
Official references
- Kubernetes documentation: Schedule GPUs
- Kubernetes documentation: Device Plugins
- NVIDIA Kubernetes device plugin
References are included so learners can verify the Kubernetes and GPU concepts from official or primary project documentation.