| Main focus | IT operations, incidents, alerts, logs, metrics, events and automation. | Machine learning model lifecycle from data preparation to deployment and monitoring. | Large language model application lifecycle: prompts, retrieval, evaluation, guardrails and serving. |
| Primary users | NOC, DevOps, SRE, platform and operations teams. | Data scientists, ML engineers, data engineers, platform teams. | AI app developers, platform engineers, DevOps/SRE teams, security teams. |
| Example problem | Too many alerts during an incident; need correlation and summary. | A fraud detection model must be retrained and redeployed safely. | A support chatbot gives inconsistent answers and needs evaluation, RAG improvement and monitoring. |
| Common data/signals | Metrics, logs, traces, Kubernetes events, incidents, runbooks, topology and change history. | Training data, features, labels, model artifacts, experiments, pipelines and prediction metrics. | Prompts, documents, embeddings, vector search results, completions, traces, feedback and evaluation datasets. |
| Operational risks | False correlation, unsafe auto-remediation, noisy or missing signals. | Data drift, model drift, poor reproducibility, bad training data, deployment rollback issues. | Hallucination, prompt injection, data leakage, high token cost, latency, weak evals and unsafe tool use. |
| DevOps contribution | Observability pipelines, automation guardrails, incident workflows and safe remediation. | CI/CD for ML pipelines, model registry integration, infra provisioning and monitoring. | Model serving, API gateways, vector DB operations, eval pipelines, observability and access control. |