LLMOps Guide
Model Deployment
Ray Serve Configuration
apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
name: llm-inference
spec:
serviceUnhealthySecondThreshold: 300
deploymentUnhealthySecondThreshold: 300
serveDeployments:
- name: llm-deployment
numReplicas: 2
rayStartParams:
num-cpus: "16"
num-gpus: "1"
containerConfig:
image: llm-server:latest
env:
- name: MODEL_NAME
value: "llama2-7b"
- name: BATCH_SIZE
value: "4"
Model Monitoring
Prometheus Rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: llm-monitoring
spec:
groups:
- name: LLMMetrics
rules:
- alert: HighLatency
expr: |
rate(llm_inference_duration_seconds_sum[5m])
/
rate(llm_inference_duration_seconds_count[5m])
> 1.0
for: 5m
labels:
severity: warning
- alert: HighErrorRate
expr: |
rate(llm_inference_errors_total[5m])
/
rate(llm_inference_requests_total[5m])
> 0.01
for: 5m
labels:
severity: critical
Performance Optimization
Triton Inference Server
apiVersion: serving.kubeflow.org/v1beta1
kind: InferenceService
metadata:
name: llm-triton
spec:
predictor:
triton:
storageUri: s3://models/llm
resources:
limits:
cpu: "8"
memory: "16Gi"
nvidia.com/gpu: "1"
containerConcurrency: 4
env:
- name: TRITON_CACHE_CONFIG
value: |
{
"cache_size": "8GB",
"cache_policy": "LRU"
}
Best Practices
- Model Management
- Version control
- A/B testing
- Canary deployment
- Model registry
- Observability
- Performance metrics
- Token usage
- Response quality
- Cost tracking
- Optimization
- Quantization
- Batching
- Caching
- Load balancing
- Security
- Input validation
- Output filtering
- Rate limiting
- Access control