LLMOps Guide

Model Deployment

Ray Serve Configuration

apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
  name: llm-inference
spec:
  serviceUnhealthySecondThreshold: 300
  deploymentUnhealthySecondThreshold: 300
  serveDeployments:
    - name: llm-deployment
      numReplicas: 2
      rayStartParams:
        num-cpus: "16"
        num-gpus: "1"
      containerConfig:
        image: llm-server:latest
        env:
          - name: MODEL_NAME
            value: "llama2-7b"
          - name: BATCH_SIZE
            value: "4"

Model Monitoring

Prometheus Rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: llm-monitoring
spec:
  groups:
  - name: LLMMetrics
    rules:
    - alert: HighLatency
      expr: |
        rate(llm_inference_duration_seconds_sum[5m])
        /
        rate(llm_inference_duration_seconds_count[5m])
        > 1.0
      for: 5m
      labels:
        severity: warning
    - alert: HighErrorRate
      expr: |
        rate(llm_inference_errors_total[5m])
        /
        rate(llm_inference_requests_total[5m])
        > 0.01
      for: 5m
      labels:
        severity: critical

Performance Optimization

Triton Inference Server

apiVersion: serving.kubeflow.org/v1beta1
kind: InferenceService
metadata:
  name: llm-triton
spec:
  predictor:
    triton:
      storageUri: s3://models/llm
      resources:
        limits:
          cpu: "8"
          memory: "16Gi"
          nvidia.com/gpu: "1"
      containerConcurrency: 4
      env:
        - name: TRITON_CACHE_CONFIG
          value: |
            {
              "cache_size": "8GB",
              "cache_policy": "LRU"
            }

Best Practices

Model Management
- Version control
- A/B testing
- Canary deployment
- Model registry
Observability
- Performance metrics
- Token usage
- Response quality
- Cost tracking
Optimization
- Quantization
- Batching
- Caching
- Load balancing
Security
- Input validation
- Output filtering
- Rate limiting
- Access control