AI Platform

Google AI Platform (formerly known as Vertex AI) is a machine learning platform that allows developers and data scientists to train and deploy ML models. This guide focuses on practical deployment scenarios using Terraform and gcloud CLI.

Key Concepts

Custom Models: Train and deploy your own ML models using custom training
AutoML: Automated machine learning with minimal expertise required
Pipelines: ML workflow orchestration
Model Registry: Version control, lineage tracking, and metadata management for models
Endpoints: Deploy models for online prediction
Feature Store: Store, share and serve ML features
Managed Notebooks: Jupyter notebook environments for ML development

Deploying AI Platform Resources with Terraform

Example: Setting up an AI Platform Notebook Instance

resource "google_project_service" "notebooks_api" {
  service = "notebooks.googleapis.com"
  disable_on_destroy = false
}

resource "google_notebooks_instance" "ml_instance" {
  name = "ml-notebook-instance"
  location = "us-central1-a"
  machine_type = "n1-standard-4"
  
  vm_image {
    project = "deeplearning-platform-release"
    image_family = "tf-latest-cpu"
  }
  
  install_gpu_driver = false
  boot_disk_type = "PD_SSD"
  boot_disk_size_gb = 100
  
  no_public_ip = true
  no_proxy_access = false
  
  network = "default"
  subnet = "default"
  
  depends_on = [google_project_service.notebooks_api]
}

Example: Creating a Model Endpoint with Terraform

resource "google_project_service" "aiplatform_api" {
  service = "aiplatform.googleapis.com"
  disable_on_destroy = false
}

resource "google_vertex_ai_endpoint" "prediction_endpoint" {
  display_name = "sample-prediction-endpoint"
  location     = "us-central1"
  
  depends_on = [google_project_service.aiplatform_api]
}

resource "google_vertex_ai_model" "sample_model" {
  display_name = "sample-model"
  metadata {
    artifact_uri = "gs://${google_storage_bucket.model_bucket.name}/model"
    container_spec {
      image_uri = "us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-8:latest"
    }
  }
  region = "us-central1"
  depends_on = [google_project_service.aiplatform_api]
}

resource "google_vertex_ai_model_deployment" "default" {
  endpoint = google_vertex_ai_endpoint.prediction_endpoint.id
  model = google_vertex_ai_model.sample_model.id
  display_name = "sample-deployment"
  
  dedicated_resources {
    machine_type = "n1-standard-2"
    min_replica_count = 1
    max_replica_count = 2
  }
  
  depends_on = [
    google_vertex_ai_model.sample_model,
    google_vertex_ai_endpoint.prediction_endpoint
  ]
}

resource "google_storage_bucket" "model_bucket" {
  name          = "model-artifacts-${random_id.bucket_suffix.hex}"
  location      = "US"
  force_destroy = true
}

resource "random_id" "bucket_suffix" {
  byte_length = 4
}

Deploying AI Platform Resources with gcloud CLI

Creating a Notebook Instance

# Enable the Notebooks API
gcloud services enable notebooks.googleapis.com

# Create a Notebook instance
gcloud notebooks instances create ml-notebook \
  --vm-image-project=deeplearning-platform-release \
  --vm-image-family=tf-latest-cpu \
  --machine-type=n1-standard-4 \
  --location=us-central1-a \
  --boot-disk-size=100GiB \
  --no-public-ip

Training a Custom Model

# Create a custom training job
gcloud ai custom-jobs create \
  --region=us-central1 \
  --display-name=my-training-job \
  --python-package-uris=gs://my-bucket/trainer.tar.gz \
  --python-module=trainer.task \
  --container-image-uri=gcr.io/cloud-aiplatform/training/tf-cpu.2-2:latest \
  --replica-count=1 \
  --machine-type=n1-standard-4

Deploying a Model

# Upload model to Model Registry
gcloud ai models upload \
  --region=us-central1 \
  --display-name=my-model \
  --container-image-uri=us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-8:latest \
  --artifact-uri=gs://my-bucket/model/

# Create an endpoint
gcloud ai endpoints create \
  --region=us-central1 \
  --display-name=my-endpoint

# Deploy model to the endpoint
gcloud ai endpoints deploy-model my-endpoint \
  --region=us-central1 \
  --model=my-model \
  --display-name=my-deployment \
  --machine-type=n1-standard-2 \
  --min-replica-count=1 \
  --max-replica-count=2 \
  --traffic-split=0=100

Best Practices

Cost Management:
- Use preemptible VMs for non-critical training jobs
- Scale endpoints based on traffic patterns
- Delete unused resources promptly
Security:
- Use VPC Service Controls to restrict data access
- Apply IAM roles with least privilege
- Enable audit logging for AI Platform operations
Performance Optimization:
- Select appropriate machine types for your workloads
- Use GPU/TPU accelerators for deep learning tasks
- Implement batch prediction for high-throughput, non-realtime needs
MLOps:
- Implement CI/CD pipelines for model training and deployment
- Use Vertex AI Pipelines for end-to-end ML workflows
- Implement model monitoring for detecting drift and ensuring quality

Common Issues and Troubleshooting

Model Deployment Failures

Check container compatibility with the selected machine type
Ensure model artifacts are properly structured
Verify IAM permissions for service accounts

Performance Issues

Monitor endpoint metrics for CPU/memory usage
Check for bottlenecks in preprocessing steps
Consider using AutoScaling policies

Cost Overruns

Set budget alerts for AI Platform resources
Review usage regularly to identify idle resources
Use spot/preemptible instances for training when possible

Real-World Example: Sentiment Analysis Pipeline

This example demonstrates a complete end-to-end ML pipeline for sentiment analysis:

# Terraform for sentiment analysis pipeline

# 1. Create Cloud Storage bucket for data and artifacts
resource "google_storage_bucket" "ml_bucket" {
  name          = "sentiment-analysis-${var.project_id}"
  location      = "US"
  force_destroy = true
}

# 2. Enable required APIs
resource "google_project_service" "required_apis" {
  for_each = toset([
    "aiplatform.googleapis.com",
    "notebooks.googleapis.com",
    "container.googleapis.com",
    "cloudbuild.googleapis.com"
  ])
  
  service = each.key
  disable_on_destroy = false
}

# 3. Create Notebook for development
resource "google_notebooks_instance" "sentiment_notebook" {
  name = "sentiment-notebook"
  location = "us-central1-a"
  machine_type = "n1-standard-4"
  
  vm_image {
    project = "deeplearning-platform-release"
    image_family = "tf-latest-cpu"
  }
  
  depends_on = [google_project_service.required_apis]
}

# 4. Create AI Platform Endpoint
resource "google_vertex_ai_endpoint" "sentiment_endpoint" {
  display_name = "sentiment-analysis-endpoint"
  location     = "us-central1"
  
  depends_on = [google_project_service.required_apis]
}

# 5. Create Cloud Build trigger for CI/CD
resource "google_cloudbuild_trigger" "ml_pipeline_trigger" {
  name = "ml-pipeline-trigger"
  location = "global"
  
  github {
    owner = "owner-name"
    name  = "repo-name"
    push {
      branch = "main"
    }
  }
  
  build {
    step {
      name = "gcr.io/cloud-builders/gcloud"
      args = [
        "ai", "custom-jobs", "create",
        "--region=us-central1",
        "--display-name=sentiment-training-job",
        "--python-package-uris=gs://${google_storage_bucket.ml_bucket.name}/trainer.tar.gz",
        "--python-module=trainer.task"
      ]
    }
  }
  
  depends_on = [google_project_service.required_apis]
}