AI Platform
Google AI Platform (formerly known as Vertex AI) is a machine learning platform that allows developers and data scientists to train and deploy ML models. This guide focuses on practical deployment scenarios using Terraform and gcloud CLI.
Key Concepts
- Custom Models: Train and deploy your own ML models using custom training
- AutoML: Automated machine learning with minimal expertise required
- Pipelines: ML workflow orchestration
- Model Registry: Version control, lineage tracking, and metadata management for models
- Endpoints: Deploy models for online prediction
- Feature Store: Store, share and serve ML features
- Managed Notebooks: Jupyter notebook environments for ML development
Deploying AI Platform Resources with Terraform
Example: Setting up an AI Platform Notebook Instance
resource "google_project_service" "notebooks_api" {
service = "notebooks.googleapis.com"
disable_on_destroy = false
}
resource "google_notebooks_instance" "ml_instance" {
name = "ml-notebook-instance"
location = "us-central1-a"
machine_type = "n1-standard-4"
vm_image {
project = "deeplearning-platform-release"
image_family = "tf-latest-cpu"
}
install_gpu_driver = false
boot_disk_type = "PD_SSD"
boot_disk_size_gb = 100
no_public_ip = true
no_proxy_access = false
network = "default"
subnet = "default"
depends_on = [google_project_service.notebooks_api]
}
Example: Creating a Model Endpoint with Terraform
resource "google_project_service" "aiplatform_api" {
service = "aiplatform.googleapis.com"
disable_on_destroy = false
}
resource "google_vertex_ai_endpoint" "prediction_endpoint" {
display_name = "sample-prediction-endpoint"
location = "us-central1"
depends_on = [google_project_service.aiplatform_api]
}
resource "google_vertex_ai_model" "sample_model" {
display_name = "sample-model"
metadata {
artifact_uri = "gs://${google_storage_bucket.model_bucket.name}/model"
container_spec {
image_uri = "us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-8:latest"
}
}
region = "us-central1"
depends_on = [google_project_service.aiplatform_api]
}
resource "google_vertex_ai_model_deployment" "default" {
endpoint = google_vertex_ai_endpoint.prediction_endpoint.id
model = google_vertex_ai_model.sample_model.id
display_name = "sample-deployment"
dedicated_resources {
machine_type = "n1-standard-2"
min_replica_count = 1
max_replica_count = 2
}
depends_on = [
google_vertex_ai_model.sample_model,
google_vertex_ai_endpoint.prediction_endpoint
]
}
resource "google_storage_bucket" "model_bucket" {
name = "model-artifacts-${random_id.bucket_suffix.hex}"
location = "US"
force_destroy = true
}
resource "random_id" "bucket_suffix" {
byte_length = 4
}
Deploying AI Platform Resources with gcloud CLI
Creating a Notebook Instance
# Enable the Notebooks API
gcloud services enable notebooks.googleapis.com
# Create a Notebook instance
gcloud notebooks instances create ml-notebook \
--vm-image-project=deeplearning-platform-release \
--vm-image-family=tf-latest-cpu \
--machine-type=n1-standard-4 \
--location=us-central1-a \
--boot-disk-size=100GiB \
--no-public-ip
Training a Custom Model
# Create a custom training job
gcloud ai custom-jobs create \
--region=us-central1 \
--display-name=my-training-job \
--python-package-uris=gs://my-bucket/trainer.tar.gz \
--python-module=trainer.task \
--container-image-uri=gcr.io/cloud-aiplatform/training/tf-cpu.2-2:latest \
--replica-count=1 \
--machine-type=n1-standard-4
Deploying a Model
# Upload model to Model Registry
gcloud ai models upload \
--region=us-central1 \
--display-name=my-model \
--container-image-uri=us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-8:latest \
--artifact-uri=gs://my-bucket/model/
# Create an endpoint
gcloud ai endpoints create \
--region=us-central1 \
--display-name=my-endpoint
# Deploy model to the endpoint
gcloud ai endpoints deploy-model my-endpoint \
--region=us-central1 \
--model=my-model \
--display-name=my-deployment \
--machine-type=n1-standard-2 \
--min-replica-count=1 \
--max-replica-count=2 \
--traffic-split=0=100
Best Practices
- Cost Management:
- Use preemptible VMs for non-critical training jobs
- Scale endpoints based on traffic patterns
- Delete unused resources promptly
- Security:
- Use VPC Service Controls to restrict data access
- Apply IAM roles with least privilege
- Enable audit logging for AI Platform operations
- Performance Optimization:
- Select appropriate machine types for your workloads
- Use GPU/TPU accelerators for deep learning tasks
- Implement batch prediction for high-throughput, non-realtime needs
- MLOps:
- Implement CI/CD pipelines for model training and deployment
- Use Vertex AI Pipelines for end-to-end ML workflows
- Implement model monitoring for detecting drift and ensuring quality
Common Issues and Troubleshooting
Model Deployment Failures
- Check container compatibility with the selected machine type
- Ensure model artifacts are properly structured
- Verify IAM permissions for service accounts
Performance Issues
- Monitor endpoint metrics for CPU/memory usage
- Check for bottlenecks in preprocessing steps
- Consider using AutoScaling policies
Cost Overruns
- Set budget alerts for AI Platform resources
- Review usage regularly to identify idle resources
- Use spot/preemptible instances for training when possible
Real-World Example: Sentiment Analysis Pipeline
This example demonstrates a complete end-to-end ML pipeline for sentiment analysis:
# Terraform for sentiment analysis pipeline
# 1. Create Cloud Storage bucket for data and artifacts
resource "google_storage_bucket" "ml_bucket" {
name = "sentiment-analysis-${var.project_id}"
location = "US"
force_destroy = true
}
# 2. Enable required APIs
resource "google_project_service" "required_apis" {
for_each = toset([
"aiplatform.googleapis.com",
"notebooks.googleapis.com",
"container.googleapis.com",
"cloudbuild.googleapis.com"
])
service = each.key
disable_on_destroy = false
}
# 3. Create Notebook for development
resource "google_notebooks_instance" "sentiment_notebook" {
name = "sentiment-notebook"
location = "us-central1-a"
machine_type = "n1-standard-4"
vm_image {
project = "deeplearning-platform-release"
image_family = "tf-latest-cpu"
}
depends_on = [google_project_service.required_apis]
}
# 4. Create AI Platform Endpoint
resource "google_vertex_ai_endpoint" "sentiment_endpoint" {
display_name = "sentiment-analysis-endpoint"
location = "us-central1"
depends_on = [google_project_service.required_apis]
}
# 5. Create Cloud Build trigger for CI/CD
resource "google_cloudbuild_trigger" "ml_pipeline_trigger" {
name = "ml-pipeline-trigger"
location = "global"
github {
owner = "owner-name"
name = "repo-name"
push {
branch = "main"
}
}
build {
step {
name = "gcr.io/cloud-builders/gcloud"
args = [
"ai", "custom-jobs", "create",
"--region=us-central1",
"--display-name=sentiment-training-job",
"--python-package-uris=gs://${google_storage_bucket.ml_bucket.name}/trainer.tar.gz",
"--python-module=trainer.task"
]
}
}
depends_on = [google_project_service.required_apis]
}