Cloud Storage
Google Cloud Storage is a globally unified, scalable, and highly durable object storage service for storing and accessing any amount of data. It provides industry-leading availability, performance, security, and management features.
Key Features
- Global Accessibility: Access data from anywhere in the world
- Scalability: Store and retrieve any amount of data at any time
- Durability: 11 9โs (99.999999999%) durability for stored objects
- Storage Classes: Standard, Nearline, Coldline, and Archive storage tiers
- Object Versioning: Maintain history and recover from accidental deletions
- Object Lifecycle Management: Automatically transition and delete objects
- Strong Consistency: Read-after-write and list consistency
- Customer-Managed Encryption Keys (CMEK): Control encryption keys
- Object Hold and Retention Policies: Enforce compliance requirements
- VPC Service Controls: Add security perimeter around sensitive data
Cloud Storage Classes
| Storage Class | Purpose | Minimum Storage Duration | Typical Use Cases |
|---|---|---|---|
| Standard | High-performance, frequent access | None | Website content, active data, mobile apps |
| Nearline | Low-frequency access | 30 days | Data accessed less than once a month |
| Coldline | Very low-frequency access | 90 days | Data accessed less than once a quarter |
| Archive | Data archiving, online backup | 365 days | Long-term archive, disaster recovery |
Deploying Cloud Storage with Terraform
Basic Bucket Creation
resource "google_storage_bucket" "static_assets" {
name = "my-static-assets-bucket"
location = "US"
storage_class = "STANDARD"
labels = {
environment = "production"
department = "engineering"
}
# Enable versioning for recovery
versioning {
enabled = true
}
# Use uniform bucket-level access (recommended)
uniform_bucket_level_access = true
# Public access prevention (recommended security setting)
public_access_prevention = "enforced"
}
# Grant access to a service account
resource "google_storage_bucket_iam_member" "viewer" {
bucket = google_storage_bucket.static_assets.name
role = "roles/storage.objectViewer"
member = "serviceAccount:my-service-account@my-project.iam.gserviceaccount.com"
}
Advanced Configuration with Lifecycle Policies
resource "google_storage_bucket" "data_lake" {
name = "my-datalake-bucket"
location = "US-CENTRAL1"
storage_class = "STANDARD"
# Enable versioning
versioning {
enabled = true
}
# Enable object lifecycle management
lifecycle_rule {
condition {
age = 30 # days
}
action {
type = "SetStorageClass"
storage_class = "NEARLINE"
}
}
lifecycle_rule {
condition {
age = 90 # days
}
action {
type = "SetStorageClass"
storage_class = "COLDLINE"
}
}
lifecycle_rule {
condition {
age = 365 # days
}
action {
type = "SetStorageClass"
storage_class = "ARCHIVE"
}
}
# Delete old non-current versions
lifecycle_rule {
condition {
age = 30 # days
with_state = "ARCHIVED" # non-current versions
}
action {
type = "Delete"
}
}
# Use Customer-Managed Encryption Key (CMEK)
encryption {
default_kms_key_name = google_kms_crypto_key.bucket_key.id
}
# Other security settings
uniform_bucket_level_access = true
public_access_prevention = "enforced"
}
# Create KMS key for CMEK
resource "google_kms_key_ring" "storage_keyring" {
name = "storage-keyring"
location = "us-central1"
}
resource "google_kms_crypto_key" "bucket_key" {
name = "bucket-key"
key_ring = google_kms_key_ring.storage_keyring.id
}
# Grant Cloud Storage service account access to use KMS key
data "google_storage_project_service_account" "gcs_account" {}
resource "google_kms_crypto_key_iam_binding" "crypto_key_binding" {
crypto_key_id = google_kms_crypto_key.bucket_key.id
role = "roles/cloudkms.cryptoKeyEncrypterDecrypter"
members = [
"serviceAccount:${data.google_storage_project_service_account.gcs_account.email_address}",
]
}
Static Website Hosting Configuration
resource "google_storage_bucket" "website" {
name = "my-static-website-bucket"
location = "US"
storage_class = "STANDARD"
# Enable website serving
website {
main_page_suffix = "index.html"
not_found_page = "404.html"
}
# Set CORS configuration
cors {
origin = ["https://example.com"]
method = ["GET", "HEAD", "OPTIONS"]
response_header = ["Content-Type", "Access-Control-Allow-Origin"]
max_age_seconds = 3600
}
# Force bucket to serve content via HTTPS
force_destroy = true
}
# Make objects publicly readable
resource "google_storage_bucket_iam_member" "public_read" {
bucket = google_storage_bucket.website.name
role = "roles/storage.objectViewer"
member = "allUsers"
}
# Upload index page
resource "google_storage_bucket_object" "index" {
name = "index.html"
bucket = google_storage_bucket.website.name
source = "./website/index.html"
# Set content type
content_type = "text/html"
}
# Upload 404 page
resource "google_storage_bucket_object" "not_found" {
name = "404.html"
bucket = google_storage_bucket.website.name
source = "./website/404.html"
content_type = "text/html"
}
Managing Cloud Storage with gsutil
Basic Bucket Commands
# Create a bucket
gsutil mb -l us-central1 gs://my-bucket
# List buckets
gsutil ls
# List objects in a bucket
gsutil ls gs://my-bucket/
# Get bucket information
gsutil ls -L gs://my-bucket
# Enable bucket versioning
gsutil versioning set on gs://my-bucket
# Set default storage class
gsutil defstorageclass set NEARLINE gs://my-bucket
Object Operations
# Upload file(s)
gsutil cp file.txt gs://my-bucket/
# Upload directory
gsutil cp -r ./local-dir gs://my-bucket/dir/
# Upload with specific content type
gsutil -h "Content-Type:text/html" cp index.html gs://my-bucket/
# Download file(s)
gsutil cp gs://my-bucket/file.txt ./
# Download directory
gsutil cp -r gs://my-bucket/dir/ ./local-dir/
# Move/Rename objects
gsutil mv gs://my-bucket/old-name.txt gs://my-bucket/new-name.txt
# Delete object
gsutil rm gs://my-bucket/file.txt
# Delete all objects in a bucket
gsutil rm gs://my-bucket/**
# Delete bucket and all its contents
gsutil rm -r gs://my-bucket
Access Control
# Make object public
gsutil acl ch -u AllUsers:R gs://my-bucket/file.txt
# Set bucket-level IAM policy
gsutil iam ch serviceAccount:my-service@my-project.iam.gserviceaccount.com:objectViewer gs://my-bucket
# Get IAM policy
gsutil iam get gs://my-bucket
# Set uniform bucket-level access (recommended)
gsutil uniformbucketlevelaccess set on gs://my-bucket
# Disable public access
gsutil pap set enforced gs://my-bucket
Lifecycle Management
# Create a lifecycle policy JSON file
cat > lifecycle.json << EOF
{
"lifecycle": {
"rule": [
{
"action": {
"type": "SetStorageClass",
"storageClass": "NEARLINE"
},
"condition": {
"age": 30,
"matchesStorageClass": ["STANDARD"]
}
},
{
"action": {
"type": "Delete"
},
"condition": {
"age": 365
}
}
]
}
}
EOF
# Apply lifecycle policy to bucket
gsutil lifecycle set lifecycle.json gs://my-bucket
# View current lifecycle policy
gsutil lifecycle get gs://my-bucket
Real-World Example: Multi-Region Data Lake Architecture
This example demonstrates a complete data lake architecture using Cloud Storage:
Architecture Overview
- Landing Zone: Raw data ingestion bucket
- Processing Zone: Data transformation and staging
- Curated Zone: Processed, high-quality data
- Archive Zone: Long-term, cold storage
Terraform Implementation
provider "google" {
project = var.project_id
region = "us-central1"
}
# Create VPC with private access
resource "google_compute_network" "data_lake_network" {
name = "data-lake-network"
auto_create_subnetworks = false
}
resource "google_compute_subnetwork" "data_lake_subnet" {
name = "data-lake-subnet"
ip_cidr_range = "10.0.0.0/16"
region = "us-central1"
network = google_compute_network.data_lake_network.id
# Enable Google Private Access
private_ip_google_access = true
}
# Create VPC Service Controls perimeter
resource "google_access_context_manager_service_perimeter" "data_perimeter" {
parent = "accessPolicies/${google_access_context_manager_access_policy.data_policy.name}"
name = "accessPolicies/${google_access_context_manager_access_policy.data_policy.name}/servicePerimeters/data_lake_perimeter"
title = "Data Lake Perimeter"
status {
resources = ["projects/${var.project_id}"]
restricted_services = ["storage.googleapis.com"]
ingress_policies {
ingress_from {
identities = [
"serviceAccount:${google_service_account.data_processor.email}",
]
}
ingress_to {
resources = ["*"]
operations {
service_name = "storage.googleapis.com"
method_selectors {
method = "*"
}
}
}
}
}
}
resource "google_access_context_manager_access_policy" "data_policy" {
parent = "organizations/${var.organization_id}"
title = "Data Lake Access Policy"
}
# Service Account for data processing
resource "google_service_account" "data_processor" {
account_id = "data-processor"
display_name = "Data Lake Processing Service Account"
}
# KMS for encryption
resource "google_kms_key_ring" "data_lake_keyring" {
name = "data-lake-keyring"
location = "global"
}
resource "google_kms_crypto_key" "data_lake_key" {
name = "data-lake-key"
key_ring = google_kms_key_ring.data_lake_keyring.id
# Rotation settings
rotation_period = "7776000s" # 90 days
# Protect against destruction
lifecycle {
prevent_destroy = true
}
}
# Grant KMS access to service account
resource "google_kms_crypto_key_iam_binding" "data_lake_key_binding" {
crypto_key_id = google_kms_crypto_key.data_lake_key.id
role = "roles/cloudkms.cryptoKeyEncrypterDecrypter"
members = [
"serviceAccount:${data.google_storage_project_service_account.gcs_account.email_address}",
]
}
data "google_storage_project_service_account" "gcs_account" {}
# Create buckets for the data lake zones
resource "google_storage_bucket" "landing_zone" {
name = "${var.project_id}-landing-zone"
location = "US"
storage_class = "STANDARD"
# Security settings
uniform_bucket_level_access = true
public_access_prevention = "enforced"
# Set CMEK encryption
encryption {
default_kms_key_name = google_kms_crypto_key.data_lake_key.id
}
# Lifecycle policies
lifecycle_rule {
condition {
age = 7
}
action {
type = "Delete"
}
}
# Ensure data is kept for compliance even if deleted in Terraform
lifecycle {
prevent_destroy = true
}
# Logging configuration
logging {
log_bucket = google_storage_bucket.logs.name
log_object_prefix = "landing-zone"
}
}
resource "google_storage_bucket" "processing_zone" {
name = "${var.project_id}-processing-zone"
location = "US"
storage_class = "STANDARD"
uniform_bucket_level_access = true
public_access_prevention = "enforced"
encryption {
default_kms_key_name = google_kms_crypto_key.data_lake_key.id
}
# Transition to Nearline after 30 days
lifecycle_rule {
condition {
age = 30
}
action {
type = "SetStorageClass"
storage_class = "NEARLINE"
}
}
# Delete after 60 days
lifecycle_rule {
condition {
age = 60
}
action {
type = "Delete"
}
}
logging {
log_bucket = google_storage_bucket.logs.name
log_object_prefix = "processing-zone"
}
}
resource "google_storage_bucket" "curated_zone" {
name = "${var.project_id}-curated-zone"
location = "US"
storage_class = "STANDARD"
uniform_bucket_level_access = true
public_access_prevention = "enforced"
# Enable versioning for data protection
versioning {
enabled = true
}
encryption {
default_kms_key_name = google_kms_crypto_key.data_lake_key.id
}
# Lifecycle management
lifecycle_rule {
condition {
age = 90
}
action {
type = "SetStorageClass"
storage_class = "NEARLINE"
}
}
lifecycle_rule {
condition {
age = 365
}
action {
type = "SetStorageClass"
storage_class = "COLDLINE"
}
}
# Delete non-current versions after 30 days
lifecycle_rule {
condition {
age = 30
with_state = "ARCHIVED"
}
action {
type = "Delete"
}
}
logging {
log_bucket = google_storage_bucket.logs.name
log_object_prefix = "curated-zone"
}
}
resource "google_storage_bucket" "archive_zone" {
name = "${var.project_id}-archive-zone"
location = "US"
storage_class = "ARCHIVE"
uniform_bucket_level_access = true
public_access_prevention = "enforced"
# Enable object holds for compliance
retention_policy {
retention_period = 31536000 # 1 year in seconds
}
encryption {
default_kms_key_name = google_kms_crypto_key.data_lake_key.id
}
logging {
log_bucket = google_storage_bucket.logs.name
log_object_prefix = "archive-zone"
}
}
# Create bucket for access logs
resource "google_storage_bucket" "logs" {
name = "${var.project_id}-access-logs"
location = "US"
storage_class = "STANDARD"
uniform_bucket_level_access = true
public_access_prevention = "enforced"
# Set lifecycle for logs
lifecycle_rule {
condition {
age = 90
}
action {
type = "SetStorageClass"
storage_class = "COLDLINE"
}
}
lifecycle_rule {
condition {
age = 365
}
action {
type = "Delete"
}
}
}
# IAM permissions for the buckets
resource "google_storage_bucket_iam_binding" "landing_zone_writer" {
bucket = google_storage_bucket.landing_zone.name
role = "roles/storage.objectCreator"
members = [
"serviceAccount:${google_service_account.data_ingestion.email}",
]
}
resource "google_storage_bucket_iam_binding" "processing_zone_reader" {
bucket = google_storage_bucket.landing_zone.name
role = "roles/storage.objectViewer"
members = [
"serviceAccount:${google_service_account.data_processor.email}",
]
}
resource "google_storage_bucket_iam_binding" "processing_zone_writer" {
bucket = google_storage_bucket.processing_zone.name
role = "roles/storage.objectAdmin"
members = [
"serviceAccount:${google_service_account.data_processor.email}",
]
}
resource "google_storage_bucket_iam_binding" "curated_zone_writer" {
bucket = google_storage_bucket.curated_zone.name
role = "roles/storage.objectAdmin"
members = [
"serviceAccount:${google_service_account.data_processor.email}",
]
}
resource "google_storage_bucket_iam_binding" "curated_zone_viewer" {
bucket = google_storage_bucket.curated_zone.name
role = "roles/storage.objectViewer"
members = [
"serviceAccount:${google_service_account.data_analyst.email}",
"group:data-analysts@example.com",
]
}
resource "google_storage_bucket_iam_binding" "archive_zone_writer" {
bucket = google_storage_bucket.archive_zone.name
role = "roles/storage.objectAdmin"
members = [
"serviceAccount:${google_service_account.data_processor.email}",
]
}
# Additional service accounts
resource "google_service_account" "data_ingestion" {
account_id = "data-ingestion"
display_name = "Data Ingestion Service Account"
}
resource "google_service_account" "data_analyst" {
account_id = "data-analyst"
display_name = "Data Analyst Service Account"
}
# Notification configuration for new file arrivals
resource "google_storage_notification" "landing_zone_notification" {
bucket = google_storage_bucket.landing_zone.name
payload_format = "JSON_API_V1"
topic = google_pubsub_topic.landing_zone_notifications.id
event_types = ["OBJECT_FINALIZE"]
}
resource "google_pubsub_topic" "landing_zone_notifications" {
name = "landing-zone-notifications"
}
resource "google_pubsub_topic_iam_binding" "landing_zone_publisher" {
topic = google_pubsub_topic.landing_zone_notifications.name
role = "roles/pubsub.publisher"
members = [
"serviceAccount:${data.google_storage_project_service_account.gcs_account.email_address}",
]
}
Data Lifecycle Automation Script
# data_lifecycle.py
from google.cloud import storage
import datetime
import logging
def move_processed_data(event, context):
"""Cloud Function triggered by Pub/Sub to move processed data"""
# Get bucket and file details
bucket_name = event['attributes']['bucketId']
object_name = event['attributes']['objectId']
if not object_name.endswith('.processed'):
return
# Initialize storage client
storage_client = storage.Client()
# Set source and destination buckets
source_bucket = storage_client.bucket(bucket_name)
processed_blob = source_bucket.blob(object_name)
# Determine target bucket based on data type
object_metadata = processed_blob.metadata
data_type = object_metadata.get('data_type', 'unknown')
if data_type == 'report':
dest_bucket_name = f"{bucket_name.split('-')[0]}-curated-zone"
dest_path = f"reports/{datetime.datetime.now().strftime('%Y/%m/%d')}/{object_name.replace('.processed', '')}"
elif data_type == 'archive':
dest_bucket_name = f"{bucket_name.split('-')[0]}-archive-zone"
dest_path = f"{datetime.datetime.now().strftime('%Y/%m')}/{object_name.replace('.processed', '')}"
else:
dest_bucket_name = f"{bucket_name.split('-')[0]}-curated-zone"
dest_path = f"other/{object_name.replace('.processed', '')}"
# Copy to destination
dest_bucket = storage_client.bucket(dest_bucket_name)
source_blob = source_bucket.blob(object_name)
# Copy with metadata
dest_blob = source_bucket.copy_blob(
source_blob, dest_bucket, dest_path
)
# Delete original after successful copy
source_blob.delete()
logging.info(f"Moved {object_name} to {dest_bucket_name}/{dest_path}")
Best Practices
- Bucket Naming and Organization
- Choose globally unique, DNS-compliant names
- Use consistent naming conventions
- Organize objects with clear prefix hierarchy
- Consider regional requirements for data storage
- Security
- Enable uniform bucket-level access
- Use VPC Service Controls for sensitive data
- Apply appropriate IAM roles with least privilege
- Enforce public access prevention
- Use CMEK for regulated data
- Enable object holds for compliance
- Cost Optimization
- Choose appropriate storage classes for data access patterns
- Implement lifecycle policies for automatic transitions
- Use composite objects for small files
- Monitor usage with Cloud Monitoring
- Consider requester pays for shared datasets
- Performance
- Store frequently accessed data in regions close to users
- Use parallel composite uploads for large files
- Avoid small, frequent operations
- Use signed URLs for temporary access
- Implement connection pooling in applications
- Data Management
- Enable object versioning for critical data
- Configure access logs for audit trails
- Use object metadata for classification
- Set up notifications for bucket events
- Implement retention policies for compliance
Common Issues and Troubleshooting
Access Denied Errors
- Verify IAM permissions and roles
- Check for VPC Service Controls blocking access
- Ensure service accounts have proper permissions
- Validate CMEK access for encrypted buckets
- Check organization policies for restrictions
Performance Issues
- Review network configuration for private Google access
- Ensure proper region selection for proximity to users
- Monitor request rates and throttling
- Check object naming patterns for hotspots
- Optimize upload/download processes
Cost Management
- Review storage distribution across classes
- Check lifecycle policies for effectiveness
- Monitor large, unnecessary object versions
- Watch for unexpected egress charges
- Verify requester-pays configuration
Data Management
- Validate versioning is working as expected
- Check retention policy effectiveness
- Monitor object holds and legal holds
- Verify notification configurations
- Ensure backups are properly configured