freundcloud

Terraform

Create a file named auth.tf, and add the following content to the file. This configuration initializes the Databricks Terraform provider and authenticates Terraform with your workspace.

To authenticate with a Databricks CLI configuration profile, add the following content:

```hcl variable “databricks_connection_profile” { description = “The name of the Databricks connection profile to use.” type = string }

Initialize the Databricks Terraform provider.

terraform { required_providers { databricks = { source = “databricks/databricks” } } }

Use Databricks CLI authentication.

provider “databricks” { profile = var.databricks_connection_profile }

Retrieve information about the current user.

data “databricks_current_user” “me” {} ```plaintext

To authenticate with environment variables, add the following content instead.

```hcl

Initialize the Databricks Terraform provider.

terraform { required_providers { databricks = { source = “databricks/databricks” } } }

Use environment variables for authentication.

provider “databricks” {}

Retrieve information about the current user.

data “databricks_current_user” “me” {} ```plaintext

To authenticate with the Azure CLI, add the following content instead:

```hcl variable “databricks_host” { description = “The Azure Databricks workspace URL.” type = string }

Initialize the Databricks Terraform provider.

terraform { required_providers { databricks = { source = “databricks/databricks” } } }

Use Azure CLI authentication.

provider “databricks” { host = var.databricks_host }

Retrieve information about the current user.

data “databricks_current_user” “me” {} ```plaintext

Create another file named auth.auto.tfvars, and add the following content to the file. This file contains variable values for authenticating Terraform with your workspace. Replace the placeholder values with your own values.

To authenticate with a Databricks CLI configuration profile, add the following content:

Copy

hcl databricks_connection_profile = "DEFAULT" plaintext

To authenticate with the Azure CLI, add the following content instead:

Copy

hcl databricks_host = "https://<workspace-instance-name>" plaintext

To authenticate with with environment variables, you do not need an auth.auto.tfvars file.

bash terraform init plaintext

Create another file named cluster.tf, and add the following content to the file. This content creates a cluster with the smallest amount of resources allowed. This cluster uses the lastest Databricks Runtime Long Term Support (LTS) version.

For a cluster that works with Unity Catalog:

```hcl variable “cluster_name” {} variable “cluster_autotermination_minutes” {} variable “cluster_num_workers” {} variable “cluster_data_security_mode” {}

Create the cluster with the “smallest” amount

of resources allowed.

data “databricks_node_type” “smallest” { local_disk = true }

Use the latest Databricks Runtime

Long Term Support (LTS) version.

data “databricks_spark_version” “latest_lts” { long_term_support = true }

resource “databricks_cluster” “this” { cluster_name = var.cluster_name node_type_id = data.databricks_node_type.smallest.id spark_version = data.databricks_spark_version.latest_lts.id autotermination_minutes = var.cluster_autotermination_minutes num_workers = var.cluster_num_workers data_security_mode = var.cluster_data_security_mode }

output “cluster_url” { value = databricks_cluster.this.url } ```plaintext

For an all-purpose cluster:

```hcl variable “cluster_name” { description = “A name for the cluster.” type = string default = “My Cluster” }

variable “cluster_autotermination_minutes” { description = “How many minutes before automatically terminating due to inactivity.” type = number default = 60 }

variable “cluster_num_workers” { description = “The number of workers.” type = number default = 1 }

Create the cluster with the “smallest” amount

of resources allowed.

data “databricks_node_type” “smallest” { local_disk = true }

Use the latest Databricks Runtime

Long Term Support (LTS) version.

data “databricks_spark_version” “latest_lts” { long_term_support = true }

resource “databricks_cluster” “this” { cluster_name = var.cluster_name node_type_id = data.databricks_node_type.smallest.id spark_version = data.databricks_spark_version.latest_lts.id autotermination_minutes = var.cluster_autotermination_minutes num_workers = var.cluster_num_workers }

output “cluster_url” { value = databricks_cluster.this.url } ```plaintext

Create another file named cluster.auto.tfvars, and add the following content to the file. This file contains variable values for customizing the cluster. Replace the placeholder values with your own values.

For a cluster that works with Unity Catalog:

hcl cluster_name = "My Cluster" cluster_autotermination_minutes = 60 cluster_num_workers = 1 cluster_data_security_mode = "SINGLE_USER" plaintext

For an all-purpose cluster:

hcl cluster_name = "My Cluster" cluster_autotermination_minutes = 60 cluster_num_workers = 1 plaintext

Create another file named notebook.tf, and add the following content to the file:

```hcl variable “notebook_subdirectory” { description = “A name for the subdirectory to store the notebook.” type = string default = “Terraform” }

variable “notebook_filename” { description = “The notebook’s filename.” type = string }

variable “notebook_language” { description = “The language of the notebook.” type = string }

resource “databricks_notebook” “this” { path = “${data.databricks_current_user.me.home}/${var.notebook_subdirectory}/${var.notebook_filename}” language = var.notebook_language source = “./${var.notebook_filename}” }

output “notebook_url” { value = databricks_notebook.this.url } ```plaintext


For the Python notebook a file named notebook-getting-started-lakehouse-e2e.py with the following contents:

```python

Databricks notebook source

external_location = “" catalog = ""

dbutils.fs.put(f”{external_location}/foobar.txt”, “Hello world!”, True) display(dbutils.fs.head(f”{external_location}/foobar.txt”)) dbutils.fs.rm(f”{external_location}/foobar.txt”)

display(spark.sql(f”SHOW SCHEMAS IN {catalog}”))

COMMAND ———-

from pyspark.sql.functions import col

Set parameters for isolation in workspace and reset demo

username = spark.sql(“SELECT regexp_replace(current_user(), ‘[^a-zA-Z0-9]’, ‘’)”).first()[0] database = f”{catalog}.e2e_lakehouse{username}_db” source = f”{external_location}/e2e-lakehouse-source” table = f”{database}.target_table” checkpoint_path = f”{external_location}/_checkpoint/e2e-lakehouse-demo”

spark.sql(f”SET c.username=’{username}’”) spark.sql(f”SET c.database={database}”) spark.sql(f”SET c.source=’{source}’”)

spark.sql(“DROP DATABASE IF EXISTS ${c.database} CASCADE”) spark.sql(“CREATE DATABASE ${c.database}”) spark.sql(“USE ${c.database}”)

Clear out data from previous demo execution

dbutils.fs.rm(source, True) dbutils.fs.rm(checkpoint_path, True)

Define a class to load batches of data to source

class LoadData:

def init(self, source): self.source = source

def get_date(self): try: df = spark.read.format(“json”).load(source) except: return “2016-01-01” batch_date = df.selectExpr(“max(distinct(date(tpep_pickup_datetime))) + 1 day”).first()[0] if batch_date.month == 3: raise Exception(“Source data exhausted”) return batch_date

def get_batch(self, batch_date): return ( spark.table(“samples.nyctaxi.trips”) .filter(col(“tpep_pickup_datetime”).cast(“date”) == batch_date) )

def write_batch(self, batch): batch.write.format(“json”).mode(“append”).save(self.source)

def land_batch(self): batch_date = self.get_date() batch = self.get_batch(batch_date) self.write_batch(batch)

RawData = LoadData(source)

COMMAND ———-

RawData.land_batch()

COMMAND ———-

Import functions

from pyspark.sql.functions import input_file_name, current_timestamp

Configure Auto Loader to ingest JSON data to a Delta table

(spark.readStream .format(“cloudFiles”) .option(“cloudFiles.format”, “json”) .option(“cloudFiles.schemaLocation”, checkpoint_path) .load(file_path) .select(“*”, input_file_name().alias(“source_file”), current_timestamp().alias(“processing_time”)) .writeStream .option(“checkpointLocation”, checkpoint_path) .trigger(availableNow=True) .option(“mergeSchema”, “true”) .toTable(table))

COMMAND ———-

df = spark.read.table(table_name)

COMMAND ———-

display(df) ```plaintext

For the Python notebook a file named notebook-quickstart-create-databricks-workspace-portal.py with the following contents:

```python

Databricks notebook source

blob_account_name = “azureopendatastorage” blob_container_name = “citydatacontainer” blob_relative_path = “Safety/Release/city=Seattle” blob_sas_token = r””

COMMAND ———-

wasbs_path = ‘wasbs://%s@%s.blob.core.windows.net/%s’ % (blob_container_name, blob_account_name,blob_relative_path) spark.conf.set(‘fs.azure.sas.%s.%s.blob.core.windows.net’ % (blob_container_name, blob_account_name), blob_sas_token) print(‘Remote blob path: ‘ + wasbs_path)

COMMAND ———-

df = spark.read.parquet(wasbs_path) print(‘Register the DataFrame as a SQL temporary view: source’) df.createOrReplaceTempView(‘source’)

COMMAND ———-

print(‘Displaying top 10 rows: ‘) display(spark.sql(‘SELECT * FROM source LIMIT 10’)) ```plaintext

If you are creating the notebook, create another file named notebook.auto.tfvars, and add the following content to the file. This file contains variable values for customizing the notebook configuration.

For the Python notebook:

hcl notebook_subdirectory = "Terraform" notebook_filename = "notebook-getting-started-lakehouse-e2e.py" notebook_language = "PYTHON" plaintext

For the Python notebook:

hcl notebook_subdirectory = "Terraform" notebook_filename = "notebook-quickstart-create-databricks-workspace-portal.py" notebook_language = "PYTHON" plaintext

If you are creating a notebook, in your Azure Databricks workspace, be sure to set up any requirements for the notebook to run successfully.

If you are creating a job, create another file named job.tf, and add the following content to the file. This content creates a job to run the notebook.

```hcl variable “job_name” { description = “A name for the job.” type = string default = “My Job” }

resource “databricks_job” “this” { name = var.job_name existing_cluster_id = databricks_cluster.this.cluster_id notebook_task { notebook_path = databricks_notebook.this.path } email_notifications { on_success = [ data.databricks_current_user.me.user_name ] on_failure = [ data.databricks_current_user.me.user_name ] } }

output “job_url” { value = databricks_job.this.url } ```plaintext

If you are creating the job, create another file named job.auto.tfvars, and add the following content to the file. This file contains a variable value for customizing the job configuration.

Copy

hcl job_name = "My Job" plaintext

plaintext terraform validate plaintext

plaintext terraform plan plaintext

plaintext terraform apply plaintext

\

\