ADF + Databricks Pipeline Configuration Guide

This guide walks through every configuration step required to set up a complete BimlFlex pipeline that uses Azure Data Factory for orchestration and Databricks for compute and processing. Follow these steps in order for your first project, then use individual sections as a reference for subsequent projects.

Introduction

Prerequisites

Before you begin, confirm the following are in place:

Azure Subscription with permissions to create resources in Azure Data Factory and Azure Databricks
Databricks Workspace provisioned in the same Azure region as your storage accounts
Azure Blob Storage or ADLS Gen2 account for staging and landing data
BimlFlex installation (BimlFlex App and BimlStudio) at version 2026 or later
BimlCatalog database deployed to Azure SQL Database (used for orchestration logging)
ODBC or JDBC driver for Databricks connectivity from your local workstation (used by the Metadata Importer):
- ODBC: Install the Simba Databricks ODBC driver and configure a system DSN. Required if you want to import metadata or test connectivity from BimlFlex directly.
- JDBC: No local driver installation required. Choose JDBC when targeting serverless compute, since serverless clusters cannot run init scripts for ODBC driver installation.

Two Processing Approaches

BimlFlex supports two processing approaches when orchestrating Databricks through ADF:

Approach	ADF Activity Type	When to Use
Traditional Notebook	ADF Notebook Activity	Simpler setup; each notebook is invoked individually by ADF
Pushdown Processing (recommended)	ADF Databricks Job Activity	Higher performance; all transformation logic runs inside Databricks workflows with built-in dependency management and restart logic

Pushdown Processing is new in BimlFlex 2026 and is the recommended approach for new projects. It generates Databricks Asset Bundles (DAB) for deployment and supports serverless compute when combined with JDBC mode.

Step 1: Create Your Project

In the BimlFlex App, create a new project and select the Databricks (ADF) template (IntegrationTemplateId = 5).

Connection Slots

The project template defines the following connection slots. Configure each one in the Connections editor before building.

Connection	Required?	Notes
Source	Yes	Your source database (SQL Server, Oracle, MySQL, etc.)
Landing	Conditional	Required unless PushdownExtraction is enabled. Typically Azure Blob Storage configured as flat-file delimited.
Stage	Yes	Databricks target for staging tables. Set System Type to Databricks Data Warehouse.
PSA	Optional	Persistent Staging Area for historized bronze-layer retention.
Target	Yes	Databricks connection for Data Vault and/or Data Mart layers.
Compute	Yes	Databricks workspace connection used for linked service generation. Validation PRJ_28005005 will block the build if this is missing.

PushdownProcessing

Enable PushdownProcessing on the project to change the pipeline architecture. When enabled:

ADF uses a Databricks Job Activity instead of individual Notebook Activities
BimlFlex generates Databricks Asset Bundle (DAB) YAML files that define workflows, tasks, and dependencies
Cluster assignment is controlled by the DatabricksJobCluster setting (job cluster, existing cluster, or serverless)
Built-in restart logic skips already-completed ADF Copy Activities on retry

PushdownExtraction

Enable PushdownExtraction on the project to skip the ADF Copy Activity and read source data directly from within Databricks. When enabled:

The Landing connection slot is hidden (no blob staging is needed)
Databricks reads the source system directly, which is most useful when the source is also a Databricks catalog or a cloud-accessible database
Combines naturally with PushdownProcessing for a fully Databricks-native pipeline

Step 2: Configure Connections

Open the Connections editor and configure each connection assigned to your project.

Key Fields for Databricks Connections

Field	Value	Notes
Connection Type	ODBC	Use ODBC for all Databricks target connections
System Type	Databricks Data Warehouse	Required for all Databricks connections
Linked Service Type	Databricks	Enables the Databricks-specific linked service form
Catalog	Your Unity Catalog name	Maps to the Databricks catalog. When `DatabricksUseUnityCatalog = Y`, this value appears in all qualified table names.
Cloud	Checked	Must be enabled for all cloud-based connections

ExternalLocation

When DatabricksUseUnityCatalog = Y and DatabricksUseManagedTables = N, you must specify an ExternalLocation on the connection. This tells Databricks where to store the underlying data files for external (non-managed) tables.

warning

Validation CON_21005007 will block the build if ExternalLocation is missing when Unity Catalog is enabled with non-managed tables. Set the ExternalLocation field to your ADLS Gen2 path, for example: abfss://[email protected]/path.

Compute Connection

The Compute connection represents the Databricks workspace itself (not a database within it). Configure it as follows:

Field	Value
Connection Type	ODBC
System Type	Databricks Data Warehouse
Linked Service Type	Databricks
Cloud	Checked

This connection is referenced by the project for generating the ADF Databricks linked service. It must include the workspace URL and authentication details in its linked service configuration.

Step 3: Configure Linked Services

Open the Connections editor, select your Compute connection, and configure its Linked Service settings.

Authentication Methods

Auth Method	When to Use	Key Fields
Access Token	Simple setup, development and test environments	`LS_DatabricksAccessToken` (plaintext) or store in Azure Key Vault (recommended)
Managed Identity	Production environments, no secrets to manage	`LS_DatabricksWorkspaceResourceId` (the Azure resource ID of the Databricks workspace)

tip

For production deployments, use Managed Identity authentication. This removes the need to rotate access tokens and simplifies secret management. The ADF Managed Identity must be granted Contributor access to the Databricks workspace.

New Cluster Configuration Fields

When the linked service creates new clusters for notebook execution, configure these fields:

Field	Description	Example
Version	Databricks Runtime version	`15.4.x-scala2.12`
NodeType	VM size for worker nodes	`Standard_DS3_v2`
NumOfWorker	Number of worker nodes (0 = single-node)	`2`
SparkConf	Spark configuration key-value pairs	`spark.speculation true`
SparkEnvVars	Environment variables for the cluster	`PYSPARK_PYTHON /databricks/python3/bin/python3`
InitScripts	Cluster init scripts (required for ODBC mode)	`dbfs:/databricks/scripts/bfx_init_odbc.sh`
CustomTags	Azure tags applied to the cluster	`CostCenter 12345`
LogDestination	DBFS path for cluster log delivery	`dbfs:/cluster-logs`

When using an Instance Pool, specify the pool ID instead of NodeType. The pool provides pre-warmed VMs for faster cluster startup.

note

Init scripts are required when using ODBC mode (DatabricksUtilsDriver = ODBC) because the Simba ODBC driver must be installed on every cluster node. When using JDBC mode, no init scripts are needed and you can target serverless compute.

Step 4: Azure Storage Settings

Azure Blob Storage settings define where ADF lands extracted data before Databricks processes it. These settings follow the same pattern used for Snowflake and other cloud targets.

Configure the following settings in the BimlFlex Settings editor under the Azure category:

Setting	Description	Example
AzureStageContainer	Container name for staging files	`stage`
AzureStageAccountName	Storage account name	`bfxblobaccount`
AzureStageAccountKey	Storage access key	`<StorageAccountKey>==`
AzureStageSasToken	SAS token (alternative to account key)	`?<SasToken>`
AzureArchiveContainer	Container for archived files	`archive`
AzureArchiveAccountName	Storage account for archive	`bfxblobaccount`
AzureErrorContainer	Container for error files	`error`
AzureErrorAccountName	Storage account for errors	`bfxblobaccount`
AzureBlobStorageDomain	Storage domain suffix	`blob.core.windows.net`

tip

You can use the same storage account for stage, archive, and error containers. Separate containers within a single account is the most common configuration.

Step 5: Databricks Settings

This is the most extensive configuration area. Databricks settings are managed in the BimlFlex Settings editor and control everything from driver selection to table DDL to deployment paths. For the full settings reference, see the Databricks settings index.

Architecture Settings

These settings determine the fundamental processing architecture of your pipeline.

Setting	Default	Impact
`DatabricksUtilsDriver`	`ODBC`	Choose `ODBC` or `JDBC`. JDBC enables serverless compute because it uses the pure-Python `pytds` library, removing the dependency on cluster-level ODBC driver installation. This is a critical architectural choice that affects cluster configuration, init scripts, and deployment artifacts.
`DatabricksJobCluster`	(none)	Cluster name, cluster ID, or the literal value `Serverless`. Controls the DAB YAML cluster configuration. When set to `Serverless`, BimlFlex omits cluster assignment from task definitions and references a `bfx_jdbc` environment with the required Python dependencies.
`DatabricksUseSqlScripting`	`N`	New in 2026. Wraps SQL in `BEGIN`/`END` blocks for native SQL scripting. Produces more readable notebooks. Mutually exclusive with `DatabricksUseTemporaryViews`.
`DatabricksUseTemporaryViews`	`N`	Creates temporary views for intermediate results. Mutually exclusive with `DatabricksUseSqlScripting`.
`DatabricksUseExistingCluster`	`N`	Controls DAB YAML output: when `Y`, generates `existing_cluster_id`; when `N`, generates `job_cluster_key`.
`DatabricksUseScriptFragments`	`N`	Splits notebooks into smaller script fragments for easier debugging and version control.

warning

DatabricksUseSqlScripting and DatabricksUseTemporaryViews are mutually exclusive. Enabling both triggers validation SET_28002004 and blocks the build. Choose one approach:

SQL Scripting (DatabricksUseSqlScripting = Y): Best for teams that prefer readable SQL and use Unity Catalog features like stored procedures.
Temporary Views (DatabricksUseTemporaryViews = Y): Legacy approach that creates temp views for intermediate transformations.

Unity Catalog Settings

Unity Catalog settings have the single largest impact on generated DDL and table naming.

Setting	Default	Impact
`DatabricksUseUnityCatalog`	`Y`	Critical setting. Changes table naming from underscore-concatenated (`schema_table`) to dot-separated (`schema.table`). Enables stored procedures, materialised views, and proper three-part naming (`catalog.schema.table`). Requires `ExternalLocation` on the connection when `DatabricksUseManagedTables = N`.
`DatabricksUseManagedTables`	`N`	When `N`, tables are created as external tables and `ExternalLocation` is required on each Databricks connection. When `Y`, Databricks manages the storage location.
`DatabricksUseStoredProcedures`	`N`	Generates stored procedures for transformation logic. Only active when `DatabricksUseUnityCatalog = Y`.
`DatabricksMaterialiseCurrentViews`	`N`	Materialises current-state views as tables instead of views. Only active when `DatabricksUseUnityCatalog = Y`.
`DatabricksUseCreateCatalog`	`N`	Adds `CREATE CATALOG IF NOT EXISTS` statements to DDL scripts. Useful for automated environment provisioning.
`DatabricksTempTableSchema`	(none)	Schema name for temporary tables in Unity Catalog. When set, temp tables are created in this schema rather than the default schema.

Performance Settings

Setting	Default	Impact
`DatabricksAnalyzeTable`	`N`	Runs `ANALYZE TABLE` after each load to update table statistics. Improves query optimizer decisions.
`DatabricksOptimizeTable`	`N`	Runs `OPTIMIZE` after each load to compact small files. Recommended for tables with frequent small writes.
`DatabricksUseLiquidClustering`	`N`	Enables Liquid Clustering, which replaces traditional partitioning and `ZORDER`. Liquid Clustering automatically manages data layout and is the recommended approach for new Delta tables.

DDL Settings

Setting	Default	Impact
`DatabricksAddPrimaryKeys`	`N`	Adds primary key constraints to Delta tables. Databricks uses these as informational constraints for query optimization.
`DatabricksTableOwner`	(none)	Sets the table owner via `ALTER TABLE ... SET OWNER TO`. Useful for access control in Unity Catalog.
`DatabricksTableProperties`	(none)	Appended as a `TBLPROPERTIES` clause on every generated `CREATE TABLE` statement. Example: `'delta.autoOptimize.optimizeWrite' = 'true'`.

Deployment Settings

Setting	Default	Impact
`DatabricksNotebookPath`	`/Repos/BimlFlex/@@Repository/Databricks/`	Runtime path where notebooks are expected in the Databricks workspace. The `@@Repository` placeholder is replaced by the value of `DatabricksRepositoryName`.
`DatabricksRepositoryName`	`YourRepository`	Repository name used in the notebook path substitution. Change this to match your Databricks repo or folder structure.
`DatabricksOutputPath`	`@@OutputPath\Databricks\`	Local build output directory where generated Databricks artifacts are written. `@@OutputPath` resolves to the BimlStudio output path.
`DatabricksGitSource`	(none)	Git source reference for DAB bundle configuration. When set, the DAB YAML includes a `git` source block for CI/CD integration.
`DatabricksUseDisplayFolder`	`N`	When `Y`, organizes notebooks into subfolders based on the object's display folder, creating a more structured notebook hierarchy.

Step 6: Import Source Metadata

With connections and settings configured, import metadata from your source system.

ODBC DSN Setup

If your source is a Databricks catalog, configure an ODBC DSN on your local machine:

Install the Simba Databricks ODBC driver
Open ODBC Data Source Administrator and create a System DSN
Enter the Databricks workspace URL, HTTP path, and authentication credentials
Test the connection

For non-Databricks sources (SQL Server, Oracle, etc.), use the standard ODBC or OLE DB connection as you normally would.

Import Process

Navigate to the source Connection in the BimlFlex App
Click Import Metadata to launch the Metadata Importer
Select the tables and views to import
Review the imported schema, columns, and data types
Apply any overrides or exclusions as needed

tip

For additional details on the import process, see the Metadata Importer documentation.

Step 7: Configure Business Keys

After importing metadata, define Business Keys on each source object. Business keys are essential for Data Vault modeling and are used to generate Hub, Link, and Satellite structures.

Open the Objects editor
For each source table, set the columns that form the natural business key
BimlFlex uses these keys to generate Hubs and Links automatically

Step 8: Data Vault and Data Mart on Databricks

Data Vault Naming with Unity Catalog

The DatabricksUseUnityCatalog setting directly affects how Data Vault table names are generated:

Setting	Generated Name	Format
`DatabricksUseUnityCatalog = Y`	`raw_vault.hub_customer`	`schema.table` (dot-separated)
`DatabricksUseUnityCatalog = N`	`raw_vault_hub_customer`	`schema_table` (underscore-concatenated)

This naming difference flows through all generated DDL, notebooks, and stored procedures. The source code uses this logic consistently across all target name resolution, including staging, delete detection, and Data Vault objects.

note

When Unity Catalog is enabled, BimlFlex generates proper three-part names (catalog.schema.table), which enables fine-grained access control through Unity Catalog grants. When disabled, the schema is concatenated into the table name because hive_metastore does not support true schemas.

Data Mart

Data Mart configuration on Databricks follows the same patterns described in the Data Mart Configuration documentation. The key differences for Databricks are:

Transformations use Spark SQL instead of T-SQL
Tables are Delta format with optional Liquid Clustering, OPTIMIZE, and ANALYZE support
Star schema patterns (Fact and Dimension tables) work identically at the metadata level

Step 9: System Columns and Delete Detection

BimlFlex automatically adds system columns (audit columns, hash keys, load dates) to generated tables based on your settings configuration.

Delete Detection on Databricks

When delete detection is enabled for source objects, BimlFlex generates dedicated Databricks notebooks that:

Compare current source data against previously loaded data
Identify deleted records
Insert delete-flagged rows into the target tables

With PushdownProcessing enabled, delete detection notebooks are included in the DAB workflow YAML as additional tasks with proper dependency ordering.

Step 10: Build and Deploy

Generated Artifacts

When you build the project in BimlStudio, the following artifacts are generated in the output directory:

Artifact	Description
`Tables/`	DDL scripts (`CREATE TABLE`, `CREATE SCHEMA`, etc.) for all Databricks target tables
`Notebooks/`	Python/SQL notebooks for staging, Data Vault, and Data Mart load logic
`bfxutils.py`	Utility library used by all notebooks for BimlCatalog connectivity. Content differs between ODBC and JDBC mode: ODBC version uses `pyodbc`, JDBC version uses `pytds`.
`bfx_init_odbc.sh`	Cluster init script that installs the Simba ODBC driver. Generated only in ODBC mode.
`bfx_setup_secrets.ps1`	PowerShell script with Databricks CLI commands to create the secret scope and store BimlCatalog connection credentials.
`databricks.yml`	Databricks Asset Bundle definition file. Generated only when PushdownProcessing is enabled.
`resources/*.yml`	Per-batch DAB workflow YAML files defining jobs, tasks, and dependencies. Generated only when PushdownProcessing is enabled.
ADF ARM templates	Azure Resource Manager templates for deploying ADF pipelines, linked services, and datasets.

Deployment Sequence

Deploy the generated artifacts in this order:

Deploy DDL scripts -- Run the table creation scripts against your Databricks workspace (via Databricks SQL editor, notebook, or CLI). This creates all schemas, tables, and stored procedures.

Deploy notebooks and DAB bundles -- Use the Databricks CLI to deploy notebooks and workflow definitions:

# Navigate to your Databricks output directory
cd output/Databricks

# Deploy using Databricks Asset Bundles
databricks bundle deploy --target dev

Deploy ADF ARM templates -- Deploy the generated ARM templates to your Azure Data Factory instance using the Azure CLI, Azure Portal, or your CI/CD pipeline.
Configure secrets -- Run the generated bfx_setup_secrets.ps1 script to create the Databricks secret scope and store BimlCatalog connection credentials:
```
# Review and customize the generated script first
./bfx_setup_secrets.ps1
```
This script uses databricks secrets create-scope and databricks secrets put-secret to store the server, database, username, and password that bfxutils.py uses to connect to the BimlCatalog database at runtime.

warning

The secret scope must be created before running any pipelines. Without it, notebooks will fail when attempting to log execution status to the BimlCatalog database.

tip

For JDBC mode, connection credentials can be stored as either a JSON object ({"server": "host", "database": "db", "user": "u", "password": "p", "port": 1433}) or a traditional ODBC connection string. The bfxutils.py JDBC version automatically parses both formats.

Validation Checklist

Before running your first pipeline, verify:

All DDL scripts have been executed successfully in Databricks
Notebooks are deployed to the path specified by DatabricksNotebookPath
The Databricks secret scope contains valid BimlCatalog credentials
ADF linked services can connect to both the source system and Databricks
Azure Blob Storage containers (stage, archive, error) exist and are accessible
The ADF Managed Identity (or access token) has the required permissions on the Databricks workspace

Next Steps

Implementing Databricks with ADF -- Detailed architecture overview including pushdown processing and restartability
Databricks Configuration Overview -- Connection setup, ODBC DSN, and storage settings
Configuring Linked Services for Databricks -- Linked service form fields and authentication
Databricks Settings Reference -- Full list of Databricks-specific settings
Data Vault Templates -- Data Vault modeling and generation
Data Mart Configuration -- Star schema patterns and configuration
Batch Orchestration for Databricks -- Batch and execution group configuration

Introduction​

Prerequisites​

Two Processing Approaches​

Step 1: Create Your Project​

Connection Slots​

PushdownProcessing​

PushdownExtraction​

Step 2: Configure Connections​

Key Fields for Databricks Connections​

ExternalLocation​

Compute Connection​

Step 3: Configure Linked Services​

Authentication Methods​

New Cluster Configuration Fields​

Step 4: Azure Storage Settings​

Step 5: Databricks Settings​

Architecture Settings​

Unity Catalog Settings​

Performance Settings​

DDL Settings​

Deployment Settings​

Step 6: Import Source Metadata​

ODBC DSN Setup​

Import Process​

Step 7: Configure Business Keys​

Step 8: Data Vault and Data Mart on Databricks​

Data Vault Naming with Unity Catalog​

Data Mart​

Step 9: System Columns and Delete Detection​

Delete Detection on Databricks​

Step 10: Build and Deploy​

Generated Artifacts​

Deployment Sequence​

Validation Checklist​

Next Steps​

Introduction

Prerequisites

Two Processing Approaches

Step 1: Create Your Project

Connection Slots

PushdownProcessing

PushdownExtraction

Step 2: Configure Connections

Key Fields for Databricks Connections

ExternalLocation

Compute Connection

Step 3: Configure Linked Services

Authentication Methods

New Cluster Configuration Fields

Step 4: Azure Storage Settings

Step 5: Databricks Settings

Architecture Settings

Unity Catalog Settings

Performance Settings

DDL Settings

Deployment Settings

Step 6: Import Source Metadata

ODBC DSN Setup

Import Process

Step 7: Configure Business Keys

Step 8: Data Vault and Data Mart on Databricks

Data Vault Naming with Unity Catalog

Data Mart

Step 9: System Columns and Delete Detection

Delete Detection on Databricks

Step 10: Build and Deploy

Generated Artifacts

Deployment Sequence

Validation Checklist

Next Steps