ADF + Databricks Pipeline Configuration Guide
This guide walks through every configuration step required to set up a complete BimlFlex pipeline that uses Azure Data Factory for orchestration and Databricks for compute and processing. Follow these steps in order for your first project, then use individual sections as a reference for subsequent projects.
Introduction
Prerequisites
Before you begin, confirm the following are in place:
- Azure Subscription with permissions to create resources in Azure Data Factory and Azure Databricks
- Databricks Workspace provisioned in the same Azure region as your storage accounts
- Azure Blob Storage or ADLS Gen2 account for staging and landing data
- BimlFlex installation (BimlFlex App and BimlStudio) at version 2026 or later
- BimlCatalog database deployed to Azure SQL Database (used for orchestration logging)
- ODBC or JDBC driver for Databricks connectivity from your local workstation (used by the Metadata Importer):
- ODBC: Install the Simba Databricks ODBC driver and configure a system DSN. Required if you want to import metadata or test connectivity from BimlFlex directly.
- JDBC: No local driver installation required. Choose JDBC when targeting serverless compute, since serverless clusters cannot run init scripts for ODBC driver installation.
Two Processing Approaches
BimlFlex supports two processing approaches when orchestrating Databricks through ADF:
| Approach | ADF Activity Type | When to Use |
|---|---|---|
| Traditional Notebook | ADF Notebook Activity | Simpler setup; each notebook is invoked individually by ADF |
| Pushdown Processing (recommended) | ADF Databricks Job Activity | Higher performance; all transformation logic runs inside Databricks workflows with built-in dependency management and restart logic |
Pushdown Processing is new in BimlFlex 2026 and is the recommended approach for new projects. It generates Databricks Asset Bundles (DAB) for deployment and supports serverless compute when combined with JDBC mode.
Step 1: Create Your Project
In the BimlFlex App, create a new project and select the Databricks (ADF) template (IntegrationTemplateId = 5).
Connection Slots
The project template defines the following connection slots. Configure each one in the Connections editor before building.
| Connection | Required? | Notes |
|---|---|---|
| Source | Yes | Your source database (SQL Server, Oracle, MySQL, etc.) |
| Landing | Conditional | Required unless PushdownExtraction is enabled. Typically Azure Blob Storage configured as flat-file delimited. |
| Stage | Yes | Databricks target for staging tables. Set System Type to Databricks Data Warehouse. |
| PSA | Optional | Persistent Staging Area for historized bronze-layer retention. |
| Target | Yes | Databricks connection for Data Vault and/or Data Mart layers. |
| Compute | Yes | Databricks workspace connection used for linked service generation. Validation PRJ_28005005 will block the build if this is missing. |
PushdownProcessing
Enable PushdownProcessing on the project to change the pipeline architecture. When enabled:
- ADF uses a Databricks Job Activity instead of individual Notebook Activities
- BimlFlex generates Databricks Asset Bundle (DAB) YAML files that define workflows, tasks, and dependencies
- Cluster assignment is controlled by the
DatabricksJobClustersetting (job cluster, existing cluster, or serverless) - Built-in restart logic skips already-completed ADF Copy Activities on retry
PushdownExtraction
Enable PushdownExtraction on the project to skip the ADF Copy Activity and read source data directly from within Databricks. When enabled:
- The Landing connection slot is hidden (no blob staging is needed)
- Databricks reads the source system directly, which is most useful when the source is also a Databricks catalog or a cloud-accessible database
- Combines naturally with PushdownProcessing for a fully Databricks-native pipeline
Step 2: Configure Connections
Open the Connections editor and configure each connection assigned to your project.
Key Fields for Databricks Connections
| Field | Value | Notes |
|---|---|---|
| Connection Type | ODBC | Use ODBC for all Databricks target connections |
| System Type | Databricks Data Warehouse | Required for all Databricks connections |
| Linked Service Type | Databricks | Enables the Databricks-specific linked service form |
| Catalog | Your Unity Catalog name | Maps to the Databricks catalog. When DatabricksUseUnityCatalog = Y, this value appears in all qualified table names. |
| Cloud | Checked | Must be enabled for all cloud-based connections |
ExternalLocation
When DatabricksUseUnityCatalog = Y and DatabricksUseManagedTables = N, you must specify an ExternalLocation on the connection. This tells Databricks where to store the underlying data files for external (non-managed) tables.
Validation CON_21005007 will block the build if ExternalLocation is missing when Unity Catalog is enabled with non-managed tables. Set the ExternalLocation field to your ADLS Gen2 path, for example: abfss://container@storageaccount.dfs.core.windows.net/path.
Compute Connection
The Compute connection represents the Databricks workspace itself (not a database within it). Configure it as follows:
| Field | Value |
|---|---|
| Connection Type | ODBC |
| System Type | Databricks Data Warehouse |
| Linked Service Type | Databricks |
| Cloud | Checked |
This connection is referenced by the project for generating the ADF Databricks linked service. It must include the workspace URL and authentication details in its linked service configuration.
Step 3: Configure Linked Services
Open the Connections editor, select your Compute connection, and configure its Linked Service settings.
Authentication Methods
| Auth Method | When to Use | Key Fields |
|---|---|---|
| Access Token | Simple setup, development and test environments | LS_DatabricksAccessToken (plaintext) or store in Azure Key Vault (recommended) |
| Managed Identity | Production environments, no secrets to manage | LS_DatabricksWorkspaceResourceId (the Azure resource ID of the Databricks workspace) |
For production deployments, use Managed Identity authentication. This removes the need to rotate access tokens and simplifies secret management. The ADF Managed Identity must be granted Contributor access to the Databricks workspace.
New Cluster Configuration Fields
When the linked service creates new clusters for notebook execution, configure these fields:
| Field | Description | Example |
|---|---|---|
| Version | Databricks Runtime version | 15.4.x-scala2.12 |
| NodeType | VM size for worker nodes | Standard_DS3_v2 |
| NumOfWorker | Number of worker nodes (0 = single-node) | 2 |
| SparkConf | Spark configuration key-value pairs | spark.speculation true |
| SparkEnvVars | Environment variables for the cluster | PYSPARK_PYTHON /databricks/python3/bin/python3 |
| InitScripts | Cluster init scripts (required for ODBC mode) | dbfs:/databricks/scripts/bfx_init_odbc.sh |
| CustomTags | Azure tags applied to the cluster | CostCenter 12345 |
| LogDestination | DBFS path for cluster log delivery | dbfs:/cluster-logs |
When using an Instance Pool, specify the pool ID instead of NodeType. The pool provides pre-warmed VMs for faster cluster startup.
Init scripts are required when using ODBC mode (DatabricksUtilsDriver = ODBC) because the Simba ODBC driver must be installed on every cluster node. When using JDBC mode, no init scripts are needed and you can target serverless compute.
Step 4: Azure Storage Settings
Azure Blob Storage settings define where ADF lands extracted data before Databricks processes it. These settings follow the same pattern used for Snowflake and other cloud targets.
Configure the following settings in the BimlFlex Settings editor under the Azure category:
| Setting | Description | Example |
|---|---|---|
| AzureStageContainer | Container name for staging files | stage |
| AzureStageAccountName | Storage account name | bfxblobaccount |
| AzureStageAccountKey | Storage access key | <StorageAccountKey>== |
| AzureStageSasToken | SAS token (alternative to account key) | ?<SasToken> |
| AzureArchiveContainer | Container for archived files | archive |
| AzureArchiveAccountName | Storage account for archive | bfxblobaccount |
| AzureErrorContainer | Container for error files | error |
| AzureErrorAccountName | Storage account for errors | bfxblobaccount |
| AzureBlobStorageDomain | Storage domain suffix | blob.core.windows.net |
You can use the same storage account for stage, archive, and error containers. Separate containers within a single account is the most common configuration.
Step 5: Databricks Settings
This is the most extensive configuration area. Databricks settings are managed in the BimlFlex Settings editor and control everything from driver selection to table DDL to deployment paths. For the full settings reference, see the Databricks settings index.
Architecture Settings
These settings determine the fundamental processing architecture of your pipeline.
| Setting | Default | Impact |
|---|---|---|
DatabricksUtilsDriver | ODBC | Choose ODBC or JDBC. JDBC enables serverless compute because it uses the pure-Python pytds library, removing the dependency on cluster-level ODBC driver installation. This is a critical architectural choice that affects cluster configuration, init scripts, and deployment artifacts. |
DatabricksJobCluster | (none) | Cluster name, cluster ID, or the literal value Serverless. Controls the DAB YAML cluster configuration. When set to Serverless, BimlFlex omits cluster assignment from task definitions and references a bfx_jdbc environment with the required Python dependencies. |
DatabricksUseSqlScripting | N | New in 2026. Wraps SQL in BEGIN/END blocks for native SQL scripting. Produces more readable notebooks. Mutually exclusive with DatabricksUseTemporaryViews. |
DatabricksUseTemporaryViews | N | Creates temporary views for intermediate results. Mutually exclusive with DatabricksUseSqlScripting. |
DatabricksUseExistingCluster | N | Controls DAB YAML output: when Y, generates existing_cluster_id; when N, generates job_cluster_key. |
DatabricksUseScriptFragments | N | Splits notebooks into smaller script fragments for easier debugging and version control. |
DatabricksUseSqlScripting and DatabricksUseTemporaryViews are mutually exclusive. Enabling both triggers validation SET_28002004 and blocks the build. Choose one approach:
- SQL Scripting (
DatabricksUseSqlScripting = Y): Best for teams that prefer readable SQL and use Unity Catalog features like stored procedures. - Temporary Views (
DatabricksUseTemporaryViews = Y): Legacy approach that creates temp views for intermediate transformations.
Unity Catalog Settings
Unity Catalog settings have the single largest impact on generated DDL and table naming.
| Setting | Default | Impact |
|---|---|---|
DatabricksUseUnityCatalog | Y | Critical setting. Changes table naming from underscore-concatenated (schema_table) to dot-separated (schema.table). Enables stored procedures, materialised views, and proper three-part naming (catalog.schema.table). Requires ExternalLocation on the connection when DatabricksUseManagedTables = N. |
DatabricksUseManagedTables | N | When N, tables are created as external tables and ExternalLocation is required on each Databricks connection. When Y, Databricks manages the storage location. |
DatabricksUseStoredProcedures | N | Generates stored procedures for transformation logic. Only active when DatabricksUseUnityCatalog = Y. |
DatabricksMaterialiseCurrentViews | N | Materialises current-state views as tables instead of views. Only active when DatabricksUseUnityCatalog = Y. |
DatabricksUseCreateCatalog | N | Adds CREATE CATALOG IF NOT EXISTS statements to DDL scripts. Useful for automated environment provisioning. |
DatabricksTempTableSchema | (none) | Schema name for temporary tables in Unity Catalog. When set, temp tables are created in this schema rather than the default schema. |
Performance Settings
| Setting | Default | Impact |
|---|---|---|
DatabricksAnalyzeTable | N | Runs ANALYZE TABLE after each load to update table statistics. Improves query optimizer decisions. |
DatabricksOptimizeTable | N | Runs OPTIMIZE after each load to compact small files. Recommended for tables with frequent small writes. |
DatabricksUseLiquidClustering | N | Enables Liquid Clustering, which replaces traditional partitioning and ZORDER. Liquid Clustering automatically manages data layout and is the recommended approach for new Delta tables. |
DDL Settings
| Setting | Default | Impact |
|---|---|---|
DatabricksAddPrimaryKeys | N | Adds primary key constraints to Delta tables. Databricks uses these as informational constraints for query optimization. |
DatabricksTableOwner | (none) | Sets the table owner via ALTER TABLE ... SET OWNER TO. Useful for access control in Unity Catalog. |
DatabricksTableProperties | (none) | Appended as a TBLPROPERTIES clause on every generated CREATE TABLE statement. Example: 'delta.autoOptimize.optimizeWrite' = 'true'. |
Deployment Settings
| Setting | Default | Impact |
|---|---|---|
DatabricksNotebookPath | /Repos/BimlFlex/@@Repository/Databricks/ | Runtime path where notebooks are expected in the Databricks workspace. The @@Repository placeholder is replaced by the value of DatabricksRepositoryName. |
DatabricksRepositoryName | YourRepository | Repository name used in the notebook path substitution. Change this to match your Databricks repo or folder structure. |
DatabricksOutputPath | @@OutputPath\Databricks\ | Local build output directory where generated Databricks artifacts are written. @@OutputPath resolves to the BimlStudio output path. |
DatabricksGitSource | (none) | Git source reference for DAB bundle configuration. When set, the DAB YAML includes a git source block for CI/CD integration. |
DatabricksUseDisplayFolder | N | When Y, organizes notebooks into subfolders based on the object's display folder, creating a more structured notebook hierarchy. |
Step 6: Import Source Metadata
With connections and settings configured, import metadata from your source system.
ODBC DSN Setup
If your source is a Databricks catalog, configure an ODBC DSN on your local machine:
- Install the Simba Databricks ODBC driver
- Open ODBC Data Source Administrator and create a System DSN
- Enter the Databricks workspace URL, HTTP path, and authentication credentials
- Test the connection
For non-Databricks sources (SQL Server, Oracle, etc.), use the standard ODBC or OLE DB connection as you normally would.
Import Process
- Navigate to the source Connection in the BimlFlex App
- Click Import Metadata to launch the Metadata Importer
- Select the tables and views to import
- Review the imported schema, columns, and data types
- Apply any overrides or exclusions as needed
For additional details on the import process, see the Metadata Importer documentation.
Step 7: Configure Business Keys
After importing metadata, define Business Keys on each source object. Business keys are essential for Data Vault modeling and are used to generate Hub, Link, and Satellite structures.
- Open the Objects editor
- For each source table, set the columns that form the natural business key
- BimlFlex uses these keys to generate Hubs and Links automatically
Step 8: Data Vault and Data Mart on Databricks
Data Vault Naming with Unity Catalog
The DatabricksUseUnityCatalog setting directly affects how Data Vault table names are generated:
| Setting | Generated Name | Format |
|---|---|---|
DatabricksUseUnityCatalog = Y | raw_vault.hub_customer | schema.table (dot-separated) |
DatabricksUseUnityCatalog = N | raw_vault_hub_customer | schema_table (underscore-concatenated) |
This naming difference flows through all generated DDL, notebooks, and stored procedures. The source code uses this logic consistently across all target name resolution, including staging, delete detection, and Data Vault objects.
When Unity Catalog is enabled, BimlFlex generates proper three-part names (catalog.schema.table), which enables fine-grained access control through Unity Catalog grants. When disabled, the schema is concatenated into the table name because hive_metastore does not support true schemas.
Data Mart
Data Mart configuration on Databricks follows the same patterns described in the Data Mart Configuration documentation. The key differences for Databricks are:
- Transformations use Spark SQL instead of T-SQL
- Tables are Delta format with optional Liquid Clustering, OPTIMIZE, and ANALYZE support
- Star schema patterns (Fact and Dimension tables) work identically at the metadata level
Step 9: System Columns and Delete Detection
BimlFlex automatically adds system columns (audit columns, hash keys, load dates) to generated tables based on your settings configuration.
Delete Detection on Databricks
When delete detection is enabled for source objects, BimlFlex generates dedicated Databricks notebooks that:
- Compare current source data against previously loaded data
- Identify deleted records
- Insert delete-flagged rows into the target tables
With PushdownProcessing enabled, delete detection notebooks are included in the DAB workflow YAML as additional tasks with proper dependency ordering.
Step 10: Build and Deploy
Generated Artifacts
When you build the project in BimlStudio, the following artifacts are generated in the output directory:
| Artifact | Description |
|---|---|
Tables/ | DDL scripts (CREATE TABLE, CREATE SCHEMA, etc.) for all Databricks target tables |
Notebooks/ | Python/SQL notebooks for staging, Data Vault, and Data Mart load logic |
bfxutils.py | Utility library used by all notebooks for BimlCatalog connectivity. Content differs between ODBC and JDBC mode: ODBC version uses pyodbc, JDBC version uses pytds. |
bfx_init_odbc.sh | Cluster init script that installs the Simba ODBC driver. Generated only in ODBC mode. |
bfx_setup_secrets.ps1 | PowerShell script with Databricks CLI commands to create the secret scope and store BimlCatalog connection credentials. |
databricks.yml | Databricks Asset Bundle definition file. Generated only when PushdownProcessing is enabled. |
resources/*.yml | Per-batch DAB workflow YAML files defining jobs, tasks, and dependencies. Generated only when PushdownProcessing is enabled. |
| ADF ARM templates | Azure Resource Manager templates for deploying ADF pipelines, linked services, and datasets. |
Deployment Sequence
Deploy the generated artifacts in this order:
-
Deploy DDL scripts -- Run the table creation scripts against your Databricks workspace (via Databricks SQL editor, notebook, or CLI). This creates all schemas, tables, and stored procedures.
-
Deploy notebooks and DAB bundles -- Use the Databricks CLI to deploy notebooks and workflow definitions:
# Navigate to your Databricks output directory
cd output/Databricks
# Deploy using Databricks Asset Bundles
databricks bundle deploy --target dev -
Deploy ADF ARM templates -- Deploy the generated ARM templates to your Azure Data Factory instance using the Azure CLI, Azure Portal, or your CI/CD pipeline.
-
Configure secrets -- Run the generated
bfx_setup_secrets.ps1script to create the Databricks secret scope and store BimlCatalog connection credentials:# Review and customize the generated script first
./bfx_setup_secrets.ps1This script uses
databricks secrets create-scopeanddatabricks secrets put-secretto store the server, database, username, and password thatbfxutils.pyuses to connect to the BimlCatalog database at runtime.
The secret scope must be created before running any pipelines. Without it, notebooks will fail when attempting to log execution status to the BimlCatalog database.
For JDBC mode, connection credentials can be stored as either a JSON object ({"server": "host", "database": "db", "user": "u", "password": "p", "port": 1433}) or a traditional ODBC connection string. The bfxutils.py JDBC version automatically parses both formats.
Validation Checklist
Before running your first pipeline, verify:
- All DDL scripts have been executed successfully in Databricks
- Notebooks are deployed to the path specified by
DatabricksNotebookPath - The Databricks secret scope contains valid BimlCatalog credentials
- ADF linked services can connect to both the source system and Databricks
- Azure Blob Storage containers (stage, archive, error) exist and are accessible
- The ADF Managed Identity (or access token) has the required permissions on the Databricks workspace
Next Steps
- Implementing Databricks with ADF -- Detailed architecture overview including pushdown processing and restartability
- Databricks Configuration Overview -- Connection setup, ODBC DSN, and storage settings
- Configuring Linked Services for Databricks -- Linked service form fields and authentication
- Databricks Settings Reference -- Full list of Databricks-specific settings
- Data Vault Templates -- Data Vault modeling and generation
- Data Mart Configuration -- Star schema patterns and configuration
- Batch Orchestration for Databricks -- Batch and execution group configuration