Skip to main content

ADF + Databricks Pipeline Configuration Guide

This guide walks through every configuration step required to set up a complete BimlFlex pipeline that uses Azure Data Factory for orchestration and Databricks for compute and processing. Follow these steps in order for your first project, then use individual sections as a reference for subsequent projects.

Introduction

Prerequisites

Before you begin, confirm the following are in place:

  • Azure Subscription with permissions to create resources in Azure Data Factory and Azure Databricks
  • Databricks Workspace provisioned in the same Azure region as your storage accounts
  • Azure Blob Storage or ADLS Gen2 account for staging and landing data
  • BimlFlex installation (BimlFlex App and BimlStudio) at version 2026 or later
  • BimlCatalog database deployed to Azure SQL Database (used for orchestration logging)
  • ODBC or JDBC driver for Databricks connectivity from your local workstation (used by the Metadata Importer):
    • ODBC: Install the Simba Databricks ODBC driver and configure a system DSN. Required if you want to import metadata or test connectivity from BimlFlex directly.
    • JDBC: No local driver installation required. Choose JDBC when targeting serverless compute, since serverless clusters cannot run init scripts for ODBC driver installation.

Two Processing Approaches

BimlFlex supports two processing approaches when orchestrating Databricks through ADF:

ApproachADF Activity TypeWhen to Use
Traditional NotebookADF Notebook ActivitySimpler setup; each notebook is invoked individually by ADF
Pushdown Processing (recommended)ADF Databricks Job ActivityHigher performance; all transformation logic runs inside Databricks workflows with built-in dependency management and restart logic

Pushdown Processing is new in BimlFlex 2026 and is the recommended approach for new projects. It generates Databricks Asset Bundles (DAB) for deployment and supports serverless compute when combined with JDBC mode.

Step 1: Create Your Project

In the BimlFlex App, create a new project and select the Databricks (ADF) template (IntegrationTemplateId = 5).

Connection Slots

The project template defines the following connection slots. Configure each one in the Connections editor before building.

ConnectionRequired?Notes
SourceYesYour source database (SQL Server, Oracle, MySQL, etc.)
LandingConditionalRequired unless PushdownExtraction is enabled. Typically Azure Blob Storage configured as flat-file delimited.
StageYesDatabricks target for staging tables. Set System Type to Databricks Data Warehouse.
PSAOptionalPersistent Staging Area for historized bronze-layer retention.
TargetYesDatabricks connection for Data Vault and/or Data Mart layers.
ComputeYesDatabricks workspace connection used for linked service generation. Validation PRJ_28005005 will block the build if this is missing.

PushdownProcessing

Enable PushdownProcessing on the project to change the pipeline architecture. When enabled:

  • ADF uses a Databricks Job Activity instead of individual Notebook Activities
  • BimlFlex generates Databricks Asset Bundle (DAB) YAML files that define workflows, tasks, and dependencies
  • Cluster assignment is controlled by the DatabricksJobCluster setting (job cluster, existing cluster, or serverless)
  • Built-in restart logic skips already-completed ADF Copy Activities on retry

PushdownExtraction

Enable PushdownExtraction on the project to skip the ADF Copy Activity and read source data directly from within Databricks. When enabled:

  • The Landing connection slot is hidden (no blob staging is needed)
  • Databricks reads the source system directly, which is most useful when the source is also a Databricks catalog or a cloud-accessible database
  • Combines naturally with PushdownProcessing for a fully Databricks-native pipeline

Step 2: Configure Connections

Open the Connections editor and configure each connection assigned to your project.

Key Fields for Databricks Connections

FieldValueNotes
Connection TypeODBCUse ODBC for all Databricks target connections
System TypeDatabricks Data WarehouseRequired for all Databricks connections
Linked Service TypeDatabricksEnables the Databricks-specific linked service form
CatalogYour Unity Catalog nameMaps to the Databricks catalog. When DatabricksUseUnityCatalog = Y, this value appears in all qualified table names.
CloudCheckedMust be enabled for all cloud-based connections

ExternalLocation

When DatabricksUseUnityCatalog = Y and DatabricksUseManagedTables = N, you must specify an ExternalLocation on the connection. This tells Databricks where to store the underlying data files for external (non-managed) tables.

warning

Validation CON_21005007 will block the build if ExternalLocation is missing when Unity Catalog is enabled with non-managed tables. Set the ExternalLocation field to your ADLS Gen2 path, for example: abfss://container@storageaccount.dfs.core.windows.net/path.

Compute Connection

The Compute connection represents the Databricks workspace itself (not a database within it). Configure it as follows:

FieldValue
Connection TypeODBC
System TypeDatabricks Data Warehouse
Linked Service TypeDatabricks
CloudChecked

This connection is referenced by the project for generating the ADF Databricks linked service. It must include the workspace URL and authentication details in its linked service configuration.

Step 3: Configure Linked Services

Open the Connections editor, select your Compute connection, and configure its Linked Service settings.

Authentication Methods

Auth MethodWhen to UseKey Fields
Access TokenSimple setup, development and test environmentsLS_DatabricksAccessToken (plaintext) or store in Azure Key Vault (recommended)
Managed IdentityProduction environments, no secrets to manageLS_DatabricksWorkspaceResourceId (the Azure resource ID of the Databricks workspace)
tip

For production deployments, use Managed Identity authentication. This removes the need to rotate access tokens and simplifies secret management. The ADF Managed Identity must be granted Contributor access to the Databricks workspace.

New Cluster Configuration Fields

When the linked service creates new clusters for notebook execution, configure these fields:

FieldDescriptionExample
VersionDatabricks Runtime version15.4.x-scala2.12
NodeTypeVM size for worker nodesStandard_DS3_v2
NumOfWorkerNumber of worker nodes (0 = single-node)2
SparkConfSpark configuration key-value pairsspark.speculation true
SparkEnvVarsEnvironment variables for the clusterPYSPARK_PYTHON /databricks/python3/bin/python3
InitScriptsCluster init scripts (required for ODBC mode)dbfs:/databricks/scripts/bfx_init_odbc.sh
CustomTagsAzure tags applied to the clusterCostCenter 12345
LogDestinationDBFS path for cluster log deliverydbfs:/cluster-logs

When using an Instance Pool, specify the pool ID instead of NodeType. The pool provides pre-warmed VMs for faster cluster startup.

note

Init scripts are required when using ODBC mode (DatabricksUtilsDriver = ODBC) because the Simba ODBC driver must be installed on every cluster node. When using JDBC mode, no init scripts are needed and you can target serverless compute.

Step 4: Azure Storage Settings

Azure Blob Storage settings define where ADF lands extracted data before Databricks processes it. These settings follow the same pattern used for Snowflake and other cloud targets.

Configure the following settings in the BimlFlex Settings editor under the Azure category:

SettingDescriptionExample
AzureStageContainerContainer name for staging filesstage
AzureStageAccountNameStorage account namebfxblobaccount
AzureStageAccountKeyStorage access key<StorageAccountKey>==
AzureStageSasTokenSAS token (alternative to account key)?<SasToken>
AzureArchiveContainerContainer for archived filesarchive
AzureArchiveAccountNameStorage account for archivebfxblobaccount
AzureErrorContainerContainer for error fileserror
AzureErrorAccountNameStorage account for errorsbfxblobaccount
AzureBlobStorageDomainStorage domain suffixblob.core.windows.net
tip

You can use the same storage account for stage, archive, and error containers. Separate containers within a single account is the most common configuration.

Step 5: Databricks Settings

This is the most extensive configuration area. Databricks settings are managed in the BimlFlex Settings editor and control everything from driver selection to table DDL to deployment paths. For the full settings reference, see the Databricks settings index.

Architecture Settings

These settings determine the fundamental processing architecture of your pipeline.

SettingDefaultImpact
DatabricksUtilsDriverODBCChoose ODBC or JDBC. JDBC enables serverless compute because it uses the pure-Python pytds library, removing the dependency on cluster-level ODBC driver installation. This is a critical architectural choice that affects cluster configuration, init scripts, and deployment artifacts.
DatabricksJobCluster(none)Cluster name, cluster ID, or the literal value Serverless. Controls the DAB YAML cluster configuration. When set to Serverless, BimlFlex omits cluster assignment from task definitions and references a bfx_jdbc environment with the required Python dependencies.
DatabricksUseSqlScriptingNNew in 2026. Wraps SQL in BEGIN/END blocks for native SQL scripting. Produces more readable notebooks. Mutually exclusive with DatabricksUseTemporaryViews.
DatabricksUseTemporaryViewsNCreates temporary views for intermediate results. Mutually exclusive with DatabricksUseSqlScripting.
DatabricksUseExistingClusterNControls DAB YAML output: when Y, generates existing_cluster_id; when N, generates job_cluster_key.
DatabricksUseScriptFragmentsNSplits notebooks into smaller script fragments for easier debugging and version control.
warning

DatabricksUseSqlScripting and DatabricksUseTemporaryViews are mutually exclusive. Enabling both triggers validation SET_28002004 and blocks the build. Choose one approach:

  • SQL Scripting (DatabricksUseSqlScripting = Y): Best for teams that prefer readable SQL and use Unity Catalog features like stored procedures.
  • Temporary Views (DatabricksUseTemporaryViews = Y): Legacy approach that creates temp views for intermediate transformations.

Unity Catalog Settings

Unity Catalog settings have the single largest impact on generated DDL and table naming.

SettingDefaultImpact
DatabricksUseUnityCatalogYCritical setting. Changes table naming from underscore-concatenated (schema_table) to dot-separated (schema.table). Enables stored procedures, materialised views, and proper three-part naming (catalog.schema.table). Requires ExternalLocation on the connection when DatabricksUseManagedTables = N.
DatabricksUseManagedTablesNWhen N, tables are created as external tables and ExternalLocation is required on each Databricks connection. When Y, Databricks manages the storage location.
DatabricksUseStoredProceduresNGenerates stored procedures for transformation logic. Only active when DatabricksUseUnityCatalog = Y.
DatabricksMaterialiseCurrentViewsNMaterialises current-state views as tables instead of views. Only active when DatabricksUseUnityCatalog = Y.
DatabricksUseCreateCatalogNAdds CREATE CATALOG IF NOT EXISTS statements to DDL scripts. Useful for automated environment provisioning.
DatabricksTempTableSchema(none)Schema name for temporary tables in Unity Catalog. When set, temp tables are created in this schema rather than the default schema.

Performance Settings

SettingDefaultImpact
DatabricksAnalyzeTableNRuns ANALYZE TABLE after each load to update table statistics. Improves query optimizer decisions.
DatabricksOptimizeTableNRuns OPTIMIZE after each load to compact small files. Recommended for tables with frequent small writes.
DatabricksUseLiquidClusteringNEnables Liquid Clustering, which replaces traditional partitioning and ZORDER. Liquid Clustering automatically manages data layout and is the recommended approach for new Delta tables.

DDL Settings

SettingDefaultImpact
DatabricksAddPrimaryKeysNAdds primary key constraints to Delta tables. Databricks uses these as informational constraints for query optimization.
DatabricksTableOwner(none)Sets the table owner via ALTER TABLE ... SET OWNER TO. Useful for access control in Unity Catalog.
DatabricksTableProperties(none)Appended as a TBLPROPERTIES clause on every generated CREATE TABLE statement. Example: 'delta.autoOptimize.optimizeWrite' = 'true'.

Deployment Settings

SettingDefaultImpact
DatabricksNotebookPath/Repos/BimlFlex/@@Repository/Databricks/Runtime path where notebooks are expected in the Databricks workspace. The @@Repository placeholder is replaced by the value of DatabricksRepositoryName.
DatabricksRepositoryNameYourRepositoryRepository name used in the notebook path substitution. Change this to match your Databricks repo or folder structure.
DatabricksOutputPath@@OutputPath\Databricks\Local build output directory where generated Databricks artifacts are written. @@OutputPath resolves to the BimlStudio output path.
DatabricksGitSource(none)Git source reference for DAB bundle configuration. When set, the DAB YAML includes a git source block for CI/CD integration.
DatabricksUseDisplayFolderNWhen Y, organizes notebooks into subfolders based on the object's display folder, creating a more structured notebook hierarchy.

Step 6: Import Source Metadata

With connections and settings configured, import metadata from your source system.

ODBC DSN Setup

If your source is a Databricks catalog, configure an ODBC DSN on your local machine:

  1. Install the Simba Databricks ODBC driver
  2. Open ODBC Data Source Administrator and create a System DSN
  3. Enter the Databricks workspace URL, HTTP path, and authentication credentials
  4. Test the connection

For non-Databricks sources (SQL Server, Oracle, etc.), use the standard ODBC or OLE DB connection as you normally would.

Import Process

  1. Navigate to the source Connection in the BimlFlex App
  2. Click Import Metadata to launch the Metadata Importer
  3. Select the tables and views to import
  4. Review the imported schema, columns, and data types
  5. Apply any overrides or exclusions as needed
tip

For additional details on the import process, see the Metadata Importer documentation.

Step 7: Configure Business Keys

After importing metadata, define Business Keys on each source object. Business keys are essential for Data Vault modeling and are used to generate Hub, Link, and Satellite structures.

  1. Open the Objects editor
  2. For each source table, set the columns that form the natural business key
  3. BimlFlex uses these keys to generate Hubs and Links automatically

Step 8: Data Vault and Data Mart on Databricks

Data Vault Naming with Unity Catalog

The DatabricksUseUnityCatalog setting directly affects how Data Vault table names are generated:

SettingGenerated NameFormat
DatabricksUseUnityCatalog = Yraw_vault.hub_customerschema.table (dot-separated)
DatabricksUseUnityCatalog = Nraw_vault_hub_customerschema_table (underscore-concatenated)

This naming difference flows through all generated DDL, notebooks, and stored procedures. The source code uses this logic consistently across all target name resolution, including staging, delete detection, and Data Vault objects.

note

When Unity Catalog is enabled, BimlFlex generates proper three-part names (catalog.schema.table), which enables fine-grained access control through Unity Catalog grants. When disabled, the schema is concatenated into the table name because hive_metastore does not support true schemas.

Data Mart

Data Mart configuration on Databricks follows the same patterns described in the Data Mart Configuration documentation. The key differences for Databricks are:

  • Transformations use Spark SQL instead of T-SQL
  • Tables are Delta format with optional Liquid Clustering, OPTIMIZE, and ANALYZE support
  • Star schema patterns (Fact and Dimension tables) work identically at the metadata level

Step 9: System Columns and Delete Detection

BimlFlex automatically adds system columns (audit columns, hash keys, load dates) to generated tables based on your settings configuration.

Delete Detection on Databricks

When delete detection is enabled for source objects, BimlFlex generates dedicated Databricks notebooks that:

  • Compare current source data against previously loaded data
  • Identify deleted records
  • Insert delete-flagged rows into the target tables

With PushdownProcessing enabled, delete detection notebooks are included in the DAB workflow YAML as additional tasks with proper dependency ordering.

Step 10: Build and Deploy

Generated Artifacts

When you build the project in BimlStudio, the following artifacts are generated in the output directory:

ArtifactDescription
Tables/DDL scripts (CREATE TABLE, CREATE SCHEMA, etc.) for all Databricks target tables
Notebooks/Python/SQL notebooks for staging, Data Vault, and Data Mart load logic
bfxutils.pyUtility library used by all notebooks for BimlCatalog connectivity. Content differs between ODBC and JDBC mode: ODBC version uses pyodbc, JDBC version uses pytds.
bfx_init_odbc.shCluster init script that installs the Simba ODBC driver. Generated only in ODBC mode.
bfx_setup_secrets.ps1PowerShell script with Databricks CLI commands to create the secret scope and store BimlCatalog connection credentials.
databricks.ymlDatabricks Asset Bundle definition file. Generated only when PushdownProcessing is enabled.
resources/*.ymlPer-batch DAB workflow YAML files defining jobs, tasks, and dependencies. Generated only when PushdownProcessing is enabled.
ADF ARM templatesAzure Resource Manager templates for deploying ADF pipelines, linked services, and datasets.

Deployment Sequence

Deploy the generated artifacts in this order:

  1. Deploy DDL scripts -- Run the table creation scripts against your Databricks workspace (via Databricks SQL editor, notebook, or CLI). This creates all schemas, tables, and stored procedures.

  2. Deploy notebooks and DAB bundles -- Use the Databricks CLI to deploy notebooks and workflow definitions:

    # Navigate to your Databricks output directory
    cd output/Databricks

    # Deploy using Databricks Asset Bundles
    databricks bundle deploy --target dev
  3. Deploy ADF ARM templates -- Deploy the generated ARM templates to your Azure Data Factory instance using the Azure CLI, Azure Portal, or your CI/CD pipeline.

  4. Configure secrets -- Run the generated bfx_setup_secrets.ps1 script to create the Databricks secret scope and store BimlCatalog connection credentials:

    # Review and customize the generated script first
    ./bfx_setup_secrets.ps1

    This script uses databricks secrets create-scope and databricks secrets put-secret to store the server, database, username, and password that bfxutils.py uses to connect to the BimlCatalog database at runtime.

warning

The secret scope must be created before running any pipelines. Without it, notebooks will fail when attempting to log execution status to the BimlCatalog database.

tip

For JDBC mode, connection credentials can be stored as either a JSON object ({"server": "host", "database": "db", "user": "u", "password": "p", "port": 1433}) or a traditional ODBC connection string. The bfxutils.py JDBC version automatically parses both formats.

Validation Checklist

Before running your first pipeline, verify:

  • All DDL scripts have been executed successfully in Databricks
  • Notebooks are deployed to the path specified by DatabricksNotebookPath
  • The Databricks secret scope contains valid BimlCatalog credentials
  • ADF linked services can connect to both the source system and Databricks
  • Azure Blob Storage containers (stage, archive, error) exist and are accessible
  • The ADF Managed Identity (or access token) has the required permissions on the Databricks workspace

Next Steps