Skip to main content

Implementing Databricks Using Azure Data Factory

BimlFlex provides an intuitive process to implement Databricks using Azure Data Factory (ADF) for cloud-based data warehousing solutions. With the BimlFlex 2026 release, significant enhancements have been introduced including Pushdown Processing and SQL Scripting options that dramatically improve performance and reduce costs.

Architecture Overview

BimlFlex supports two primary processing approaches when using Databricks with Azure Data Factory:

Traditional Notebook Approach

The traditional approach uses ADF Copy Activities to land source data in Azure Blob Storage as Parquet files, followed by ADF Notebook Activities that execute Databricks notebooks to load data into tables.

New in BimlFlex 2026, pushdown processing eliminates the need for ADF Notebook Activities by leveraging the Azure Data Factory Databricks Job Activity. This approach:

  • Pushes all transformation logic directly into Databricks workflows
  • Significantly reduces cluster spin-up overhead
  • Supports compute clusters, job clusters, and serverless compute
  • Generates native Databricks workflows handling dependencies, delta detection, and restart logic
  • Supports smart restartability that skips already-completed ADF copy activities on retry
Performance Benefits

Early benchmarks demonstrate pipelines completing in under 30 minutes using job clusters, compared to over two hours on larger dedicated clusters—representing up to a 75% reduction in runtime cost.

Pushdown Processing

With pushdown processing enabled, all transformations are executed within Databricks workflows or jobs rather than being orchestrated externally through ADF notebook activities. This approach offers several advantages:

  • Lower Runtime Costs: Pay only for actual compute time without idle cluster overhead
  • Simplified Orchestration: Fewer ADF activities to manage and monitor
  • Better Resource Utilization: Leverage ephemeral job clusters for efficient resource usage
  • Native Artifacts: All generated pipelines, jobs, and notebooks remain fully native to Databricks and ADF

Compute and Connectivity Options

BimlFlex supports two connectivity modes for Databricks notebooks to communicate with the BimlCatalog database:

ModeConfigurationCompute SupportDriver Requirement
ODBC (default)DatabricksUtilsDriver = ODBCCompute clusters, job clustersRequires bfx_init_odbc.sh init script on the cluster
JDBC (pytds)DatabricksUtilsDriver = JDBCCompute clusters, job clusters, serverless computeNo driver installation required

When JDBC mode is enabled, BimlFlex generates a bfxutils.py that uses the pure-Python pytds library instead of pyodbc. This removes the dependency on cluster-level ODBC driver installation and enables serverless compute, which does not support init scripts or custom driver installation.

To use serverless compute, set the DatabricksJobCluster setting to Serverless for the relevant objects. BimlFlex will automatically:

  • Omit cluster assignment from the Databricks Asset Bundle task definitions
  • Reference a bfx_jdbc environment with the required Python dependencies (python-tds, pyOpenSSL, certifi)
  • Set disable_auto_optimization: true on serverless tasks

For job clusters (non-serverless) with JDBC mode, the required libraries are attached as PyPI dependencies directly on each task instead.

Restartability

Pushdown processing includes built-in restart logic that avoids redundant work when a pipeline is re-executed after a failure. This is particularly important for long-running incremental loads where ADF copy activities may have already completed successfully before the Databricks job failed.

When a Databricks job activity fails after the preceding ADF copy activities have succeeded, the framework sets NextLoadStatus='D' (Databricks restart) instead of the standard 'R' (retry). On the next execution:

  1. The LogExecutionStart stored procedure detects the 'D' status and returns ExecutionStatus='D' along with the LastExecutionID from the previous run
  2. The ADF pipeline evaluates the ExecutionStatus — when it is 'D', the row_audit_id parameter passed to the Databricks job resolves to LastExecutionID rather than the current ExecutionID
  3. The Databricks notebooks use the original row_audit_id to locate the data that was already landed by the previous copy activities, and re-execute only the transformation logic

This means that a restart after a Databricks job failure does not re-run the ADF copy activities. The already-landed data is reused and the Databricks workflow resumes from the point of failure.

Secrets Configuration

BimlFlex generates a bfx_setup_secrets.ps1 sample file alongside the Databricks Asset Bundle output. This file documents the Databricks CLI commands needed to create the secret scope and store connection credentials for the BimlCatalog database.

When using JDBC mode, connection credentials can be stored as either:

  • JSON format (recommended for JDBC): {"server": "host", "database": "db", "user": "u", "password": "p", "port": 1433}
  • ODBC connection string format: Existing ODBC connection strings are automatically parsed by bfxutils.py

Both formats are stored in Databricks secrets using databricks secrets put-secret. See the generated bfx_setup_secrets.ps1 file for the complete setup steps.

Lakehouse Medallion Architecture Support

BimlFlex supports the medallion architecture pattern for Databricks Lakehouse implementations. The pushdown processing and SQL scripting options apply across all layers:

LayerBimlFlex ImplementationDatabricks Components
BronzeStaging + Persistent StagingLanding in Blob/ADLS, Delta tables for raw data
SilverData Vault or Normal FormUnity Catalog managed Delta tables
GoldData Mart / DimensionalOptimized Delta tables for analytics

Bronze Layer

Raw data lands in Azure Blob Storage or ADLS as Parquet files, then loads to Delta tables. BimlFlex manages:

  • Staging tables for current batch processing
  • Persistent Staging Area for historical retention

Silver Layer

BimlFlex supports two approaches:

  • Data Vault (recommended): Hub, Link, and Satellite patterns with full history
  • Normal Form: Traditional relational modeling

Gold Layer

Dimensional models optimized for analytics:

  • Star schema patterns with Fact and Dimension tables
  • Delta Lake optimizations (Z-ordering, partitioning)
tip

For detailed guidance on implementing medallion architecture, see the Delivering Lakehouse documentation.

SQL Scripting Option

New in BimlFlex 2026, a configuration option enables native SQL-based scripting for Databricks workloads. This provides:

  • Greater Readability: SQL-based templates are easier to review and understand
  • Easier Debugging: Familiar SQL syntax simplifies troubleshooting
  • SQL-Centric Development: Aligns with teams preferring SQL over Python/Scala approaches

Metadata-driven templates now support generating staging, Data Vault, and Data Mart patterns directly in SQL while still leveraging Databricks' scalability and performance.

Prerequisites

Before implementing Databricks with ADF, ensure you have completed the following:

  1. Databricks Configuration: Complete the setup outlined in the Databricks Configuration Overview
  2. Azure Storage: Configure blob storage for landing, staging, archive, and error containers
  3. Linked Services: Create and configure the Databricks linked service in BimlFlex
note

Detailed prerequisites and configuration steps are provided in the Databricks Configuration Overview section.

Configuring Databricks in BimlFlex

Loading Sample Metadata

BimlFlex provides sample metadata specifically designed for Databricks with Azure Data Factory. Load the sample from the Dashboard by selecting from the Load Sample Metadata dropdown.

note

For more information on lakehouse and data modeling implementations:

Connection Configuration

Configure your Databricks connections from within the BimlFlex Connections editor:

Source System Connection:

  • Enable Cloud option for the source system
  • Configure Staging / Landing Environment for Blob Storage with ADF linked services

Databricks Connection:

  • Set System Type to Databricks Data Warehouse
  • Set Linked Service Type to Databricks
  • Configure Integration Template to ADF Source -> Target

Batch Configuration

Prior to building your solution, configure batches from the BimlFlex Batches editor to:

  • Assign batches to different compute resources
  • Configure scaling parameters
  • Set execution priorities

Generated Output

With metadata imported, BimlFlex generates a complete Databricks solution. All generated artifacts are fully native to Databricks and Azure Data Factory, with no proprietary runtime or execution engine required.

BimlFlex generates the following artifacts:

  • Table Definitions: DDL scripts for creating Databricks tables
  • Stored Procedures: SQL procedures for data transformation logic
  • Notebooks/Workflows: Databricks notebooks or workflow definitions (depending on processing mode)
  • ADF Pipelines: Azure Data Factory orchestration artifacts ready to deploy

Deployed Solution

Once deployed to Azure Data Factory, the solution provides:

  • Visual pipeline representation
  • Monitoring and logging capabilities
  • Error handling with automatic file archiving

Monitoring and Management

After deployment, you can:

  • Scale compute resources up or down
  • View copy command completions and errors
  • Suspend or resume solution execution
  • Monitor execution status and performance
note

Files encountering errors are automatically moved to an error folder. On subsequent runs, these files will have already been processed and archived appropriately.

Worked Example: Source to Delta Lake Staging

This example shows the key configuration for a SQL Server → Databricks staging pipeline.

Connection Configuration

ConnectionConnection TypeSystem TypeIntegration StageKey Settings
AWLT_SRCOLEDBSQL ServerSource SystemStandard on-prem SQL Server source
BFX_LNDADONETAzure Blob StorageLanding AreaBlob container for ADF Copy Activity output
BFX_STG_DBRADONETDatabricksStaging AreaDatabricks workspace URL, catalog, schema

Project Configuration

FieldValue
ProjectEXT_AWLT_DBR
Integration TemplateDatabricks (DBR)
Source ConnectionAWLT_SRC
Target ConnectionBFX_STG_DBR
Pushdown ProcessingEnabled

What BimlFlex Generates

For each source object, BimlFlex produces:

  1. An ADF pipeline that orchestrates the end-to-end flow
  2. A Copy Activity that extracts data from the source and lands it in Azure Blob Storage
  3. A Databricks notebook activity that reads from the landing area and writes to a Delta table in the configured catalog and schema

The generated artifacts are organized in ADF under the project folder (EXT_AWLT_DBR), with one pipeline per source object. Databricks notebooks are placed in the workspace path configured by the DatabricksNotebookPath setting (default: /Repos/BimlFlex/@@Repository/Databricks/).

For detailed settings, see Databricks Configuration.

BimlFlex Documentation

External Resources