Implementing Databricks Using Azure Data Factory
BimlFlex provides an intuitive process to implement Databricks using Azure Data Factory (ADF) for cloud-based data warehousing solutions. With the BimlFlex 2026 release, significant enhancements have been introduced including Pushdown Processing and SQL Scripting options that dramatically improve performance and reduce costs.
Architecture Overview
BimlFlex supports two primary processing approaches when using Databricks with Azure Data Factory:
Traditional Notebook Approach
The traditional approach uses ADF Copy Activities to land source data in Azure Blob Storage as Parquet files, followed by ADF Notebook Activities that execute Databricks notebooks to load data into tables.
Pushdown Processing (Recommended)
New in BimlFlex 2026, pushdown processing eliminates the need for ADF Notebook Activities by leveraging the Azure Data Factory Databricks Job Activity. This approach:
- Pushes all transformation logic directly into Databricks workflows
- Significantly reduces cluster spin-up overhead
- Supports compute clusters, job clusters, and serverless compute
- Generates native Databricks workflows handling dependencies, delta detection, and restart logic
- Supports smart restartability that skips already-completed ADF copy activities on retry
Early benchmarks demonstrate pipelines completing in under 30 minutes using job clusters, compared to over two hours on larger dedicated clusters—representing up to a 75% reduction in runtime cost.
Pushdown Processing
With pushdown processing enabled, all transformations are executed within Databricks workflows or jobs rather than being orchestrated externally through ADF notebook activities. This approach offers several advantages:
- Lower Runtime Costs: Pay only for actual compute time without idle cluster overhead
- Simplified Orchestration: Fewer ADF activities to manage and monitor
- Better Resource Utilization: Leverage ephemeral job clusters for efficient resource usage
- Native Artifacts: All generated pipelines, jobs, and notebooks remain fully native to Databricks and ADF
Compute and Connectivity Options
BimlFlex supports two connectivity modes for Databricks notebooks to communicate with the BimlCatalog database:
| Mode | Configuration | Compute Support | Driver Requirement |
|---|---|---|---|
| ODBC (default) | DatabricksUtilsDriver = ODBC | Compute clusters, job clusters | Requires bfx_init_odbc.sh init script on the cluster |
| JDBC (pytds) | DatabricksUtilsDriver = JDBC | Compute clusters, job clusters, serverless compute | No driver installation required |
When JDBC mode is enabled, BimlFlex generates a bfxutils.py that uses the pure-Python pytds library instead of pyodbc. This removes the dependency on cluster-level ODBC driver installation and enables serverless compute, which does not support init scripts or custom driver installation.
To use serverless compute, set the DatabricksJobCluster setting to Serverless for the relevant objects. BimlFlex will automatically:
- Omit cluster assignment from the Databricks Asset Bundle task definitions
- Reference a
bfx_jdbcenvironment with the required Python dependencies (python-tds,pyOpenSSL,certifi) - Set
disable_auto_optimization: trueon serverless tasks
For job clusters (non-serverless) with JDBC mode, the required libraries are attached as PyPI dependencies directly on each task instead.
Restartability
Pushdown processing includes built-in restart logic that avoids redundant work when a pipeline is re-executed after a failure. This is particularly important for long-running incremental loads where ADF copy activities may have already completed successfully before the Databricks job failed.
When a Databricks job activity fails after the preceding ADF copy activities have succeeded, the framework sets NextLoadStatus='D' (Databricks restart) instead of the standard 'R' (retry). On the next execution:
- The
LogExecutionStartstored procedure detects the'D'status and returnsExecutionStatus='D'along with theLastExecutionIDfrom the previous run - The ADF pipeline evaluates the
ExecutionStatus— when it is'D', therow_audit_idparameter passed to the Databricks job resolves toLastExecutionIDrather than the currentExecutionID - The Databricks notebooks use the original
row_audit_idto locate the data that was already landed by the previous copy activities, and re-execute only the transformation logic
This means that a restart after a Databricks job failure does not re-run the ADF copy activities. The already-landed data is reused and the Databricks workflow resumes from the point of failure.
Secrets Configuration
BimlFlex generates a bfx_setup_secrets.ps1 sample file alongside the Databricks Asset Bundle output. This file documents the Databricks CLI commands needed to create the secret scope and store connection credentials for the BimlCatalog database.
When using JDBC mode, connection credentials can be stored as either:
- JSON format (recommended for JDBC):
{"server": "host", "database": "db", "user": "u", "password": "p", "port": 1433} - ODBC connection string format: Existing ODBC connection strings are automatically parsed by
bfxutils.py
Both formats are stored in Databricks secrets using databricks secrets put-secret. See the generated bfx_setup_secrets.ps1 file for the complete setup steps.
Lakehouse Medallion Architecture Support
BimlFlex supports the medallion architecture pattern for Databricks Lakehouse implementations. The pushdown processing and SQL scripting options apply across all layers:
| Layer | BimlFlex Implementation | Databricks Components |
|---|---|---|
| Bronze | Staging + Persistent Staging | Landing in Blob/ADLS, Delta tables for raw data |
| Silver | Data Vault or Normal Form | Unity Catalog managed Delta tables |
| Gold | Data Mart / Dimensional | Optimized Delta tables for analytics |
Bronze Layer
Raw data lands in Azure Blob Storage or ADLS as Parquet files, then loads to Delta tables. BimlFlex manages:
- Staging tables for current batch processing
- Persistent Staging Area for historical retention
Silver Layer
BimlFlex supports two approaches:
- Data Vault (recommended): Hub, Link, and Satellite patterns with full history
- Normal Form: Traditional relational modeling
Gold Layer
Dimensional models optimized for analytics:
- Star schema patterns with Fact and Dimension tables
- Delta Lake optimizations (Z-ordering, partitioning)
For detailed guidance on implementing medallion architecture, see the Delivering Lakehouse documentation.
SQL Scripting Option
New in BimlFlex 2026, a configuration option enables native SQL-based scripting for Databricks workloads. This provides:
- Greater Readability: SQL-based templates are easier to review and understand
- Easier Debugging: Familiar SQL syntax simplifies troubleshooting
- SQL-Centric Development: Aligns with teams preferring SQL over Python/Scala approaches
Metadata-driven templates now support generating staging, Data Vault, and Data Mart patterns directly in SQL while still leveraging Databricks' scalability and performance.
Prerequisites
Before implementing Databricks with ADF, ensure you have completed the following:
- Databricks Configuration: Complete the setup outlined in the Databricks Configuration Overview
- Azure Storage: Configure blob storage for landing, staging, archive, and error containers
- Linked Services: Create and configure the Databricks linked service in BimlFlex
Detailed prerequisites and configuration steps are provided in the Databricks Configuration Overview section.
Configuring Databricks in BimlFlex
Loading Sample Metadata
BimlFlex provides sample metadata specifically designed for Databricks with Azure Data Factory. Load the sample from the Dashboard by selecting from the Load Sample Metadata dropdown.
For more information on lakehouse and data modeling implementations:
Connection Configuration
Configure your Databricks connections from within the BimlFlex Connections editor:
Source System Connection:
- Enable Cloud option for the source system
- Configure Staging / Landing Environment for Blob Storage with ADF linked services
Databricks Connection:
- Set System Type to Databricks Data Warehouse
- Set Linked Service Type to Databricks
- Configure Integration Template to ADF Source -> Target
Batch Configuration
Prior to building your solution, configure batches from the BimlFlex Batches editor to:
- Assign batches to different compute resources
- Configure scaling parameters
- Set execution priorities
Generated Output
With metadata imported, BimlFlex generates a complete Databricks solution. All generated artifacts are fully native to Databricks and Azure Data Factory, with no proprietary runtime or execution engine required.
BimlFlex generates the following artifacts:
- Table Definitions: DDL scripts for creating Databricks tables
- Stored Procedures: SQL procedures for data transformation logic
- Notebooks/Workflows: Databricks notebooks or workflow definitions (depending on processing mode)
- ADF Pipelines: Azure Data Factory orchestration artifacts ready to deploy
Deployed Solution
Once deployed to Azure Data Factory, the solution provides:
- Visual pipeline representation
- Monitoring and logging capabilities
- Error handling with automatic file archiving
Monitoring and Management
After deployment, you can:
- Scale compute resources up or down
- View copy command completions and errors
- Suspend or resume solution execution
- Monitor execution status and performance
Files encountering errors are automatically moved to an error folder. On subsequent runs, these files will have already been processed and archived appropriately.
Worked Example: Source to Delta Lake Staging
This example shows the key configuration for a SQL Server → Databricks staging pipeline.
Connection Configuration
| Connection | Connection Type | System Type | Integration Stage | Key Settings |
|---|---|---|---|---|
AWLT_SRC | OLEDB | SQL Server | Source System | Standard on-prem SQL Server source |
BFX_LND | ADONET | Azure Blob Storage | Landing Area | Blob container for ADF Copy Activity output |
BFX_STG_DBR | ADONET | Databricks | Staging Area | Databricks workspace URL, catalog, schema |
Project Configuration
| Field | Value |
|---|---|
| Project | EXT_AWLT_DBR |
| Integration Template | Databricks (DBR) |
| Source Connection | AWLT_SRC |
| Target Connection | BFX_STG_DBR |
| Pushdown Processing | Enabled |
What BimlFlex Generates
For each source object, BimlFlex produces:
- An ADF pipeline that orchestrates the end-to-end flow
- A Copy Activity that extracts data from the source and lands it in Azure Blob Storage
- A Databricks notebook activity that reads from the landing area and writes to a Delta table in the configured catalog and schema
The generated artifacts are organized in ADF under the project folder (EXT_AWLT_DBR), with one pipeline per source object. Databricks notebooks are placed in the workspace path configured by the DatabricksNotebookPath setting (default: /Repos/BimlFlex/@@Repository/Databricks/).
For detailed settings, see Databricks Configuration.