Skip to main content

Preparing BimlFlex for Databricks

Before working with Databricks metadata, you should ensure that your environment is configured to use Databricks appropriately. This section will walk you through the required software and system configurations you will need to connect BimlFlex to your Databricks environment.

Getting Started with Sample Metadata

BimlFlex includes Databricks starter metadata that creates sample connections and metadata to help you start experimenting right away. To load the sample metadata:

  1. Open BimlFlex
  2. From the Dashboard, select the Load Sample Metadata dropdown
  3. Choose the Databricks sample that matches your target architecture

The sample metadata creates connections for each layer of the lakehouse:

  • Source Connection: Points to a sample source database (e.g., AdventureWorks)
  • Landing Connection: Where raw data is stored when first extracted
  • Staging Connection: Where data is processed before loading to integration layers
  • Persistent Staging Connection: Optional historized bronze layer
  • Data Vault Connection: Silver layer for historization and traceability
  • Data Mart Connection: Gold layer for analytics and reporting

Installing and Configuring a Databricks ODBC DSN

The ODBC driver is required when using the Metadata Importer to bring Databricks metadata into the BimlFlex framework when Databricks data sources are part of the solution.

tip

For additional details on installing and configuring and ODBC driver and DSN, refer to the below Databricks Docs guides:

Configuring BimlFlex for Databricks

BimlFlex uses Settings to adapt to specific requirements for file locations, naming conventions, data conventions etc.

Align these settings with the organizations best practices and environmental requirements.

Databricks Settings

The following Databricks Databricks-specific settings are used to configure overall orchestration options for Databricks.

These settings define the connectivity configuration from BimlFlex to Databricks, and are used in the resulting data solution when deployed to the target environment. For example, these settings are used to define the Linked Service and file locations.

danger

Please configure and review the above settings carefully, as they are essential for correct connectivity to Databricks in the compiled data solution.

Azure Storage Processing Settings

The following Azure Settings are used to configure the blob destinations. This is mandatory when using Databricks.

Settings

Setting KeySetting Description
Archive ContainerThe Container Name to use for the archive process
Archive Account NameThe Azure Blob Storage Account Name to use for archiving data as files in blob storage
Archive Account KeyA Storage access key to use when accessing the Blob storage
Archive SAS TokenA Storage access SAS Token to use when accessing the Blob storage
Error ContainerThe Container Name to use for the Error process
Error Account NameThe Azure Blob Storage Account Name to use for error files in blob storage
Error Account KeyA Storage access key to use when accessing the Blob storage
Error SAS TokenA Storage access SAS Token to use when accessing the Blob storage
Stage ContainerThe Container Name to use for the staging process
Stage Account NameThe Azure Blob Storage Account Name to use for staging data as files in blob storage
Stage Account KeyA Storage access key to use when accessing the Blob storage
Stage SAS TokenA Storage access SAS Token to use when accessing the Blob storage
Blob Storage DomainThe AzCopy domain to use

Examples

Setting KeySetting Description
Archive Containerarchive
Archive Account Namebfxblobaccount
Archive Account Key<StorageAccountKey>==
Archive SAS Token?<SasToken>
Error Containererror
Error Account Namebfxblobaccount
Error Account Key<StorageAccountKey>==
Error SAS Token?<SasToken>
Stage Containerstage
Stage Account Namebfxblobaccount
Stage Account Key<StorageAccountKey>==
Stage SAS Token?<SasToken>
Blob Storage Domainblob.core.windows.net
tip

For additional details, please refer to the following Microsoft guide on creating an account SAS.


Object File Path Configuration

In addition to the global storage settings, BimlFlex allows granular file path configuration at the Object level. These settings are configured in the Object Editor and override the default paths for individual objects.

Settings

Setting KeySetting Description
Source File PathThe path where source files are located. Used when the Object represents a file-based source.
Source File PatternThe naming pattern for source files. Supports expressions for dynamic file matching.
Source File FilterThe filter applied to the Filter Activity after the Metadata Activity to select specific files.
Landing File PathThe path where extracted data files should be placed during the landing process.
Landing File PatternThe naming pattern for landing files. Supports expressions including @@this for object name and pipeline().parameters for dynamic values.
Persistent Landing File PathThe file path where raw landing files are stored persistently for historical tracking and reprocessing.
Persistent Landing File PatternThe naming pattern for files stored in the Persistent Landing File Path. May include date, time, or other variable elements.

Examples

Setting KeyExample Value
Source File Path@@rs\@@schema\@@object or \\myadls.file.core.windows.net\myshare\souce
Source File Patternsales_*.parquet or @string('Account*.csv')
Source File Filter@greater(item().lastModified, pipeline().parameters.LastLoadDate) or @and(startswith(item().name, 'Account'), endswith(item().name, '.csv'))
Landing File Path@@this/Landing/
Landing File Pattern@concat('@@this', '_', replace(replace(formatDateTime(pipeline().parameters.BatchStartTime, 'yyyy-MM-ddTHH:mm:ss'), ':', ''), '-', ''), '.parquet')
Persistent Landing File Path@concat('@@this/Archive/', formatDateTime(pipeline().parameters.BatchStartTime, 'yyyy/MM/dd/'))
Persistent Landing File Pattern@@this_@@timestamp.parquet

note

The @@this placeholder is automatically replaced with the Object name at runtime. These settings apply per-object and take precedence over connection-level defaults.

Databricks Table Settings

BimlFlex provides Databricks-specific settings at the Object level for controlling how tables are created in Databricks. These settings map directly to Databricks CREATE TABLE statement clauses.

Settings

Setting KeySetting Description
Partitioned By ClauseAn optional clause to partition the table by a subset of columns. Improves query performance for filtered queries.
Location ClauseAn optional path to the directory where table data is stored, which could be a path on distributed storage (e.g., DBFS, ADLS, S3).
Cluster By ClauseOptionally cluster the table or each partition into a fixed number of hash buckets using a subset of columns. Improves join and aggregation performance.
Table PropertiesOptionally sets one or more user-defined table properties using TBLPROPERTIES.
Option ClauseSets or resets one or more user-defined table options using the OPTIONS clause.

Examples

Setting KeyExample Value
Partitioned By ClausePARTITIONED BY (ModifiedDate)
Location ClauseLOCATION 'abfss://bfx-lnd@lakehouse.dfs.core.windows.net/awlt/Address'
Cluster By ClauseCLUSTER BY (CustomerKey)
Table PropertiesTBLPROPERTIES (delta.autoOptimize.autoCompact = true)
Option ClauseOPTIONS ('delta.logRetentionDuration' = '30 days')

tip

For Delta Lake tables, it is recommended to add TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true, delta.autoOptimize.autoCompact = true) to enable automatic optimization.

Configuring a Databricks Connection

This section outlines any specific considerations needed when configuring BimlFlex to use Databricks across the various Integration Stages.

FieldSupported ValuesGuide
Integration StageSource System*, Staging Area, Persistent Staging Area, Data Vault, Data MartDetails
Connection TypeODBC*, ODBC SQL Based ELT*Details
System TypeDatabricks Data WarehouseDetails
Connection StringDsn={DSN Name};Uid={User Name};Pwd={Password};Database={Database Name};*Details
Linked Service TypeDatabricksDetails

Integration Stage

BimlFlex provides support for the use of Databricks as both a target warehouse platform and as a Source System. It is highly recommended to use a single database that when using Databricks as a target warehouse platform.

Naming patterns and schemas can be used for separation as needed.

The generated DDL does not currently support USE DATABASE or USE WAREHOUSE statements.

Deployment would require either 'Cherry Picking' the correct tables for each database or using Generate Script Options and scoping to a specific Integration Stage before each generation of the scripts.

Landing Area

Databricks is not currently supported as a Landing Area but a Landing Area is required when using ADF to orchestrate data movement.

The recommendation is for the Landing Area to be a Connection Type of Azure Blob Storage and System Type of Flat File Delimited.

In addition to the Landing Area it is also important that the Settings for an Azure Blob Stage Container, Azure Blob Archive Container, and Azure Blob Error Container are populated correctly.

danger

Ensure that all Azure Blob Container Settings are configured properly:

Connection Type

Currently BimlFlex only supports the Connection Type of ODBC for a Databricks Connection when being used as the target data warehouse platform. When configuring a Databricks Integration Stage of Source System, please set the Connection Type to ODBC.

System Type

The System Type should always be Databricks Data Warehouse for any Databricks Connection.

Connection String

For Databricks connections, the connection string is configured through the Linked Service. See the Linked Services Section for details on configuring connection string values and Azure Key Vault secrets.

tip

For additional details on creating a Connection refer to the below guide:

Importing Source Metadata

Once your connections are configured, you can import metadata from your source systems. BimlFlex treats all sources the same way because everything is metadata-driven, whether you are importing 10 tables or hundreds.

To import source metadata:

  1. Navigate to the source Connection
  2. Use the Import Metadata action to bring in tables
  3. For file-based sources, you can import directly from flat files or Parquet files
  4. After import, review the schema diagram showing tables and relationships
  5. Apply any overrides or additional modeling as needed
tip

For additional details on importing metadata, refer to the Metadata Importer documentation.

Deploying the target Warehouse Environment

When Databricks is used as the target warehouse platform, BimlFlex will generate the required SQL scripts for the deployment of all the tables, stored procedures, and the Data Vault default inserts (Ghost Keys).

Once generated the scripts can be manually deployed to the required database.

Generating Databricks SQL Scripts

Using Databricks as the target warehouse platform requires the generation of the Databricks Table Script and Databricks Procedure Script options when using Generate Scripts in BimlStudio. Additionally if Data Vault is being used the standard Data Vault Default Insert Script can be used to generate the required Ghost Keys.

tip

For additional details on generating DDL refer to the BimlFlex DDL generation guide.

Deploying Databricks SQL Scripts

Once BimlFlex generates the scripts they can be executed against the target database. These can be deployed through copy/paste using a Databricks SQL Editor, Notebook, or via the Databricks CLI.

danger

Ensure you are using the appropriate catalog and schema when executing the SQL scripts. The scripts may not contain explicit catalog/schema references and depend on the user executing the script to have selected the appropriate context.

tip

For additional details on using Databricks SQL and notebooks, refer to the Databricks documentation:

Orchestration

BimlFlex automatically generates the orchestration artifacts as part of the standard build process. The actual artifacts generated depends on the Integration Stage that is configured for the Project.

There is no difference in the process of deploying the Azure environment and ADF orchestration compared to any other target infrastructure in BimlFlex.

As a final check, please ensure the following configurations were made:

tip

For additional details on generating deploying ADF artifacts refer to the below guides: