Skip to main content

Generating Azure Data Factory Datasets and Pipelines With Biml

Why Generate ADF Assets

A typical Azure Data Factory v1 pipeline that copies an on-premises SQL Server table into Azure Data Lake Store needs a linked service for the source, a linked service for the destination, an input dataset, an output dataset, and a copy activity inside a pipeline. Each of those is a JSON file. Hand authoring the JSON for two or three tables is fine. Hand authoring it for 100 or more tables is tedious, error prone, and slow to update when a column or path convention changes.

BimlScript can read a list of tables from a spreadsheet, expand a JSON template once per row, and write the resulting files to disk. The same pattern that generates SSIS packages from metadata generates ADF JSON from metadata.

What This Note Covers

The walkthrough covers the dataset side of the pattern: an input dataset that points at a SQL Server table and an output dataset that points at a folder in Azure Data Lake Store. Pipelines and copy activities follow the same approach but are larger and are not shown here.

The version of Azure Data Factory referenced in the original work is v1. The dataset JSON shapes shown below are the v1 shapes.

Metadata Source

The driver is an Excel sheet named LoadCatalog with one row per source table. Each row carries:

  • SchemaName and TableName
  • Frequency, set to Daily or Hourly
  • A flag that decides whether the load is a full reload or incremental
  • A column list and column types for the SELECT, used by the pipeline template

The BimlScript that generates the JSON reads the spreadsheet through ExcelReader.ReadExcelQuery and works against an in-memory DataSet.

Input Dataset Template

The input dataset template is a single .biml file that produces one JSON file per call. Three properties at the top declare what the caller will pass in. The body is the JSON that ADF expects, with code nuggets where the table name and frequency vary:

<#@ property name="schemaName" type="string" #>
<#@ property name="tableName" type="string" #>
<#@ property name="frequency" type="string" #>

{
"$schema": "http://datafactories.schema.management.azure.com/schemas/2015-09-01/Microsoft.DataFactory.Table.json",
"name": "DS_OnPremSQL_OperationsDB_<#=schemaName#>_<#=tableName#>",
"properties": {
"type": "SqlServerTable",
"linkedServiceName": "LS_OnPremSQL_OperationsDB",
"structure": [],
"external": true,
"typeProperties": {
"tableName": "<#=schemaName#>.<#=tableName#>"
},
"availability": {
<# if (frequency == "Daily") { #>
"frequency": "Day",
<# } else if (frequency == "Hourly") { #>
"frequency": "Hour",
"style": "EndOfInterval",
<# } #>
"interval": 1
}
}
}

The conditional inside availability picks one of two shapes depending on the cadence the row asks for.

Output Dataset Template

The output template has the same shape but adds a scope property that distinguishes a full reload folder from a deltas folder. The folderPath uses ADF date-time slice tokens so each slice writes to its own folder:

<#@ property name="schemaName" type="string" #>
<#@ property name="tableName" type="string" #>
<#@ property name="frequency" type="string" #>
<#@ property name="scope" type="string" #>

{
"$schema": "http://datafactories.schema.management.azure.com/schemas/2015-09-01/Microsoft.DataFactory.Table.json",
"name": "DS_DataLake_OperationsDB_<#=schemaName#>_<#=tableName#>",
"properties": {
"type": "AzureDataLakeStore",
"linkedServiceName": "LS_ADLS_LakeStore",
"structure": [],
"typeProperties": {
"folderPath": "/raw/operationsdb/<#=schemaName#>_<#=tableName#>/<#=scope.ToLower()#>/{Year}/{Month}/{Day}/",
"fileName": "<#=schemaName#>_<#=tableName#>_{Hour}{Minute}{Second}.txt",
"format": {
"type": "TextFormat",
"columnDelimiter": "|",
"quoteChar": "\"",
"nullValue": ""
},
"partitionedBy": [
{ "name": "Year", "value": { "type": "DateTime", "date": "SliceStart", "format": "yyyy" } },
{ "name": "Month", "value": { "type": "DateTime", "date": "SliceStart", "format": "%M" } },
{ "name": "Day", "value": { "type": "DateTime", "date": "SliceStart", "format": "%d" } },
{ "name": "Hour", "value": { "type": "DateTime", "date": "SliceStart", "format": "hh" } },
{ "name": "Minute", "value": { "type": "DateTime", "date": "SliceStart", "format": "mm" } },
{ "name": "Second", "value": { "type": "DateTime", "date": "SliceStart", "format": "ss" } }
]
},
"availability": {
<# if (frequency == "Daily") { #>
"frequency": "Day",
<# } else if (frequency == "Hourly") { #>
"frequency": "Hour",
"style": "EndOfInterval",
<# } #>
"interval": 1
}
}
}

Driver Script

A separate .biml file reads the Excel sheet, splits the rows into four DataViews by frequency and load type, and calls the templates once per row. The script writes JSON files directly to the ADF project folder using System.IO. CallBimlScript expands the named template with the supplied arguments and returns the resulting text:

<#@ template tier="10" #>
<#@ import namespace="System.Data" #>
<#@ import namespace="System.Text" #>
<#@ code file="LoadHelper.cs" #>
<#@ import namespace="LoadHelper" #>

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
</Biml>

<#
string metadataPath = @"C:\Source\OperationsDW\metadata";
string metadataFile = "LoadCatalog.xlsx";
string metadataSheet = "Catalog$";
bool metadataHasHeader = true;

string adfProjectPath = @"C:\Source\OperationsDW\adf\";

DataSet catalog = ExcelReader.ReadExcelQuery(
metadataPath, metadataFile, metadataSheet, metadataHasHeader);

DataView dailyFulls = new DataView(catalog.Tables["Catalog"], "Frequency = 'Daily' and [Changes Only] = 'No'", "", DataViewRowState.CurrentRows);
DataView dailyDeltas = new DataView(catalog.Tables["Catalog"], "Frequency = 'Daily' and [Changes Only] = 'Yes'", "", DataViewRowState.CurrentRows);
DataView hourlyFulls = new DataView(catalog.Tables["Catalog"], "Frequency = 'Hourly' and [Changes Only] = 'No'", "", DataViewRowState.CurrentRows);
DataView hourlyDeltas = new DataView(catalog.Tables["Catalog"], "Frequency = 'Hourly' and [Changes Only] = 'Yes'", "", DataViewRowState.CurrentRows);

foreach (DataRowView rowView in dailyFulls)
{
DataRow r = rowView.Row;
string schemaName = r["SchemaName"].ToString();
string tableName = r["TableName"].ToString();
string frequency = r["Frequency"].ToString();
string scope = "full";

System.IO.File.WriteAllText(
adfProjectPath + "DS_OnPremSQL_OperationsDB_" + schemaName + "_" + tableName + ".json",
CallBimlScript("DS_OnPremSQL_OperationsDB_Schema_Table.biml", schemaName, tableName, frequency));

System.IO.File.WriteAllText(
adfProjectPath + "DS_DataLake_OperationsDB_" + schemaName + "_" + tableName + ".json",
CallBimlScript("DS_DataLake_OperationsDB_Schema_Table.biml", schemaName, tableName, frequency, scope));
}

// Repeat the same block for dailyDeltas, hourlyFulls, and hourlyDeltas.
#>

The four foreach blocks are structurally identical. Each one calls the input template and the output template, with full or deltas substituted into the scope argument as appropriate.

What the Run Produces

A single execution writes one input dataset JSON and one output dataset JSON per row in the spreadsheet. For 100 source tables that is 200 files, all consistent in naming, all consistent in path layout, all consistent in availability shape. The pipeline templates that consume these datasets follow the same pattern and are emitted from a similar driver block.

Where to Take It Next

The same script can write pipelines that group tables by cadence, one-time backfill pipelines that run once per source table, and a separate set of files for an environment-specific deployment. ARM templates can wrap the resulting JSON for repeatable deployment to a new factory. The center of the pattern stays the same: a metadata table or sheet, a Biml template per asset shape, and a driver script that maps one to the other.