Automating Azure Data Factory v2 With Biml
Why Generate ADF v2 Objects From Biml
The 2018 Biml release wave added native Azure Data Factory v2 tags to BimlStudio. The same metadata that drives SSIS package generation can now produce the JSON for linked services, datasets, pipelines, and triggers in a single build. BimlExpress can still emit the JSON by hand authored Biml strings, but the typed ADF tags shown below require BimlStudio.
The DataFactory Skeleton
A Data Factory project is wrapped in a 'DataFactories' element. Each 'DataFactory' has a name (required) and an optional logical display folder. Inside it sit linked services, datasets, pipelines, triggers, and annotations:
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<DataFactories>
<DataFactory>
<LinkedServices></LinkedServices>
<Datasets></Datasets>
<Pipelines></Pipelines>
<Triggers></Triggers>
<Annotations></Annotations>
</DataFactory>
</DataFactories>
</Biml>
Annotations behave the same way they do anywhere else in Biml.
Linked Services
Linked services are connections to sources and sinks. Anything available in the ADF UI is exposed as a typed Biml element, from on premises SQL Server to Amazon S3 to Azure Data Lake. A minimal linked service for an on premises SQL Server looks like this:
<SqlServer ConnectionString="myconnstring" Name="OnPremSql" />
Datasets
Each dataset represents a file or table and references a linked service. A dataset is the closest ADF equivalent of a Biml Table element. Most dataset types accept a Structure block with named columns and data types.
<SqlServer LinkedServiceName="OnPremSql" Name="OnPrem_Customer" Table="Customer">
<Structure>
<Column Name="LastName" DataType="String" />
<Column Name="FirstName" DataType="String" />
</Structure>
</SqlServer>
Pipelines
Pipelines hold one or more activities. Custom Activity and Copy Activity ship out of the box. A Copy Activity pairs a single source with a single sink, both pointing at named datasets:
<Pipelines>
<Pipeline Name="DemoPipeline">
<Activities>
<Copy>
<SqlServerSource DatasetName="OnPrem_Customer" />
<AzureSqlDatabaseSink DatasetName="Asdb_Customer" />
</Copy>
</Activities>
</Pipeline>
</Pipelines>
Triggers
Triggers are optional. Schedule and tumbling window triggers are both supported, and pipelines can also be left to run only on demand.
A Full Metadata Driven Sample
The script below reads the source database schema, then projects every table into a matching pair of datasets (one Azure SQL source, one Azure Data Lake sink) and a Copy activity that links them inside one pipeline:
<#@ template tier="2" language="VB" optionexplicit="False" #>
<#
Dim sourceConnection as AstDbConnectionNode = RootNode.DbConnections("Source")
Dim importResults = sourceConnection.GetDatabaseSchema()
#>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<DataFactories>
<DataFactory Name="AzureLoader">
<LinkedServices>
<AzureSqlDatabase Name="SourceData" ConnectionString="YOURCONNSTRING" />
<AzureDataLake Name="DataLake"
Url="https://YOURADLNAME.azuredatalakestore.net/webhdfs/v1"
ResourceGroup="YOURRG" SubscriptionId="YOURSUB" Tenant="YOURTENANT" />
</LinkedServices>
<Datasets>
<# for each tbl in importResults.TableNodes #>
<AzureSqlDatabase Name="ASQLDBSRC_<#= tbl.Name #>"
LinkedServiceName="SourceData"
Table="<#= tbl.Schema.Name #>.<#= tbl.Name #>">
<Structure>
<# for each column in tbl.Columns #>
<Column Name="<#= column.Name #>" DataType="<#= column.DataType #>" />
<# next #>
</Structure>
</AzureSqlDatabase>
<AzureDataLake Name="ADLS_<#= tbl.Name #>"
LinkedServiceName="DataLake"
FolderPath="adf/<#= tbl.Name #>/"
File="<#= tbl.Name #>.csv.gz">
<TextFormat ColumnDelimiter="," RowDelimiter="\n" />
<Compression CompressionType="GZip" CompressionLevel="Optimal" />
</AzureDataLake>
<# next #>
</Datasets>
<Pipelines>
<Pipeline Name="LoadDataLake">
<Activities>
<# for each tbl in importResults.TableNodes #>
<Copy Name="Copy <#= tbl.Name #>">
<AzureSqlDatabaseSource DatasetName="ASQLDBSRC_<#= tbl.Name #>" />
<AzureDataLakeSink DatasetName="ADLS_<#= tbl.Name #>"
CopyBehavior="PreserveHierarchy" />
</Copy>
<# next #>
</Activities>
</Pipeline>
</Pipelines>
</DataFactory>
</DataFactories>
</Biml>
Build and Deploy
A successful build drops the generated ADF JSON files under 'output\DataFactory'. Deployment runs through PowerShell at this point. Once the PowerShell deployment succeeds, every linked service, dataset, pipeline, and trigger is live in the target Data Factory.