Skip to main content

Change Data Capture and Biml

Why CDC Belongs in a Biml Project

Change Data Capture has been an enterprise feature of SQL Server since 2008. It collects only the changed rows from selected tables in a source system without intrusive triggers, which makes it well suited to incremental data warehouse loads. SQL Server 2012 added three SSIS components that wrap the CDC plumbing into the data flow:

  • The CDC Control task manages Log Sequence Number ranges across runs of a CDC package.
  • The CDC Source reads a range of change rows from the change tables and pushes them downstream.
  • The CDC Splitter divides a single CDC source flow into separate insert, update, and delete branches.

These components are custom tasks rather than native SSIS components, so generating them through Biml takes a little extra effort.

A Three Step Approach

The pattern that works well for any Biml solution that uses custom components has three steps:

  1. Build the package by hand in SSIS, or at least sketch the design.
  2. Create a Biml file that produces the package without script. This is the vanilla version.
  3. Add metadata gathering and BimlScript to scale the vanilla template across many tables.

This walkthrough focuses on step 2.

Preparing the Environment

Before generating any packages, set up SQL Server with:

  1. A working database, for example 'CdcLab'.
  2. A source table that holds the rows being tracked, for example 'staging.DivisionGroup_Cdc'.
  3. CDC enabled at the database level.
  4. A primary key on the source table.
  5. CDC enabled on the source table.
  6. A destination table that receives the loaded rows.
  7. A small state table that stores the CDC state value between runs.

Initial Load Package

The initial load package has a control flow with a CDC Control task that marks the start, a data flow that copies the current contents of the source table into the destination, and a second CDC Control task that marks the end. The data flow itself is a straightforward source to destination copy with no CDC components inside it; the change tracking work happens in the control flow tasks.

The 'CdcControlTask' is a custom component, which means Biml needs the 'CustomTask' element with the correct 'CreationName' and a populated 'ObjectData' element that mirrors the XML SSIS produces for that task. The fastest way to harvest those values is to build the vanilla package once in SSIS, open it through View Code (F7), find the CDC Control task block, and copy the 'CreationName' attribute and the 'ObjectData' contents into the Biml definition:

<CustomTask Name="MarkInitialLoadStart" CreationName="Microsoft.SqlServer.Dts.Tasks.CdcControlTask, Microsoft.SqlServer.CDCControlTask, Version=11.0.0.0, Culture=neutral, PublicKeyToken=89845dcd8080cc91">
<ObjectData>
<![CDATA[
<!-- ObjectData payload copied from the vanilla package -->
]]>
</ObjectData>
</CustomTask>

The data flow that follows reads from the source table and writes to the destination table using ordinary 'OleDbSource' and 'OleDbDestination' transformations. After the data flow, a matching CDC Control task records the LSN that becomes the starting point for the first incremental run.

Incremental Load Package

The incremental load uses the CDC Source and CDC Splitter components in the data flow. The control flow wraps the work in a sequence container that holds:

  • A CDC Control task that retrieves the current processing range from the state table.
  • Steps that create the staging tables for changed rows and deleted rows.
  • A data flow that pulls the change rows, splits them by operation, and routes each branch to a staging table or directly to the destination.
  • An Execute SQL task that applies the staged updates and deletes against the destination.
  • A CDC Control task that records the new processing range.
  • Steps that drop the staging tables.

A complicating factor is that the CDC Source and CDC Splitter are also custom components, and they require column metadata that SSIS normally infers. For these components Biml needs the metadata supplied explicitly inside the 'CustomComponent' element, again copied from the XML of a vanilla package.

A typical layout for the incremental data flow looks like this:

<Dataflow Name="LoadDivisionGroupChanges">
<Transformations>

<CustomComponent Name="ReadChanges" CreationName="Microsoft.SqlServer.Dts.Pipeline.CDCSource, ..." ConnectionName="CdcSource">
<CustomProperties>
<!-- properties copied from vanilla package, including ColumnMetadata -->
</CustomProperties>
<OutputPaths>
<OutputPath Name="Output" />
</OutputPaths>
</CustomComponent>

<CustomComponent Name="SplitByOperation" CreationName="Microsoft.SqlServer.Dts.Pipeline.CDCSplitter, ...">
<InputPaths>
<InputPath OutputPathName="ReadChanges.Output" />
</InputPaths>
<OutputPaths>
<OutputPath Name="InsertOutput" />
<OutputPath Name="UpdateOutput" />
<OutputPath Name="DeleteOutput" />
</OutputPaths>
</CustomComponent>

<OleDbDestination Name="WriteInserts" ConnectionName="WarehouseTarget">
<InputPath OutputPathName="SplitByOperation.InsertOutput" />
<ExternalTableOutput Table="dim.DivisionGroup" />
</OleDbDestination>

<OleDbDestination Name="StageUpdates" ConnectionName="WarehouseTarget">
<InputPath OutputPathName="SplitByOperation.UpdateOutput" />
<ExternalTableOutput Table="stg.DivisionGroup_Updates" />
</OleDbDestination>

<OleDbDestination Name="StageDeletes" ConnectionName="WarehouseTarget">
<InputPath OutputPathName="SplitByOperation.DeleteOutput" />
<ExternalTableOutput Table="stg.DivisionGroup_Deletes" />
</OleDbDestination>

</Transformations>
</Dataflow>

The CDC Control task at the start of the container retrieves the processing range. After the data flow finishes, an Execute SQL task issues the UPDATE and DELETE statements that apply the staged operations against the destination table. A second CDC Control task then records the new state, and the staging tables are dropped to clean up.

Scaling Across Tables

A single sequence container per source table keeps the package readable. When the number of tracked tables grows, the typical refactor moves the CDC Control tasks into a master package and gives each tracked table its own child package. That structure is also where step 3 of the three step approach pays off: a metadata table listing the tracked source tables drives a BimlScript loop that emits one child package per row.