Optimizing and Balancing Unit of Work in SSIS With T-SQL and Biml
Why Balance Containers Up Front
A staging package that runs every load in parallel works on paper, but most physical sources or targets cannot keep up with unbounded parallelism. The usual response is to bucket the loads into a fixed number of parallel containers, ideally with equal total cost per container. Doing that by hand is tedious, and the right answer drifts as load durations change. The pattern below assigns tables to containers in T-SQL using a simple greedy bin packing approach, then uses Biml to materialize one container per bucket.
Metadata Tables
Two tables drive the assignment. 'meta.Tables' lists every table to load, the cost (typically the recent average load time, but row count or size on disk also work), and the container the table currently belongs to. 'meta.Container' tracks the running totals per container so the assignment routine can keep picking the cheapest one:
CREATE SCHEMA meta
GO
CREATE TABLE [meta].[Container] (
[ContainerID] [INT] NULL,
[TotalCost] [FLOAT] NULL,
[Tables] [INT] NULL
) ON [PRIMARY]
GO
ALTER TABLE [meta].[Container]
ADD CONSTRAINT [DF_Container_TotalCost] DEFAULT((0)) FOR [TotalCost]
GO
ALTER TABLE [meta].[Container]
ADD CONSTRAINT [DF_Container_Tables] DEFAULT((0)) FOR [Tables]
GO
CREATE TABLE [meta].[Tables] (
[TableName] [NVARCHAR](50) NULL,
[Cost] [FLOAT] NULL,
[Container] [INT] NULL
) ON [PRIMARY]
GO
Seed the metadata. The values below are illustrative; the more accurate the cost, the more even the buckets:
INSERT INTO [meta].[Tables] ([TableName], [Cost], [Container]) VALUES ('Person.Customer', 1, 0)
INSERT INTO [meta].[Tables] ([TableName], [Cost], [Container]) VALUES ('Person.ContactInfo', 1, 0)
INSERT INTO [meta].[Tables] ([TableName], [Cost], [Container]) VALUES ('Sales.OrderHeader', 10, 0)
INSERT INTO [meta].[Tables] ([TableName], [Cost], [Container]) VALUES ('Sales.OrderLine', 20, 0)
Stored Procedures
Three procedures handle the work. The first creates the requested number of empty container rows:
CREATE PROCEDURE [dbo].[SP_CreateContainers] (@MaxContainers INT)
AS
BEGIN
DECLARE @CurrentContainer INT = 1
TRUNCATE TABLE meta.Container
WHILE @CurrentContainer <= @MaxContainers
BEGIN
INSERT INTO meta.Container (ContainerID) VALUES (@CurrentContainer)
SET @CurrentContainer = @CurrentContainer + 1
END
END
The second refreshes the running totals on each container row:
CREATE PROCEDURE [dbo].[SP_UpdateContainerStats]
AS
BEGIN
UPDATE a SET tables = cnt, TotalCost = Cost
FROM meta.Container a
INNER JOIN (
SELECT container, COUNT(*) AS cnt, SUM(Cost) AS Cost
FROM meta.Tables
GROUP BY container
) b ON a.ContainerID = b.container
END
GO
The third does the actual assignment. It clears the existing assignments, recreates the containers, then loops over the tables in descending cost order, dropping each one into whichever container has the lowest running total:
CREATE PROCEDURE [dbo].[SP_SetContainers] (@MaxContainers INT)
AS
BEGIN
UPDATE meta.Tables SET container = 0
EXEC [SP_CreateContainers] @MaxContainers
DECLARE @TableName NVARCHAR(50)
DECLARE @Cost FLOAT
DECLARE @NextContainer BIGINT
DECLARE tbl_Cursor CURSOR FOR
SELECT TableName, Cost FROM meta.Tables ORDER BY Cost DESC
OPEN tbl_Cursor
FETCH NEXT FROM tbl_Cursor INTO @TableName, @Cost
WHILE @@FETCH_STATUS = 0
BEGIN
SELECT @NextContainer = MIN(ContainerID)
FROM (
SELECT ContainerID, TotalCost, Tables, MIN(TotalCost) OVER() AS MinCost
FROM meta.Container
) a
WHERE TotalCost = MinCost
UPDATE meta.Tables
SET container = @NextContainer
WHERE TableName = @TableName
EXEC SP_UpdateContainerStats
FETCH NEXT FROM tbl_Cursor INTO @TableName, @Cost
END
CLOSE tbl_Cursor
DEALLOCATE tbl_Cursor
EXEC SP_UpdateContainerStats
END
For two or two thousand tables, the result is a column of container assignments where the totals across containers are close to equal.
Generating the Package
The Biml template runs the assignment procedure first, then queries 'meta.Tables' to build one outer container per bucket and one inner container per table. The actual data load pattern goes inside the inner container; the example leaves it as a placeholder so the focus stays on the bucketing.
C# version:
<#@ import namespace="System.Data.SqlClient" #>
<#@ import namespace="System.Data" #>
<#
int MaxContainers = 3;
string ConnString = "Data source=localhost; Database=PerfMeta; Integrated Security=SSPI";
SqlConnection conn = new SqlConnection(ConnString);
SqlCommand cmd = new SqlCommand("exec [dbo].[SP_SetContainers] " + MaxContainers, conn);
conn.Open();
cmd.ExecuteNonQuery();
conn.Close();
#>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<Packages>
<Package Name="LoadContainers" ConstraintMode="Parallel">
<Tasks>
<# for (int Container = 1; Container <= MaxContainers; Container++) { #>
<Container Name="LoadBucket <#= Container #>" ConstraintMode="Linear">
<Tasks>
<# foreach (DataRow row in ExternalDataAccess.GetDataTable(
"Provider=SQLNCLI11;" + ConnString,
"select * from meta.Tables where container = " + Container).Rows) { #>
<Container Name="Load <#= row["TableName"] #>" ConstraintMode="Linear">
<Tasks>
<!-- Data load pattern goes here -->
</Tasks>
</Container>
<# } #>
</Tasks>
</Container>
<# } #>
</Tasks>
</Package>
</Packages>
</Biml>
VB version:
<#@ template language="VB" optionexplicit="False" #>
<#@ import namespace="System.Data.SqlClient" #>
<#@ import namespace="System.Data" #>
<#
MaxContainers = 3
ConnString = "Data source=localhost; Database=PerfMeta; Integrated Security=SSPI"
Dim conn as new SqlConnection(ConnString)
Dim cmd = new SqlCommand("exec [dbo].[SP_SetContainers] " & MaxContainers, conn)
conn.Open()
cmd.ExecuteNonQuery()
conn.Close()
#>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<Packages>
<Package Name="LoadContainers" ConstraintMode="Parallel">
<Tasks>
<# for Container = 1 to MaxContainers #>
<Container Name="LoadBucket <#= Container #>" ConstraintMode="Linear">
<Tasks>
<# for each row in ExternalDataAccess.GetDataTable(
"Provider=SQLNCLI11;" & ConnString,
"select * from meta.Tables where container = " & Container).Rows #>
<Container Name="Load <#= row("TableName") #>" ConstraintMode="Linear">
<Tasks>
<!-- Data load pattern goes here -->
</Tasks>
</Container>
<# next #>
</Tasks>
</Container>
<# next #>
</Tasks>
</Package>
</Packages>
</Biml>
The MaxContainers variable controls how many parallel buckets the package gets. Lower it for fragile sources, raise it on hardware that can take more concurrency. Because the assignment runs every time the script is regenerated, fluctuations in load cost get reflected automatically the next build.