Loading Tables by Foreign Key Topology with Biml: Parallel Loads
Why Run Loads in Parallel
The linear topology load walkthrough produces a working package, but every table runs one after another. On a machine with spare CPU and disk capacity that approach wastes time. Loads can run concurrently as long as every parent table has finished before its children start. The pattern below uses the same TopoSort helper and Dataflow fragment as the linear version, but the package structure groups independent tables into parallel containers.
This walkthrough assumes the TopoSort module, source and target connections, foreign key walking, and the table create package from the linear load article are already in place.
The Parallel Load Strategy
The package builder works in five stages:
- Track loaded tables in a list. The list starts empty.
- Find every table that does not reference any other table. Load each one, then recursively load tables that reference only that single table.
- Each time a table is loaded, add its name to the loaded list.
- Iterate up to ten times, finding tables whose foreign key dependencies are all already loaded, and load them in parallel inside a container.
- If any tables remain after the loop ends, load them sequentially in topology order as a safety net.
The Parallel Load Package
<#@ template tier="4" language="VB" #>
<#@ code file="../Code/TopologySort.vb" #>
<#@ import namespace="TopologySort" #>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<Packages>
<Package Name="03_Load_Parallel" ConstraintMode="Linear" PackageSubpath="<#= OutputPath #>">
<Tasks>
<ExecutePackage Name="Recreate Tables">
<ExternalFile ExternalFilePath="<#= OutputPath #>\01_Create.dtsx" />
</ExecutePackage>
<Container Name="Load roots and immediate descendants" ConstraintMode="Parallel">
<Tasks>
<# For Each t As AstTableNode In RootNode.Tables.Where(Function(r) r.Columns.OfType(Of AstTableColumnTableReferenceNode)().Count + r.Columns.OfType(Of AstMultipleColumnTableReferenceNode)().Count = 0) #>
<Container Name="Load <#= t.Name #>" ConstraintMode="Linear">
<Tasks>
<#= CallBimlScript("Recursion.biml", t, LoadedTables) #>
</Tasks>
</Container>
<# Next #>
</Tasks>
</Container>
<#
Do While RootNode.Tables.Where(Function(e) Not LoadedTables.Contains(e.Name) And Not e.Columns.OfType(Of AstMultipleColumnTableReferenceNode).Where(Function(c) c.ForeignTable.Name <> e.Name And Not LoadedTables.Contains(c.ForeignTable.Name)).Count > 0 And Not e.Columns.OfType(Of AstTableColumnTableReferenceNode).Where(Function(m) m.ForeignTable.Name <> e.Name And Not LoadedTables.Contains(m.ForeignTable.Name)).Count > 0).Any And Level < 10
Level += 1
LoadableTables = RootNode.Tables.Where(Function(e) Not LoadedTables.Contains(e.Name) And Not e.Columns.OfType(Of AstMultipleColumnTableReferenceNode).Where(Function(c) c.ForeignTable.Name <> e.Name And Not LoadedTables.Contains(c.ForeignTable.Name)).Count > 0 And Not e.Columns.OfType(Of AstTableColumnTableReferenceNode).Where(Function(m) m.ForeignTable.Name <> e.Name And Not LoadedTables.Contains(m.ForeignTable.Name)).Count > 0).ToList
#>
<Container Name="Load level <#= Level #>" ConstraintMode="Parallel">
<Tasks>
<# For Each tbl As AstTableNode In LoadableTables #>
<#= CallBimlScript("Dataflow.biml", tbl, LoadedTables) #>
<# Next #>
</Tasks>
</Container>
<#
Loop
If LoadedTables.Count < RootNode.Tables.Count Then
#>
<Container Name="Load leftover tables" ConstraintMode="Linear">
<Tasks>
<# For Each t As AstTableNode In RootNode.Tables.TopoSort.Where(Function(r) Not LoadedTables.Contains(r.Name)) #>
<#= CallBimlScript("Dataflow.biml", t, LoadedTables) #>
<# Next #>
</Tasks>
</Container>
<# End If #>
</Tasks>
</Package>
</Packages>
</Biml>
Recursive Loading of Direct Descendants
The first parallel container delegates each root table to a recursive Biml file. That file loads the table itself, then loops over tables that reference only this single root, recursing back into itself to handle their children:
<#@ template language="VB" designerbimlpath="Biml/Packages/Package/Tasks" #>
<#@ property name="tbl" type="AstTableNode" #>
<#@ property name="LoadedTables" type="List (of String)" #>
<#= CallBimlScript("Dataflow.biml", tbl, LoadedTables) #>
<# For Each t As AstTableNode In RootNode.Tables.Where(Function(r) r.Columns.OfType(Of AstTableColumnTableReferenceNode)().Count + r.Columns.OfType(Of AstMultipleColumnTableReferenceNode)().Count = 1).Where(Function(e) e.Columns.OfType(Of AstTableColumnTableReferenceNode)().First.ForeignTable.Name = tbl.Name) #>
<#= CallBimlScript("Recursion.biml", t, LoadedTables) #>
<# Next #>
CallBimlScript inside a script that itself was called is the trick that makes recursion possible.
What the Loop Stage Does
After the first container takes care of the easy roots and their immediate descendants, the do while loop looks for any unloaded table whose foreign key references all point to already loaded tables. Self references are ignored. Tables that qualify go into one parallel container per level. The loop runs at most ten times.
Anything still left over (typically because of cycles or unusually deep graphs) is loaded sequentially in topology order so the package always completes. On a representative warehouse, the parallel package finishes considerably faster than the linear version while preserving referential integrity.