Loading Flat Files with XML Format Definitions
Why Drive Flat File Loads From Format Files
A flat file load in SSIS needs three pieces in Biml: a flat file format that describes the columns and delimiters, a flat file connection that points the format at a path, and a destination table that matches the format. Maintaining all three by hand for dozens of feeds is tedious and error prone.
The bcp utility can already produce an XML format file for any source table. Parsing that XML once and feeding the result through a small set of extension methods produces every Biml flat file format, every connection, and every staging table from the same source of truth. Adding a new feed becomes a matter of dropping a new format file into the format folder.
Producing the Source Files
Two folders hold the artifacts. One holds the comma separated data exports. The other holds the bcp XML format files. A batch script invokes bcp twice per table:
bcp SourceDb.person.PersonPhone out Data\PersonPhone.csv -T -t, -c
bcp SourceDb.person.EmailAddress out Data\EmailAddress.csv -T -t, -c
bcp SourceDb.sales.SalesOrderHeader out Data\SalesOrderHeader.csv -T -t, -c
bcp SourceDb.person.PersonPhone format nul -f Format\PersonPhone.xml -T -t, -c -x
bcp SourceDb.person.EmailAddress format nul -f Format\EmailAddress.xml -T -t, -c -x
bcp SourceDb.sales.SalesOrderHeader format nul -f Format\SalesOrderHeader.xml -T -t, -c -x
Running the script produces one CSV per table and a matching XML format file for each one.
A Pointer at the Target Database
A small environment file declares the target database that staging tables will land in:
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<Connections>
<OleDbConnection Name="Target"
ConnectionString="Data Source=localhost;Initial Catalog=FlatFileLanding;Provider=SQLNCLI11;Integrated Security=SSPI" />
</Connections>
<Databases>
<Database Name="FlatFileLanding" ConnectionName="Target" />
</Databases>
<Schemas>
<Schema Name="dbo" DatabaseName="FlatFileLanding" />
</Schemas>
</Biml>
The Extension Method File
The rest of the work lives in a single VB module of extension methods plus two small converter helpers. The module declares a static folder path and a default locale, then exposes three extensions. The first turns an 'AstFlatFileFormatNode' into an 'AstTableNode' suitable for the staging schema. The second turns an 'AstFlatFileFormatNode' into a matching 'AstFlatFileConnectionNode'. The third reads a bcp XML file and returns a populated 'AstFlatFileFormatNode':
Imports Varigence.Biml.Extensions
Imports Varigence.Languages.Biml
Imports Varigence.Languages.Biml.FileFormat
Imports Varigence.Languages.Biml.Table
Imports System.IO
Imports System.Xml
Imports System.Data
Imports System.Runtime.CompilerServices
Module FlatFileExtension
Public Basepath As String = "C:\flatfiles"
Public Locale As Varigence.Languages.Biml.Cube.Language = _
Varigence.Languages.Biml.Cube.Language.Lcid1033
<Extension()>
Public Function ToAstTableNode(FlatFile As AstFlatFileFormatNode,
Schema As AstSchemaNode) As AstTableNode
Dim BimlTable As New AstTableNode(Nothing)
BimlTable.Name = "FF_" + FlatFile.Name
BimlTable.Schema = Schema
For Each flatFileColumn As AstFlatFileColumnNode In FlatFile.Columns
Dim tableColumn As New AstTableColumnNode(Nothing)
tableColumn.Name = flatFileColumn.Name
tableColumn.DataType = flatFileColumn.DataType
tableColumn.Length = flatFileColumn.Length
tableColumn.Precision = flatFileColumn.Precision
tableColumn.Scale = flatFileColumn.Scale
tableColumn.CodePage = flatFileColumn.CodePage
BimlTable.Columns.Add(tableColumn)
Next
Dim annotation As New AstAnnotationNode(Nothing)
annotation.Tag = "ConnName"
annotation.Text = FlatFile.Name
BimlTable.Annotations.Add(annotation)
Return BimlTable
End Function
<Extension()>
Public Function ToFlatfileConnection(FlatFileFormat As AstFlatFileFormatNode) _
As Connection.AstFlatFileConnectionNode
Dim Connection As New Connection.AstFlatFileConnectionNode(Nothing)
Connection.Name = FlatFileFormat.Name
Connection.FileFormat = FlatFileFormat
Connection.FilePath = Basepath & "\Data\" & FlatFileFormat.Name & ".csv"
Return Connection
End Function
<Extension()>
Public Function GetFlatFileFormatfromXML(XmlFile As String) As AstFlatFileFormatNode
Dim FlatFileFormat As New AstFlatFileFormatNode(Nothing)
Dim xmldoc As New XmlDocument
xmldoc.Load(XmlFile)
Dim records As XmlNodeList = xmldoc.GetElementsByTagName("RECORD").Item(0).ChildNodes
Dim rows As XmlNodeList = xmldoc.GetElementsByTagName("ROW").Item(0).ChildNodes
Dim row As XmlNode
FlatFileFormat.Locale = Locale
FlatFileFormat.Name = Path.GetFileNameWithoutExtension(XmlFile)
FlatFileFormat.RowDelimiter = ConvertDelimiter(
records.Item(records.Count - 1).Attributes("TERMINATOR").Value)
FlatFileFormat.ColumnNamesInFirstDataRow = False
FlatFileFormat.IsUnicode = False
FlatFileFormat.TextQualifier = "_x0022_"
For Each record As XmlNode In records
row = rows.Item(record.Attributes("ID").Value - 1)
Dim DataType As String = row.Attributes("xsi:type").Value
Dim DatatypeID As Integer = ConvertDatatype(DataType)
Dim column As New AstFlatFileColumnNode(Nothing)
column.Name = row.Attributes("NAME").Value
column.Delimiter = ConvertDelimiter(record.Attributes("TERMINATOR").Value)
If DatatypeID = Nothing Then
column.DataType = DbType.String
Else
column.DataType = DatatypeID
End If
If DatatypeID = Nothing Then
column.Length = 1000
ElseIf DatatypeID = DbType.AnsiString Or DatatypeID = DbType.String Then
column.Length = record.Attributes("MAX_LENGTH").Value
End If
If ConvertDatatype(DataType) = DbType.VarNumeric Then
column.Precision = 32
column.Scale = 16
End If
FlatFileFormat.Columns.Add(column)
Next
Return FlatFileFormat
End Function
Public Function ConvertDatatype(CSVType As String) As String
Select Case CSVType
Case "SQLINT" : Return DbType.Int32
Case "SQLSMALLINT" : Return DbType.Int16
Case "SQLVARCHAR" : Return DbType.AnsiString
Case "SQLDATETIME" : Return DbType.DateTime
Case "SQLMONEY" : Return DbType.Currency
Case "SQLNUMERIC" : Return DbType.Double
Case "SQLNVARCHAR" : Return DbType.String
Case "SQLUNIQUEID" : Return DbType.String
Case Else : Return Nothing
End Select
End Function
Public Function ConvertDelimiter(CSVDelimiter As String) As String
Select Case CSVDelimiter
Case "\r\n" : Return "CRLF"
Case Else : Return CSVDelimiter
End Select
End Function
End Module
A few details are worth pointing out. The XML produced by bcp puts physical fields under a 'RECORD' parent and logical column names under a 'ROW' parent. Each 'FIELD' under 'RECORD' carries an 'ID' attribute that maps to one 'COLUMN' under 'ROW', so the loop walks 'records' and looks up the corresponding 'rows' entry by index. The two converter helpers map the bcp data type names and terminator strings to the values Biml expects. Anything the converter does not recognize falls through to a wide string column with a one thousand character length.
Generating the Flat File Formats
With the extension methods in place, a short Biml file walks the format folder and emits one 'FlatFileFormat' per XML file:
<#@ template language="VB" tier="2" #>
<#@ import namespace="System.IO" #>
<#@ code file="../code/FlatFileExtensions.vb" #>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<FileFormats>
<# For Each XmlFile As String In Directory.GetFiles(Basepath + "\Format") #>
<#= XmlFile.GetFlatFileFormatfromXML().GetBiml() #>
<# Next #>
</FileFormats>
</Biml>
Each format is fed back through 'GetBiml' to produce the flat Biml that Biml Studio writes into the project.
Generating the Connections and Staging Tables
A second file builds the staging tables and the flat file connections from the formats already in 'RootNode':
<#@ template language="VB" tier="3" #>
<#@ code file="../code/FlatFileExtensions.vb" #>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<Connections>
<# For Each ff As AstFlatFileFormatNode In RootNode.FileFormats #>
<#= ff.ToFlatfileConnection().GetBiml() #>
<# Next #>
</Connections>
<Tables>
<# Dim UseSchema As AstSchemaNode = RootNode.Schemas(0)
For Each ff As AstFlatFileFormatNode In RootNode.FileFormats #>
<#= ff.ToAstTableNode(UseSchema).GetBiml() #>
<# Next #>
</Tables>
</Biml>
Building the project produces one Biml format, one Biml connection, and one Biml staging table per XML file. Adding a new feed becomes a single drop into the format folder; everything downstream regenerates from the same metadata.