Skip to main content

Building a Data Warehouse While Source Systems Evolve

The Problem

Building a BI solution while the underlying ERP and CRM are still being changed is a common pattern. The new systems often go live with the data warehouse and reports as part of acceptance, which means the BI team is integrating against a moving target. Without automation, every source change cascades into hours of manual rework on packages, tables, views, and ETL logic.

Three Things That Keep the BI Team Agile

Three coordinated tools make the difference between sustainable and unsustainable progress:

  • An alert and reporting system that surfaces source schema changes. A daily snapshot of source tables, views, and columns makes additions, drops, and type changes visible the day they happen.
  • A data warehouse automation framework that regenerates tables, views, primary keys, foreign keys, indexes, stored procedures, and SSIS packages from metadata. Biml fits this role: change the metadata, regenerate the artifacts.
  • A testing framework that runs daily and reports any divergence between current results and the values business users expect.

Designing a Lightweight Testing Framework

A testing framework sounds elaborate but a useful one can be built in an afternoon. Start with the reports the users already trust. Ask which numbers matter and capture the expected values in a spreadsheet with this layout:

testNametestTypecriterion1criterion2expectedValueexpression
InternetSales202401DWH202401Internet125035(SQL goes here)

The 'expression' column holds a SQL statement that returns the expected value. The placeholders 'criterion1' and 'criterion2' get replaced with the values from the same row at execution time, which lets the same expression cover many similar tests by changing only the criteria values.

Example expression:

SELECT CAST(SUM(f.sales) AS int)
FROM factSales f
INNER JOIN dimDate d
ON f.DateKey = d.DateKey
INNER JOIN dimChannel ch
ON f.ChannelKey = ch.ChannelKey
WHERE d.YearMonth = [criterion1]
AND ch.ChannelName = '[criterion2]';

Tables That Hold Tests and Results

Two tables are enough. The first holds the test catalog imported from the spreadsheet:

CREATE TABLE DataTests (
testName nvarchar(64),
testType nvarchar(20),
criterion1 nvarchar(128),
criterion2 nvarchar(128),
expectedValue int,
expression nvarchar(1024)
);

The second logs every test execution:

CREATE TABLE DataTestResults (
testName nvarchar(64),
testDate datetime DEFAULT getdate(),
actualValue int,
expectedValue int
);

The SSIS Package That Runs the Tests

The package has four pieces:

  • An Execute SQL Task that returns the list of tests and their resolved queries
  • A ForEach Loop that iterates over the result set
  • An Execute SQL Task that runs each resolved query against the data warehouse
  • An Execute SQL Task that writes the actual and expected values into 'DataTestResults'

The Biml that produces this package:

<Packages>
<Package Name="PKG_TestData" ConstraintMode="Linear">
<Variables>
<Variable Name="TestList" DataType="Object" />
<Variable Name="testQuery" DataType="String" />
<Variable Name="testName" DataType="String" />
<Variable Name="expectedValue" DataType="Int32">0</Variable>
<Variable Name="actualValue" DataType="Int32">0</Variable>
</Variables>
<Tasks>

<ExecuteSQL Name="SQL_GetTestList" ConnectionName="META" ResultSet="Full">
<DirectInput>
SELECT testName,
testQuery = REPLACE(REPLACE(expression, '[criterion1]', criterion1), '[criterion2]', criterion2),
expectedValue
FROM DataTests
WHERE testType = 'DWH'
</DirectInput>
<Results>
<Result Name="0" VariableName="User.TestList" />
</Results>
</ExecuteSQL>

<ForEachAdoLoop Name="FELC_Test" SourceVariableName="User.TestList" ConstraintMode="Linear">
<VariableMappings>
<VariableMapping Name="0" VariableName="User.testName" />
<VariableMapping Name="1" VariableName="User.testQuery" />
<VariableMapping Name="2" VariableName="User.expectedValue" />
</VariableMappings>
<Tasks>

<ExecuteSQL Name="SQL_ExecuteTest" ConnectionName="DWH" ResultSet="SingleRow">
<VariableInput VariableName="User.testQuery" />
<Results>
<Result Name="0" VariableName="User.actualValue" />
</Results>
</ExecuteSQL>

<ExecuteSQL Name="SQL_LogTestResult" ConnectionName="META" ResultSet="None">
<DirectInput>
INSERT INTO DataTestResults (testName, actualValue, expectedValue)
VALUES (?, ?, ?)
</DirectInput>
<Parameters>
<Parameter Name="0" Direction="Input" DataType="String" VariableName="User.testName" />
<Parameter Name="1" Direction="Input" DataType="Int32" VariableName="User.actualValue" />
<Parameter Name="2" Direction="Input" DataType="Int32" VariableName="User.expectedValue" />
</Parameters>
</ExecuteSQL>

</Tasks>
</ForEachAdoLoop>

</Tasks>
</Package>
</Packages>

What This Buys

A daily run produces an immediate signal. Errors surface before users start testing. Source system changes that break a report are localized to specific tests. Confidence in the warehouse rises because the same numbers users care about are verified every day.

Summary

Three pieces, an alert system on source metadata, a Biml-driven automation layer, and a small testing harness, are enough to keep a data warehouse coherent while its sources continue to change. The testing framework is the one most often skipped and the one that makes the others trustworthy.