Introduction to hashing in BimlFlex

BimlFlex uses hashing in two common scenarios, for Data Vault hash-based keys and for row checksums for identifying changes to attributes.

Hashing keys in Data Vault allows integration keys to be loaded in a deterministic way from multiple sources in parallel. This also removes the need for key lookups between related entities.

Hashing row checksums for identifying changes allows the load process to compare a single generated hash to identify if an attribute has changed for a key. This allows for flexible and fast loading of only changes to data.

BimlFlex provides 2 main ways of hashing:

Using a custom SSIS component that provides hashing of data in the SSIS Data Flow Task for SSIS-based loads
Using the SQL HASHBYTES() script function in ELT code

Hash Keys

A hash of the Integration Key.
Not a compulsory field for Data Vault however currently compulsory in BimlFlex. We will support a no hash implementation option in all our ELT (SQL) templates.

Hashing Algorithm

Support all major hash algorithms with SHA1 as the default. Implement other available options by changing the HASH ALGORITHM setting to one of the following values
- MD5
- SHA1
- SHA2_256
- SHA2_512
Previous SSIS versions of BimlFlex created a different hash for Unicode characters. The USE SQL COMPATIBLE HASH and USE SQL COMPATIBLE ROW HASH settings provides for backward compatibility. They should be set to Y for new implementations.
Support for binary and text hash representation depending on your requirements and can be configured by changing the HASH BINARY setting. It is recommended to use binary hashing. With binary hashing the hash value is stored in its native binary form, requiring half the storage space compared to the hexadecimal string representation.
Performance setting: ADD ROW HASH KEY INDEX will add an index to the staging table on the RowHashKey to optimize the Hub ETL SSIS packages.

Hash Collision

When using hash values to represent the value of something else, an alternate key or a row checksum, there is a small risk that two source values generate the same hash value. This would result in a hash collision.

When two keys hash to the same target hash a potential effect could be that only the first would be loaded to the Data Vault Hub, with the first key value. Attributes that relate to the second key would be confused and co-located.

Row hashing is used to detect changes in rows, all the attributes are combined into a single hash value, commonly referred to as a HashDiff, when that value changes there is a change in one or more of the attributes, when it is the same the rows are the same. If a collision occur here, the new version of the row (with the new changes) would not be persisted in the target as it would not be identified as a change.

A way to accommodate hash collision risk is to do a reverse hash and combine this with the original hash, which is a configuration option in BimlFlex for the SSIS templates, however this will have an impact on load times.

Another way is to do reconciliation checks in the Target data warehouse. For targets that don't enforce constraints, such as Synapse and Snowflake, the hash values are inserted in the target regardless of collisions. BimlFlex can be extended to create custom reconciliation queries to allow reports and analytics on key/hash comparisons. For targets that do enforce the constraints, the collision will result in a potential hard failure, allowing the scenario to be analyzed and accommodated. An option for remediation is to increase the hash length, if a collision happens using SHA1, upgrade to use SHA2_256 or SHA2_512 that have longer key length and less risk for hash collisions.

Hash Key Settings

Choose a tab below to view relevant setting descriptions or examples for Hash Keys.

Choose a tab to view either setting descriptions or examples.

Group	Section	Setting	Description
Core	Hash	Hash Algorithm	The hashing algorithm to use. ("MD5"/"SHA1"/"SHA2_256"/"SHA2_512")
Core	Hash	Hash Binary	Should the generated hash value be stored as a binary representation rather than a string representation
Core	Hash	Hash Integration Key	Should the Integration Key be hashed. This is done automatically for any project where the destination connection integration stage is Data Vault as it is a requirement for a Data Vault load. For other load process designs the hashing is optional and controlled by this flag as well as the hashing configuration in the configuration sheet
Core	Hash	Use SQL Compatible Hash	Should the SSIS inline hashing component use a hashing approach compatible with the SQL Server "HASHBYTES()" function. This is recommended so that the hashed values can be recreated using standard SQL queries when needed
Core	Hash	Use SQL Compatible Row Hash	Should the SSIS inline hashing component for Full Row Hashing use a hashing approach compatible with the SQL Server "HASHBYTES()" function. The default is false for backward compatibility however we recommend true for new projects to make it forward compatible with cloud deployments
Core	Hash	Cast Boolean to True False for Hash	Should the SQL inline hashing function for MSSQL, SQLDB and Synapse convert BIT (Boolean) values to True/False instead of 1/0
Data Vault	Accelerator	Use Hash Keys	Should the Data Vault use Hash Keys or Natural Keys
Staging	Settings	Add Row Hash Key Index	Enable to add a unique, non-clustered constraint on the RowHashKey and EffectiveFromDate columns in staging tables

Group	Section	Setting	Default Value
Core	Hash	Hash Algorithm	SHA1
Core	Hash	Hash Binary	`false`
Core	Hash	Hash Integration Key	`false`
Core	Hash	Use SQL Compatible Hash	`true`
Core	Hash	Use SQL Compatible Row Hash	`false`
Core	Hash	Cast Boolean to True False for Hash	`false`
Data Vault	Accelerator	Use Hash Keys	`true`
Staging	Settings	Add Row Hash Key Index	`false`

Hash configurations

Both of these approaches provides a set of optional configurations and settings. The hashing approach in the SSIS packages are controlled through the following settings:

SettingKey	SettingValue	Description
HashAlgorithm	`SHA1`	Hashing Algorithm (`MD5`, `SHA1`, `SHA2_256`, `SHA2_512`)
HashBinary	`N`	Binary or String hash representation (`Y`, `N`)
HashIntegrationKey	`N`	Should the Object's Integration Key be hashed. This is always applied for loads with a Data Vault destination (`Y`, `N`)
UseSqlCompatibleHash	`Y`	Should the Integration Key hashing use a SQL `HASHBYTES()` compatible pattern (`Y`, `N`)
UseSqlCompatibleRowHash	`Y`	Should the row checksum hashing use a SQL `HASHBYTES()` compatible pattern (`Y`, `N`)

Hash representation

This defines how the resulting hash value is stored

String representation, the resulting hash is stored using the text representation of the binary hash value. This makes the hash value legible and can easily be copied and stored in plain text format
Binary representation, the hash is stored in its native, binary form. This requires less storage however is not as well suited for readability or agnostic storage

When considering the hash data type, storage requirements and SQL Server performance is one aspect, while direct joinability to other sources without converting the hash is another.

Hash algorithm

The cryptographic algorithm used to derive the hash. The different algorithms result in differently sized hash values, require different amount of processing power and have different risks for key collisions.

MD5
SHA1
SHA2_256
SHA2_512

Data Types and storage requirements for hash columns

Hash Algorithm	String representation	Binary Representation
`MD5`	char(32)	binary(16)
`SHA1`	char(40)	binary(20)
`SHA2_256`	char(64)	binary(32)
`SHA2_512`	char(128)	binary(64)

Attribute Separators and Field Order

The hash algorithm runs its logic on a single input object. For data warehousing scenarios, the fields that are part of the hash are all concatenated together into a single string.

When hashing multiple concatenated values it is important to separate them to distinguish patterns. This is sometimes called hash or field concatenator, separator or sanding value. The default field separator in BimlFlex is ~ and it is configurable in the settings. This is used so that hashing A and BC generates a different hash compared to AB and C. without the separator both scenarios would generate the input value ABC to the hashing process and result in identical hash values, whereas using the separator gives either A~BC or AB~C, resulting in different hashes. The separator is also used in the Integration Key creation when concatenating multiple fields into a single string Integration Key representation.

The hash is dependent on the order of the included attributes. For the row hash BimlFlex orders the columns in the same order as they are ordered in the metadata. This is controlled by the Ordinal metadata attribute. This is a flexible approach and, for instance, allows a column to be renamed without affecting the hash outcome.

Null value replacements

it is not possible to hash a null value, therefore the BimlFlex hashing procedure has an optional null value replacement value. The default value is NVL and it is configurable in the settings.

Hashing Integration Keys

A Business or Integration key is used as the main identifier for entities. This is most commonly used as the base key for Hub entities in Data Vault. BimlFlex has a built in feature to convert source columns and concatenate them into a Integration Key. There is also an optional setting to upper case the string. This is useful where the database is case insensitive and keys with different casings are considered the same entity. As different casings are hashed to different hash values whereas the database engine would consider Integration Keys with different casings to be the same this provides a convenient way to automatically align all Integration Keys. BimlFlex always store the original values in Satellites by default, so no data is lost in this process.

Sample Script used for hashing Integration Keys. When using AdventureWorksLT and the Product table the following hashes are created by the SQL Compatible SSIS Hashing component and they can be recreated using SQL Server HASHBYTES() and the below sample script.

Hashing Row Checksums

A row hash is a hash of all relevant attributes in the row and is mainly used to identify changes to records. In the default source to staging to persistent staging pattern there is a lookup that compares the stored row hash in the persistent staging with the new row hash to see if there has been a change to any attributes or if the row should be ignored.

Special consideration for changes in the row change type indicator is normally required. This is due to the fact that a row with the same attributes but with a change indicator indicating it has been deleted would result in the same attribute hash but completely different meaning in the data warehouse.

Key collisions

There is a risk for key collisions using hashing where different values generate the same key. This is a calculated risk in using hashing and a collision mitigation strategy should be in place for the integration process to accommodate these potential issues should they arise. It is out of the scope of this article to detail the mathematical background information or gauge the risk or the required responses to that risk.

Hash compatibility

The most important part of using hashing of data is a consistent behavior, so that the same input value results in the same hash value. This is especially true when hashing data from different systems using different tools and approaches are used for hashes where they will be used for joining and linking at a later stage. For this compatibility, BimlFlex provides a HASHBYTES() compatible hashing approach that allows other systems to hash using the same approach so that matching is possible. This document details the hashing approach so that it is easy to replicate the approach either manually in SQL or in a different process.

Using deterministic representation in hash input

Some data types, such as dates and date times have culture specific representations as the default in certain scenarios. To ensure deterministic hashing these data types are converted to strings with a format applied. This allows the same input value to be derived regardless of the current culture or default implicit string conversion of the current environment.

Hashing approximate and complex data types

Some data types are by definition approximate representations. In SQL Server they are represented by the real and float data types. Depending on the values and the processing entity the actual value used in the hashing might be different. For example, SQL and SSIS store some values slightly different. For these scenarios it is important to note that the generated hash is not deterministic and can lead to different hash values for the apparent same input value. In most scenarios this will only lead to the occasional added row when a change has not occurred. For scenarios where this is not acceptable, consider using exact numeric data types.

Some complex data types, such as geography, geometry, hierarchyid and image, don't have obvious string representations and can therefore be interpreted differently by different hashing approaches. For scenarios where these data types are used and a deterministic hashing is needed, consider manually converting them to a string, known format, or object representation before hashing.

Sample comparison script

The below SQL Script queries the AdventureWorksLT staging table for the AWLT.Product table and recreates the SSIS derived hash for the Integration Key and Row hash using SQL HASHBYTES() syntax. Note the use of CAST and CONVERT with formats.

SELECT

CONVERT(CHAR(40), HASHBYTES('SHA1', CONVERT(NVARCHAR(MAX),
    COALESCE(LTRIM(RTRIM(CAST(Product_BK AS NVARCHAR(100)))), 'NVL')
)),2) AS FlexRowHashKeyInSql,

FlexRowHashKey,

CONVERT(CHAR(40), HASHBYTES('SHA1', CONVERT(NVARCHAR(MAX),
    COALESCE(LTRIM(RTRIM(CAST([ProductID] AS NVARCHAR(100)))), 'NVL') +'~'+
    COALESCE(LTRIM(RTRIM(CAST([Name] AS NVARCHAR(100)))), 'NVL') +'~'+
    COALESCE(LTRIM(RTRIM(CAST([ProductNumber] AS NVARCHAR(100)))), 'NVL') +'~'+
    COALESCE(LTRIM(RTRIM(CAST([Color] AS NVARCHAR(100)))), 'NVL') +'~'+
    COALESCE(LTRIM(RTRIM(CONVERT(NVARCHAR(100), [StandardCost],2 ))), 'NVL') +'~'+
    COALESCE(LTRIM(RTRIM(CONVERT(NVARCHAR(100), [ListPrice],2 ))), 'NVL') +'~'+
    COALESCE(LTRIM(RTRIM(CAST([Size] AS NVARCHAR(100)))), 'NVL') +'~'+
    COALESCE(LTRIM(RTRIM(CAST([Weight] AS NVARCHAR(100)))), 'NVL') +'~'+
    COALESCE(LTRIM(RTRIM(CAST([ProductCategoryID] AS NVARCHAR(100)))), 'NVL') +'~'+
    COALESCE(LTRIM(RTRIM(CAST([ProductModelID] AS NVARCHAR(100)))), 'NVL') +'~'+
    COALESCE(LTRIM(RTRIM(CONVERT(NVARCHAR(30), [SellStartDate], 127))), 'NVL') +'~'+
    COALESCE(LTRIM(RTRIM(CONVERT(NVARCHAR(30), [SellEndDate], 127))), 'NVL') +'~'+
    COALESCE(LTRIM(RTRIM(CONVERT(NVARCHAR(30), [DiscontinuedDate], 127))), 'NVL') +'~'+
    COALESCE(LTRIM(RTRIM(CAST([ThumbnailPhotoFileName] AS NVARCHAR(100)))), 'NVL') +'~'+
    COALESCE(LTRIM(RTRIM(CAST(REPLACE(REPLACE([rowguid], '{',''),'}','')  AS NVARCHAR(100)))), 'NVL') +'~'+
    COALESCE(LTRIM(RTRIM(CONVERT(NVARCHAR(30), ModifiedDate, 127))), 'NVL')
)),2) AS FlexRowHashInSql,

FlexRowHash, *

FROM AWLT.Product ORDER BY ProductID