Introduction to hashing in BimlFlex
BimlFlex uses hashing in two common scenarios, for Data Vault hash-based keys and for row checksums for identifying changes to attributes.
Hashing keys in Data Vault allows integration keys to be loaded in a deterministic way from multiple sources in parallel. This also removes the need for key lookups between related entities.
Hashing row checksums for identifying changes allows the load process to compare a single generated hash to identify if an attribute has changed for a key. This allows for flexible and fast loading of only changes to data.
BimlFlex provides 2 main ways of hashing:
- Using a custom SSIS component that provides hashing of data in the SSIS Data Flow Task for SSIS-based loads
- Using the SQL
HASHBYTES()
script function in ELT code
Hash Keys
- A hash of the Integration Key.
- Not a compulsory field for
Data Vault
however currently compulsory in BimlFlex. We will support a no hash implementation option in all our ELT (SQL) templates.
Hashing Algorithm
- Support all major hash algorithms with
SHA1
as the default. Implement other available options by changing the HASH ALGORITHM setting to one of the following valuesMD5
SHA1
SHA2_256
SHA2_512
- Previous SSIS versions of BimlFlex created a different hash for Unicode characters.
The USE SQL COMPATIBLE HASH and USE SQL COMPATIBLE ROW HASH settings provides for backward compatibility.
They should be set to
Y
for new implementations. - Support for
binary
andtext
hash representation depending on your requirements and can be configured by changing the HASH BINARY setting. It is recommended to use binary hashing. With binary hashing the hash value is stored in its native binary form, requiring half the storage space compared to the hexadecimal string representation. - Performance setting: ADD ROW HASH KEY INDEX will add an index to the staging table on the
RowHashKey
to optimize the Hub ETL SSIS packages.
Hash Collision
When using hash values to represent the value of something else, an alternate key or a row checksum, there is a small risk that two source values generate the same hash value. This would result in a hash collision.
When two keys hash to the same target hash a potential effect could be that only the first would be loaded to the Data Vault Hub, with the first key value. Attributes that relate to the second key would be confused and co-located.
Row hashing is used to detect changes in rows, all the attributes are combined into a single hash value, commonly referred to as a HashDiff
, when that value changes there is a change in one or more of the attributes, when it is the same the rows are the same.
If a collision occur here, the new version of the row (with the new changes) would not be persisted in the target as it would not be identified as a change.
A way to accommodate hash collision risk is to do a reverse hash and combine this with the original hash, which is a configuration option in BimlFlex for the SSIS templates, however this will have an impact on load times.
Another way is to do reconciliation checks in the Target data warehouse. For targets that don't enforce constraints, such as Synapse and Snowflake, the hash values are inserted in the target regardless of collisions. BimlFlex can be extended to create custom reconciliation queries to allow reports and analytics on key/hash comparisons. For targets that do enforce the constraints, the collision will result in a potential hard failure, allowing the scenario to be analyzed and accommodated. An option for remediation is to increase the hash length, if a collision happens using SHA1, upgrade to use SHA2_256 or SHA2_512 that have longer key length and less risk for hash collisions.
Hash Key Settings
Choose a tab below to view relevant setting descriptions or examples for Hash Keys.
Choose a tab to view either setting descriptions or examples.
Hash configurations
Both of these approaches provides a set of optional configurations and settings. The hashing approach in the SSIS packages are controlled through the following settings:
SettingKey | SettingValue | Description |
---|---|---|
HashAlgorithm | SHA1 |
Hashing Algorithm (MD5 , SHA1 , SHA2_256 , SHA2_512 ) |
HashBinary | N |
Binary or String hash representation (Y , N ) |
HashIntegrationKey | N |
Should the Object's Integration Key be hashed. This is always applied for loads with a Data Vault destination (Y , N ) |
UseSqlCompatibleHash | Y |
Should the Integration Key hashing use a SQL HASHBYTES() compatible pattern (Y , N ) |
UseSqlCompatibleRowHash | Y |
Should the row checksum hashing use a SQL HASHBYTES() compatible pattern (Y , N ) |
Hash representation
This defines how the resulting hash value is stored
- String representation, the resulting hash is stored using the text representation of the binary hash value. This makes the hash value legible and can easily be copied and stored in plain text format
- Binary representation, the hash is stored in its native, binary form. This requires less storage however is not as well suited for readability or agnostic storage
When considering the hash data type, storage requirements and SQL Server performance is one aspect, while direct joinability to other sources without converting the hash is another.
Hash algorithm
The cryptographic algorithm used to derive the hash. The different algorithms result in differently sized hash values, require different amount of processing power and have different risks for key collisions.
- MD5
- SHA1
- SHA2_256
- SHA2_512
Data Types and storage requirements for hash columns
Hash Algorithm | String representation | Binary Representation |
---|---|---|
MD5 |
char(32) | binary(16) |
SHA1 |
char(40) | binary(20) |
SHA2_256 |
char(64) | binary(32) |
SHA2_512 |
char(128) | binary(64) |
Attribute Separators and Field Order
The hash algorithm runs its logic on a single input object. For data warehousing scenarios, the fields that are part of the hash are all concatenated together into a single string.
When hashing multiple concatenated values it is important to separate them to distinguish patterns. This is sometimes called hash or field concatenator, separator or sanding value. The default field separator in BimlFlex is ~
and it is configurable in the settings. This is used so that hashing A
and BC
generates a different hash compared to AB
and C
. without the separator both scenarios would generate the input value ABC
to the hashing process and result in identical hash values, whereas using the separator gives either A~BC
or AB~C
, resulting in different hashes. The separator is also used in the Integration Key creation when concatenating multiple fields into a single string Integration Key representation.
The hash is dependent on the order of the included attributes. For the row hash BimlFlex orders the columns in the same order as they are ordered in the metadata. This is controlled by the Ordinal
metadata attribute. This is a flexible approach and, for instance, allows a column to be renamed without affecting the hash outcome.
Null value replacements
it is not possible to hash a null value, therefore the BimlFlex hashing procedure has an optional null value replacement value. The default value is NVL
and it is configurable in the settings.
Hashing Integration Keys
A Business or Integration key is used as the main identifier for entities. This is most commonly used as the base key for Hub entities in Data Vault. BimlFlex has a built in feature to convert source columns and concatenate them into a Integration Key. There is also an optional setting to upper case the string. This is useful where the database is case insensitive and keys with different casings are considered the same entity. As different casings are hashed to different hash values whereas the database engine would consider Integration Keys with different casings to be the same this provides a convenient way to automatically align all Integration Keys. BimlFlex always store the original values in Satellites by default, so no data is lost in this process.
Sample Script used for hashing Integration Keys. When using AdventureWorksLT and the Product table the following hashes are created by the SQL Compatible SSIS Hashing component and they can be recreated using SQL Server HASHBYTES()
and the below sample script.
Hashing Row Checksums
A row hash is a hash of all relevant attributes in the row and is mainly used to identify changes to records. In the default source to staging to persistent staging pattern there is a lookup that compares the stored row hash in the persistent staging with the new row hash to see if there has been a change to any attributes or if the row should be ignored.
Special consideration for changes in the row change type indicator is normally required. This is due to the fact that a row with the same attributes but with a change indicator indicating it has been deleted would result in the same attribute hash but completely different meaning in the data warehouse.
Key collisions
There is a risk for key collisions using hashing where different values generate the same key. This is a calculated risk in using hashing and a collision mitigation strategy should be in place for the integration process to accommodate these potential issues should they arise. It is out of the scope of this article to detail the mathematical background information or gauge the risk or the required responses to that risk.
Hash compatibility
The most important part of using hashing of data is a consistent behavior, so that the same input value results in the same hash value. This is especially true when hashing data from different systems using different tools and approaches are used for hashes where they will be used for joining and linking at a later stage. For this compatibility, BimlFlex provides a HASHBYTES()
compatible hashing approach that allows other systems to hash using the same approach so that matching is possible. This document details the hashing approach so that it is easy to replicate the approach either manually in SQL or in a different process.
Using deterministic representation in hash input
Some data types, such as dates and date times have culture specific representations as the default in certain scenarios. To ensure deterministic hashing these data types are converted to strings with a format applied. This allows the same input value to be derived regardless of the current culture or default implicit string conversion of the current environment.
Hashing approximate and complex data types
Some data types are by definition approximate representations. In SQL Server they are represented by the real
and float
data types. Depending on the values and the processing entity the actual value used in the hashing might be different. For example, SQL and SSIS store some values slightly different. For these scenarios it is important to note that the generated hash is not deterministic and can lead to different hash values for the apparent same input value. In most scenarios this will only lead to the occasional added row when a change has not occurred. For scenarios where this is not acceptable, consider using exact numeric data types.
Some complex data types, such as geography
, geometry
, hierarchyid
and image
, don't have obvious string representations and can therefore be interpreted differently by different hashing approaches. For scenarios where these data types are used and a deterministic hashing is needed, consider manually converting them to a string, known format, or object representation before hashing.
Sample comparison script
The below SQL Script queries the AdventureWorksLT staging table for the AWLT.Product table and recreates the SSIS derived hash for the Integration Key and Row hash using SQL HASHBYTES()
syntax. Note the use of CAST
and CONVERT
with formats.
SELECT
CONVERT(CHAR(40), HASHBYTES('SHA1', CONVERT(NVARCHAR(MAX),
COALESCE(LTRIM(RTRIM(CAST(Product_BK AS NVARCHAR(100)))), 'NVL')
)),2) AS FlexRowHashKeyInSql,
FlexRowHashKey,
CONVERT(CHAR(40), HASHBYTES('SHA1', CONVERT(NVARCHAR(MAX),
COALESCE(LTRIM(RTRIM(CAST([ProductID] AS NVARCHAR(100)))), 'NVL') +'~'+
COALESCE(LTRIM(RTRIM(CAST([Name] AS NVARCHAR(100)))), 'NVL') +'~'+
COALESCE(LTRIM(RTRIM(CAST([ProductNumber] AS NVARCHAR(100)))), 'NVL') +'~'+
COALESCE(LTRIM(RTRIM(CAST([Color] AS NVARCHAR(100)))), 'NVL') +'~'+
COALESCE(LTRIM(RTRIM(CONVERT(NVARCHAR(100), [StandardCost],2 ))), 'NVL') +'~'+
COALESCE(LTRIM(RTRIM(CONVERT(NVARCHAR(100), [ListPrice],2 ))), 'NVL') +'~'+
COALESCE(LTRIM(RTRIM(CAST([Size] AS NVARCHAR(100)))), 'NVL') +'~'+
COALESCE(LTRIM(RTRIM(CAST([Weight] AS NVARCHAR(100)))), 'NVL') +'~'+
COALESCE(LTRIM(RTRIM(CAST([ProductCategoryID] AS NVARCHAR(100)))), 'NVL') +'~'+
COALESCE(LTRIM(RTRIM(CAST([ProductModelID] AS NVARCHAR(100)))), 'NVL') +'~'+
COALESCE(LTRIM(RTRIM(CONVERT(NVARCHAR(30), [SellStartDate], 127))), 'NVL') +'~'+
COALESCE(LTRIM(RTRIM(CONVERT(NVARCHAR(30), [SellEndDate], 127))), 'NVL') +'~'+
COALESCE(LTRIM(RTRIM(CONVERT(NVARCHAR(30), [DiscontinuedDate], 127))), 'NVL') +'~'+
COALESCE(LTRIM(RTRIM(CAST([ThumbnailPhotoFileName] AS NVARCHAR(100)))), 'NVL') +'~'+
COALESCE(LTRIM(RTRIM(CAST(REPLACE(REPLACE([rowguid], '{',''),'}','') AS NVARCHAR(100)))), 'NVL') +'~'+
COALESCE(LTRIM(RTRIM(CONVERT(NVARCHAR(30), ModifiedDate, 127))), 'NVL')
)),2) AS FlexRowHashInSql,
FlexRowHash, *
FROM AWLT.Product ORDER BY ProductID