Skip to main content

Delta Lake

Delta Lake is an open-source storage layer that can be used ‘on top off’ a Data Lake, in this case Azure Data Lake Gen2. Delta Lake provides transaction control (Atomicity, Consistency, Isolation and Durability, or 'ACID') features to the Data Lake.

This approach supports a 'Lake House' style architecture, which offers opportunities to work with various kinds of data in a single environment. For example, combining semi- and unstructured data or batch- and streaming processing. This means various use-cases can be supported by a single infrastructure.

Microsoft has made Delta Lake connectors available for Azure Data Factory (ADF) pipelines. Using these connectors, you can use your Data Lake to 'act' as a typical relational database for delivering your target models while at the same time use the lake for other use-cases.

There are many ways to do this. Fundamentally, there is a split between where the data resides (storage) and where the processing takes place (compute). The technologies used for each of these components defines the technical architecture to a large extent.

For example, you can use an existing Databricks cluster to store files in the local storage or mount a file or directory from the Data Lake or other file storage provider. You can also use connectors to access Data Lake files from Databricks so that other data sources (including Data Lake) can be made available without necessarily using the local storage.

When connecting to the cluster via ADF (Linked Service) you can run queries against this existing cluster and its tables using the Delta Lake layer. The cluster will do the compute, and the result can be written to a sink as defined in ADF. Delta Lake will manage transaction control.

The Data Lake is fundamentally a file-based environment, but the transaction controls that Delta Lake provides allow it to be used as a relational database to some extent.

ADF Pipelines can connect to existing Databricks clusters using an 'Azure Databricks Delta Lake' Linked Service. This way queries can be run against the Delta Lake, but only when a cluster is available. Queries cannot be run against the Data Lake directly, a cluster is always needed.

Use Cases