By Marko Oja on Nov 5, 2021 11:20:44 AM

A Data Lakehouse with low-code - Implementation

In a previous blog post I talked about how Databricks Data Lakehouse can be created with low code implementation only. That is almost true. System needs to be setup and for that initial configuration some code is needed. What this code does is it creates a mount to storage account that will be used as storage for delta tables. Fortunately this code is well documented and there are multiple guides to accomplish this like this one: https://docs.databricks.com/data/data-sources/azure/adls-gen2/azure-datalake-gen2-sp-access.html

On high level architecture pattern is simple. Source data is loaded from a file into staging table. View holds transformation rules for the data. And “data mart layer” table is updated with merge statement.

Data Lakehouse high level architecture

Steps in this load pattern goes like this:

Data Lakehouse load pattern

Extract data from storage account
Data factory Copy activity copies data into staging Delta tables
Data factory Executes the Databricks notebook to update the final business layer (data mart, star schema) tables
Views are defined to hold transformation logic in Databricks
Merge statement load data from the view into the table (each table has its own merge statement and view)

Result is that data is loaded into Delta tables.

Setting up Databricks

Databricks setup needed small code batch but fortunately rest of the implementation is SQL and Data Factory implementation. Before going into the pipeline creation let’s first take a look what we need to setup into Databricks. I had three tables on my setup, but I will share just one to make this easier to follow. First, we need to establish the actual Delta tables likes this:

Databricks Delta tables

Here I have created staging and bl tables for person.

On top of the staging table there is a view to validate I don’t have duplicate ID’s.

Databricks validation

And finally, I have the load pattern to load the data into the actual dimension table.

Databricks load pattern

That is all we need to do in Databricks. Rest of the implementation is Data Factory based.

Load pipeline

I first created a metadata operated dynamic pipeline but when writing this blog, it got a bit hard to follow. So here is a simplified version without all the bells and whistles. Basic load pattern with dedicated copy activities for each table and notebook execution to run merge statements.

Metadata operated dynamic pipeline

Source for copy activities is a basic azure blob storage and sink is Azure Databricks Delta Lake dataset. Both uses parameter to specify target file or table. To use Delta Lake dataset a running interactive cluster is needed which is configured in a linked service just like it would be for a basic notebook execution.

Azure Databricks Delta Lake dataset

The copy activity itself doesn’t hold any magic. There is possibility to run a Pre-copy script like in SQL databases. Here truncate is used for staging tables. But also, more Databricks specific commands like VACUUM could be used to optimize loads further.

Truncate is used for staging tables

The notebook calls at the end of the pipeline does nothing but executes my notebook containing merge statements in Databricks.

Conclusions

The entire pattern was extremely simple and straightforward. As a low-code implementations should be. I had the most low-level cluster with single node setup on Databricks. Needless to say load times were not impressive. I believe that there would be much better suited patterns to use for Databricks even written with SQL that would perform much better than the ones I used. This pattern was however straight from traditional data warehousing implementation, and I was determined to test just that.

For keeping raw data history, Delta Lake seems very promising. In real-life scenario the load pattern would be naturally a bit more complex. Also, for Data Factory implementation more dynamic metadata operated solution would be preferable. Still, this is a very close to real life scenario requiring only small changes and it is a low-code solution and quite cost effective one. Delta lake is a promising tool in the modern data platform toolset, and I expect to see growing amount of implementations utilising it in the future.

A Data Lakehouse with low-code - Implementation

Setting up Databricks

Load pipeline

Conclusions

Marko Oja

Read more