Data Vault is now Popular!
Over the last six years (after I joined Snowflake basically), I have witnessed a massive increase in the interest and implementation of Data Vault 2.0. I have talked to literally hundreds of companies across the globe and across all industries about changing their approach to building an enterprise data platform. It was sort of mind boggling how many folks wanted to speak to me about this. So why, after almost two decades of successful data vault implementations, have so many people “suddenly” got interested in Data Vault?
Well, a few reasons:
- They are moving to the cloud (in this case, Snowflake) and figured it was time to look at their approach to data warehousing and data lakes.
- What they had been doing for decades, on critical review, really was not working (i.e., lots of expensive re-engineering all the time) and definitely could not scale.
- Things are changing so rapidly, they needed to find a way to be more agile.
So, when they searched on Google for things like agile data warehousing or agile data modeling, BOOM, they found blog posts and books on Data Vault 2.0! As they investigated more and read about the architectural principles and the very granular, pattern-based modeling technique, they saw that Data Vault 2.0 might indeed solve some of the problems they were experiencing. (The fact that it is a highly automatable approach was also very appealing.)
In the end, my conclusion was that the companies moving to the cloud were experiencing a bit of a change in mindset where they were willing to challenge all the old ways and so now were willing to consider an alternate approach. As the saying goes, when the pupil is ready, the teacher will appear!
Data Mesh is Hot!
Over the last eight to eleven months, I have also been speaking a lot to organizations about Data Mesh and how it might fit into their approach to building an analytics focused data platform. While Dan Linstedt first published about Data Vault and the approach in 2000, Data Mesh is a more recent addition to the thought leadership space around modern data architectures. Zhamak Dehghani published a white paper on the concept in 2019. In a subsequent paper, she laid out four basic principles:
- Domain-driven ownership
- Data as a Product
- Self-service data platform
- Federated computational governance
While there is much debate about what each of these things really mean and how to implement a data mesh, everyone agrees that the approach is about decentralizing control and development of data products to eliminate the bottleneck many organizations experienced trying to build data lakes or data warehousing via a centralized IT department that controlled all the DBAs, modelers, and ETL developers.
As I talked with a number of Snowflake’s larger global customers who were looking into Data Mesh, I discovered something very interesting – many of them were the same ones I had talked to earlier about Data Vault! As they were making their journey to the cloud, not only were they considering a different modeling and design approach but they were also looking at Data Mesh as a different organizational approach to building and supporting their new global, cloud-based data platform. In the process several came to the conclusion that they should establish some standards and that Data Vault was the best, and maybe the easiest, way to successfully implement the vision of a de-centralized Data Mesh architecture.
How Data Vault Supports Data Mesh
The main question that popped up for these organizations was how were they going to design and model their data so that it can be
- Developed independently by separate domain data teams
- Used to produce independent data products that would not have redundant data (i.e., no duplicate data siloes), and would be easily inter-operable (i.e., shared and joined to other data products)
- Governed and secured at the domain level
- Agile, flexible, and extensible for future requirements
The Data Vault 2.0 approach actually supports all of these things, and has from the very beginning!
The Data Vault approach has always emphasized approaching the data from a business semantic perspective. To do that, it meant we had to engage the actual business owners of the data to understand the meaning of the data in order to build a source-system agnostic model. Well, in Data Mesh, that is the point of domain data teams – having the data owners (i.e., subject matter experts) take responsibility for their data – the collecting, managing, cleansing, and delivery of good, high-quality data for analytics to any and all consumers of that data inside or outside the organization. That is a Data Vault dream come true!
In Data Vault we use Hub tables to contain the unique business keys for every business concept, once and only once. This approach is what allows us to integrate data from multiple sources at a business semantic level. We use Link tables to carry the relationships (some might say “joins”) between these business concepts. That is, we have abstracted all the relationships in a way that lets us build the model very incrementally and add to it as we discover new data and new relationships. This allows us to not only be agile but more importantly, for Data Mesh, it means that to join data together between different concepts (or products or domains), we just need to add that Link table with the relationships. One large bank was doing this with several dozen, separate agile teams as early as a decade ago!
As for governance and security – Data Vault was invented in a highly secure, US defense department environment. That means some data was highly classified and had to be segregated (physically in those days) from the less sensitive data in order to protect national security. So, the data warehouse had to be de-centralized (or as we said then distributed) but with the ability to allow the folks with classified access to easily see the less classified data – without duplicating it. Sort of sounds like separate data products that can be easily federated back together. This was easily done with classified Hubs being on one server and less classified Hubs on another with Link tables on the classified side that provided the joins to the less classified side (but the less classified side had no idea the classified data was even there because they did not have access to the Links).
In more recent years I have seen teams separate sensitive data even within the Satellite tables. For example, to be HIPAA compliant they put the PHI (personal healthcare information) for a Patient (Hub) into a PHI specific Satellite and the less sensitive data into another Satellite but both connected to the same Hub. With this split they could more easily control who had access to that data and either hide the PHI data or mask it from prying eyes.
So, using Links and Satellites in Data Vault makes it very easy to achieve the domain level governance expected in Data Mesh.
The Data Vault approach has, for two decades, allowed organizations to build data platforms that do not grow stale or outdated. Because it is easy to add new Hubs, Links, and Satellites there is no need to waste time and money on re-engineering existing structures when things change. It is all additive. Data Vault gives you the ability to future proof your design because the approach is indeed flexible and extensible.
The Future
Based on everything I have seen and the organizations I have worked with in the last few years, I do see that Data Vault is becoming the primary approach being considered for build agile data platforms in the cloud. I am also seeing a huge interest, and several early successes, with the application of Data Mesh principles in very large organizations as they move to the cloud. The combination of the proven principles and methods of Data Vault 2.0 with the domain-centric and data product focused approach of Data Mesh may very well be the killer combination to success for organizations that are looking to build a true enterprise analytics platform in the cloud.
About the author
Kent Graziano, the Data Warrior, is a recognized industry thought leader, author, speaker, and semi-retired Data Cloud and Data Vault evangelist with too many decades in the industry to count! He is a Data Vault Master, Knight of the Oaktable, Oracle ACE Director – Alumni, and Grandmaster of TaeKwonDo. When he is not thinking about data and the cloud you might find him kayaking lakes and streams in various places, taking foodie pictures, practicing martial arts, volunteering at his local food bank, or just sitting on a beach watching the waves roll in – looking for that next perfect wave. You can follow him on his blog kentgraziano.com or on Twitter @kentgraziano.