In part two of this blog series, we sit down with Kent Graziano and Cindi Meyersohn to dig a bit deeper on data governance and its relationship to data quality.
Cindi: Kent, we’ve already talked about data governance as it relates to security, now what about data quality?
Kent: It’s important to talk about the key aspects and benefits of data governance found in improved data quality, because if people can’t trust the data, then they’re not going to use the data. No matter how sophisticated you get in loading your data, moving your data, transforming your data, displaying your data, if people don’t trust the quality of that data, then they’re not going to use it.
The Data Vault architecture and methodology, above and beyond modeling, includes data quality in the approach, specifically related to TQM, which is vectorized total quality management, along with several other frameworks. Particularly in Data Vault, you embrace the idea of having a continuous feedback loop on the data to the sources as part of the approach. The inclusion of a continuous feedback loop should be a part of any planning and design of a Data Vault.
Cindi: You and I know how crucial data quality is, but how would you suggest exposing the business to data quality awareness?
Kent: The idea is to expose the data to business users and have the business users be able to provide feedback on whether this data is correct. In many of my Data Vault projects, I often used the raw vault as a prototype before any soft business rules were applied to it. I simply expose the data to the end users in some form so that they could see the quality of the data and comment on it. This is how we could begin the journey of data quality awareness. Is their source data what they expected to be? Is the quality there, or not?
Cindi: And that also delivers value to the business sooner, yes?
Kent: Right, in line with a disciplined agile approach, we’re delivering value to the end users sooner rather than later and doing more rapid iterations. Instead of waiting until we get all the way to the end of a full-blown production quality Information Mart, we show them the data early. This way, we can find the errors in the data sooner and start making corrections at the source.
Cindi: Much more efficient than what we did in the legacy data warehouse days with data scrubbing.
Kent: People always talked about scrubbing the data. What that meant is we would modify the data as it entered the data warehouse. Now, there’s a lot of downsides to that. One predominant issue is that you lost track of what was in the source data. This causes issues from an auditability and lineage perspective; you’re now looking at transformed data, with no way to easily replicate what the raw data was. This is one of the places where the Data Vault methodology comes into play as a benefit in data governance and in data quality.
Specifically, the combination of loading the raw data unchanged into a staging area followed by reformatting it into a Data Vault schema using the hubs, links, and satellite model gives us not one, but two places to look at the original data to make determinations about data quality.
Without Data Vault methodology, if someone says the data is wrong, it’s hard for you to prove. Was it a mistake that was made in building the data warehouse? Was it a mistake that was made in applying business rules? Or was the data bad within the source system because you no longer can audit that aspect of the build?
Cindi: The other benefit out of that also is the reduction of technical debt because that is directly tied to how much quality cleansing and data scrubbing you’ll have to do to fix that data. You get the benefit of reduced cost in an auditable data set.
Kent: Correct, the farther through the lifecycle you are when an error is discovered, the more it costs to fix it, because of the amount of code to look through and re-engineer, or at least evaluate. And it is much harder to determine where the error occurred. With the Data Vault approach, we can look at the data on the staging area as soon as we’ve onboarded it into the platform.
Now the concept in Data Vault of hard and soft rules is that we want to apply the hard rules on the way into the staging areas, things that don’t change the grain of the data, without pre-aggregating it. This is the first gate of data quality in a Data Vault approach.
Cindi: Exactly, we’re looking at the data in the form it was collected – the data as a “fact,” and soft rules are applied further down the line, instead of what used to happen in the data warehousing and ETL world, where data was physically changed.
Kent: We’re applying the business rules later in the game: after we’ve got the data in, we’ve got it structured, we know we have the right data types, all of that. Again, this reduces the overall cost because we know we’re working with the right data when we start applying the rules. If something is wrong on the other end, we can reasonably assume we misapplied a business rule and we can look at the business rules and the interpretation of the data. Perhaps it’s not the data that’s a problem, but the business rule or the approach to aggregation, and that’s a different level of data quality from a business user perspective.
Oftentimes, we apply business rules as told to us by the businesspeople, and there are many cases where they may not understand the implication of the rule until they see the outcome, which is why having the ability to see all the correct source data at any time is such a great benefit of Data Vault.
Cindi: Speaking of things like business rules from a governance perspective, the methodology also calls for versioning your objects or processes, so you also have that concept of data governance over the business rules by maintaining a version of the rule at any moment in time so you can audit back to how a calculation was computed.
Kent: We want to know from an auditing perspective where the mistake was, and if you don’t have it before and after, then essentially, you’re left guessing. This is one of the advantages of the Data Vault structure itself, with its satellites. We always know where the data changed, even months later. It’s a great quality check.
Cindi: I really liked what you were saying about checking the data in staging. You’re not only capturing the errors but you’re capturing the metrics around those errors by data source.
One of the intrinsic values that the Data Vault methodology brings to the business is the embedded feedback loop. What this means to the business and to the data teams supporting the Data Vault is that the technical team now has a consistent, repeatable, pattern-based method to report back to the business where the data teams are encountering errant, low, and/or poor-quality data originating from the source system.
Knowing how many keys are arriving null, and what type of key is arriving null (required versus optional), provides quality metrics about the data in the source systems; it enables quantifiable measurements regarding quality problems that are emerging from the source systems, like null business keys. How many actual business keys arrive null, and how many foreign keys are arriving as null values, are indicative of broken source or operational system processes.
Traditionally, the data teams are instructed to mask these kinds of problems by filtering out these types of errant records before the data is ever landed. This is a form of technical debt that remains hidden from the business as a result. When keys arrive null then there is a problem in the source system, in the operational system itself, that needs to be addressed by the business and corrected one time by the source system’s operational team – NOT passed down to the data analytics team to “fix” in the data warehouse.
Data Vault keeps counts of these types of errors discreetly and enables the load processes to continue running without process failure and without stopping – all because the methodology advocates using a specific substitute replacement value for a required business key that arrives null versus the substitute/replacement value for an optional business key (like a foreign key) that arrives null. Keeping counts of quality and system data errors like these allows the technical team to report to the business where they are getting a bunch of null foreign keys versus null primary business keys. A null value in a foreign key shows a different problem from a null value in a primary business key. Missing or null foreign keys indicate that perhaps you’ve got a broken process, or the constraint has been changed at the database level, right? And these are just a few of the data quality checks that Data Vault embeds in its methodology.
Kent: Correct, and all these checks are a key part of Data Vault methodology: duplicate checks, orphan checks, metric marts, error marts; this is the Data Vault 2.0 system of business intelligence: not just hubs, links, and satellites. The model gives us a lot of benefits, but the methodology that incorporates things like TQM and orphan checks and duplicate checks is a way of running your project and building your platform to ensure that you are delivering the quality data that the end users need, and in cases where the data coming in from the sources is of poor quality, you discover that sooner rather than later.
Cindi: It’s so important to remember the “why” behind the methodology. We capture the nulls for example because the standard says we capture the good, the bad and the ugly data, and we’re not filtering it out.
Kent: Right, because again, in the old world of data warehousing, if something violated a referential integrity constraint, we skipped over it, maybe dumped it off to the side somewhere, but you would have to then look somewhere else to find the problem exactly. Put simply, if you’re not recording it, you can’t report.
Cindi: Well Kent, as always, it’s been a pleasure talking with you. I really believe that when we can talk about Data Vault from the perspective of what the methodology offers the business, the business can begin to understand the value that Data Vault brings to the table. I always tell my classes, “Building hubs, links, and satellites does NOT a Data Vault solution make!” This discussion helps to open the mind into those aspects of the Data Vault that many users – both technical and business – never hear about.
Data governance is such a huge topic today, especially for companies that are gearing up to embrace the concept of a data mesh ecosystem. I genuinely believe that the vision behind the data mesh theory was cut from the Data Vault 2.0 solution cloth … but that’s probably a topic for another discussion!