Data Quality vs. Data Observability; What’s the difference?

This is part of Solutions Review’s Premium Content Series, a collection of contributed columns written by industry experts in maturing software categories. In this submission, Datafold founding CEO Gleb Mezhanskiy answers the question: “data quality vs. data observability; what’s the difference?”

SR Premium Content

As a general rule, data quality problems have always been handled reactively. In the past, data quality issues were discovered in production by business users and were then fixed by a trained data engineer. This meant that these issues may have existed for a long time, and many insights and decisions based on that data could have been inaccurate.

As data becomes more and more central to how a business operates, we can learn from the software world how to reduce quality issues. In software, a key principle for maintaining quality is that defects are most easily addressed when they’re discovered while writing code — before anything is merged into production. We should aim for the same in data to shift the testing from business users left to the data engineers who are modeling the data.

Today, there are a handful of data quality solutions that focus on shortening the interval between the initial appearance of a data quality issue and its eventual resolution. This typically takes the form of a rules-based monitoring system that tracks live data within a data warehouse, then sends an alert if it sees anything that falls outside of acceptable bounds. Some products use artificial intelligence (AI) to scan for anomalies, fine-tuning their selection criteria along the way. In either case, the process is very much the same: A problem is spotted in the data, then comes an intervention, things get fixed, and life goes on until the next issue shows up.

That’s data observability. It’s better than the business user letting you know about the problem. The question is: Can we do better? Monitoring for data quality problems is useful, but it doesn’t actually prevent problems from cropping up in the first place. In order for these tools to do their job, something must already have broken.

In an era of big data analytics, AI/ML, and automation at scale, that’s a problem. The data stored in warehouses is now being used to power a wide variety of reports, analytics and applications. That includes things like email marketing segmentation, fraud detection, or data science applications. Dashboards, data apps, notebooks, and data activation are consuming data at scale, very often in ways that may not have been planned or predicted in advance. In some cases, the data engineers overseeing a warehouse might not even be aware of all the ways their data is being used.

The key point here is that data quality issues can have very real (and far-reaching) consequences. Detecting them post-factum may very often be too late. Alerting systems may also be incomplete because some anomalies may fly under the radar, not easily detectable by algorithms.

When poor data quality is allowed to persist, it drives bad decisions, with potentially catastrophic consequences. As algorithms drive more and more business processes based on data, poor data quality can lead to poor automation. That means payments being denied or processed improperly, email workflows driving the wrong messages to the wrong people, and ultimately, poor customer experiences.

The term “data quality,” broadly speaking, defines a problem domain. Data observability describes one potential solution to the problem of poor data quality, but it’s not the only one — and it’s not necessarily the best one. There is a second (and better) solution, which we might refer to as preemptive data testing.

Avora co-founder David Jayatillake recently pointed out that while shift-left testing has been a driving force in software engineering for quite some time, it’s a relatively new idea within the context of data engineering and data governance. By taking a proactive approach to data quality, we can prevent a great many data quality problems from ever emerging at all. In other cases, we can at least identify and fix issues earlier, at a much lower cost.

This suggests a fundamental shift in how data stewards approaches the problem of poor data quality. It calls for making data testing an integral element of the data engineering process. In other words, we should look for conditions that will result in poor data quality. Although many vendors in the data quality space are pursuing data observability, they’re missing out on this opportunity.

By providing data engineers with a data quality impact report, we can systematically identify potential data quality problems up front within their pull request workflow. That will allow data engineers to make better design choices and reduce the number of quality incidents that get introduced into the data.

This is a preemptive approach, aimed at prevention, so it shows up at the far left extreme of the data flow diagram. Deal with data quality in the pull request by understanding the impact that changes will have on the underlying data as well as any connected data assets and products. Armed with that information, data engineers can prevent their newly merged code from adversely impacting data quality.

Gleb Mezhanskiy
Latest posts by Gleb Mezhanskiy (see all)

Leave a Comment

Your email address will not be published.