SpotX runs thousands of auction servers that generate terabytes of reporting data on a daily basis. The quality and integrity of our data is critically important for our customers as it plays a key role in making business decisions. Reporting revenue data with an accuracy to the nearest penny is crucial for publishers while advertisers need quality data to improve their planning capabilities and campaign efficiency via our forecasting application.
When dealing with huge volumes of time-critical data this issue of quality becomes ever more important. But how do you gain the confidence to trust your data at this massive scale? As a data team within SpotX, that’s exactly what we were tasked to do.
Overview of data at SpotX
At SpotX, our modern data pipeline, shown simplified below, consists of technologies such as Apache Kafka, Spark, and Druid backed by our Hadoop cluster. These scalable technologies help us to process many events in both a batch and real-time manner to generate reports and power dashboards for downstream consumers. For example, in a typical day we process billions of markets.
One situation we wish to avoid is the manual discovery of data issues several hours after processing. One such issue could be writing multiple copies of the same data. Hence, this blog post talks about how we introduced a tool, Deequ, at SpotX to test the quality of our data at a stage in our pipeline.
What is Deequ and how does it help?
Deequ is an open source library developed at AWS that is built on top of Apache Spark for defining “unit tests for data.” You define constraints on your data (such as values in a column can’t be null) and Deequ will compute the metrics needed to verify this constraint. Example metrics are given below:
- Completeness – the fraction of non-null values in a column
- ApproxCountDistinct – the approximate count of unique values in a column using a hyperloglog algorithm
Deequ then outputs the degree to which your constraints held on the data.
The declarative nature of Deequ lets you focus on describing how your data should look as opposed to writing the underlying checks. It is also possible for constraints to be suggested by letting Deequ profile your data. Deequ scales up to datasets with millions of rows easily since it is written on top of Apache Spark and is designed to perform as few passes as necessary over the data. An overview is shown below. The variety of checks, ease when developing and scalable nature are the reasons why we opted to use Deequ at SpotX.
Where Deequ fits in at SpotX
As part of our data pipeline, we run a Spark application that joins streams of market information (such as bid prices) and events (such as a user clicking on an ad) coming out of our auction process, performs some aggregations, and writes to Hive tables before being ingested into Druid. It is vital that the output of this job is correct so we use Deequ to test the data quality. Here is a sample of the checks we perform on our data.
We incorporated Deequ into our pipeline to run after this batch job has finished processing its hour of data. Our Deequ application computes metrics for hundreds of constraints we defined over hundreds of millions of rows of data, taking approximately 15 minutes to finish. If any constraints are not met then we will be alerted, allowing us to investigate in near real time and fix the data before consumers are affected.
One data issue that Deequ brought to our attention was a certain column had values outside of its expected allowed values. We raised this issue with upstream development teams which were able to investigate further. This shows the power of automated checks since this issue was discovered in one row out of forty million rows for that hour of data.
In addition to using constraints with lower and upper ranges we plan to make use of Deequ’s anomaly detection feature which allows us to check how a metric is changing over time. For example, we can specify the maximum rate that the number of rows in a table should grow by from one hour to the next. If this rate is exceeded then Deequ will mark that constraint as failed.
To sum up, incorporating automated data quality checks into our pipeline has:
- Improved our confidence in its outputs, especially after an upstream release;
- Given us visibility into data issues before consumers are affected; and
- Brought the quality of data to the forefront while developing.
About the author
Adam Welsh is a Software Engineer at SpotX, working within a big data team based in Belfast, helping to develop and maintain a data pipeline. He is a contributor to the Apache Airflow and Apache Druid projects. Outside of work, Adam enjoys playing board games, baking and learning how to ride his unicycle.