Software Engineering Daily

Technical interviews about software topics.


Great Expectations: Data Pipeline Testing with Abe Gong

A data pipeline is a series of steps that takes large data sets and creates usable results from them. At the beginning of a data pipeline, a data set might be pulled from a database, a distributed file system, or a Kafka topic. Throughout a data pipeline, different data sets are joined, filtered, and statistically analyzed.

At the end of a data pipeline, data might be put into a data warehouse or Apache Spark for ad-hoc analysis and data science. At this point, the end-user of the data set expects that data to be clean and accurate. But how do we have any guarantees about the correctness?

Abe Gong is the creator of Great Expectations, a system for data pipeline testing. In Great Expectations, the developer creates tests called “expectations”, which verify certain characteristics of the data set at different phases in a data pipeline. This helps ensure that the end result of a multi-stage data pipeline is correct.

Abe joins the show to discuss the architecture of a data pipeline and the use cases of Great Expectations.

Sponsorship inquiries:

The post Great Expectations: Data Pipeline Testing with Abe Gong appeared first on Software Engineering Daily.

fyyd: Podcast Search Engine

 February 17, 2020  1h8m