Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

https://www.dataengineeringpodcast.com

subscribe
share





episode 111: Solving Data Lineage Tracking And Data Discovery At WeWork [transcript]


Summary

Building clean datasets with reliable and reproducible ingestion pipelines is completely useless if it’s not possible to find them and understand their provenance. The solution to discoverability and tracking of data lineage is to incorporate a metadata repository into your data platform. The metadata repository serves as a data catalog and a means of reporting on the health and status of your datasets when it is properly integrated into the rest of your tools. At WeWork they needed a system that would provide visibility into their Airflow pipelines and the outputs produced. In this episode Julien Le Dem and Willy Lulciuc explain how they built Marquez to serve that need, how it is architected, and how it compares to other options that you might be considering. Even if you already have a metadata repository this is worth a listen to learn more about the value that visibility of your data can bring to your organization.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You work hard to make sure that your data is clean, reliable, and reproducible throughout the ingestion pipeline, but what happens when it gets to the data warehouse? Dataform picks up where your ETL jobs leave off, turning raw data into reliable analytics. Their web based transformation tool with built in collaboration features lets your analysts own the full lifecycle of data in your warehouse. Featuring built in version control integration, real-time error checking for their SQL code, data quality tests, scheduling, and a data catalog with annotation capabilities it’s everything you need to keep your data warehouse in order. Sign up for a free trial today at dataengineeringpodcast.com/dataform and email team@dataform.co with the subject "Data Engineering Podcast" to get a hands-on demo from one of their data experts.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference, the Strata Data conference, and PyCon US. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Willy Lulciuc and Julien Le Dem about Marquez, an open source platform to collect, aggregate, and visualize a data ecosystem’s metadata
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing what Marquez is?
    • What was missing in existing metadata management platforms that necessitated the creation of Marquez?
  • How do the capabilities of Marquez compare with tools and services that bill themselves as data catalogs?
    • How does it compare to the Amundsen platform that Lyft recently released?
  • What are some of the tools or platforms that are currently integrated with Marquez and what additional integrations would you like to see?
  • What are some of the capabilities that are unique to Marquez and how are you using them at WeWork?
  • What are the primary resource types that you support in Marquez?
    • What are some of the lowest common denominator attributes that are necessary and useful to track in a metadata repository?
  • Can you explain how Marquez is architected and how the design has evolved since you first began working on it?
    • Many metadata management systems are simply a service layer on top of a separate data storage engine. What are the benefits of using PostgreSQL as the system of record for Marquez?
      • What are some of the complexities that arise from relying on a relational engine as opposed to a document store or graph database?
  • How is the metadata itself stored and managed in Marquez?
    • How much up-front data modeling is necessary and what types of schema representations are supported?
  • Can you talk through the overall workflow of someone using Marquez in their environment?
    • What is involved in registering and updating datasets?
    • How do you define and track the health of a given dataset?
    • What are some of the interesting questions that can be answered from the information stored in Marquez?
  • What were your assumptions going into this project and how have they been challenged or updated as you began using it for production use cases?
  • For someone who is interested in using Marquez what is involved in deploying and maintaining an installation of it?
  • What have you found to be the most challenging or unanticipated aspects of building and maintaining a metadata repository and data discovery platform?
  • When is Marquez the wrong choice for a metadata repository?
  • What do you have planned for the future of Marquez?
Contact Info
  • Julien Le Dem
    • @J_ on Twitter
    • Email
    • julienledem on GitHub
  • Willy
    • LinkedIn
    • @wslulciuc on Twitter
    • wslulciuc on GitHub
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
  • Marquez
    • DataEngConf Presentation
  • WeWork
  • Canary
  • Yahoo
  • Dremio
  • Hadoop
  • Pig
  • Parquet
    • Podcast Episode
  • Airflow
  • Apache Atlas
  • Amundsen
    • Podcast Episode
  • Uber DataBook
  • LinkedIn DataHub
  • Iceberg Table Format
    • Podcast Episode
  • Delta Lake
    • Podcast Episode
  • Great Expectations data pipeline unit testing framework
    • Podcast.__init__ Episode
  • Redshift
  • SnowflakeDB
    • Podcast Episode
  • Apache Kafka Schema Registry
    • Podcast Episode
  • Open Tracing
  • Jaeger
  • Zipkin
  • DropWizard Java framework
  • Marquez UI
  • Cayley Graph Database
  • Kubernetes
  • Marquez Helm Chart
  • Marquez Docker Container
  • Dagster
    • Podcast Episode
  • Luigi
  • DBT
    • Podcast Episode
  • Thrift
  • Protocol Buffers

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA


share







 2019-12-16  1h1m
 
 
00:10  Tobias Macey
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at linode. With 200 gigabit private networking, scalable shared block storage and a 40 gigabit public network, you get everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads. They just announced dedicated CPU instances and they've got GPU instances as well. Go to data engineering podcast.com slash linode. That's l i n o d today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for their continued support of this show. And you work hard to make sure that your data is called reliable and reproducible throughout the ingestion pipeline. But what happens when it gets to the data warehouse, data form picks up where your ETL jobs leave off turning raw data into reliable analytics. Their web based transformation tool with built in collaboration features lets your analysts and the full lifecycle of data in your warehouse featuring built in version control integration, real time error checking for their sequel code, data quality tests, scheduling, and a data catalog with annotation capabilities. It's everything you need to keep your data warehouse in order. Sign up for a free trial today at data engineering podcast.com slash data form and email team at data form co with the subjects data engineering podcast to get a hands on demo from one of their data experts. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet, listen and learn from your peers. You don't want to miss out on This year's conference season, we have partnered with organizations such as O'Reilly Media chronium Global intelligence, od sc and data Council. Upcoming events include the software architecture conference, the strata data conference, and pi con us go to data engineering podcast.com slash conferences to learn more about these and other events and to take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey, and today I'm interviewing Willie will cook and Julian Madame about Marquez an open source platform to collect aggregate and visualize a data ecosystems metadata. So Willie, can you start by introducing yourself?
02:37  Willy Lulciuc
Yes, sure. So I'm Willie. I'm a software engineer. We work and I've been with the company for just over just over a year now, since joining the work I've been working on the Marquez team in San Francisco. But previously, I worked on a real time streaming data platform that was powering behavioral marketing software. And before that I designed in scale sensor data streams at Canary, which is an IoT company. based in New York City,
03:01  Tobias Macey
and Julian, How about yourself?
03:03  Julian Le Dem
Hi, I'm Julian, I've been at work for about two years. I'm the principal engineer for data platform, which means that I focus more on the architecture side of the data platform. And before that, I've been at Yahoo, and Twitter, then join you.
03:20  Tobias Macey
And going back to you, Willie, do you remember how you first got involved in the area of data management?
03:24  Willy Lulciuc
Yeah, so I feel my involvement has been a bit unconventional. So what I mean by that is, I owe a lot of my understanding of data management to Julian. You know, I draw a lot of my inspiration about the topic from our earlier conversations that we had. So before my kids was really a thing. Marquez was this really thin data abstraction layer on a diagram that Julie and I talked discussed and the cut really the cut across multiple concerns. So you think about ingest you think about storage and compute and how interactive these component. So back then we called it the metadata layer. I know the name wasn't as cool, but this abstraction layer would eventually be Come and be called Marquez and become a critical core component of the works data platform. So, you know, now over a year later, since we've had that discussion, you know, we have the opportunity to tell others about our journey, you know, why organizations invest in those tools, tooling around data management? And what we've learned building mark, hasn't we work?
04:20  Tobias Macey
And Julian, do you remember how you first got involved in the area of data management?
04:23  Julian Le Dem
Yes. So 12 years ago, I was working at Yahoo, and building platform on top of Hadoop. So that was the very beginning of the Hadoop ecosystem and building batch processing on top of it. And so that was very interesting. We build schedulers and some new things. After that, I started contributing to open source project like big, and I joined Twitter, at twitter. I work in the data platform also got involved with building metadata system to improve how we share data, and also a big party when I was over there. That was the beginning of having to deal with how do we scale the organization, how we manage the that scale and building platforms on top of it. And and so that's how I got to find him join. We work to work on the architecture for the data platform and thinking about those data management problem in get you right from the beginning.
05:24  Tobias Macey
And for anybody interested, you are actually on a previous episode with Doug cutting to talk about your work on parkay and his work on Avro. So I'll link to that in the show notes as well. And so as we mentioned, we're talking about the Marquez engine that you've both been working on now. So I'm wondering if you can just start by describing a bit about what Marquez is and some of the problems you were trying to solve by creating it.
05:45  Julian Le Dem
So Marquez is made that data management, meet it data storage layer, and what it is about, it's really about capturing all the jobs, all the data sets and for each job. What Data said it reads and writes do. And this is really about understanding operations how, which version of my job comes from one version of the data set and produce wide version of the data set, and helping with easy taking longer and longer over time. Who do I depend on? Whoo, is depending on me, and, you know, these program of data, freshness, data quality, all of that, having better visibility and capabilities to ensure you have good quality. And around that, that also enable a bunch of use case around data governance around data discovery data catalog. And so it's really about capturing the state of your data environment. So that's kind of like the basics of what Mark is is and it's really about the data lineage, but really from these Biograph perspective of jobs and data set
06:56  Tobias Macey
and what was missing in the existing solutions for metadata management that were available at the time we first began working on this project that you felt you could do a better job of addressing with Marquez, rather than trying to maybe build just some some supplemental resources to tie into those existing engines.
07:14  Julian Le Dem
So some, I think, if you look at tools i'd we work we use airflo, for example, which is one of the main scheduler, open source scheduler on here. And airflo focuses a lot on the job lineage and doesn't know much about data set. And if you look at other things like it less, they know a lot of that data lineage and focus more on governance, but they don't really have this precise model of connecting jobs in data set. So there's kind of the operation side of things, and really having a precise model of those dependencies is missing. And that's kind of why we started markets, right. You also have things like the hive meta store, which knows about all the data set and their partitions and they focus a lot on the data set. Not too much. On how jobs depends on that asset and how people depend on each other. So there's, there's a lot of those, I think a lot of components exists that touch around the meta data, but they don't really connect all the dots together. So it's kind of what we were trying to achieve with markets.
08:17  Tobias Macey
And so in terms of the capabilities that you have built into it, I'm wondering if you can give a bit of compare and contrast to some of the other tools and services that build themselves as data catalogues or metadata layers, and maybe talk a bit about some of the ways that it's being used such as that in the Amundson project from lift that we had on the show previously.
08:39  Willy Lulciuc
Yeah, so yeah, so before we can kind of compare and contrast the differences and similarities between features, you know, features enabled by Marquez, we first have to ask ourselves, why do organizations take on this data engine engineering challenge to build their own in house data catalog solution, right? So for example, I know you have Uber they have their own internal All data catalog called Data book lift, which I think they were previously, they were on a previous episode, they have a munsen. And then LinkedIn recently, they open source data hub. But many of these solutions focus on three core features. So you can think about data lineage, which is how do you track the transformation of your data set over time? You know, what are those intermediate processes that touch that data, and also derive datasets? The other core component is data discovery. So how do you democratize data? How do you get to a point where, you know, employees within your organization can trust your data and know how to if they want to access a data set? How do they connect and pull that data? The other component is data governance, so really understanding who can access what data and do they have the right leverage, right privileges to interact with that data. So, you know, in a Venn diagram, if you if we take like a, you know, a few steps back and look at the intersection of those features, Marquez is at the center, right? But the unique thing that we built out in Marquez is this version and capability. So both for data sets, and also for jobs. And that's really, you know, when I when I talk about Marquez, that's the real differentiator, and sort of the versioning logic that we built in to support for example, four data sets. We version, you know, inverting ensures historical logs of changes of data sets. And for example, you know, with Marquez, if you're the schema for a data set changes, we tie that to the data set version. If a column is added to a table or a column is removed, that's important and we want to track that similarly for for jobs, you know, the business logic, business logic changes. So instead of, you know, maybe you're adding in a filter to a data set, or you're replying, you know, additional joining logic, we want to capture and keep a unique reference to a link in source code, a link to source code that allows us to reproduce the the actual artifacts The job from the source code itself
11:15  Julian Le Dem
vz visualization, a part of the meta data management, data management and market is focusing more on the operation and whole personal lineage of the day jobs. And so we actually had a hack week project when we connected the two as a proof of concept of fusing them together. So I think that's an interesting things we could approach in the future and see how those communities can collaborate. And we can build on top of each other.
11:47  Willy Lulciuc
Yeah, exactly. So you know, before a month in was open source, we actually had an opportunity to speak with the amounts of team at lift. So you know, it was this amazing in person jam session where you know, We talked about metadata and it really ended with a deep technical whiteboard discussion on on how these those efforts can be combined. So, you know, if we scan the features of a munsen, it supports associating associating owners to datasets, data lineage powered by Apache Atlas. It also supports data discovery, which is backed by Elasticsearch. So, you know, for for Marquez, we do have our own UI that we that we use to search for data sets and explore the meta data that has been collected by our API's. But the the cool thing with the munsen, and something that Julian touched upon was that they have an API contract, which makes, you know, pulling metadata from a backend metadata service in the in the UI, very easy. So that becomes a part of a pluggable component in their architecture. And one of our goals is to provide Marquez as a political pluggable back end for four months in
12:56  Tobias Macey
and what are some of the other integrations that you're currently using on top of Marquez and some of the ways that you're consuming the metadata and maybe some of the downstream effects of having this available that has maybe simplified or improved your capabilities for being able to identify and utilize the these data sets for your analytics.
13:17  Willy Lulciuc
Yeah, sure. So, as Julian mentioned, we work airflo has quickly become an important component of our data platform that's powering billing as well as space inventory. So internally, nationally, we've prioritized adding airflo support for Marquez. So the integration allows us to capture metadata for workflows, managed and scheduled by airphone. Enabling, you know, data scientists and data engineers to better debug problems as they come up. One answer that a lot of our data scientists and analysts really care about is that also common question, but really hard to answer is why is my my was my workflow failing and allowing you know One solution to this and one key feature of Marquez is the data lineage graph, that it's maintained on the back end. So the integration allows us to checkpoint the run state of a workflow, understand the run arguments to the pipeline itself. And conveniently, a pointer to the workflow definition and version control. The some of the other integrations that we've been focusing on is with iceberg, so it's a really exciting project that was open source by Netflix. And it now I think it's incubating in the incubating as an Apache project. And iceberg is is a table extraction on that table extraction for data sets that are stored across multiple partitions in a in a file system. So with with that, you know, iceberg does allow us to begin to version files in s3 and capture metadata around around file systems.
14:53  Tobias Macey
And as far as the capabilities that are unique to Marquez, I know that you have mentioned Some of this idea of linking the jobs that produce given data sets to the data sets themselves and being able to version them together. And I'm wondering if you can talk through some of the just overall benefits that that has as far as being able to consume data sets and ensure the health of the data and ensure that you have some visibility into maybe when a schema mismatch occurs as far as a job being produced or some of the other information that you're able to obtain by using Marquez as this unifying layer across all of your different jobs and data sets?
15:34  Julian Le Dem
Yes, there are a couple of use cases where that becomes very handy. So one is, of course, when something goes wrong, right. I think a lot of what when you see data processing in companies, a lot of those framework andron mentor, very designed with the best case scenario in mind, like people know what happens if the job be successful and you produce the die and you trigger downstream processing. However, when something goes wrong, then it becomes hard to debug. Or if you need to reprocess something, it becomes hard to debug. So Marquez is capturing very precise metadata about when the job run one version of the code run, what version of the data set was right? Especially if you use a storage layer like iceberg and delta Lake, where you have precise definition of each version of the data set. And so when you job fails, or it's taking too long, or the job be successful, but the data looks wrong, you can start looking at what change right you can see a for your particular job. Does a version of the code changed since last time you tried? Or is that the data set shape of the input? Change, right? You could use things like great expectations, which is an open source framework for defining declarative properties of your data set and verify that they're still valid or they didn't change significantly. And you could look at that Not only for your job, but for all the upstream jobs because you understand the dependencies. So often, you have simple thing happening, like, why is my job not training, why it's not training because your input is not showing up and your input is not showing up, because the job that's producing it is not training, right? So you can work that graph upstream until you find the source of your problem. And it may be that there's some improvement as that's wrong. It may be that the there's a bug that got introduced, and you can figure out what's going on. Right. So first, you have a lot of information, depending what's happening. And second, since you have a precise model, and you know, for each run, what version of it that is set in run on if you need to restate a partition, in a data set, you can improve your triggering, you know exactly what jobs need to rerun. So I think the state of the industry is often that people have to do a lot of manual work when they need to restate something and rerun the old done stream jobs. And the first capabilities that is required is having visibility and understanding all the dependencies, right what to rerun. And, and in the future, you could even imagine using that very precise model to trigger automatically, all the things that need to be around. Or if something is too expensive to be around, and he's not worth it, you could flag it as something that doesn't, you know, the data is dirty and should not be used or something like that. So there are a lot of aspects like this that are important. And I think in the world where you see a lot of more machine learning jobs happening on data, I mean, these information of that particular training set training job, run on these version of the training set, using those hyper parameters, and producing that version of the model that was then used in that experiment was an experiment ID and tying everything together as a lot of usefulness right because people need to be able to reproduce Same model. So capturing these information, or even the model is drifting over time, having the proper matrix and being able to get back to that version of the training set, or understand what has changed, whether in the data or in the parameters is really important. So that's some of the, you know, specific things we have in mind, where are we looking at this very precise model of jobs and data set and what's running?
19:28  Willy Lulciuc
Yeah, and if I could add to that, you know, the, you know, a lot of what happens, you know, as a data engineer, you you, you work on a pipeline, and you deploy changes periodically, but really, you know, if you update the logic of your pipeline, and usually what happens about a week or so later is really when you start seeing downstream issues with your dashboards is like, Hey, you know, I is the data wrong, why is the, you know, I see a sudden drop in my graph or my dashboard, and that could be related to a number of things. So with that, With Marquez, you have this highly multi dimensional model, which allows you to say, okay, which job version? At what point was this introduced this bug? And also, what were the downstream jobs that were affected by the output of this particular job version, which allows you to really, you know, make backfilling a lot more, I think straightforward and kind of what we what we see now. And really, I think a lot of data engineering teams tend to avoid that and say, Oh, yeah, let's just write it off as something we could address. When the pipeline runs again,
20:31  Tobias Macey
yeah, being able to identify some of the downstream consumers that are going to be impacted by a job changes I can see as being very valuable because it might inform whether or not you actually want to push that job to production now, or maybe wait until somebody else is done using a particular version of a data set, or at least as you said, having that visibility into what are all the potential impacts. Whereas if you're just focusing on the one job, it can be easy to ignore the fact that there are downstream consumers of the data that you're dealing with and then terms of the inputs to Marquez, we've been talking a lot about some of the sort of discrete jobs and batch oriented workflows. But I'm curious to if there is any capability for being able to record metadata for things like streaming event pipelines, where you have a continuous flow of data into a data lake or a given table, or that might be fed into a batch job that's maybe doing some sort of windowing functions and how the breakdown falls as far as batch versus streaming workloads.
21:28  Julian Le Dem
So we do have that in the model. So the core entities are this notion of jobs and data set, right and they're attached to a namespace, and that's our modeling for ownership and multi tenancy jobs. And that is said fully new namespace was producing them. And then for each jobs and data set, we do have types attached to them. And depending on the type, we capture slightly different media data, and so on the data set time We have the badge address that are said that could be iceberg Delta Lake, you know, usually stored in a distributed file system like s3 or something similar. And we have the more tables that are set. Like if you use a warehouse like redshift, or snowflake, or vertica. In that sense, we have a less precise model, because we can't really pinpoint a particular version of a data set, we can't go back to a specific version of the table, but we can version the changes in a schema. So we do capture that. And then the third type is a streaming data set. So typically, something like a Kafka topic, which as a schema as well, if you're using the schema registry with Avatar like we do, and so we can version that. And similarly, we don't have like that precise, pinpointing on a version because the job is continuously running instead of having those discretes runs, then a batch data set has. So we have those three types of data set at the moment whether it's more like sequel tables, warehouse streaming data sets in Kafka or batch data set in s3. And then on the job side, similarly, you have batch jobs and streaming jobs and a batch job as discrete runs. And for both types we capture, you know, the version of the code. And when the job started when the job stopped. And for batch jobs, you have like discrete runs that are tied to a version of a data set. And for streaming job, you still have runs because of streaming job starts and ends. But you have fewer than right? Didn't they more continuous. And so you have less of these. You don't have these tracking of version of data set. But we do track when this schema evolved, if you update your streaming job, for example, and you add you to feel into the output, so we do capture those different type of information. And so there's a higher level model. And then depending on the type of data set of the type of jobs we can we try to be more precise in which we capture depending on the children,
23:57  Tobias Macey
and I'm wondering if you can dig a bit more into The specifics of the data model for Marquez, I know you mentioned the sort of different entities as far as data sets and jobs. And I'm wondering both what are some of the lowest common denominator as far as the attributes that are necessary for it to be useful within the metadata repository, and if there's any option for extending the data models for use cases outside of what you are, in particular concerned with that we work,
24:26  Julian Le Dem
so the, we have this notion of job and that is it and I think maybe job is a little bit of an overloaded term, but when you define system like this, you always, always have some terms that are using a specific meaning in one in one area and a different meaning and then those are area. So by job we really define something that consumes and produce data. And so the the common denominator is really this notion of inputs and outputs, and having jobs that consume and produce data, so Seeing that always companies you have inputs and outputs, and you have a version of the code that was deployed, and you have parameters. And for data said there's a physical location, an owner to it same as for the job, right? So this notion of ownership and dependencies is common to everything. And then what we do is we do specialize in the model, we have specialized tables for each type of data set and job to capture a little bit of what when we can be more precise in one environment because what we capturing the streaming environment versus a batch environment is not the same. So they are higher level model that similar with the input and output and some of the other things we've been thinking about, because, of course, upstream from your data processing. There are services that depend on each other as well, but the model is slightly The difference. So in our model, you always have this notion of something consuming data sets and producing data set. So you always have the data set in between dependencies between components and artifacts that people build. And in the service world, usually it's darky service to service dependencies. So it's something we haven't really spend a lot of time on, but that people start asking sometimes he's how you connect both worlds and having the dependency tracking, which often people do with open tracing things like Hagar zipkin. In the service world, how do we connect the dots because there are a lot of there's like a jewel between the data processing and the service world. And there are a lot of those concepts that align. And so how do we connect the dots between those things?
26:49  Tobias Macey
And can you talk a bit about how Marquez itself was actually implemented and some of the overall overall system architecture and maybe some of how that's evolved since he first began working on it
26:59  Willy Lulciuc
yesterday. Or Marquez itself is a modular system. So when we first designed the the original source code and also the the back end data store, we wanted to make sure that the first of all the API and also the the back end data model was platform agnostic. So you know, if, when I think of our kids, I always kind of talked about three system components. So first, we have our metal repository and the repository itself stores you know, all data set and job metadata, but it also tracks the complete history of data set changes. So you know, you can think of when someone does when a when a system or a team updates their schema, we want to track that. So we keep we keep a complete history of that, as well as when a job runs. It also updates the the data set itself. So Mark has on the back end, Christos relationships. The other component is the you know, the the REST API itself. And you know, if you if I can talk a little bit about the stack itself, you know, written in Java, we do use drop wizard pretty extensively on the project to expose the REST API, but also interact with the the backend database itself. And really, the API drives the integration. So you know, for one example, that we talked about, this is the airfoil integration that we've done. And then finally, we have the UI itself, which is used to explore data sets and discover data sets, as well as you know, explore the dependencies between jobs themselves, and allows our end users, you know, as we work to navigate, different sources that we've collected, as well as the data sets and jobs that Mark has, has catalogued.
28:39  Tobias Macey
And when I was going through the documentation, it looks like the actual underlying storage engine, at least for a year implementation is Postgres. I'm wondering what the motivation was for relying on relational database for this, any other supported backends that you have and what the benefits are for using a relational Engine versus a document store or graph store for this type of data.
29:04  Willy Lulciuc
Sure, you know, for for us, you know, Postgres gets us pretty far, you know, you know, when we, when we white boarded the data model for Marquez, he was a relational models. So we kind of went with that, you know, there, there is going to be a point where our relational database cannot get us to the scale that we need. But we, when we, when we designed the system, we wanted to make sure that it was simple to operate. And there was limited dependent, you know, there wasn't too many dependencies that you had to pull in to get up and running. So, you know, as we see more and more usage of Marquez internally, we will naturally kind of transition to a graph database because that gives us more rich relationships and allow us to kind of pinpoint in a node in a graph, you know, the what are the relationships between a job and a data set, but that doesn't mean Mark has doesn't have a graph database we actually do. It's called Kaylie, which is open source by Google. That's what we use to drive the data lineage graph that is, is a key component and really a huge, huge feature of the API itself a document store, I think that'd be a little hard. I mean, for us, if you look at what we're trying to model, a document store would require, I mean, if you think of, you know, dynamodb, you know, you do have to do a lot of prefetching and filtering yourself within the application, or you push that down to the actual no sequel database itself. So for us, naturally, it just made sense to use Postgres and then transition over to a graph database as we scaled out.
30:35  Julian Le Dem
And I think what one of the abuse PCs where you can help scaling that model is, since we capture older runs of a job, and when people look at what's happening, there are many interested in what has been happening recently, right so you can archive all the old runs two or more key value store type model that would scale it easily to storing old historical runs. Of all the jobs and all the old versions of data sets. And it's we're still talking about meta data here. So they are kind of, it's not that much data, but it does accumulate over time. And so from that perspective, I think their relational database gets you pretty far from the number of old your data sets, right? encrypting limited data for that. And we can add, as we see people using it on larger and larger environments, and data ecosystems, you can start archiving the historical runs of the jobs to a secondary storage that scales better in volume, and for something that you may want to look at more in aggregate or something like that.
31:43  Tobias Macey
And for somebody who's interested in using Marquez, can you talk through some of the overall workflow of getting it set up and getting it integrated into a data platform and maybe some of the work involved in actually populating it with the different metadata objects and records
32:00  Willy Lulciuc
So, you know, Marquez, it's open source. So you, you do have the option of just building the jar itself. So if you have a running Postgres instance, and you wanted to apply the the markets data model, you just pointed to that database. And Marquez will run the migration scripts that we have that applies the schema to that database. So that's one option. The other one is we, you know, we work we are heavily invested in Kubernetes. So that is one option as well. We do use a home chart to deploy the UI, as well as the the backend API itself. So those are two options that our end users do you know, someone who wants to get up and running with Marquez has, we also publish a Docker image. So if you're, you know, your organization is a environment that runs containers and manages through Kubernetes or some other container management system. You get up and running that way.
32:55  Tobias Macey
And then as far as getting the job information and everything I know that there are air flow connected And you have native clients for Python, as well as a Daxter integration that I noticed is a fairly recent addition. So I'm wondering if you can just talk through some of the other work as far as once you've got it up and running just the overall work of actually integrating it into the rest of the data platform to record metadata and job and data set information. And then also on the downstream setting up consumers for being able to take advantage of that information.
33:25  Willy Lulciuc
Right. So as you mentioned, we do have a Python Client on we do also have a Java client, and we're working on a go client as well. Because there's a lot of applications that are going at we work. So really the integration itself requires this Java client, these clients that really implement the REST API. So a lot of when when we do integrations with our internal platform components or integration with open source project like airflo, what we end up doing is using the REST API, so we have an API for registering source metadata days. metadata around data sets, but also an API around around jobs. So really, it comes down to just understanding when your pipeline is running or when your your application is running. What are the friction points? So really what we care about is, when does someone when does your application access data? And also when does it right? right at it so so there's a two key integration points that we care about.
34:24  Julian Le Dem
Yeah, and as those integration are contributing to contributed to the project, really, they become there's less and less work to do for people to integrate. So today, if you use airflo, you'll have the air flow support right away available, but some other companies users can be recalled Wuji. So currently, we don't have 3g support. So if someone wants to use 3g with Marquez would have to write the Luigi integration to send the same information. But once that is done everybody using the Luigi scheduler, we benefit from it and so the same applies to spark. So we have integration for the snowflake creative sequel. And that's something that everybody can leverage. And really, it's something that the more that's one of the reason for open source Eden Marquez is really, it's something that becomes more valuable, the more it's using the open source, right, because people contribute those integrations. And then the more we have, the more it's easy for anyone to use it right away without much work. And so that's kind of the advantage of open source in these kind of projects.
35:36  Willy Lulciuc
Yeah, and, you know, in terms of kind of, like, continuing on that so the, you know, the, the one exciting integration that we've done with airflow is you know, we do provide a sequel parser So, a lot of the time what we see is airflow is used for ETL workloads, mainly sort of reading from s3 and then writing to writing to your warehouse. So we what we ended up doing was we have this built in sequel Sir that really understands are the tables that are part of your sequel sequel statement? Where are the tables that are part of your join? And also what, what tables are you writing to. And you know the the key thing, when we were looking at integrating with airflo, we want it to be really easy just drop in play. And if you just have to do one line change to modify your, your library in terms of what, what input you're using, we wanted to make that really, really simple. So it's just a one line change. And by default, you get all of this rich metadata sent to Marquez. And, by default, you get a lineage graph that sort of cuts across multiple airflo instances, if you're you're doing you plan your deployment, you could do a multiple multi tenancy deployment in airflo. Or you could have single instances. So there there is an opportunity to, you know, stitch together the inter dependencies between workflows,
36:56  Tobias Macey
and in terms of the actual separation there. Do you have Different deployment of Marquez for production versus pre production workflows, or do you have it all in one UI, so you can view the entirety of your data sets across all of your environments?
37:12  Willy Lulciuc
Yeah, so we follow a fairly standard deployment process. So we do have a staging environment for for Marquez, and most of that really is your sort of dummy data. But also, if someone's testing out a new pipeline, we do have that reported to the Marquez back end. But and then we also have a deployment process for production. We sometimes do sync metadata from production just to kind of see, you know, to provide a more populated metadata, and in staging so that way, we can start querying, okay, we added this new field, does it really make sense? Should we drop it? Does it really answer the question that we, we've been trying to ask? But yeah, we, we hooked into ci and we have a continuous deployment to both staging and then also production.
38:00  Tobias Macey
As far as the assumptions that you made and the ideas that you had going into this project, what are some of the ways that those have been challenged or updated as you've actually started using it in production and exposed it to other organizations that have started employing it for their environments.
38:15  Julian Le Dem
One of the other matrix for the success of Marquez, he's looking at coverage of lineage. So when you we look at that, sometimes it's a little bit of a moving target, right? Because in the air for integration, to integrate with air flow, and we have multiple instances of air flow for multiple teams. So right away as you deploy the airfryer integration, you see all the jobs, but you may not see all the lineage right away because to capture the lineage, then we have extractors that figure out the niche for each type of operator people are using inside of airflow. And so when we define targets, in terms of we need to cover all the operators that people are using We start working on that. Meanwhile, of course, people keep innovating, and using more typical barriers. And so making sure you define a more standardized way of working together and making sure as we include more operators, we don't have more and more that needs to be integrated is a challenge that we've seen in the past. And so it's kind of important to work with your users. It's kind of having having, how would you make sure that your coverage of a niche target doesn't be become a moving target, right, then you keep the morning edge you more coverage you add, the more coverage you need to have. And at the beginning, it was a bit challenging, but as soon as you start paying attention to it, it actually works pretty well. We've seen some a fork like people who are you starting using DBT to have any edge information in their jobs. But then they have like teenage information for inside the team. Right and Marcus gives you the ninja formation across the entire organization. And so just working together as being important and making sure we have like align goals on how we would build that. So that's been a little bit challenging from that aspect.
40:14  Willy Lulciuc
Yeah. And you know, it's funny we we do version or dB, the schema that we do have from our kids. So I think we're on version maybe like 21. But if you look back at what we initially had, it was it was just, I think three entities where you had job data sets and runs. And you know, if you fast forward to where we are now, we have a far richer data model where we capture not only the run arcs, but also we capture the context around the job itself. So recently, with our airflo integration, we wanted to capture the sequel, so And that way, we can display it on the Marquez front end. So we added this job context field, which is just a key value pair that allows you to store additional information about the job itself. When we first started, I think the most and Mostly tricky part for me was was to really understand how we were going to provide this extensive metadata model that allows us to version datasets, it was always theoretical, but once we kind of got it running in production and our first integration with airflo allowed us to really expand and implement that version logic, which, you know, kind of looking back. Now it's a, it was a far more bigger task than I thought it would be. Right now. It's just a fairly simple versioning functions, depending on the data set itself. And also we we didn't, we did kind of expand on ownership of metadata. So with a namespace, so a namespace allows you to group met today by context. So initially, we we tracked it at the job level, but then we kind of move that up one level where we now tie ownerships to data sets and jobs. So really, there was just so many additions and modifications that we've made in the past year from our first whiteboard session and the first data model that we had for four markets.
41:59  Julian Le Dem
Yeah, I think he Really important to have those entities and their relationship, right? Because from that, then it's really easy to add more material around each entity, but they're evolving their relation the entities themselves, and the relation between them is a bit harder, especially once you're in production. And so having these notion of jobs version runs that I said the said version, and inputs and outputs and really having their their right modeling of how the what the world looks like, enables a lot of this.
42:28  Tobias Macey
Yeah, and one last one last thing, you know, the when, when we thought about the meta meta repository, we didn't really want to store schemas, we didn't want to become a schema registry, that stored all the all the data set fields, but what we ended up seeing was the need for that. So Mark has now is able to version fields of a data set and tie those to a version. So we care when we when we capture metadata for data set, we also capture its fields. So we have the name, the type and also description itself. Which is, I think a direction that I didn't think we would take. But man, you know, it's really kind of paying off. And we're seeing some really cool usage based off that. And in terms of the description, I know that one of the most valuable aspects of having a meta data repository and a data catalog is being able to capture the context of the data sets so that you can understand what their intended purpose is and some of the information that went into the decisions as to how it was produced and some of the schema that was formed. And I'm curious, what level of additional annotation is possible beyond just a free form description field or some of the interesting ways that you've seen that leveraged?
43:39  Julian Le Dem
So we have some tagging features and in can be used to leverage to you know, to implement privacy or security aspect or encoding SLA s, right. He's my data experimental is my data production, really, those kind of aspects that people can use it for? other aspect is adding data quality metrics in the data set. So we've been experimenting with great expectations to do this. And you then people can decide. Usually it's it's using two ways whether when you're producing the data, and just having some declarative properties and force in your data set and fail, you know, you don't want to let anybody see that data set, if it's the code may run and not declare any errors, but the result is not correct. And so that can be used as a, you know, circuit breaker to not start the downstream jobs and never not publish these data set. All the ways people use it is actually the consumers may have different opinions of what the data quality should be for them to run their job. So they can also use as a pre validation check, like enforcing certain data quality metrics, before consuming a job in preventing you Bad data to percolate through the system, right? Because then it can be expensive or of impacting production, especially if you're doing machine learning or recommendation engine or things like that. If you have beta, bad data going in, then you have bad recommendation coming out, right. And that's has a real impact on the production systems. So those are some of the ways people are using it. So there are always two aspects. Either you have a more January January tagging or flexible type of metadata adding to an existing entity, or if it's something that can benefit that's from being including in the core model, then it can become like an actual attribute or an entity in the model.
45:48  Willy Lulciuc
Yeah, and the one way we we plan on using descriptions is for our search results. So if someone's searching for a data set, and they happen to provide a description for data set, we want to reward the owners of those status. By moving those datasets up the the search results, because we do we do make dead descriptions off show. But we, like I said we do on reward our end users for putting the extra effort to annotate their data sets.
46:12  Tobias Macey
And we've talked a couple of times about the health of a data set. And you mentioned Julian, the idea of using something like great expectations for being able to populate some of these data quality metrics. And I'm wondering what are some of the other useful signals as to the overall health of a data set, and then also things like the last updated field for indicating when something might be stale, or when you might want to get some additional information about why it's not up to date or why it's in a particular state as far as the health of the quality
46:46  Julian Le Dem
so data freshness is often a property of data that you see. So yes, things like to meet data freshness is really more an attribute of the pipeline for using the data right. So People look at data freshness when the oldest sees their data set. And they said like, when was the last time a data set was updated? When really other thing you can look into is, is it taking longer and longer to produce these data set? Right? And it does it retry, does the system fail and retry a couple of times before working. And those are all attributes of the jobs when you see the data. And so that's kind of part of the importance of understanding that graph, right? And a lot of those data transformation they are not linear, right? Most people they start with the data set size and as they're being successful, their input size we grow and grow. And the job me that consumes that data does something with it may take longer and longer right to join is not to linear time operation. The bigger you data said the the time he takes he's not proportional to the input. And so those are kind of things that You will have to maintain your pipeline as you go with something that was working early on in the life of your product may not work later, just because the processing time doesn't scale linearly with the size of your input. And so that's one basic one, you know, like data freshness, and understanding why it takes time to do something. Also, as you get more users or more data source, like the the shape of the data may change, right, the distribution of values, and that's also can impact processing or data quality. So, you know, great expectation is one way to get more information about this size of your input. Another one is looking at, how long does it take to process the data? If you have failures, it's important to correlate with how is the code changing because you may have changed an algorithm and, you know, added some function at break something else and so how Those changes as your organization grows, and more and more people are involved in modifying the pipelines, the more you have different conflicting changes that may have impact on the overall system. So several of those are interesting attributes of the data and the data freshness, data quality. And sometimes it's important to just look at the like business metrics also that derive from it, not just like the data property itself, but how what are the metrics of like, if you do a recommendation engine based on that data, just having great expectation metrics on how is the distribution of a column evolving may not be sufficient, right, you may want to track metrics downstream from that is how does that affect the user engagement in some way and connect that all the way to how they input data searching?
49:53  Tobias Macey
And what are some of the interesting or unexpected or challenging aspects of building a Maintaining the Marquez project that you have learned in the process of going through it.
50:05  Julian Le Dem
Yeah. There's been some growth, you know. So we mentioned before how we evolve the model, to how do we get to this precise and good model of those entities. And starting the integrations. I think once you have these good model, then you can start having more integrations in part of it, right, because once a model is more stable, it's easier to build more integrations and whether it's schedulers or processing frameworks, like spark and Flink and a calf guy and all those things. And so that's when the challenge the other challenge is about one thing we did early on is make sure we talked to other companies to validate the use cases and validate the model. And so in starting building that community, and the second asset aspect is talking to other companies is whether they want to use them Use the open source project. And then the next level is do they want to contribute to the project. And so making sure that we are all on an equal footing, building that community, right? So it's kind of like so we started with having this design doc in the open and validating the use cases, validating the model working with other people that those are companies and working with others trying it out. We work together and making sure we do all the development in the open so that everyone feels on a we all on an equal footing, building that project. So I think it's part of the challenging, right? How do we make sure this project which is going to become more valuable, the more people use it, we all feel and we all have a feeling of ownership of it. And it's really a community driven project.
51:54  Tobias Macey
And so Mark has definitely looks like it provides a lot of value and utility for being a To manage the health and visibility of different data sets across an organization, but what are the cases where it's the wrong choice, and you'd be better served with a different solution?
52:09  Julian Le Dem
So one thing we we keep mentioning in this model, right is there's this strong notion of jobs in data sets, right? So it's kind of marketers relied on this notion that you have things that depend on each other through data set. So these these like asynchronous type of communication, where you produce a data set, whether it's streaming or batch data set, and someone else we consume that data set, right, so that's how we model dependencies. So that works well for any kind of batch and stream processing type of jobs, right? tisco, this whole data ecosystem kind of work like that. And that's a model. And so we capture this information. If you're in an environment where you have, you know, every request looks different and like depending on the request, you may be sending an event to a lot of different things and Or you took two different type of services, then that's not necessarily the best model for it. You know, like, if you look at things like open tracing, or you know, the projects like jaegers zipkin, other other projects that are similar that look at how to request the flows through a system, they may not look the same depending on the requests, right, and they may, like you may have a lot of dependencies between the microservices, then Marcus is not necessarily the best model. So we'll definitely look in the future how we connect those two worlds, because there's a lot of interest in understanding the union of the data, not just when it enters gasca or whatever data collection system you have, but also understand upstream where the data is coming from, but it's still a different model. Right. So I think in that case, you know, Marcus is not necessarily the best system to understand how you microservices. depend on each other. It is kind of related world. But our model is really about these more asynchronous communication between systems through the assets.
54:09  Willy Lulciuc
Yeah. So what I found most challenging is I think controlling the story around Marquez, because every time you know, internally, we were we were, we went to different teams, they had different assumption on what Marquez was, and also the type of metadata. Marquez was storing. And so depending on who you talk to you, it would be metadata around, you know, services, or it was metadata that was very general. And you could store whatever you wanted in the repository. But the key thing that I always have to kind of drive as it's, you know, Marquez is relevant and also most useful within the context of data processing. So that was probably the most difficult part is sort of educating our end users on why this is important, what it unlocks and what they can actually do with the metadata that's stored in Marquez
54:56  Tobias Macey
and looking to the future of the project. What are some of the plans You have both from a technical and organizational and community aspect as you continue to evolve and grow it.
55:07  Julian Le Dem
So from from a technical standpoint, you know, like now that the internal model is stable, having more integration, like I mentioned, Luigi has another scooter, all the things people are using for processing data and understanding the eth, so and that's the part of the project that can really scale in parallel, right different people, users can contribute different integration in parallel. And that scales very well in an open source project. And for example, why doing parkay once a core model, and format representation existed, having integration with a lot of different things, whether it's arrows three, protobuf, Spark, hive, all of those things, it was really easy to work losing problems. So I think we're at the step with Marcus and that's 3d. The next step is how we be able to all those integrations, so that it becomes more valuable and also next step. Which to me is a natural next step for a project is to move to possibly a foundation, right? So can have the next if you want to really show that this project is not it's community driven, not owned by any particular entity and not controlled by any particular entity and everybody is on an equal footing on MP evolve the mission of the project and making it successful. That's really kind of a good Testament in showing that. Look, it's owned by open source foundation. And that's how you can help driving community involvement and more contributors, because they know that they're going to be on an equal footing to everybody in the community. So that's also to me a next step we're thinking about,
56:49  Willy Lulciuc
yeah, and for me, I think, the next step, you know, building on top of the metadata that we've collected so far, because that unlocks a really cool feature that we've been discussing. Data triggers. Since Marquez is aware of when a job modifies a data set. Imagine if Mark has also wrote that change log to a queue somewhere which then a back end system would listen on and trigger a job based off the data set being being modified. The other thing that you can think about is, you know, having some sort of health quality check, you know, before the job is triggered, enabling you to be like, you know, before I actually kick off this, this job, are all of the partitions that are required for this job to run actually present. So we could do those type of health checks at that point. So for me, it's just, there's so many more things that we can do with just the metadata, the metadata that we've collected so far. And yeah, so I'm very excited about the future of the project. Well, for anybody who wants to follow along with the work that you're doing and get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data man But today, and I'll start with you, Willie, you have for me what you would have to be the tooling around sort of ensuring the quality of the data set that that's part of the input of your job, and also the output of your job. I think we've seen over the years amazing tooling around code, and visibility and code. So you will have logging for your application to understand the runtime, you'll have metrics for your system to understand his performance and also the load on your system. So there's very little that that we see in the open source around data sets themselves. And I think that's where Marquez really fits in and problem is trying to solve. And as Julie mentioned, Great Expectations is one of those really exciting open source projects that allows you to find the shape of your data as well as the expectations that you you'd like to see before you actually process that data set.
58:49  Tobias Macey
And Julian How about yourself.
58:50  Julian Le Dem
So related to what we just said, I think like the data operations in general, is kind of a being missing point because you see From the service in the service world, there's a very mature way of how do you wear a genie test? How do you deploy? How do you monitor your application? How is your own cold rotation working? And in the data world, I think they're not that much. Either tooling, or even best practices there are defined right so part of building Marquez is really about how do you take ownership of your jobs? How do you understand what you're depending on and who owns the data set your depending on the job that produces it will depends on the data set you're responsible for? And how, as companies grow, and you have more and more teams that depend on each other, through sharing data sets, and how we build these really good culture of data ownership and depending on each other, and how we all call for it, and especially in a world where machine learning is Becoming more prominent problems in data affect more and more production. You know, it used to be that services, when you services down, you most likely impact something right now, when a batch processing doesn't work, well, maybe you impact something in a few hours or next day, and maybe it's less urgent. Maybe it's becoming more and more urgent and important to have a good, you know, production practices around data processing. So I think that's one of the gap. And that's where Marquez can have help. And also it connects with all those other aspects of governance discovery, but also how are you ownership ownership of data set and jobs? And how are they?
1:00:42  Tobias Macey
Well, thank you both very much for taking the time today to join me and discuss your work on Mark has it's pretty interesting project and one that I look forward to taking advantage of for my environment. So thank you for your efforts on that front. And I hope you enjoy the rest of your day. Thank
1:00:55  Julian Le Dem
you, Tobias. You too.
1:00:56  Willy Lulciuc
Yeah, thanks. I always enjoy talking about metadata. So this was a great day. discussion.
1:01:06  Tobias Macey
Listening Don't forget to check out our other show podcast.in it at Python podcast.com to learn about the Python language, its community and the innovative ways it is being used. And visit the site at data engineering podcasts. com to subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects in the show, then tell us about it. Email hosts at data engineering podcast.com with your story, and to help other people find the show. Please leave a review on iTunes and tell your friends and co workers