Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.


episode 82: Data Lineage For Your Pipelines [transcript]


Some problems in data are well defined and benefit from a ready-made set of tools. For everything else, there’s Pachyderm, the platform for data science that is built to scale. In this episode Joe Doliner, CEO and co-founder, explains how Pachyderm started as an attempt to make data provenance easier to track, how the platform is architected and used today, and examples of how the underlying principles manifest in the workflows of data engineers and data scientists as they collaborate on data projects. In addition to all of that he also shares his thoughts on their recent round of fund-raising and where the future will take them. If you are looking for a set of tools for building your data science workflows then Pachyderm is a solid choice, featuring data versioning, first class tracking of data lineage, and language agnostic data pipelines.

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays,, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to today to learn more and thank them for their support.
  • Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to to learn more and take advantage of our partner discounts when you register.
  • Go to to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at
  • Your host is Tobias Macey and today I’m interviewing Joe Doliner about Pachyderm, a platform that lets you deploy and manage multi-stage, language-agnostic data pipelines while maintaining complete reproducibility and provenance
  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Pachyderm is and how it got started?
    • What is new in the last two years since I talked to Dan Whitenack in episode 1?
    • How have the changes and additional features in Kubernetes impacted your work on Pachyderm?
  • A recent development in the Kubernetes space is the Kubeflow project. How do its capabilities compare with or complement what you are doing in Pachyderm?
  • Can you walk through the overall workflow for someone building an analysis pipeline in Pachyderm?
    • How does that break down across different roles and responsibilities (e.g. data scientist vs data engineer)?
  • There are a lot of concepts and moving parts in Pachyderm, from getting a Kubernetes cluster set up, to understanding the file system and processing pipeline, to understanding best practices. What are some of the common challenges or points of confusion that new users encounter?
  • Data provenance is critical for understanding the end results of an analysis or ML model. Can you explain how the tracking in Pachyderm is implemented?
    • What is the interface for exposing and exploring that provenance data?
  • What are some of the advanced capabilities of Pachyderm that you would like to call out?
  • With your recent round of fundraising I’m assuming there is new pressure to grow and scale your product and business. How are you approaching that and what are some of the challenges you are facing?
  • What have been some of the most challenging/useful/unexpected lessons that you have learned in the process of building, maintaining, and growing the Pachyderm project and company?
  • What do you have planned for the future of Pachyderm?
Contact Info
  • @jdoliner on Twitter
  • LinkedIn
  • jdoliner on GitHub
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
  • Pachyderm
  • RethinkDB
  • AirBnB
  • Data Provenance
  • Kubeflow
  • Stateful Sets
  • EtcD
  • Airflow
  • Kafka
  • GitHub
  • GitLab
  • Docker
  • Kubernetes
  • CI == Continuous Integration
  • CD == Continuous Delivery
  • Ceph
    • Podcast Interview
  • Object Storage
  • MiniKube
  • FUSE == File System In User Space

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA


 2019-05-27  49m
Tobias Macey: Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the project to hear about on the show, you'll need somewhere to deploy it. So check out our friends at Lynn ODE with 200 gigabit private networking, scalable shared block storage and 40 gigabit public network you've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution, they've got that coverage to with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads. They just announced dedicated CPU instances. Go to data engineering slash Linux, that's l i n o d today to get a $20 credit and launch a new server and under a minute. And understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers time, it lets your business users decide what data they want where go to data engineering slash segment i o today to sign up for their startup plan and get $25,000 in segment credits and $1 million in free software for marketing and analytics companies like AWS, Google and intercom. On top of that, you'll get access to the analytics Academy for the educational resources you need to become an expert in data analytics for measuring product market fit. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity into the Open Data Science Conference. Go to data engineering slash conferences to learn more and to take advantage of our partner discounts when you register and go to data engineering to subscribe to the show, sign up for the mailing list, read the show notes and get in touch. And please help other people find the show by leaving a review on iTunes and telling your friends and co workers. Your host is Tobias Macey and today I'm interviewing Joe Dolan or about packager, the platform that lets you deploy and manage multi stage language agnostic data pipelines while maintaining complete reproducibility and provenance. So Joe, could you start by introducing yourself?
Joe Doliner: Yeah, sure. My name is Joe Doliner. It's great to be here talking to you today Tobias I am the founder and CEO of Pachyderm. And I started life as a software engineer, the first company I ever worked at was rethink dB. And that's basically the only other company I've worked at, besides a little while at Airbnb between.
Tobias Macey: and so do you remember how you first got involved in the area of data management?
Yeah, absolutely. I mean, I have always been interested in data infrastructure tools. So rethink DB was an open source database. And I knew coming out of college that I wanted to work on these types of data management, data analysis, data manipulation tools. So I joined that company right out of college and got to cut my teeth doing like open source software development, and data infrastructure and things like that, absolutely fell in love with it. And then after I left rethink dB, I got really interested in big data, you know, rethink DB is more of a like transactional database, you use it as like the backend of your website. And so I wanted to learn what the world of like data science and data analysis and everything look like. And so I started sort of hacking on Pachyderm in my spare time, it was actually sort of started because I wanted to use the Hadoop platform to analyze some chess games. I'm a big chess fan. And the system was just really, really cool, edgy, and it was all based on Java, which I didn't like that much. So I sort of started hacking on what an alternative to this might be. And along the way, I spent some time working at Airbnb. And so I got a chance to see, you know, what their Hadoop infrastructure look like, and what the challenges were there. And so doing this concurrently, with hacking on my own stuff, it sort of eventually turned into the platform that became packager. And you know, then we managed to get funding as a company, and the company sort of took off from there.
And so actually had Dan Whitenack on to talk about Pachyderm all the way back in episode one about two years ago, but I'm wondering if you can talk a bit about what has happened in those two years, both in terms of the platform itself, and the company, and just the overall environment of big data and data analytics that you're fitting your platform into?
Yeah, absolutely. Um, the the core mission of the company hasn't really changed much. You know, when I was working at Airbnb, I saw a lot of gaps in the data infrastructure that existed in the day, in that day and age, the biggest one I saw was sort of an absence of the ability to track any sort of provenance or lineage of the data. And the way that this really came up for us at Airbnb was, we, you know, had this massive pipeline of data analysis tasks that had been written by a bunch of different data scientists. And it was really, really challenging to keep all of these green at the same time, because you know, everybody's modifying them. And they're all sort of working independently, and someone makes a change that's incompatible with the ones downstream, and then the whole thing just cascades read all the way down. And so we would have important tasks like our fraud models, that would just sort of start coming out blank, when when something went wrong. And when that happened, I'd be going into debug it and sort of tried to figure out like, all right, where where along the way did this break, and I didn't have any way to ask the system, like, give me the full lineage of this data, you know, because because it looks wrong or something like that. So that hasn't really changed. What has changed is sort of the the rest of the platform maturing around us. So when you first talked to Dan, we were probably about six months into using Kubernetes. And that was because Kubernetes had one point O had been released probably five months ago. And so we were sort of trying to figure out what we could do on this platform and what sort of stuff it could provide. And now that's a lot more clear. And there's been a lot of features that have been figured out in Kubernetes, that we've been able to just sort of like pass along to our users. We've also figured out a lot how to integrate with various machine learning packages that exists. So cube flow didn't exist at the time when you talk to Dan or when Cooper daddy's first came out. But it does now and it gives you a very, very good way to sort of deploy a machine learning pipeline on Kubernetes, which by extension gives you a good way to deploy these machine learning tasks and Pachyderm and this actually very complimentary with Pachyderm because Pachyderm basically takes the data right up to the point where it gets into the machine learning models. So you know, any, any type of sophisticated machine learning model is going to have a lot of steps in it that are cleaning the data, getting it into the right format, joining it with the right data, you need to train stuff, and then the actual training process happens inside of cube flow. And then that comes back out into Pachyderm and then we start doing the inference steps and the you know, checking how good this machine learning model is and stuff like that. And that all happens within Pachyderm as well.
And I know that there are a number of other features of Kubernetes itself that have occurred in those past two years, including things like state full set. So I'm wondering what are some of the other primitives of the platform that have come along that have simplified or obviated certain parts of the package arm code base itself,
staple sets are definitely one of them. Because we you know, are a we're not a stateless service Pachyderm in fact, is all about storing state because it's like for storing large amounts of data, the machine already have the data is actually stored in object storage. And so that that worked before stay full sets. But we also rely on SCD as the sort of metadata and consensus system for for Pachyderm. And so having a state full set set up to manage at CD is really nice and manages things really well. Another thing that's been really, really big for our customers in particular, is GPU support. So, you know, you can now in Kubernetes, and this has been true for a while you can you submit a Resource Request to say that this pod needs this much memory, it needs this much CPU. And you can also have it asked for GPUs. And what that'll do is it'll tell the scheduler that this needs to be scheduled on a machine that has a GPU and it needs to be given a GPU and have that available to it during processing. And so this is really nice when you want to run these high powered machine learning tasks that train a lot faster on a GPU, see other things in Kubernetes that we've used, we've been relying a lot on the Ingress feature for some of the cloud stuff that we're building. Now we're in the process of rolling out our cloud offering for Pachyderm and the fact that Kubernetes can do you know a lot of sophisticated Ingress things with load balancers. And you can build authentication right into those has been really, really useful for us. And
so as you mentioned, keep flow has come along. And that's, as you said, complimentary to the capabilities of package. But I'm wondering if you can just briefly talk about what are the sort of main pieces of package arm itself, I know that there's the Pachyderm file system for supporting versions, there's the Pachyderm pipeline system. And I'm wondering if you can talk a bit more about any sort of additional complimentary aspects of the overall Big Data ecosystem of things like airflow or Kafka or, you know, various other sort of big data pieces that fit together nicely with package term or that package sort of supplants in terms of the overall workforce? Somebody who's building an analytics pipeline on the back end platform?
Yeah, absolutely. So at a very high level, the the two pieces that you just touched on the Pachyderm file system, and the Pachyderm pipeline system are basically all of Pachyderm, everything that we have we think of isn't one of those two camps, the file system, like you mentioned, is responsible for providing version control for your your big data. So if you know for those of you who haven't heard the first episode, where Dan talked about it, it's semantics very similar to get you've got, you know, commits, you've got repos, you've got branches, but it can store massive amounts of data. And it's storing it in in Cloud Storage. So it's storing it in like s3, or GCS, or something like that. The Pachyderm file system is also the thing that's responsible for enforcing the provenance constraints. So it's got basically this constraint Salter built into it, where you say, you know, here's, here's a repo that contains images. And here's a repo that contains tags on those images. And then here's a branch that is associated with those two branches, meaning that it contains computations that have been done using those images and those tags on those images. And the Pachyderm pipeline system uses this API, to then implement a machine learning pipeline that takes these tags and these these images and trains, a classifier based on those but in, in theory, something else can use that and other people do implement their own things on top of this, this and basically use the provenance system without using our containerized execution system. And you know, you can insert sequel queries in there, you can insert all sorts of things in there, the pipeline system is what's responsible for the scheduling of those these tasks. And so that's, you know, uses Kubernetes, to say, I want this branch to be materialized that contains machine learning models trained on the images that come in here and the tags that come in here. And the pipeline system knows, okay, when a new commit comes in, I need to spin up these pods, I need them to have GPUs, I need them to, you know, have these containers in there, so that TensorFlow, and then it runs all the code and it slips up all the data and make sure you know, the data gets into the pot, and then the data gets out of the pod. And ultimately, you get your results. And because of the provenance system, your results will always be linked to the inputs that created them. So there's no way to short circuit, this. It's not like a system where you need to sort of like when you check in your results, you also check in a manifesto of where it came from. It's basically, you know, hard enforced by the system. To give you idea, an idea of some of the sort of new things and how this plays into other data systems, we recently released this feature called spouts. And these are sort of like a pipeline and that they schedule a pot on Kubernetes. What's different about them is that rather than pipelines, which normally take inputs, process that data and produce outputs, these just stay up all the time, and they produce outputs. And so it's like a spout of data coming into your system. And so this is really, really useful for subscribing to a coffee topic, for example. And this sort of allows you to have like a very convenient shim between Pachyderm and any other system that you can subscribe to, because it's a container, you can put whatever code whatever libraries you want in there. So you know, you can very easily have something that subscribes to a feed on Twitter, and gets new tweets coming in. And those will just show up in your Pachyderm file system. And then downstream of that, you can have all sorts of sophisticated pipelines and stuff that are processing those tweets that are trading models on those tweets, stuff like that. Yeah. And
I think that that's definitely one great differentiating factor between the Hadoop platform that you're working to sort of replace where it's entirely batch oriented. And there are sort of streaming capabilities that have been bolted on to it. But having it built into package or as a first class feature, I think is definitely useful, given that there is, particularly in the past couple of years, a lot more of a push to doing real time and streaming analytics.
Joe Doliner: Yeah, absolutely. We, this, this was one of the sort of earliest features that we conceived of, because we we have very sophisticated streaming capabilities. But it's not like when you're making a Pachyderm pipeline, you choose like, Okay, this is going to be a batch pipeline. And this is going to be a streaming pipeline, there's really no difference between the two. And the reason that that is, and the reason that we can do that is the underlying version control system. So because we can always say, all right, this is, you know, this data has this hash, it's part of this commit, it hasn't changed since the last commit, we processed it, then we got a successful result. Here's the result, again, identified by a hash, so we know that it corresponds to the same code, the same data and everything, we just get to reuse that result. And so really, the reason that it's a streaming system is that we've got this pretty sophisticated computation duplication system in the background that just go whenever it goes to compute something, it tries to figure out if it's already computed it and if it has, it just uses that result. And so this is often a bit of a magic moment for people when they first start using Pachyderm is that they put in a bunch of data, they churn through it, it takes a little while, because it's an expensive computation. And then they add a little bit more data. And it happens super quickly. And we actually get people coming into our user channel asking, like, why did this happened so quickly? You know, I think something's broken, it didn't process it, like no, the system just figured out that it didn't need to reprocess all of that data. And so you got a result really quickly, because there actually wasn't much to do.
Tobias Macey: Yeah. And when I was reading through the documentation, I was definitely impressed by the duplication and data, hashing capabilities that you have in the file system, and how that supports the increment ality of computation. So that as you said, you don't have to do a complete Rebuild of an entire batch job, you can just work on the data that's new, since the last time you ran something.
Joe Doliner: That was that was one of the things that I was most excited about having before before I even really started working on Pachyderm because I spent so much lunch time waiting for things to compute, you know, and I think probably anybody who's tried to do a decent sized data project has experienced this where like, you, you write out all of your code, you run it on all of your data. And then you find that there's this like one or two like files that have like some slightly different format, the crash the whole thing. And so then you fix your code and try to get it to run. And you can't get it to run on, on just the the stuff that it failed on. And so you have to sit there and wait for two hours to see if it works on these like two files. And then if it doesn't, you have to do that again. And this was this was even worse at Airbnb, because we had, like so many things depending on each other. And we had so much data there, that basically the granularity that we had was running stuff once a day, because the pipelines would run every single night. And so if things were broken, then we've got a week basically write some new code we committed, and then we come in the next morning, and that it worked. And if it didn't, then we do the same thing the next night.
Tobias Macey: Yeah, that's definitely a quick way to build a lot of frustration and burnout on a data team.
Joe Doliner: Yeah, you know, that's, and that's really the, the biggest reason that I wanted to do this company and this open source project is just that I felt like data teams in general, were in a state where they could be a lot more productive if the tools looked a lot better. It reminded me a lot and still does to a certain extent of what making websites look like before the LAMP stack. In the you know, people had all these CGI scripts, there were all these things that you could sort of cobble together. But there wasn't just this, like, well known good platform that you could just get out of the box and build a website in like a weekend in your garage, or something like that. And then once that platform existed, and people started to congeal around it and the tooling started to explode, you got all of these like explosion of websites, and people were able to make all of this cool stuff. And I feel like that still hasn't quite quite happened yet for data science and data engineering, but we're getting a lot closer to it.
Tobias Macey: And that's definitely something that I'd like to talk through in the context of packager is how the sort of collaboration between data scientists and data engineers, and the sort of breakdown of responsibilities and workflow happens within data teams, both when it's just one data scientist doing everything, or when you're working at a medium to large organization where you actually have that separation of roles and just the overall process of going from conception to delivery of a data project.
Joe Doliner: Yeah, absolutely. So one of the first most important things to say about this, because it's often sort of a misconception that people have that throws them off a lot at the beginning is that Pachyderm is not trying to be a replacement for get or GitHub or any of these other, you know, version code version control and collaboration tools. They were version controlling different things. And so when people are successfully collaborating on Pachyderm, normally what this looks like is you have your code in GitHub or get lab on, you know, somewhere shortened version control. And you have a repo, I like to have it all in one repo. But you can have it across multiple repos, you have a repo that has your analysis code that can be compiled into Docker containers, and then also has your pipeline manifests that explain how to deploy this onto a Pachyderm cluster. And then from there, you set up a CI pipeline that basically re deploys these these pipelines when when commits come in, so that you can basically like merge into master and you can have a CI CD process on top of this. And then from that, where you start to leverage the Pachyderm features is the fact that when you want to have a branch that people are working on, that's a sort of experimental thing, you can have your ci CD process, deploy that into separate branches and separate pipelines and Pachyderm that can still share all the underlying data. So you don't need to make a copy of the data. It's version control, and D duped. But you can have, you know, these two pipelines running concurrently. And you can see, you know, okay, this one's running like this, you know, it's this one's succeeding, whereas this one's failing. So we want to move the one that's succeeding and this one is performing this much better based on like, these metrics pipelines that we've put on at the end. And you can basically have a collaborative process around this because the tools enable it, it's a very open ended tool, you know, similar to get like, people have a million different branching strategies on get people use mana repos, people use, like small micro repos for their projects. And Pachyderm isn't particularly more prescriptive than good in that regard. So we see people using this in a bunch of different ways. But the core like, underlying concept is that you can collaborate because the system is tracking your versions for you. And so you sort of always know which way is up, because you can always just access the system, what's the history of this data? What's the lineage meaning, like, take me back to how this data was produced. Versus history is take me back to what it looked like, you know, yesterday, a year ago, etc. And you can do things like by sets, you know, you can say like, this looks bad. Now it looked good. A week ago, we're in between did it change.
Tobias Macey: And one of the challenges inherent in package term is just understanding some of the principal some of the primitives of things like Docker and Kubernetes. And so I'm wondering what you found to be some of the common challenges or points of confusion or stumbling blocks for people who are coming into this project and trying to get up and running with it. Because even just trying to define a Docker file can oftentimes be a nightmare in and of itself.
Joe Doliner: Yeah, so that's definitely one of them is just understanding, you know, this idea that Docker is like a machine that you're sort of setting up every single time, but it's not really a VM. And sometimes, like the details of your machine poke up into it, because it's the same Linux kernel, everything like that. That's definitely one of the challenges. I think that that one, that's one that that people normally get past at least these days, that used to be a lot more of a challenge maybe three years ago, but I think that just the both of the learning materials about Docker and Docker files and stuff, and just the sort of communal knowledge of that have really started to take hold. So you know, most people at this point, if you're working at a decent sized company, even if you don't know Docker, somebody there does and will be happy to sit you down and explain like, here's how you make a Docker file, here's how you build things. I don't think the same can really be said for Kubernetes yet. And in some ways, I think that makes sense, because Kubernetes is newer, and it's also a more specific tool and a more complicated tool. So definitely the biggest stumbling block for people getting started with Pachyderm is just getting Kubernetes setup. And you know, we we help people with that all the time as much as we can. But we're actually not super cool netease experts either, you know, we understand how to use it. And we understand how to deploy it in our our system and stuff like that. But you know, people who want to deploy it on prem people who want to deploy it in sort of weird settings and stuff like that, we don't always know what to tell them about how to get Kubernetes to work. I think that those are the biggest to the other one that is kind of interesting is getting like the underlying storage setup. So to run Pachyderm you need access to an object store. And you need some sort of a persistent volume for CD to run on. And on AWS or GC or Azure, this is all pretty well known. And we have, you know, a deploy assistant that will basically just spit out a manifest that you can give to it that will set up all of these things for you on Kubernetes, but the variety of object stores that people want to run against, seems to be growing in our experience. And so there's all of these sort of slightly off the beaten path ones like Seth and swift stack, and etc, yes, and things like that. And each one of those is a little bit of a new adventure adventure to get the system set up on. And then it's also a bit of a new adventure for us. Because while they all ostensibly support the same s3 API, there are little subtle differences and how they support that s3 API that occasionally trip our system up. And so we've been doing a decent amount of work recently on just like trying to cover all of these different subtle differences between them, and get it to work on all of these object stores.
Tobias Macey: And another thing that can often be challenging when working with cloud oriented workflows is trying to figure out what the local dev story looks like. So I'm curious what the general approaches or at least what your general approach is for trying to do local experimentation and iteration on some code or maybe trying to pull in some subset of the data and the Pachyderm file system for getting things ready to go before you ship it off to production?
Joe Doliner: Yeah, absolutely. I mean, I'll say sort of upfront that this is one of the parts of package that I'm least satisfied with how it is right now, there's, I think, a lot of work to be done on it. And I think that there's a decent amount of work in just Docker land in general, to make this really good. The sort of anti pattern that you get into that really sucks is that your development look loop looks like write some code, build a Docker container, push that Docker container to Docker Hub, redeploy a pipeline that points to that container, which then pulls down the container and runs it. And then you see, you know, that's, that's, that's possibly taken, you know, 10 minutes or so. And then you get, you get some results back on what you need to change. It's like, Oh, you, you know, this Python code doesn't run like you, you're referencing a variable that doesn't exist, and then you try it again. And you can't just run it on your local machine, because you don't have the data accessible to you what what I do when I'm developing pipelines on Packer, and that works pretty well, is I do everything entirely on the same Docker host. So I have mini cube running. And that's just running on Docker on my local machine. And then when I build my image, it just builds on my local Docker host. And then when I run it, it's it's the images right there. So I don't need to push it anywhere, I don't need to pull it anywhere, because it's right there. And that leads to a pretty quick development loop. The other thing that you can do that can be pretty effective is Pachyderm supports a fuse mount for your file system. So you can just do pack control mount and directory will show up that has all of the data in that's available within your distributed file system within PFS. And it's kind of cool because you like ls this directory, you're like, oh, shoot, here's like a file. That's, that's terabytes in size. And of course, this is only working because it's not actually on your file system. And then you can run your code against this fuse bound. And you can run with actual data and see what how things are going to work. The challenge with this is that one, it doesn't create the Kubernetes environment around it. So if you want to have like a secret available to you, and Kubernetes such that you can access some outside service, then you need to sort of mock that up. And sometimes the time spent blocking it is like not really canceling out the time that you're saving by not just pushing this into the Kubernetes cluster. The other thing is that Pachyderm gives you this pretty nice way to describe how data gets split up, which is just using glob patterns, which are the things that you're like, if you're familiar with LSA around on a command line, when you do like LS star, that star is a glob character for glob patterns. And this impacted arm is how you define that, like you can process all of these things in parallel, and it paralyzes it. But when you just mount data in, it's not respecting that in any way. So we have some work to do in terms of the local development story for Pachyderm for sure. It's right now it's good enough that people can get things done. And the real reason when things really get nice is when you have some code that you sort of want to be running in production. And you want to be able to rely on this just running every single night. And then Pachyderm is great just like running every single day, keeping it keeping it going and letting you know when there's an error. And
Tobias Macey: going back to the idea of data provenance and data lineage, you've mentioned that some of the way that it's tracked is through these version and capabilities of the file system. But I'm wondering if you can just dig deeper into the underlying way that it's represented as far as tracking it both from source to delivery, and how that actually is exposed when you're trying to trace back from the end result all the way back to where the data came from? And what's happened to it along the way.
Joe Doliner: Yeah, absolutely. So the layers, the level that we track Providence at is the commit level. And the sort of first problem that you have to solve, if you want to track provenance is you want to store a reference to some data that you know isn't going to change, right, because if I tell you like this machine learning model was created using all of the images in this image directory. And then I go and add to 10 new images to that image directory, well, then that doesn't tell you anything anymore, right? Because you don't know what was actually used to create a model, you just know where that data happened to be stored at a time when it was used. So commits allow us to have this immutable snapshot of what data looked like, at a certain point in time. From there, we link these commits together. So if you've got pipelines and Pachyderm, then the input to those pipelines is data commits. And the output from those pipelines is also data commits. And the relationship between these commits is the provenance relationship. And so any, any committed packet and basically has this metadata attached to it, that just is all of the commits that it is provident on. And you can you know, inspect these commits using the command line using our API using the the web interface. And it'll just show you a list of these commits. And then of course, you can like, track those commits up and look at what's in those commits. And so the actual, you know, this, the structure of this is a pretty stringent standard directed basic like graph structure, from computer science. Now, something that's sort of a cool aspect of the provenance system is that we actually track provenance at another level, which is the branch level. And this doesn't quite mean the same thing as commit provenance commit provenance is this sort of immutable snapshot that tells you here's where this data came from. The provenance on branches basically describes how your data is flowing at the time. So if a branch is prominent on another branch, then that means that every time you get to commit to the upstream branch, you also get to commit to the downstream branch. And that downstream commit is the results of processing the upstream commit, which means Of course, these commits are going to be linked by a provenance as well.
Tobias Macey: What are some of the other advanced capabilities of a term that you think are worth calling out that are often overlooked or underutilized?
Joe Doliner: I think, I wouldn't say it's necessarily underutilized. But it's definitely not something that people immediately associate with it. But that gets a lot of us, which is our sort of Kron functionality. And that's the ability to have a pipeline that isn't triggered by putting data in the top and getting data at the bottom, but rather, it's triggered just on a cadence. And so people use this a lot of times as a way to, you know, do something every hour, do something every night, they use it to scrape things, they use it to push things and stuff like that, I think that that is definitely that's one of those features, that's not actually super sexy, it's just super useful. Let's see, I think that the the fact that you can do, you can sort of expose all a lot of the various Kubernetes underlying Kubernetes things is something that is isn't hasn't been fully explored, people are sort of finding new things to do with that every single day. So you can unpack it, or you can attach two pipelines. It is sort of random modifications to your pods. And so this can be useful for assigning affinities that can be useful for, you know, declaring resources that you need. But there's always like these new things being added to Kubernetes that are really, really useful. And those sort of just naturally propagate up into Pachyderm. And for the file system and for interacting with other source systems does Pachyderm support things like the S3 select API, or being able to run push downs on the different data sources for trying to optimize for speed and latency and reducing the amount of data that actually needs to be transferred over the wire. So it does, we can sort of like select individual pieces of it if that's if that's what you're talking about. I'm just I actually don't know what the Select API does, specifically,
Tobias Macey: as my understanding is that s3 recently added a an API where for certain file types, you can actually run a select query so that rather than just pulling down a blob, it can actually index into the data itself and understand what's contained within it so that you don't have to return the entire object
Joe Doliner: will probably have some trouble leveraging this, because Pachyderm is designed to work on a bunch of different object stores. And so we're pretty reluctant to implement anything that's only going to work on s3. One thing that this did remind me of, though, that's a cool new feature that we just added. And so it hasn't gotten anywhere near enough love, because it is only very recently released is we now support an s3 API on top of PFS. And so if you have applications that are used to sort of writing data into s3 as their data lake, then you can just swap in Pachyderm and it speaks to the s3 API. And you can put things in there. And those will turn into files in PFS that are committed and stuff like that. And underneath the hood, this is all still going into s3. So it's going to have much the same, you know, storage characteristics that you are used to in terms of costs and everything like that. But you're going to get this version control and the ability to like, you know, run pipelines on top of it. In addition,
Tobias Macey: that's definitely really cool being able to just transparently put packets in there so that the end user doesn't even have to be aware of it. But at the same time, they're getting some of that added benefit of provenance and duplication that Pachyderm supports.
Joe Doliner: Yeah, this is also how we support data tools that are used to reading stuff out of s3. So for example, this is how we support spark is that spark, you know, can could be told, like, read this data out of s3, perform these, the spark operation on it, and then write it back into this other place in s3. And now because we speak the s3 API that can just be packaged under the hood. And you know, you're now have provenance on your spark operations.
Tobias Macey: And so in terms of the problem on so I know that because your version in the containers that are executing as part of the pipeline, that is an added piece of information that goes into it, as far as this is the data that was there when we started it, this is the code that actually executed and then this was the output. But for external systems, do you have any means of tracking the actual operations that were performed to enrich the metadata associated with the provenance?
Joe Doliner: Yeah, so those can basically use the same system we we use, which is that so we track the information about all of the code that ran and you know, the Docker container and everything like that, but we actually just use that by piggybacking on PFS is provenance system, because we just add that as a commit. So every job has what we call a spec commit that specifies how the job is supposed to be run. And that includes the the code and the the Docker container and everything like that. And so outside systems are basically just expected to, you know, whatever, whatever you can serialized this information as just put it in a commit. And then you know, that's just in essence, it considered as an input into the pipeline, like it's really not in terms of the provenance tracking in the storage system, any different than any other input. It's just that this one happens to define the code that's running in the computation. And
Tobias Macey: so earlier, you were saying how package arm because it is so flexible, the ways that people are using it is sort of up to everyone's imagination. And so I'm curious what you have seen as far as being the most interesting or innovative or unexpected ways that people have been leveraging the platform,
Joe Doliner: man, let's see. I mean, so there's things that are interesting, because the end results are interesting. So I think that, you know, a lot of the image processing and machine learning things that I've been seeing trained on that are the most interesting to me, they're not really like, you know, sort of cute little hacks in the system, or like interesting abuses of the system, some of some of the really interesting things that people do in terms of like, do things that I never thought anyone would do in terms of the system are sort of calling out to other Pachyderm API's from within the pipeline. So you know, you can have a pipeline that as part of its operation, like, creates another pipeline, or does something like that. And this is something that we're, we don't officially recommend that people do, because we haven't really thought about it enough. And think there might be some weird things. But we've seen people do some really like cool things with it, and stuff. So you know, we're not we don't police this in any way or anything like that, that's sort of the great thing about open source software is, you know, we're not going to stop you from doing stuff, it's yours, you can do whatever you want with it. But it's not something that we had sort of officially thought about as a use for it
Tobias Macey: will let you aim your foot gun on whichever foot you want.
Joe Doliner: Right? Exactly. That's, you know, that's a very important principle to us is that, you know, we don't want to give you like foot guns in disguise. You know, we don't want want to trick people into using foot guns. But we also, you know, it's if you can, if you can't shoot your foot off with a system, you also can't do anything clever with it. And this is, you know, true of all of the sort of Unix ecosystem and things like that, it's like you can you can shoot your foot off with it. But you also in those abuses can find really cool useful things to do. And so we feel like we have to be open to you know, people doing that with Pachyderm, because we have gotten a lot a lot of our best, you know, features that our best understanding did originally come from people abusing the system.
Tobias Macey: And so a few months ago, you announced that you had raised a series A round of funding. And I know that with most venture capital that usually comes with some strings attached where they're hoping for some measure of hyper growth. And so I'm curious how you're approaching that stage of growing and scaling the Pachyderm platform and business.
Joe Doliner: Yeah, absolutely. And this is always, I think, has an extra wrinkle to it when you're talking about an open source company. Because, you know, there have been some, I think, notable cases where an open source company has sort of like raised money and, like, stopped really ascribing to the open source routes that they that got them where they were, and it's gone pretty badly for the community, we feel like we are very aligned with our investors, both in terms of what long term Packer needs to do to be successful. And so we're not as much focused on like, okay, we need to have this amount of revenue by this day, we need to buy you know, this, not this day, but like this quarter this year, or something like that, we need to have this number of users. And we're we're much more focused on what does it take to build a long term sustainable open source project and a company that is also long term sustainable around that. And so we are much less focused on any particular revenue goal in the short term and much more focused on basically making package them into the platform that we've always believed that it could be and making it into something that's like a ubiquitous tool for sort of the underlying data infrastructure, particularly on top of containers, but we feel as if containers are kind of going to be the underlying like cloud infrastructure for everything. And so the data infrastructure that goes on top of them is really going to be the de facto infrastructure for everybody. It does, of course, you know, investors invest, because they ultimately want to see a return. And so we need to make money off of Pachyderm. And that comes from both support contracts, our enterprise product, and we're currently rolling out our cloud offering, which we think is going to overtime become basically the vast majority of our revenue. And so far, we don't feel as if any of these things are at odds with each other not maligned is just a little bit, you know, tricky to get all the puzzle pieces to fit together to make sure that we're staying true to the open source community. And, you know, everybody who's used and contributed to this product up until this point, and also keeping the company around it going, because the reality is the open source project probably isn't going to survive without the company contributing to it.
Tobias Macey: And so in terms of your overall experience of building and maintaining and scaling the packer and project and business, what have you found to be some of the most challenging or useful or unexpected lessons that you've learned?
Joe Doliner: Definitely, the most useful lesson, I think I've learned is just to really listen to your users and see how they're using the product and try to go from there. You know, I came in to this with a whole bunch of ideas of what I thought a cool data infrastructure system would look like, and what I thought was going to be important to people. And I wouldn't say that I was wrong about everything. But I was surprised how much I didn't know, it wasn't even so much that the things I knew were wrong, just that there were these like, massive things that I hadn't even thought about. I mean, Providence is kind of a great example. Actually, we we initially implemented Providence as a sort of internal thing that we're like, okay, we need to do this to track and keep things consistent. And, and everything like that, and, you know, be able to sort of see, see this stuff. And then it started to get more and more important for people and people wanted it more and more. And then there started to be like, things like the GDPR that actually legislated provenance into the system and stuff like that, or at least legislated the been companies that they had to be able to, like, give people an explanation for machine learning made decisions and things like that. And so all of these things would, would have been easily missed if we hadn't really been listening and sort of going back every single day and see like, Okay, how are people using this? How are people failing to use this, things like that. The other thing I think, that I've I've learned and been been rewarded with is both taking risks on new open source projects, like Docker was pretty new, when we started using it. And Kubernetes was like, brand new. When we first started using it, there were a decent amount of internal discussions about like, do we want to use a platform this new and like, even at the beginning, there were a lot of discussions of like, why are you guys building this system, rather than just building you know, a Docker is staying on top of Hadoop, or like a provenance tracking thing for Hadoop. And it took a lot of conviction to just say, now we're going to build something new. We're kind of like, take a stab at doing this our way and see what happens. And ultimately, I feel like we've been very rewarded for that. But it took a lot to be confident in doing that.
Tobias Macey: And what are some of the limitations or edge cases a pack under man? When is it the wrong choice?
Joe Doliner: So it's definitely the wrong choice when what you're doing is sort of like a very well established data pattern that there's very good tools for I think the best example of this is sequel, you know, we have a lot of people ask, like, what, you know, imagine I want to do redshift style, like data warehouse queries against Pachyderm, what's the best way to do that? And right now, the best answer to that is to just use redshift. Because it's, it's really good at that, or you know, any of the various options. There's like Big Query, there's hive, there's presto, and things like that, you can sort of start to integrate those things into Pachyderm, like people will build Pachyderm pipelines that basically just orchestrate redshift pipelines or Big Query pipelines or things like that. But sequel is not something that we're able to be to be the things that just do sequel that because there's just, there's, there's a lot there. And it's not the most interesting challenge to us right now. Pachyderm really tends to do well, and kind of like, the everything else data case, you know, when people are thinking like, hmm, I've got, you know, these genetics files. And while you know, there's some pretty good good toolkits for analyzing these, like on a single machine or something like that, there isn't really like the distributed genetics pipeline tool, or anything like that. And so for those because Pachyderm is a super generic system, and you could just package those tools up into Docker containers and run them stuff, it's very, very nice for that it gives you some structure to these tools that otherwise you just be like firing off with scripts ad hoc on like random EC, two boxes, things like that.
Tobias Macey: And looking forward, what do you have planned for the future of package.
Joe Doliner: So the biggest sort of change in terms of what the company offers is rolling out of our our cloud offering, which is called pack hub. And if you sort of think of everything in open source package, or as get in that, you know, it enables collaboration on data science, things like that, then, pack hub is kind of like GitHub for data science. And so it's basically an online site where you can go and you have your account. And that contains, you know, your data repositories and your pipelines that process those repositories. And you can fork other people's pipelines, and you can pull in other people's repositories and things like that. It's a way for people to actually collaborate on live running Big Data pipelines, that's the thing that we're most excited about in terms of what's different. There's also, of course, tons and tons of work to be done on the core open source project. So there's a lot of upgrades to the storage layer that are going to make it a lot more sophisticated and a lot faster, that I'm very, very excited about. And there are a lot of sort of new pipeline features that are coming out. I mean, spouts was one of the first ones of those, we're also sort of implementing more sophisticated join support, so that you can join two data sets together and the ability to have more sophisticated pipelines that do like loops and conditional branching and things like
Tobias Macey: that. And so for anybody who wants to follow along with you and the work that you're doing at Packard MMO, have you add your preferred contact information to the show notes? And as a final question, I just like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
Joe Doliner: Yeah, I mean, that's sort of the biggest gap that I see is, is the one that I'm trying to fill. Because I felt like that's, that's why I wanted to do this company. And what I felt like the opportunity was, but you know, I would basically described that as the absence of a really good set of tools that are just sort of prescriptive in how you're supposed to do these things. There are ways to do all of the things that Pachyderm allows you to do, you know, you can, right some form of version control on top of object storage, you can like us get repose in theory and stuff like that. But there's nothing that really ties it all together and gets out of your way and lets you focus on the actual data science that you're really good at. And, again, I'll go back to the analogy of the LAMP stack where in you know, it used to be to build a website, like you needed somebody who was an expert on actually implementing databases, because none of them work that well for you. You needed somebody who like understood how to run all these servers and all this stuff. And then once you get this stack that people can congeal around this just a very well known well trodden path, the documentation starts to get really good because there's so many people using it, then we can stop thinking about that stuff, and do all of the interesting stuff that that allows, like build, you know, Facebook's and eBay's and things like that. And so we we still feel that that hasn't really happened with data science. And that the way to get that to happen is to focus on the infrastructural layer that's needed to tie everything together, and do it in a generic enough way that people can use all their different tools on top of it. So you know, I think that the LAMP stack worked really well, because you could do all sorts of things that you wanted to do. And you know, the P part, the PHP part, became very generic, and people started swapping Python in there, and people started swapping pearl in there, everything like that we have that same level of flexibility with our Docker container centric workloads. But we provide the same underlying like storage and orchestration primitives that we think are basically what people need to get stuff done.
Tobias Macey: Well, I appreciate you taking the time today to join me and discuss the work that you're doing on pachyderm and how it has grown and evolved in the past couple of years. So I definitely think that it's a great project. It's one that I've been keeping track of for a long time now, and I hope to be able to use it for my own purposes soon. So thank you for all that and I hope Enjoy the rest of your day.
Joe Doliner: Thank you Tobias. It was great to be here.