Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.


episode 85: Maintaining Your Data Lake At Scale With Spark [transcript]


Building and maintaining a data lake is a choose your own adventure of tools, services, and evolving best practices. The flexibility and freedom that data lakes provide allows for generating significant value, but it can also lead to anti-patterns and inconsistent quality in your analytics. Delta Lake is an open source, opinionated framework built on top of Spark for interacting with and maintaining data lake platforms that incorporates the lessons learned at DataBricks from countless customer use cases. In this episode Michael Armbrust, the lead architect of Delta Lake, explains how the project is designed, how you can use it for building a maintainable data lake, and some useful patterns for progressively refining the data in your lake. This conversation was useful for getting a better idea of the challenges that exist in large scale data analytics, and the current state of the tradeoffs between data lakes and data warehouses in the cloud.

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers, for engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Data Engineering Podcast listeners get 2 months free on any plan by going to today and signing up for a free trial. Support the show and get your data projects in order!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to to learn more and take advantage of our partner discounts when you register.
  • Go to to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at
  • Your host is Tobias Macey and today I’m interviewing Michael Armbrust about Delta Lake, an open source storage layer that brings ACID transactions to Apache Spark and big data workloads.
  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Delta Lake is and the motivation for creating it?
  • What are some of the common antipatterns in data lake implementations and how does Delta Lake address them?
    • What are the benefits of a data lake over a data warehouse?
      • How has that equation changed in recent years with the availability of modern cloud data warehouses?
  • How is Delta lake implemented and how has the design evolved since you first began working on it?
    • What assumptions did you have going into the project and how have they been challenged as it has gained users?
  • One of the compelling features is the option for enforcing data quality constraints. Can you talk through how those are defined and tested?
    • In your experience, how do you manage schema evolution when working with large volumes of data? (e.g. rewriting all of the old files, or just eliding the missing columns/populating default values, etc.)
  • Can you talk through how Delta Lake manages transactionality and data ownership? (e.g. what if you have other services interacting with the data store)
    • Are there limits in terms of the volume of data that can be managed within a single transaction?
  • How does unifying the interface for Spark to interact with batch and streaming data sets simplify the workflow for an end user?
    • The Lambda architecture was popular in the early days of Hadoop but seems to have fallen out of favor. How does this unified interface resolve the shortcomings and complexities of that approach?
  • What have been the most difficult/complex/challenging aspects of building Delta Lake?
  • How is the data versioning in Delta Lake implemented?
    • By keeping a copy of all iterations of a data set there is the opportunity for a great deal of additional cost. What are some options for mitigating that impact, either in Delta Lake itself or as a separate mechanism or process?
  • What are the reasons for standardizing on Parquet as the storage format?
    • What are some of the cases where that has led to greater complications?
  • In addition to the transactionality and data validation that Delta Lake provides, can you also explain how indexing is implemented and highlight the challenges of keeping them up to date?
  • When is Delta Lake the wrong choice?
    • What problems did you consciously decide not to address?
  • What is in store for the future of Delta Lake?
Contact Info
  • LinkedIn
  • @michaelarmbrust on Twitter
  • marmbrus on GitHub
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
  • Delta Lake
  • DataBricks
  • Spark SQL
  • Microsoft SQL Server
  • Databricks Delta
  • Spark Summit
  • Apache Spark
  • Enterprise Data Curation Episode
  • Data Lake
  • Data Warehouse
  • SnowflakeDB
  • BigQuery
  • Parquet
    • Data Serialization Episode
  • Hive Metastore
  • Great Expectations
    • Podcast.__init__ Interview
  • Optimistic Concurrency/Optimistic Locking
  • Presto
  • Starburst Labs
    • Podcast Interview
  • Apache NiFi
    • Podcast Interview
  • Tensorflow
  • Tableau
  • Change Data Capture
  • Apache Pulsar
    • Podcast Interview
  • Pravega
    • Podcast Interview
  • Multi-Version Concurrency Control
  • MLFlow
  • Avro
  • ORC

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA


 2019-06-17  50m
Tobias Macey: Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Luna node. With 200 gigabit private networking, scalable shared block storage, speedy SSD, and a 40 gigabit public network, you'll get everything you need to run a fast, reliable and bulletproof data platform. And if you need global distribution, they've got that covered to with worldwide data centers, including new ones in Toronto and one opening in Mumbai at the end of the year. And for your machine learning workloads, they just announced dedicated CPU instances where you get to take advantage of their blazing fast compute units. Go to data engineering slash node that's l i n o d e today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for their continued support of this show. And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers for engineers. clubhouse lets you craft a workflow that fits your style, including PR team tasks cross project epics, a large suite of pre built integrations and a simple API for crafting your own was such an intuitive tool, it's easy to make sure that everyone in the business is on the same page and data engineering podcast listeners get two months free on any plan by going to data engineering slash clubhouse today and signing up for a free trial. support the show and get your data projects in order. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity and the Open Data Science conference coming up this fall or the combined events of graph form and the data architecture summit in Chicago. The agendas have been announced and super early bird registration is available until July 26 for up to $300 off. Or you can get the early bird pricing until August 30 for $200 off your ticket. Use the code be an LLC to get an additional 10% off any pass when you register and go to data engineering slash conferences to learn more to take advantage of our partner discount when you register for this and other events. And you can go to data engineering to subscribe to the show, sign up for the mailing list, read the show notes and get in touch and to help other people find the show. Please leave a review on iTunes and tell your friends and co workers your host is Tobias Macey and today I'm interviewing Michael Armbrust about Delta Lake, an open source storage layer that brings asset transactions to Apache Spark and big data workloads. So Michael, could you start by introducing yourself?
Michael Armbrust: Yeah. Hi, my name is Michael Armbrust. I'm an engineer at data bricks, and I'm also the original creator of Apache Spark sequel, structured streaming. And now I'm the tech lead for the Delta Lake project.
Tobias Macey: And do you remember how you first got introduced to the area of data management?
Michael Armbrust: You know, that's a good question. But I think it probably the moment when the decision was made, was back, when I started my first internship at Microsoft, I remember, you know, I was, you know, college, undergrad, you know, excited to have this internship. And I was given a choice between sequel server and the C sharp runtime, and I had no idea what to pack and I just kind of picked randomly. And ever since then data management is followed me when I, you know, went to UC Berkeley, I tried to do security, I tried to do operating systems, but I just couldn't resist the the siren song of the database.
Tobias Macey: So a flip of the coin just dictated the rest of your career. So that That's funny. And so as you mentioned, now, you've been involved in architecting and overseeing the work on data, bricks, delta, which has now become Delta lake. So can you start by giving an overview about what it is that the Delta like project is and the motivation for creating it?
Michael Armbrust: Yeah, so delta lake is a transactional storage layer, designed to integrate deeply with Apache Spark, and also to take advantage of the cloud. And of course, it also works on prem. And what it does is it brings full acid semantics, you know, which means that when you when you perform large scale spark jobs, we're guaranteeing that what happens happens either completely or not at all. We also guarantee if there's multiple people working at the same time, that they'll be isolated from each other and always see a consistent view of what's happening. And we do that while still maintaining the scalability and elasticity properties of these underlying systems. And you know, as we've expanded the project, it's actually moved a lot into additional features for incrementally improving data quality. We first started it actually, in collaboration with this team at Apple, it started as I conversation at a Spark Summit, this big conference that we have for the the open source Apache Spark project, where I was just hanging around and this engineer came up to me, he runs the info sec team at Apple. And basically, his problem was, he has network monitors all over Apple's corporate network that ingest, you know, basically every single TCP connection every day, HDB connection, everything that goes on in the networks, we're talking trillions of records per day and petabytes of data. And he wanted to build a system with Apache Spark that could archive this, they could use it for detection and response. And my response was, Whoa, it's going to tip over, but I think we can solve it. So you know, we kind of collaborated and built it. And then we it became a much more general tool after that.
Tobias Macey: That's funny that yet again, an accidental happenstance, help dictate your sort of future workload. And something that has ultimately ended up being something that I'm sure is providing a lot of value to people outside of both data, bricks and apple.
Michael Armbrust: Yeah, it was it was really serendipitous, you know, now, I think it was at this like customer advisory board, which normally kind of sounds like a boring thing. But now, that is my favorite meeting of Spark Summit, because you get to talk to so many interesting people and about their problems. And finding those problems, I think is what really allows you to build cool software.
Tobias Macey: Yeah, it's easy to sort of go off into the vacuum of thinking that you know what the best solution is for any given problem or understanding even what the problems are that somebody might run into. But ultimately, all of that is worthless unless somebody actually uses it. And so being able to get that firsthand feedback of I've been trying to use this and these are the pain points I'm running into to help you understand where you can provide engineering that will actually ultimately end up being useful by somebody versus just wasting time on something that nobody actually ultimately cares about. That's definitely great to be able to have that feedback loop built into the community. Yeah, definitely. And so delta lake is targeted at addressing some of the issues in data lake implementations and some of the modern trends in terms of how data warehouses are used, and data lakes and just trying to provide some level of sanity to handling some of these workloads that have large volumes of data, as you mentioned. And I'm wondering, before we get too much into the Delta, like implementation itself, what you have seen as being some of the common anti patterns and edge cases and difficulties are in the data like space that you're working to address with Delta lake.
Michael Armbrust: Yeah. So I think the biggest anti pattern that I've seen in the data lake space is people believing that all you need to do is collect the data and dump it into a data lake. And it'll be ready for consumption by end users, whether that's machine learning, or reporting, or business intelligence, or whatever. I think the the ability to collect anything without thinking about it is really powerful in a data lake. But you have to plan to do this kind of whole cleaning step to figure out what the data is, what's wrong with it, what's missing, what needs to be augmented, and, and basically, what I've seen after that is when you when you get to that step, this kind of cleansing refining process, I have seen so many one off solutions that try to do this in a correct way. But it turns out, it's very hard to get transactional semantics and exactly one stuff, right. And so trying to do that, for each application, I think, is just setting yourself up for failure.
Tobias Macey: Yeah, I had a great conversation a while ago about the overall concept of data curation, and how data lakes provide the ability to have an easy way of gathering data. But that ultimately, it's not useful until you've had somebody test out what they can actually do with it, and then use that to inform ways to clean and augment and enrich the data and then ultimately landed in a data warehouse, potentially. And I'm wondering what your thoughts are on the benefits and trade offs of data, lakes versus data, where houses and weather data warehouses, in general are a necessary sort of technology and architectural paradigm in the current landscape of data capabilities?
Michael Armbrust: Yes, I think the biggest benefit of a data lake over a data warehouse is the cost, the scale and the effort that it takes to ingest new data. So you know, data, lakes are much cheaper, you can store huge amounts of data, and then without doing anything, but really, I think the key thing here is the effort that it takes to collect data, by bringing that cost to ingest a new data source down to almost zero, you will allow people to capture everything, and you can then later figure out what's actually useful with a traditional data warehouse, you kind of always have to start by creating the team table defining it schema. And you know, while that will, that's useful, and you need to do that, eventually, you don't always want that to be the first step, I think it's actually great to start with these raw data sources, and then later, figure out how to clean them and make them useful for insight. Because sometimes you don't realize the value of a piece of data until years later, when you can actually join it with some other piece of data and actually, and actually get insight from it. So those that's why I think a data lake is better than a data warehouse. You know, today, data warehouses have some advantages, concurrency, and performance. But you know, I think in a couple of years, we'll see if that is still the case.
Tobias Macey: And I know the recent additions to the data warehouse market that are leveraging cloud native technologies. So things like snowflake or Big Query, have tried to address some of the trade offs of data lake versus data warehouse by letting you land raw data and then do some of the transformations within the data warehouse itself. And I'm wondering what your thoughts are in terms of how that changes the equation, as far as what technology stack to use, and when
Michael Armbrust: Yeah, so I think the biggest benefit of the cloud is it makes it possible to do all of these things without any upfront cost. It's basically the elasticity. And the fact that somebody else is, is doing that lyst TriCity for you. So you both don't have to build out a data center to get started. But you also don't have to be experienced enough to build out a data center, you can kind of pay as you go get started collecting data, cleaning it using these tools on top of it. And then you know, as you grow, the system will kind of seamlessly scale with you. And so I think really, that that kind of fundamental change in the the barrier to entry is why why the cloud, and a lot of these more modern technologies are really exciting. And
Tobias Macey: so for delta lake itself, I know that you said that the original idea came about with this conversation with the apple engineer trying to capture all of their packet traces and analyze them and wondering how Delta like itself has been implemented, and how the overall design has evolved once you began working on it, and getting other customers involved in testing it and using it for their own various workloads. Yeah, so
Michael Armbrust: delta like started as just a scalable transaction log. So it was a collection of parquet files out in your storage system, as this record of metadata alongside that allowed us to build admin city, but also scalability of the entire metadata plane, like the, the kind of cute trick that delta does is the transaction log itself is actually written in a way. So it can also be processed by Spark. So when you have metadata that is, you know, thousands to millions of files, you know, that's actually becoming a data problem in itself. And so by using spark here, we were able to very quickly build a system that had the same scalability properties of Spark, but you know, gave us these nice acid semantics that people were, you know, used to having. So that was the beginning. And that was kind of the core thing. And we built that first. But the way it's evolved is, we realized that that's only one step. And once you have this powerful underlying asset primitive, you can do so many cool things on top that just weren't possible before. So you can build the whole suite of dl operations that people expect from a traditional database. So update, delete, merge into, you can build streaming. And then you know, now as we've as we've continued to evolve, we've started to even get into the ideas of expressing quality constraints on the data as well. So
Tobias Macey: it really, you know, it started with just this idea about scalable transactions. And it became an entire system for managing and improving the quality of data in your daily and as far as the ideas that you had going into it, and the assumptions about what the needs were and the ways that delta Lake would ultimately end up operating and being used. I'm wondering what your thoughts were at the outset, and how those assumptions and ideas have been challenged and updated as it has been used and expanded in terms of the scope and capabilities?
Michael Armbrust: Yeah, that's it. So I, you know, I think thinking back, when we were first designing Delta, you know, we're this, we looked a lot at the systems that people were using today to solve these problems. And, you know, one of them in particular is that the hive meta storm where people store all of the metadata today about you know, about the data that that's in their data lake. And, you know, I think one the initial assumptions I had was that if you if you built a system that manage its own metadata in a much more scalable way that you could actually just completely dump them at a small store in one fell swoop. But, you know, I think what that was missing was the fact that data discovery is actually its own challenge that needs solutions as well. So as the project evolved, we actually ended up building full support for storing pointers to your data is still in the meta store so that people could discover it, and comment on it and other other stuff there. And I actually think that that's a place that as the project continues to grow, that we're going to see a lot more investment. It's not just about schema enforcement and expectations and transactions. It's also about making the data available to people in a place where they can find it and understand that as well.
Tobias Macey: As you mentioned, one of the features that you ended up adding in more recently is the idea of enforcing these quality constraints on the data coming into the storage layer. And I know that that is something that is often either overlooked or undervalued when dealing with quote unquote, big data or working with data lakes. And so I'm wondering how you, I'm wondering how the definitions of these constraints are established and validated. And just the overall workflow of building these constraints, building the validation tests, and what happens when a record comes in, that doesn't fit the expectations and how you can address those out of band so that you don't just completely either throw away the data and lose it or stop all processing and block everything else that's trying to come through
Michael Armbrust: Yeah, as you said exactly the word that I like to use when I think about data quality within a data lake and its expectations. And that's actually the name of a feature that we we've been working on for a while. And that will be open sourcing this quarter. You know, when you think about an expectation and expectation is a way for a user to tell us about their domains definition of quality. So you can say things like for this table, I expect this column to be a Vin for a car. So it needs to start with a letter and have this many numbers and it must be present, or your whatever it is for your particular domain expectations are similar to this concept of invariance in a traditional database, but I think something that rigid just doesn't work in the Big Data space. And invariant would just reject anything that doesn't fit into the, you know, the predicate that the user is given. And, and that works. And we you know, we also support that mode as well. But I think in a big data system, if every time you see something unexpected, you stop processing and make a user intervene, you're never going to get anything done. And so what expectations bring to the table is the ability to tune the severity when an expectation is violated. So you can fail stop, like aboard any transaction that violates this expectation of something very important. Or if this is a table where you know, basically, it's getting towards the end of the quality journey. And you don't want to allow any bad data into it. accounting is going to read this table directly from example. But you can also tune them down. So you can say things like I want to just alert when the number of failed expectation goes over some threshold, or I only want to fail the transaction if it's above some threshold. And I think moving forward will have even more powerful, you know, functionality here, where you can also quarantine the data. So you can basically say, I don't want to let data that does not meet my expectations go any further in my pipeline. But I don't want to lose it either. So I'd like to redirect it to some other table where a human can come in on their own schedule, figure out what's wrong with it and you know, remediate the situation. And expectations are deeply built into the system. So we can actually kind of give you insights into this was the record that came in, you know, here's how you processed it, here's the expectation that it violated. And so basically brings debug ability to your data pipelines when it comes to quality.
Tobias Macey: And another thing that factors into the overall idea of data quality, and particularly in the long term is how you manage schema evolution, particularly with these large volumes of data and the potential for different varieties and sort of levels of consistency based on whatever the source might be. And so I'd like to get your perspective on how you approach the idea of schema evolution in a data lake context, whether you periodically go through and reprocess all of the old records and either update column definitions or remove them or just elide the fact that they exist within the metadata representation of them. And just your overall thoughts on maintaining consistency of a schema over, you know, potentially terabytes or petabytes of data.
Michael Armbrust: Yeah, so I'm not sure if there's one right answer here. And Delta actually supports a variety of different ways to think about schema evolution, we kind of automatically support safe schema evolution, where you can, you know, add columns with data types, you know, don't conflict with anything that's already there. You can also rewrite all of the data, or we also support kind of reprocessing. But to me, you know, just kind of when I think about this problem, and how I manage my own internal pipelines, you know, that I run for data, Brooks's managed platform, I like to think back to some rules that I learned when I was at Google, and how they evolved schema, because they really think about these things as contracts between people. And so I generally have a rule that once a column has existed, you do not want to and other people have consumed this data, it's often actually a bad idea to rename that column, or to reuse that name. For something that has different semantics, it's much better to create, you know, if you find something wrong with it, hey, this column is wrong in this weird way, it actually wasn't calculating what we want. A better option often is to deprecate that column, create a new one with a clear name, and then switch people over to that, because if you change stuff out from under people who are who are basically the end consumers of your data that can have other unexpected consequences.
Tobias Macey: And another problem too, is that if you have ATL jobs that are populating the data lake or that are working on the raw data that gets landed into the data lake managing version updates, when they're trying to process historical records, it's a complicated issue. And as you said, there isn't one universal solution. But it's definitely something that bears a lot of consideration, and merits a good amount of upfront planning as your first implementing some of these systems so that you don't end up boxing yourself into some to a sub optimal situation where you don't have any way to work your way out of it.
Michael Armbrust: Definitely, definitely, that's why I think we like out of the box support this what I consider to be the safe way of doing it, where you're certainly can reprocess you certainly can add new additional information. But you, you generally don't want to go and change semantics, I find that that can confuse people. And you know, you don't even know who is relying on those semantics.
Tobias Macey: And so the primary purpose of delta and delta, because you said going into it was the idea of transaction ality and managing that across these large volumes of data potentially with multiple users. And so I'd like to get an understanding of how you ended up implementing that. And particularly when you have issues or conflicts in terms of the data ownership, where delta lake is responsible for processing the data and adding transaction ality on top of it. But you might have some other system that's either writing a new data or consuming data that delta lake is working with, or potentially even manipulating it in some way.
Michael Armbrust: Yeah, so let me start by describing transaction, ality and delta. So basically, you know, as I said before, delta is a whole bunch of 4k files stored in your storage system of choice plus a transaction log. And with the transaction log has, it has files in it that are, you know, each atomic units a transaction, and each transaction has a set of actions that are taken against the table. So an action could be I'm going to change the schema of this table, or I'm going to add this file to this table, or I'm going to remove this file from this table. And we also have some other kinds of things for upgrade ability. And for item potency of transactions. But really, you can think about is this relatively simple list of things that you're going to do to the table. And when you get to the end of those actions, you now have the current state of the table. So we people modify adults like table by adding a new transaction to it. When we have multiple readers and writers, basically, we only need one property from the underlying storage system. And that's mutual exclusion. So if two people try to create the same file for the same version of the table, so I'm, you know, I'm at version zero of the table, and two people try to create version one, the thing that we need from the underlying storage system is it needs to allow one of them to succeed and tell the other one, no, I'm sorry, this file already exists, you cannot have two writers that both think they succeeded. And then when there's a conflict like this, we basically use standard optimistic concurrency. So when two people are modifying the, when you begin your transaction, you start by recording the version that your transaction began at. And then when you go to commit, you see what versions exist. And you check to see how those things that happened since you started. And since you're about to commit, how those have changed the table. And in many cases, they haven't changed the table in any interesting way. So you can just ignore them and go ahead and commit anyway. But in some cases, maybe you actually conflicted with them, then you need to retry your operation. So let me give an example here, a very common use case is two streams writing into the same table. Well, they both read the schema of the table, and then they write files in. And that's it. But since they only read the schema of the table, and since the schemas and changing, it doesn't matter what order they committed, you'll get the same result either way. And so delta will just automatically retry until it succeeds. So that's kind of the underlying transactional core. Now, the question is, how do you use this with other systems? And so I think answer number one here is, this is why we're so excited about open sourcing Delta Lake, because we can actually put the protocol spec out there. So anybody can implement it. We've been in talks with the presto, people over at Starburst, and some others on what it would take, you know, some of the committee's on Apache NY. And what it would take to build Delta connectors for these other open source projects. We're, we're particularly excited about that. But even before that happens, you know, I think this is where delta legs integration with Apache Spark is especially powerful. So I like to think of spark as the kind of skinny waste of the Big Data ecosystem. It has connectors to read every file format, it can read from Kafka can uses as your event of HDFS, like wherever your data is stored, spark can read from it and write to it. And so a pretty typical pattern that I see people set up is they use spark on the periphery to read from all of these sources, they capture the data in some like raw tables within their Delta leg, and then they move it through the data quality using special and then eventually out to some other system, whether that's machine learning using TensorFlow or reporting with tableau. But really what they've used delta and spark in the middle for is taking this raw data and getting it ready for consumption by downstream consumers.
Tobias Macey: And so in terms of the transaction, ality itself, I'm wondering if there is any limit, or any sorts of edge cases that come up when you're dealing with particularly large volumes of data that you're trying to encompass within the boundaries of a single transaction, such as if you have maybe a petabyte of data, and you don't have any way to capture and process all of it and RAM to be able to then write it all out atomic Lee, I'm just wondering how you approach some of those edge cases or extremes within the idea of transaction ality on these large volumes of data?
Michael Armbrust: Yeah, that's a great question. So you know, like in terms of just the scalability question about Spark, or about Delta, you know, I think they're there are some fundamental limitations and the number of tasks that spark can schedule in a single in a single stage. But you know, even those, we've successfully scale the single spark job to over a petabyte, that was like five years ago. So I think in general, you you can process tons and tons of data. You know, delta Lake, of course, also keeps information about statistics and memory. There's nothing fundamental preventing all of that from being pipeline. But really, I think that the real answer to this question here is, if you have a petabyte of data, and you want to process the whole thing, and do something, and you want to make sure that you, you process, that entire petabyte, streaming can be a really powerful tool here. And I, you know, I kind of often have to explain to people because when when people hear streaming, they usually think, Oh, that's a more complicated system that I need to set up in order to get real time semantics. But another really useful thing that streaming does is it can take
Tobias Macey: a petabyte job, break it into much smaller jobs, and then execute each one of those, well, guaranteeing that when you get to the end of the stream, you have exactly one semantics, and you've processed that entire petabyte. So I think, you know, there, you can kind of shift your thinking where streaming is actually also a tool for easing the process of dealing with such a massive amount of data. And it's worth digging deeper to into the this idea of streaming in the context of these data lakes were, particularly for processing large batch jobs of historical data, the original approach when Hadoop was fairly new, and people were still trying to figure out how to handle these types of systems was the idea of the lambda architecture. And I've noticed that in recent years, I haven't really heard that terminology come up very often. And so I know that with Delta Lake, you have unified the API as far as being able to process these large batch jobs versus the streaming jobs into a single interface. And so I'm curious how that manifests in terms of the ways that you approach designing and implementing these processing jobs. And also your thoughts on the necessity of the lambda architecture in general, given the current capabilities of our data platforms?
Michael Armbrust: Yeah, so I guess what, why did people start with the lambda architecture originally. And I think it was basically because there wasn't a way to do streaming and batch on a single set of data while maintaining exactly one semantics. At the time, we believe that streaming was just too costly. And so the only way to do streaming was to do this approximate processing. And then later you had to do this, like more expensive, slower computation to get the correct answer. And so what Delta really brings to the table with its transaction ality is it allows you to do both streaming and batch with exactly one semantics on the same set of data and and get, you know, the correct answer. And so now, you there's no reason to have all of this complexity of maintaining two separate systems. And, you know, I think this is a really important because I think each of these different paradigms still have utility. So you know, like I was saying before, streaming is not about low latency. It's about incremental ization. It's about splitting the job up. But it's also about not having a schedule, not having to worry about dependencies amongst jobs, and making sure that they run at the right time and worrying what happens if this job finishes late. It's about figuring out what data is new since you process last. And it's also about failure management, what happens when your job crashes in the middle, and you need to restart it, streaming kind of takes care of all of those things automatically. It allows you to focus only on the data flow problem, so about your business logic, and not the system's problems of how do I actually run this thing efficiently when data is continually arriving. So that's why I think streaming is very powerful. But in spite of all of this, you know, batch jobs still happen, you may have retention requirements, where the business mandates that all data over two years old is deleted, you might have GDPR requirements where some user has served you with a DSR and you have 30 days to eliminate them from all of your data lake. Or you might have changed data capture coming from an existing operational store. So you have you know, your point of sale data that is coming in every day super important to your business. But you want to merge that into an analytical store, where you can run run kind of long term longitudinal analysis that you would never want to run on your actual operational store for performance reasons. And so bringing these together into a single interface, not requiring you to manage two completely separate pipelines, I think drastically simplifies the work of data engineers. And
Tobias Macey: going a little bit further afield here, I'm wondering what your thoughts are on systems such as pulsar where it has the ability to natively tear the storage, or sort of age out the data that is maintained in this sort of streaming and queuing architecture or things like pro Vega that provides stream native storage layer for workloads, such as fling can, I believe, spark as well, and your thoughts on the capabilities and limitations of those approaches, versus what you're doing with managing parquet files in these in these cloud storage architectures, and maintaining a unified interface within spark for being able to work across the stream and batch jobs?
Michael Armbrust: Yes, I have to be honest, I'm not super familiar with those two systems. But I can talk about what we do inside of delta with respect to this. And basically, this comes down to the same trick that we're playing before, which is leveraging underlying storage systems, and letting them do what they do best while providing nice guarantees on top. And so you know, delta lake is fully compatible with Azure is glacier storage, or side s3, glacier storage. And Azure is kind of a cheaper version of the storage as well. And so you know, I think I didn't kind of leverage those systems is is very important, because you want to keep a lot of data, but you only want to, you know, pay for what, what you actually need. And and it turns out most data you don't need fast access to
Tobias Macey: and on that idea of managing cost and the access to data, particularly in historical context, I'm interested in the ways that you approach data version within Delta Lake, because I know that that's something that you have also built in, and is particularly useful when you're doing these transactional workloads or working with multiple systems that might be interacting with the data. And so I'd like to understand how you approach minimizing the differences in terms of the different versions of the data sets, how those different versions manifest, and how the park a file format helps in any of those efforts.
Michael Armbrust: Yeah, so data version, and delta lake is implemented using MBC, or multiverse concurrency control, which is just a fancy way of saying that when we change something in the table, we make a new copy. There's a couple of reasons for that one, it's just it's simple. And it works. It's a great way to do version. But also it works really well with eventually consistent storage systems like s3, if you never modify anything, then you know, there is no such thing is eventual consistency, which is, which is great. It's very easy to reason about the correctness. And so what we end up doing is we every time we change something, make a new copy, and then the transaction log is the authoritative source for which files are part of any version of the table. So you can basically look at the transaction log, play it up to the end to see what the table currently is. But you can also stop playing it somewhere in the middle to see what the table looked like at that moment. And this is a really powerful feature because it gives you snapshot isolation, people who have started a job on an older version of the table will just continue to see that version of the table, there's no inconsistency is there's no failures, because data is being deleted out from underneath you, you just continue to see a consistent snapshot of the table for the duration of your job. But it also allows you to do cool things like time travel. So we've been working a lot with the mo flow guys mo flow is another open source project that basically allows you to manage the entire life cycle of machine learning from like training and tuning parameters and figuring out what really is the best way to build your model. And when you combine MMO flow is tracking with Delta legs version, you have a really powerful tool for experiment reproducibility. So when something when you tweak parameters in your model, you can actually go and see was it the data that changed? Or was it the the changes in my tuning parameters that changed the efficacy of my model? And that's something I think really just wasn't possible before.
Tobias Macey: Another thing that I'm curious about with the decision of standardizing on par K is I guess, what the primary benefits are that it provides as opposed to RC or Avro or I mean, it's pretty obvious what it provides, in addition to JSON, but what the decision making process looked like when you were determining which technologies to standardize on, and any cases where parquet itself has actually led to a greater deal of complexity that you've had to work around.
Michael Armbrust: Yeah, so I really the reason here is parking is not a bad choice. You know, as you pointed out, you know, it's better than JSON. And in general, when I build systems, I believe in building a system with as few knobs as possible. And I think what that does is it guides users, you know, especially ones who are less experienced into making good choices. So one of my favorite kind of stories here is we had a customer who had a massive table, they were storing all this data and sequence files, you know, kind of a hive high format, and they loaded the data into delta. And they made no other changes to it, and they ran their job. And it went from taking hours to taking 15 minutes. And they were like, Oh, my God, delta is the greatest thing ever. And I was like, Well, really, I just forced you to use our K, like, it does have performance features, but you didn't hit any of them this time. And so I think just by making a good choice, you can build a system that is much easier for people to use. The reason
Tobias Macey: we picked parking specifically is just because it has good integration and spark. So spark has a vector eyes Park, a reader that reads it directly into vectors that go directly into whole stage code, Jen, there's just a really fast path into the spark execution engine. Now, to be clear, the actual underlying transaction protocol that delta like uses, is actually agnostic to the format. And it actually even already has support for any file format, that, that that spark supports. So you could use it for JSON or CSV, or RC or, or whatever it is. And in fact, we have other systems inside of data, bricks that use the transaction log for that. But in general, you know, when I'm exposing it to users, I want to I want to simplify things, I want to guide them to making good choice. I think, you know, as we continue to expand the open source project, though, I think it's very likely that there will be large companies who already have tons of data and RFC, where it might make sense to actually open up this knob for advanced users. And diving into the open source aspect of delta lake. I'm wondering how that has impacted your overall product roadmap and your approach to development now that it is public, as opposed to being an internal product, and something that you were managing directly with customers. And now that you have broader community of people who are using it, and potentially contributing to it, any sorts of changes that you've made to the way that the team is structured, the way that your development cycles are designed, and just your overall thoughts on the benefits and trade, and trade offs of providing this as an open source project.
Michael Armbrust: Yeah, so thank you all. So this is very young, we've only been doing this for a little over a month now. And I can definitely tell you kind of what we've seen so far. And really what it comes down to is, you know, database has a long history of success with open source. You know, we started with Apache Spark, Spark SQL was originally created here and donated into Apache. And and so you know, it, it's kind of in our DNA to do this. And so it's, it's been really exciting to see the uptake. And I think really, the kind of fundamental difference here is, is now people can use, they basically they can start with Delta on even when they're they're not running in the cloud, or they're not running on, on on data bricks. So you know, there's a lot of users who have HDFS clusters on prem, who just never had the ability to get this transaction, ality and other stuff. And so I think in terms of of changes, it's just more open now, which is great, we can get feedback. While we're in the coding process, we've already had a bunch of contributions to the open source repository. And what I'm really excited about, you know, coming up in this next quarter is as we release, some of these API's for doing declarative management of data and data quality, will really get to kind of CO develop this with customers, rather than having it be with customers, and just, you know, the greater open source community, rather than having it be something where we build something and secret and then dump it over the wall and then see if people like it, it can be more of a collaboration, I think it's actually it's pretty good. In the short term, it means that my team, we call the stream team inside of data bricks is almost entirely in the process of open sourcing more and more parts of delta like, so I, you know, I believe pretty strongly that API's need to be open in order for a system to be successful. And so I want to take all of the API's that are currently proprietary and push them into open source. So this is update, delete, and merge, this is vacuum, this is history. All of this needs to be pushed into open source, that's basically our whole roadmap for this quarter is to you know, de tangle those from our internal platform and get them out there. So they work with with just stock Apache Spark.
Tobias Macey: And so going back to the Delta, like project itself, I think the last piece, or at least the last major piece that we haven't talked about yet, is the capability of creating indexes across the data that you're storing and your data lake. And so I'm wondering how that manifests and any issues that you have encountered as far as scalability and maintaining consistency of those indexes as the volumes and varieties of data that you're working with grow?
Michael Armbrust: Yeah, so I think traditional secondary indexes would be very difficult to keep in sync, any, you know, all the problems you just talked about would come up. So we have a couple of tricks that we play that give you kind of similar performance indexing, but you know, operate a little bit differently. So we support, you know, standard partitioning of data where basically you say, partition by date, or partition by region, some kind of relatively low cardinality grouping of the data. And as data is inserted into delta, we automatically segmented into these different partitions. So that's like a standard trick that spark and hive and these other systems have supported with Delta changes here is, first of all, we make the the management of those partitions totally scalable by taking the data and turning it in or starting to be taking the metadata and turning it into a data problem. So you can have a table that has millions of partitions, and delta can still handle that rather than tipping over the other tricks that we played are when you do some of these DLL operations. For example, like a delete, let's say, I partition by date, and I want to delete all data that is older than a year. Well, the magic of having that partitioning there is that can be a metadata only operation, we can perform that simply by changing the transaction log without any doing any actual processing.
Tobias Macey: So that that that's pretty powerful. And just overall, in your experience of building data, bricks, delta, now Delta lake and going through the process of open sourcing it and just evolving the system as a whole. What have you found to be some of the most challenging or complex aspects of it? And what are some of the lessons that you've learned in the process that were particularly unexpected or valuable?
Michael Armbrust: Yeah, so I there been a couple challenges. One is actually education, you know, you're asking about parquet and what complications that led to one problem is parquet is almost too open, in that as soon as people started using Delta Lake, they would also go and start and trying to read those markets files without understanding the transaction log. And so I mean, a lot of people say, Oh, man, you know, it's duplicating my data, not understanding that that was, you know, the MBC aspect was actually by design. So it's been a bunch of time adding kind of guard rails around the system to make it possible to use it without needing to understand everything that's going on under the covers. The other aspect is just how many different workloads exist in the world and how they exist in you know, all different dimensions. So you know, delta is now used that thousands of customers, there's over 100,000 Delta tables in existence. Last month, we processed over and exabytes of data. But what that means is, people have stressed out every different dimension of this system. And so I have seen jobs crash in ways that I never expected. And we've gone in and kind of fixed those, you know, each little individual scaling bottleneck, you know, so that so that it runs without you needing to set a lot of tuning knobs, going back to the point we talked about before, I think you know, that's really important to building a usable system is to minimize the amount of tuning that you need to do in order to be able to use it successfully. And so I think that's been both very interesting, but also very challenging to make that possible.
Tobias Macey: And going back to the duplication of data that your customers have been seeing and issues in terms of cost control. I'm wondering what some strategies are that you can potentially employ, whether it's something like compaction, or going back and pruning old versions of data and any cases where that might be a bad idea, or cases where it would be necessary, either in terms of space savings, or cost savings, particularly when using cloud resources.
Michael Armbrust: Yeah, so delta support this command called vacuum, which basically looks at a table, and it looks at what's on the file system, and what's in the transaction log, and it does like a left aunty join to basically remove, you know, find everything that should be there, and then find everything that by definition shouldn't be there. And this is totally treatable. So when you deleted a file from Delta, we create a tombstone that tells us it was deleted. And by tuning that retention window, you can decide how long you want to keep stale snapshots for. So at one far end of the spectrum, if you vacuum retaining zero minutes, you know, as we call it, that would basically just turn it into a normal park a table, it would remove everything that is not part of the current snapshot. And so now it is just a normal park a table and you could read it with other things. And there's no extra storage being consumed. Other than a transaction log, which is pretty small.
Tobias Macey: At the other end, we have customers who for compliance reasons, want to never delete anything, and they just don't run vacuum. And they're willing to pay the cost of keeping everything forever. Because it means if they get audited, they can go back and say, oh, here's exactly what the data look like at that day. And so delta lake is definitely a very interesting system and appears to be very well designed and provide a lot of great value in terms of useful defaults and interesting capabilities that will potential obviate the need for data warehouses by adding transaction ality to these data like systems. But when is it the wrong choice.
Michael Armbrust: So delta lake is not an LTV database, you should not try to use it for lots of small transactions. If you have a lot of tiny updates, and you apply each of them individually, it will be horribly inefficient. And so really, what we're looking for is scale and throughput. That's what we've actually optimized for. So if you can take all of those updates, feed them into Kafka and apply them in batches. That's a great pattern. But if you try to call update, every time one arrives, you're certainly going to backup the system. Similarly, you know, I would not use this as a key value store. It's not good at doing individual point look ups. But it is great at doing massive table scans. And it's even good at locating relatively small needles in a haystack without the cost of actually having a key value store holding all that information.
Tobias Macey: And as you have been building and evolving delta and delta Lake, I'm wondering what are some of the problems and opportunities is that you could have addressed but consciously decided not to work on?
Michael Armbrust: Yeah, so delta, at least the current implementation, and this is, you know, nothing to say about what the what the format actually would allow. But in our implementation, we don't implement our own processing. And you know, actually in in, in some early versions, we tried, and we learned that was a bad idea. So I'll tell you a story where in the in the first version when we built the merge operator, so this is where you take a set of changes, updates, and inserts and deletes and apply them as one batch into the tables for like Change Data Capture. In the first version, we actually wrote our own implementation of join, that we thought would be slightly more efficient and kind of hand coded it. And when we benchmark did, it turns out sparks joins with whole stage code, Jen are much better, much more scalable, they spilled the disk, they do all this stuff. So delta in general, is actually just a bunch of cleverly written spark sequel queries, that then leverage is the processing of Spark. And we try not to get into that business. When we need to fix Spark, we actually go to Apache Spark and fix it instead.
Tobias Macey: And so looking forward for the project, I know that you said that in the near term, you have some work to extricate some of the code from the internal repositories and bring it into the open source repo as far as some of these dl capabilities. But looking to the medium and long term, what do you have planned for the future of delta like,
Michael Armbrust: yeah, so what gets me really excited is the idea of having a full declarative spec for defining the graph of data pipelines that you need to run. So what I've seen as soon as we, as soon as someone starts with Delta starts with streaming and starts with the other capabilities, they almost immediately go from one Delta table in one stream to thousands of delta tables and thousands of streams, we actually have customers who have that many. And then the the orchestration, the management becomes very difficult. And we're going to start this quarter. And I think, you know, we'll continue this for the next year or so open sourcing API's that allow you to declarative Lee specify the layout, the quality constraints, the metadata, the kind of human readable descriptions of the data at rest within your delta Lake, but also the flows of how data moves in between these different systems. And I think the goal here is to really ease the burden of, of the people who have to manage these production data pipelines. So I like to think about the whole life cycle here, people usually just think you, you build a production pipeline, and you're done. But really, I think there's a bunch of different steps here, you start by building it locally, you write some code, you get in your ID, you write some Python, you write some Scala, you want to test it so that you know that it's actually correct, and that it stays correct. And then you need to deploy it. But that's actually only the beginning, then you need to worry about someone comes to you with a new requirement, or you find a bug. So you need to be able to start that cycle again, and then upgrade it, you need to monitor it so that you you know that it's working, you know that you're meeting your SLA, and you need to tune it so that you're minimizing your costs. And so our idea here is, you know, in delta to give you the language for describing this whole system to make it easy to do the testing, to do the deployment, and really make it make this whole system more declarative.
Tobias Macey: And are there any other aspects of delta lake or Spark or the work that you're doing a data bricks or just the overall landscape of data, lakes and data warehouses that we didn't cover yet that you'd like to discuss before we close out the show?
Michael Armbrust: No, I actually think I think I'm tapped out.
Tobias Macey: All right, well, for anybody who wants to follow along with the work that you're doing or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
Michael Armbrust: Yeah, to me, the biggest gap is cobbling together all of the different solutions that you need to build something that is correct. You know, I think spark started this journey by unifying a bunch of API's into a simple system that you could use, but really, that didn't handle storage that didn't handle metadata. And so delta Lake, you know, I think we're, we're kind of trying to bridge that gap. And I think there's a lot of work to be done. Like I said, it's still a lot of work to actually manage a full production pipeline. But I think that that's that's the gap that I see. And, you know, I think it's something we're trying to address.
Tobias Macey: All right. Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with Delta lake. It's a very interesting system. And I was excited when I first saw the announcement. So I appreciate you taking the time to discuss your experiences of working with it. I definitely am planning to dig a bit deeper into it and keep track of it as it continues to progress. So thank you for all of that, and I hope you enjoy the rest of your day.
Michael Armbrust: Yeah. Thanks for having me.