00:12 Tobias Macey
Hello, and welcome to the data engineering podcast the show about modern data management. What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I'm working with O'Reilly on a project to collect the 97 things that every data engineer should know and I need your help. Go to data engineering podcast comm slash 97 things to add your voice and share your hard earned expertise. And when you're ready to build your next pipeline and want to test out the project to hear about on the show, you'll need some more to deploy it. So check out our friends over at linode with their managed Kubernetes platform, it's now even easier to deploy and scale your workflow. So try out the latest Helm charts from tools like pulse our package arm and daxter. With simple pricing, fast networking, object storage and worldwide data centers, you've got everything you need. To run a bulletproof data platform, go to data engineering podcast comm slash linode. That's l i n od e today and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For more opportunities to stay up to date, gain new skills and learn from your peers. There are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to data engineering podcast.com slash conferences to check out the upcoming events being offered by our partners and get registered today. Your host is Tobias Macey, and today I'm interviewing Stavros Papadopoulos about ttyl db, the universal storage engine. So Stavros, can you start by introducing yourself? Absolutely, Tobias, thank you very much for having me. I'm stablished Papadopoulos. I'm the co founder of tile dB. I'm a computer scientist and I'm excited to talk about tile DB and everything. We have done. Do you remember how you first got involved in the area of data management?
02:03 Stavros Papadopoulos
I did my PhD in databases. So that's how I started. I have always been in the databases space, I was at the time focusing mostly on multi dimensional data structures, data privacy and cryptography. But then in 2014, I joined Intel labs in MIT, where I worked on a big data initiative, alongside some database gurus at MIT, as well as some high performance computing ninjas think labs. And we are, this is where
02:26 Tobias Macey
everything started. And so you have recently started building the tile DB project, can you give a bit of an overview about what it is and some of the problems that you're trying to solve with it? I'm gonna start
02:36 Stavros Papadopoulos
by explaining in a nutshell what it is. So it is a normal engine normal kind of database, which allows you to store any kind of data, not just tables as traditional databases, so it can be genomic variants can be geospatial imaging can be data frames as well. It can be tables, but it is more than that. And we it has its own universal storage format to be able to do this and then it allows you to manage this data. So you can define access policies, you can share the data with anybody in the world, you can log everything. And, of course, you can access this data with any language or tool. So it goes beyond traditional SQL that you find in databases. It consists of two components. So that will drive the conversation later, he has an open source component, which we call tirely be embedded. And this is the storage engine that is based on this universal format that uses multi dimensional arrays. And we're gonna discuss about this a little bit later. And this contains all day the language API's as well as tool integrations, plus everything that has to do with cloud optimized storage, as well as data versioning. And there is a private offering, which we call tylee. Cloud we this is a SaaS platform, which allows you to share your talebi data with anybody in the planet and allows you to define arbitrary user defined functions with dependencies and dispatch them to a cloud service. And the most important thing about this cloud service is that it is all serverless. We do that at eXtreme Scale. It is built from ground up serverless.
04:03 Tobias Macey
And you mentioned that you've been working on and with databases for a number of years. Now I'm curious what you are drawing inspiration from as far as some of the systems that you've worked with that you're using to direct your designs on tile DB and some of your motivation for building a new database engine that is drastically different than most of the ones that I've had experience with anyway.
04:26 Stavros Papadopoulos
So time, to me has a long history. It started at the end of 2014, the beginning of 2015, when I was working at MIT in Intel, at the time, I was just looking for a research project to work on under this big umbrella of big data, this initiative we were working on at the time, and I was a c++ programmer. So I had different types of influences, right, the MIT people who were building traditional commercial database systems, and then into labs who were building high performance computing software, and a lot of it was was around linear algebra. Which is at the core of machine learning deep learning nor all advanced analytics. So I was looking for a way to combine these two areas. And from a research perspective, what I wanted to do was mostly sparse linear algebra, which essentially means linear algebra with with matrices that have a lot of zeros or empty cells, right. And these these are more peculiar from from a performance perspective, and they need careful handling. And also, I was very much influenced from geospatial data from my time during my PhD years. So Frankly, I was looking for a way to store sparse matrices, so that I can do very fast sparse linear algebra. And at the same time, I can I can capture some some of the geospatial use cases, again, everything completely research oriented. So I had a couple of requirements as I was building this engine for sparse arrays. The first requirement of course, was that he had to handle sparsity and ideally dense arrays as well. So that is a unified engine of a dense array. Has values everywhere so that the number of zeros is not as big as in sparse arrays. The second requirement was that whatever we were building, he had to work very, very well on the cloud because we saw a big shift to the cloud. So the storage engine should work on AWS, s3, Google Cloud Storage as your blob storage or any other object stores in the cloud. Another requirement was that it had to be an embedded library. So it had to be built from scratch by definition, because it goes to storage. And I couldn't use any other component from established databases. So I want to build it from scratch in c++ in an embedded way, so that you don't have to set up a server to use it. And the fourth requirement, at least for me, was that it should be built in c++. First for speed second, because I was good at c++. But finally, because I had the longer vision that these libraries should interoperate with other languages as well. So having a c++ Library may make this a little bit easier. Now I have to mention that At the time, there were such storage engines like Hd f5, for example, a very popular dense array engine. But that was architected around dense array, so I couldn't use it for my sparse problems. In Second, it was not built for the cloud, because it's been around for decades, and the cloud gained popularity only recently. So it was not architected to work very well on s3, for example. So that's how it started. I that's what motivated the storage engine. So I built it in a way that handles both dense and sparse arrays in a unified way. Because if I architected to, in a way to handle sparse arrays, maybe there are tons of similarities in handling dense array. So let's identify what is different. And then let's spell these out and handle both in a very, very efficient way. And at the same time, I was very fortunate that Intel was working with a prominent Genomics Institute and they presented me with with a very important difficult problem around storing a genomic variants. So huge data in essentially a sparse format, the genomics Data is very, very sparse. So the solution that I presented was very relevant. We created the proof of concept, it went very well. And it got adopted. So we said, okay, this storage engine, probably is very meaningful for more use cases than I had originally thought. So let's give it a chance and start building up. And this is what made tally be amended. That's the open source system that I created at the time. And of course, it evolved, and we can discuss later about how, but that's entirely been the motivation behind the the talebi embedded storage engine, which is the only system that handles both dense and sparse arrays, sparse multi dimensional arrays in a unified way.
08:39 Tobias Macey
And what was the motivation behind tile DB cloud?
08:42 Stavros Papadopoulos
Now, it the time also, we were discussing with a lot of scientists again, of course, I had the databases perspective from from MIT, but I was talking to other groups and other scientists from geosciences from, from genomics in other scientific domains, and I observed a couple of similarities. The first thing that I observed is that every single domain has its own crazy data format. It is a file format, which is very domain specific. And it's crazy in the sense that it has a lot of jargon. Although, at the end of the day, it's just data. And I'm going to explain and clarify a little bit what I mean by that. And a big similarity there was that regardless of what format you choose for a specific domain, custom made for your application, no matter how good it is, and you can make it very, very good. all hell breaks loose when when you have updates, data versioning and access control. Right, single file works great, but not so much. If you start updating this file or you're adding more files, you end up analyzing thousands of files, and that was the same in genomics as well as geospatial the exact same thing. Another thing that I observed was that every domain preferred different languages and tools they had, for example, one group in bioinformatics really like our another group like Python, On in geospatial, for example, you find somebody who, who like job as well. So a lot of different preferences in terms of what languages you want to use in order to access your data. And then that goes back to the original decision that we build everything in c++, so that we build API's for every language. We can, regardless of the domain, the scientists wanted always to share their data, of course, with with access policies and everything, and code for reproducibility, right? So just sharing files was not going to cut it. So eventually, the biggest observation of all was that the data management principles, data management features that we had in databases, I couldn't find them in domains like genomics and geospatial and later we found that that that was true for other domains as well. So data management was a problem and it was not the science behind those domains that was creating all the problems. So we kind of lucked out in the fact that The other observation was that all data regardless the vertical can be efficiently modal as a dense, or a sparse multi dimensional array. For example, in the mountains is a dense 2d array. genomics is a sparse 2d array. LIDAR point clouds are sparse 3d arrays, even key values can be considered as a sparse one dimensional vector, where the keys are our string values in a string domain. So even that I can prove to you that that essentially boils down to two sparse arrays. So a lot of common things across the verticals. And we already had tile DB embedded, which addressed the issue of storing everything is multi dimensional arrays addressed the issue of interoperability that, Hey, everybody can access the data from their favorite tool in their favorite language. What we needed was to try to scale the other data management features like access control at the global scale that did not exist. Try to do everything serverless because that alleviates the pain of setting up clusters and addressing certain issues with with scalability. And also creating user defined functions with arbitrary dependencies as task graphs and deploying them in the cloud. And that effectively gave rise to the cloud, which is this SAS platform we built for the cloud, which handles data management, soil access, controlling logging at planet scale, as well as serverless compute in the form of past graphs.
12:31 Tobias Macey
And an interesting thing to note too, is that as you said, all of these different specific domains have their own custom file formats that they've been using for years, which means that a lot of these people who are working and researching these domains or who are building applications probably have piles of data lying around in those formats. I'm curious what you have seen as far as the approach to being able to translate that information from those legacy forward mats into tile DB or from tile DB into those legacy formats to be able to fit with their existing tooling.
13:07 Stavros Papadopoulos
This is where we spend the majority of our time admittedly, right? Because again, we were a storage first company. So we spent most of our time understanding each vertical. And each file format, then, of course, we had to bring some brilliant people in our team who had this knowledge, or we were working very closely with customers, which of course provided us with this knowledge. So essentially, what we had to do was understand the data format, try to map it into a multi dimensional dense or sparse array, depending on their access patterns, right. So it took a little bit of back and forth in order to understand what the best modeling is, but at the end of the day, it was an array, then we creating just stores that we're reading from those legacy formats into the web format, and then everything fit in place. The reason is that once you get your data into the Type B format, then you inherit everything. We build the Talk regardless of your vertical, for example, if you're in genomics and you store the data as arrays, you get our integration with dask, SPARC, Maria dB, presto dB, the six API's we have, you get the whole ecosystem. And our whole mantra in the company is that we are going to integrate with pretty much everything that exists out there. So once you put the data into target B, you get these versatility, this flexibility to process your data with anything you like, including your own tools. For example, for the geospatial verticals, we did integrate with popular geospatial libraries like poodle, and jido. And of course, we're happy to do the same with in genomics, for example, it is in our plans to integrate with a popular library called kale.
14:43 Tobias Macey
So because of the fact that you have this universal data format that can model all of these different problem domains, and you're focused on being able to store the information efficiently and have these versatile interfaces for all the different computation layers. I'm curious what What you have seen as far as the challenges of being able to design the API is to make it easy to be able to actually use all of these different computation layers on top of tile dB, because you mentioned things like Spark, and presto and Maria dB. So you're working in Turing complete languages. You're also working in SQL. I'm curious what some of the challenges are as far as being able to make the access patterns intuitive and efficient for all those different use cases.
15:30 Stavros Papadopoulos
Yes, this is a great question. Again, we kind of lucked out in that respect in the past years. Let's start with the databases and then we're going to explain about everything else, all the other computation tools. The database is recently shifted to a framework where they support pluggable storage right before there were monolithic. they handled all the layers in in the stack right from parsing the query down to storing the data. On the back end. And most recently, they just unbundled the storage, right? So they created their own API's that allow you to plug your own storage engine, your own sunstone format there. So that made it very easy for us to just go into Maria dB, for example, or presto, db, or Spark, spark has data connectors by definition, and just plug it. So it was a lot of work to do it because we had to understand how every single two does it. So it's a time issue rather than the complexity issue because those guys did a good job to expose clean API's to do that. And then fortunately, for the databases we have in one to one mapping between a data frame and then array. And this is done by pretty much selecting a subset of your columns to become your dimensions. And that's your fast indexable columns. These are the columns that target will allow you to slice very fast on So for databases, we lucked out because they were already doing it. And we just planned tylee B into them for sparkle. So it was easy because they have data connectors for does the same thing. They have data connectors, they don't bind their storage to a particular library. So that was easy to do. And pretty much the it's the same story for the rest of the tools like GD and P. But we needed to have people that have done it before, in order to do that very, very efficiently, both in terms of, of time as well as performance. And again, we have people in our team that are specialized in doing exactly that. So it was not that much of a challenge from from an engineering perspective. It was just a tiny investment, which we happily did, because that completes the vision of being a universal data engine and we will continue doing that.
17:48 Tobias Macey
And particularly for things like a sequel interface that's used to working with a two dimensional array. I'm curious how you represent an n dimensional array. Is it just a series of different tables where you slicer on different axes and then join across them? And then tile DB handles, translating that into the multi dimensional array on the back end? Or was there some other level of abstraction that you needed to add to be able to make it easier for people to be able to process and analyze these multi dimensional structures?
18:20 Stavros Papadopoulos
Yeah, so let's clarify this a bit. We directly bandel to vanilla SQL, right SQL on tables, not specific adaptations for matrices. At least we haven't done it just yet. We may do it in the future. But as of today, you can use for example, Maria DB with tally be plugged in. And you can run any SQL query, as you would do it on Maria DB alone, any unseen SQL query, and it's going to work. The only thing that you substitute is in in the from clause, you put an array URL you Talib era era, which could be local on s3 on Google Cloud on AWS, or pretty much anywhere. The whole query is just gonna work. So there is nothing to be done by the user. In order for the sequel to work, the only thing that the user should know from a performance perspective is which of the columns, we marked as dimensions in the talebi world. Because those core if you slice if you have a predicate in the workflows that does a range query or a quality query on those particular columns, you're gonna get a very fast query time. That's the only thing that the user should know that you know, those columns are special, they index essentially talebi acts like a cluster index on those particular columns. So you're going to get a lot of performance from that. And similarly, if you're the one who constructs the table, even from SQL, we have added configuration options that allow you to say okay, this particular column is a dimension so you can mark it in the CREATE TABLE. statement you can mark which which of the columns are dimensions, and you should think of those as a clustered index. That's the best way to think about it. And everything works like in the sequel world.
20:11 Tobias Macey
So can you dig a bit more into the actual on disk format of the multi dimensional arrays and how they're stored by title DB for being able to then query and analyze them. And just some of the ways that users of tile DB need to think about data modeling that might be different than the ways that they're used to using either relational structures or graph databases or some of the custom file formats that they might be coming from.
20:39 Stavros Papadopoulos
So we're gonna make a categorization because every category, its peculiarities. So let's take for example, a dense array case let's take an image, okay. So if you want to store an image, so each pixel in a database table with a standard traditional database and be able to slice it multi dimension For example, put a range on one axis, put the range on the other axis and get these slides. We call this a slice right amalgamation slides. And arrays are pretty good at giving you the slice very, very fast. That's why you use arrays, right? So if you want to, alternatively store this in a traditional database, the very first thing that you should do is create one record per pixel. So instead of storing just the value of the pixel, right, the RGB or whatever it is, you have to explicitly store the coordinates of that pixel, for example, 11121314, in separate columns, there's gonna be one column for the one dimension and another for the other, and then perhaps three columns for RGB, right? So that when you issue a SQL query, a standard sequel engine is going to understand Okay, the first predicate is of the first column, the second predicate is on the second, and I can even create a clustered index and there you go, everything works very, very fast, right? The problem is that you're introducing those two extra columns. And dense arrays do not store explicitly the coordinates of the pixels in the dense case. And that's a very important difference versus the sparse case. So going back to your question, for dense arrays, we don't store the pixel coordinates, we just impose in one dimensional order to those two dimensional or n dimensional values. And there are ways to do that we we give you a lot of flexibility to impose this order by chunking into tiles, hence their name tile dB. So essentially, we impose an order. Then based on on some explicit tiled capacity, we chunk which Anglos values, and these this chunk is called a tile in tile dB. And then these values are serialized in a file, one per attribute. So it is a columnar format like parquet, for example, right? Are all the values Going to be stored in one file, all the values along g are going to be stored in another and b in another but not the coordinates. That's a very important distinction versus sparse arrays as well as traditional tables, right? Because for tables, if you don't store the indices, how you're going to slice on that the tables do not have any semantics of serializing a two dimensional space into a one dimensional curve. There is no such semantics in the database. But there is in in a dense array storage engine, like tylee Bhd fibers are that's exactly what these storage engines do very, very well. Okay, so this is the on disk format, we serialize the multi dimensional objects into a single dimensional order. So essentially, we sorted in a particular order, we chunk we compress each chunk individually, we put them in one file per column per attribute, and then we store them in a subdirectory in an array directory, which is tiny timed entities called a fragment, and this fragment is immutable, after it is stored, it will never be changed. And this is a very important architectural decision we took for data versioning, as well as for working very, very well on Cloud object stores when there are updates. So that's the dense case. This first case is almost identical with the difference that now since we don't know exactly which cell is empty, in which cell has value, and because we don't materialize the empty values or the zero values for two dimensional matrices, for example, then we need to explicitly store the coordinates of the non empty cells. And imagine that again, there is a one dimensional order imposed on the multi dimensional space with some specific configurations. And again, we do tiling, again, we put the coordinates along each dimension in a separate file, then the attributes in separate files as well. And then we put everything into a sub directory in the array directory. specifically for this past case, we employ multi dimensional indexes for fast pruning and fast slicing, like our trees. That's what we use as the in memory structures when opening an array to be able to slice fast and find the non empty cells. And this pretty much summarizes what what the oldies format is.
25:23 Tobias Macey
Today's episode of the data engineering podcast is sponsored by data dog, a SaaS based monitoring and analytics platform for cloud scale infrastructure, applications, logs and more. Data dog uses machine learning based algorithms to detect errors and anomalies across your entire stack, which reduces the time it takes to detect and address outages and helps promote collaboration between data engineering, operations and the rest of the company. Go to data engineering podcast.com slash data dog today to start your free 14 day trial. And if you start a trial and install data dogs agent they'll send you a free t shirt and For people who are trying to determine how they want to structure the data that they're storing, what are some of the data modeling considerations that they should be thinking about? Or fundamental concepts that they need to be able to understand to be able to employ tile DB to the best effect?
26:17 Stavros Papadopoulos
Yeah, this is very, very similar to fine tuning a database, like what kind of indexes are you going to use on which columns, what kind of page size we're going to use and all those configuration parameters, it is equally difficult. Let me start by by saying that it is equally difficult. Of course, we have guidelines about performance, for example, how the tile extent affects performance, how the order affects performance, so on so forth. For most cases, it could be straightforward. For example, for dense images, it could be straightforward because then see images are, for example, two dimensional, it's fairly natural to think in terms of arrays. You know which dimension corresponds to the width which corresponds to the height, then you do some reasonable chunking such that each tile is, for example, 10 kilobytes, or 100 kilobytes or one megabyte, because this affects how much data you're fetching from the cloud, or from any back end when you're slicing. So that could be a little bit easier, it becomes a little bit more complex for sparse arrays, even for database tables. Because first of all, you need to select a subset of your columns to be your dimensions. So you need to to see the the workloads that you have and say, Okay, I slice on stock and time for these asset trading data set, for example, right? So I better make those as dimensions because talebi is going to give me these performance boosts whenever I have a predicate on either of those of those two dimensions. And then of course, there's going to be some trial and error. And for other use cases like like genomics, we do them directly. For example, for a specific genomic variant use case. We did A lot of benchmarks we got from customers and users the access patterns, we said, okay, this should be dimension, this should be dimension. That should be the order. That should be the tanking. And we find all the other configuration parameters. And the customization we built specifically for genomics hides those, of course, it exposes the configurations for the user to set. But we should figure 90%, out of everything that you need to do, so that you can start using it immediately. It is a difficult problem, though. And that's why we're around. We're always happy to help with with the users use cases, they, they contact us frequently. And we're extremely interested to dive in and optimize for them.
28:40 Tobias Macey
And for somebody who is going to start using tile dB, both for the embedded and for the cloud use case, what is the overall workflow look like? And what are some of the benefits that you're seeing of unbundling the storage layer from the computation for being able to integrate face with that storage engine from multiple different libraries and runtimes.
29:06 Stavros Papadopoulos
I would like to separate those two questions, if I may. So on the first one regarding what is the workflow, so here it is for talebi embedded, it's very easy to install any of the integrations into any of the API's you'd like. So that's the first thing you do. The second thing that you need to do is depending on your use case, you need to use a particular ingest or to ingest the data from the format that you have it in into talebi. And this is what we're here to help with. We have created most of the investors for example, from all geospatial formats through our integration with jido. We do a translation to tal dB. So you just use the GL command and your data. You can store any any geospatial format into talebi. For genomics, we build our own and for CSV files we build we rely on on pandas, CSV ingest, for example. And the list of ingest stores grows So you need to ingest your data from whatever format you have it into the target format. And again, you need to do through some ingest or, but from that point onwards, you can use either any of the API's we expose directly from Todd before direct access. And this is the fastest you can interface with your data with or you can just use SQL so you don't change your workloads whatsoever. Or you use poodle in jido in geospatial and again, if you don't change your workloads at all, or you use spark in the same way that you would use spark with parquet, you can use spark with tile dB. And the same is true for dask. So we're trying to ensure as little friction as possible when it comes to using the data directly. And this is true for talebi embedded. For talebi. Cloud it is even easier, you can just sign up, sign in and go we host Jupiter notebooks there with with a single click, you can just spin up a Jupiter notebook and then we have a The dependencies, everything is installed. Of course, in the future, we're gonna allow you to install any anything you like. But it's a Jupiter lab notebook. And we have tons of examples there with example notebooks for multiple use cases, and you can start writing code immediately you can start ingesting your data, or you can start working directly on public data that we have ingested for everybody on top of the cloud. And we will keep on adding data sets there, we will keep on adding notebooks there. So once again, the best way to learn target is to go check out those notebooks even download them if you'd like to work on them locally. But without installing anything, you just sign up sign in and go. Now, that was the first question. The second question is about unplugging storage from the processing tools. And this is exactly what is going to help me clarify a little bit division of family so the benefit for database is like Maria dB, for example, presto dB, of unbundling even spark even does cry even computation framework so it spans beyond the databases. The benefit of unbundling storage, is that you can effectively separate storage from compute and allow you to scale storage and compute separately. This is one of the biggest benefits that I personally see. Right. For example, in the past, you had to pay licenses for enterprise grade databases based on the amount of data you store in the database, right. And that's not truly reasonable when it comes to genomics, where you have petabytes of data, because the licenses are going to become extraordinarily expensive, then it depends on where you store the data. And if you don't store the data in a cloud object store, then of course, you need to pay for that storage and it's extremely expensive. And finally, you end up not using all the data at the same time. 24 seven, of course, you do Analysis frequently, but not scanning the whole terabyte, for example, 24 seven. So why would you pay for the whole petabyte or for compute for the whole petabyte? 24? Seven. So there are economic benefits from separating storage from compute. And now the question is, after you do that, what do you do you need to store the data somewhere. So there is there has to be some kind of data format, which can lie on an object store like AWS s3, or Google Cloud Storage or as a blob storage. And then whenever I want, I can spin up a database server or I can spin up a serverless function and I can access this data. So the first benefit is economical. The second one has to do with interoperability. If you store the data in a format, which is understood by multiple tools, you can do a sequel operation on the same data. But at the same time, you can spin up perhaps a Python user defined function or an our user defined function to do some statistical analysis on the data. data, which is something that a database, or at least the database that you're using, could not do. So the second one has to do with flexibility in functionality. But the last thing that I want to mention is that if you just unplug storage from a database, it solves one of your problems, or two of your problems, which is savings as well as interoperability and flexibility. But you start introducing new problems like data management problems, okay, I stored my data in those files on s3. Okay, how do I impose access control on those? How do we I impose access control in a way that when I use SQL, these access policies are are respected, and at the same time when I don't use the SQL engine and I do something entirely different, which is, I don't know through my Java API or do something or through spark or through dusk. Still, I get those access policies to be respected. And if those access policies are not fine, Based AWS s3 is not going to help you. But if you have array semantics, what if you want to define an access policy on a slice of your data. So what we did was exactly the opposite of what the databases did. So a database and plug the storage engine, we applied the compute. So we kept the storage, we get the updates, we kept the versioning. We kept the access control, we kept the logging, the only thing that we unplanned was processing. Because we want you to be able to process the same data with a powerful SQL engine that there are a lot out there. But also leverage the power of spark also leverage the power of dusk. Also do something with the geospatial tool, or even write your own computational engine without worrying about the data management hassles. So that's the difference that we actually did to address this problem.
35:59 Tobias Macey
Yeah, that's definitely The thing that stands out to me most about tile DB is, as you said, you still have a lot of the benefits that you get from a vertically integrated database as far as access control and versioning, without having to go and re implement that all on your own as you would if you were just using JSON files on s3 or parquet files, where as you said, you can manage access on the file level, but not on the per column level unless you have some other layer that everything has to go through. And so I'm curious if you can dig more into how tile DB itself is architected to be able to handle all of those additional benefits on top of just the raw Bits and Bytes storage.
36:40 Stavros Papadopoulos
Yes, this is exactly where type B cloud comes into the picture. So let's clarify again what you can do with each of the offerings with tally be embedded. You have a way to store any kind of data in a universal format as multi dimensional arrays. The data versioning is built in To these formats, so still in an embedded way, and effectively serverless, you can take advantage of the versioning, you don't have to spin up a server to him serializable rights, when we have concurrency that's already handled, that's built into the format. That's how we architected our DB embedded. So that's pushed down. So if at least one of the data management aspects, which is handling updates and handling data versioning, this is built into the format and you get it of course for free, and you get it into the format so that you don't have to reinvent it for every single higher level application that you're using to be with. So that's what you get from Todd to be embedded, you get, again, the efficient storage into multi dimensional arrays and efficient slicing, compression and all those nice stuff, the optimizations for the cloud, the parallelism, the integrations with all the tools that I mentioned, and of course, the data versioning, the updates and all that stuff. You get that in an embedded way. You don't need to speed up anything and this is no tied to Any particular subset of the ecosystem, it's for the entire ecosystem. Now, if you want to do access control, especially at the scale that we're discussing about, which is planet scale, right, you should be able to share any portion of your data set with anybody, anywhere. And with as many people as you like, even beyond your organization, right? This is exactly what are the big cloud was was built to do because that cannot be done in a completely decentralized way. There must be somebody who keeps a database with all the customers and all the access policies in order to be able to enforce it. And that's exactly what type B cloud does, it enforces the access policies, while keeping the rest of the code identical, right? You have a SQL query, it's gonna work the same whether you're using talebi cloud or you're using time to be embedded. But if you're using time to be cloud, then we know how to enforce any access policies that come along. With that particular array. So that's how we build a universal layer in an access control layer. And that comes along also with logging, we log everything that is happening on your arrays or on somebody else's arrays. And the reason why this is universal is because all the access policies are defined on these universal storage format. If we did not have a universal storage format, if we were, we were an engine that supported parquet and orsi, and czar in HDFS, we would not be able to seamlessly define access policies in this single way, and be able to scale access control to Planet scale.
39:41 Tobias Macey
And in terms of the evolution of the project, I'm curious what have been some of the ways that it has changed since you first began working on it and some of the assumptions that you had early on in the project that have had to be reconsidered as you started getting more people using tile DB and more different problems. domains and technology stacks.
40:02 Stavros Papadopoulos
Yeah, the original topic was just a research project, right? There was a crazy dude writing some code in trying to convince people that this has a lot of value in in all those domains, right, the original designs and remained more or less the same. And we lucked out on that respect. I'll give you an example, the original decision to work with immutable batches of rights of freedom files, it was an important architectural decision, because it allowed us first to do updates on on sparse data, which are very, very difficult because otherwise, you would have to reorganize the whole data set if you're just infusing data in random places. But most importantly, this object immutability is exactly what you want. If you're working on an object store like like s3 or Google Cloud or hybrid cloud storage. So Blob Storage because all those objects are immutable, right? You cannot change Just four bytes in a single file, you will have to rewrite the whole file. And that allowed us, of course, to become super optimized on the cloud. So that decision remained. A lot of stuff in the core code got completely refactored, obviously, but not from an architectural point of view when it comes to the format. When it's mostly the cord, how optimized made it, we made the protocol to s3, much less chalky, which allowed us to avoid certain latencies. So it was mostly around optimizations. But one of the biggest, perhaps, architectural decisions that we make or format decisions that we made, which indeed was was important to happen after we created the company and actually appeared only recently, a couple of months ago without the 2.0 was the feature that allows you to define any of your dimensions in a sparse array to have different data types. I mean in a traditional array definition Dimensions have probably integral values, right? It doesn't make sense to have a float coordinate. For example, of course, we had supported float coordinates since the get go. But we wanted to make each of the dimensions to have a different data type if the user wants to, because that was the only way that we could capture data frames, because for data frames, ideally, the user can choose any subset of the columns with any data types and say, This is my clustered index, make sure that despite the fact that those have different data types, I want the slicing to be very, very fast on those dimensions. And that required a lot of refactoring. And that's what I'll be 2.0 introduced. So that was an important technical refactoring that we did. And of course, it starts to pay off massively because now we can handle generically any kind of data frame with duplicates and everything stuff that the traditional IRA would just not be able to handle.
42:58 Tobias Macey
And then another question. element of the data format that we've mentioned a few times in passing is data versioning, which is particularly critical for things like machine learning workloads where you're doing a lot of different experimentation and generating different output data sets. And you need to be able to backtrack or figure out what version of code one with a particular set of data. So I'm wondering if you can dig a bit more into some of the versioning aspects of the file format and how its implemented and some of the challenges that you're overcoming as far as being able to manage lifecycle policies to handle things like cost optimization, or garbage collection of old versions of data.
43:36 Stavros Papadopoulos
This is one of the most powerful features in entirely being the big differentiator from other formats as well. And again, this is built into the format. So I don't know if any embedded storage engine that can do that. I mean, you can kind of do that with parquet files, but you need you need to use something like Delta Lake on top in order to be able to pull it off. It's thought the packet format allows you to do versioning you need to kind of have it on top with a different piece of software in order to be able to do it talebi builds it into the format, right? That's exactly how it is architected. But I would like to clarify a little bit what we mean by by data versioning. So that people don't think that we have we should build some kind of get for data. This is not exactly what Talib is, although if there is if there is enough interest, we may be able to build something like that we do have the foundation for that. So when we say versioning is that when you perform a ride even paler, right, it doesn't matter when you perform a right this particular ride is a batch right? We usually tell users know to write one record or one value some value at a time just bought your your values and then perform one right because tally paralyzes everything is very, very fast when it comes to batch writes. And each batch to write creates a subdirectory time stamped subdirectory within the array directory and all the files that pertain to that batch right are inside that subdirectory. So when you do multiple writes, and when we timestamp every ride, we give you the ability to be able to time travel back right to travel back in time, and open the array in a state before some of the updates happened. For example, I do an update today, do one tomorrow, one day after that, then something I feel that something is not right. And I want to see what happened yesterday and what happened the day before. So we give you the ability to open the array at a particular timestamp, and then get all the contents of the array so you can issue any query that's the same query if you want to, but then see a state of the array as if it goes before the rights happened after the timestamp that you provided. And we've architected it in such a way that we provide excellent isolation, right, every fragment does not interfere with any other fragment. A fragment is the subject Actually, it's this batch, right. So every fragment does not interfere with any other fragment, there is no locking, there's no central locking. No locking is needed. Because the fragment name is Unique across all the fragments. It carries a timestamp and the UID, which is random. So serialize ability is guaranteed by default. So this is what we call data versioning. This is different from saying, Okay, I'm going to go back to a particular version, and then I'm going to fork it, which is something that you would do with it. This is again, doable, but we'd like to see more use cases in order to be able to build it.
46:32 Tobias Macey
And I'm curious how this differs from things like de Tomic, as far as being able to handle the versioning of data across time and doing things like event sourcing so that you don't ever actually delete anything, you just mutate a record and keep the previous version so that you can be able to say, these are all the different changes that happened to a particular attribute one of the canonical examples being You have a user who has an address and they move to a new location. So the fact that they used to live at a particular point, never ceases to be a fact, they just have a new fact as to their current location, so that you can be able to go back through time and see what was the value at a particular point. So yeah, I'm just wondering if you can give a bit of comparison as to how the versioning in tile DB compares to something like de Tomic for being able to handle the way that data is represented and versioning capacity.
47:36 Stavros Papadopoulos
Yeah, data versioning entirely be is more similar to what Delta Lake provides with parquet files. And of course, we don't have the same acid guarantees that the Delta Lake provides this this is a large topic which we will discuss in future video tutorials. But what we do provide is right serialize ability, we thought Any kind of locking everything serverless with Delta Lake, you need to have a Spark cluster or presto DB cluster. In order for this to work, we don't need any cluster for this to work. And it's mostly batched writes, which can be done in parallel. And then you can open the array at any instant in time, ignoring all the updates that happened afterwards, we do not have any transactional semantics at the moment. That's not something we optimized for, up until now. And also, at least in the embedded format, the talebi embedded format, we don't keep the logs off, you know, who access which attribute when this is not the functionality you're gonna get. You get all the logs, very detailed logs of Valley cloud. But that's not about data version, you just get tons of logs about pretty much everything that you have done, but we don't consider that as part of the data versioning feature that we have. At least not to
49:00 Tobias Macey
And then the other thing that I'm curious about is how you handle concurrency in access to the data and being able to resolve conflicts, particularly because of the fact that different batched writes will produce different versions of data. And so if you have somebody who reads the data, both at the same time, and then they create different batch grades, how you resolve those different updates.
49:26 Stavros Papadopoulos
Now, we've architected talebi in a way that can handle multiple such writers in multiple interleaved readers as well, in the following manner. As I mentioned, every batch to write creates a fragment, which is a subdirectory of the array directory, which does not interfere with anything else. And it will never collide because the name is guaranteed to be different because we have a random token in it. So with only a negligible probability you can end up with a conflict there. So multiple writers can write this Same time and there are gonna be no conflicts no corruption whatsoever even if one of the rights fails if the right completes we introduce another object a special okay file which says okay this subdirectory is good to go and then we respect all the eventual consistency issues that for example s3 introduces so it is architected in order to work with s3 is eventual consistency and therefore we inherit that model when it comes to consistency. Okay, the reads do not conflict with the rights because the read will never read it partially written fragment and that's because of the absence of of these. Okay file if they read upon opening the array. If it doesn't see this okay object, it's going to completely ignore any partially written written fragments. So this allows us to perform concurrent writes and reads without having a centralized service. To manage any kind of conflict.
51:03 Tobias Macey
And then the other interesting element of this is the fact that the tile DB embedded project is open source and publicly available for free. And then you're also building a company around that and the cloud service on top of it. So I'm curious how you're managing governance and ongoing sustainability of the open source aspects of the project. And the tensions of trying to be able to build a profitable business on top of that
51:31 Stavros Papadopoulos
time to be embedded is entirely open source, and we will maintain it as such. We do manage it as a team, we govern it. We welcome contributions from anybody. We're very happy to see contributions to it. We're very responsive, if you can see in forums and GitHub issues, and we will abide by by the style like Ptolemy embedded in the integrations and the API's are all going to be open source. The good news for us is that tally cloud is completely orthogonal. It uses tally to be embedded. So all our servers that we spin up and we do the serverless computations they all rely on on talebi embedded. We use the array format to define the access policies, the logs and everything else. But all the talebi Cloud functionality is completely orthogonal to what we do entirely be embedded. And that allows us to have a very clean separation of the two and this is not created problems for us so far.
52:33 Tobias Macey
And as far as people who are using title DB to build applications on top of it, what have you found to be some of the most interesting or unexpected or innovative ways that it's being used?
52:45 Stavros Papadopoulos
We have seen very diverse applications for tylee be embedded and most recently on top of the cloud as well. What I want to note mostly is the ones that I find admirable because Tally people's used first in an important domain like genomics, right? In some very high profile organizations trusted us to do that, since we were just four people in the company right and much earlier when I was a single person in the labs in MIT, trying to just just create a very quick proof of concept. So I find this admirable, because those are important important use cases. And data management is a huge bottleneck for them. I mean, can you believe it? The data management is the bottleneck to the actual science, you cannot do analysis at scale, especially in genomics, which is important to do it at scale. And you cannot do it because you are blocked by data management, you're blocked by all those legacy formats. You are blocked by inefficient formats, and inefficient data management in general. So this is what surprised me the most not the fact that tally be handled those cases that was no What surprised me, but that certain people in certain high profile organizations trusted To build this and improve it so that we solve a very important problem in a very important domain,
54:07 Tobias Macey
in terms of your own experience of building and growing the project and the business around tile dB, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
54:19 Stavros Papadopoulos
The most challenging part to build this piece of software is not the near 1 million lines of code that I built. That that's not it. That's the kind of easy part. The most difficult part is to start it from scratch and build a brilliant team around it. The most difficult part is to inspire some brilliant people to come and invest their time and put their passion in order to build this colossal vision. I mean, we are not delusional here and this is something that I really want to stress strongly. We're not delusional. This is a tall order. This is a very bold vision. But this is what excites us. The most. The most challenging part is to convince the engineers to come and work with me, the people that are doing the marketing, the investors, of course, the consultants, even my managers at Intel, and my colleagues at MIT to even start this project. So that was the most challenging part. And we're in a good shape. I mean, we've been doing this for for three years, we feel very confident about the team very confident about about the software, there is a long road ahead. But as long as we're excited and enthusiastic, I think the end result is going to reward everybody. And
55:38 Tobias Macey
titled DB is an ambitious project and you have a potentially huge scope of work to be done in terms of the core capabilities of the storage format, the cloud platform that you're building around it, of the different integrations for all of the runtimes and compute interfaces. I'm curious why Some of the features or capabilities that you're consciously deciding not to implement, or that you're deferring to other people to build out as part of the surrounding ecosystem.
56:11 Stavros Papadopoulos
Great question. All the computational parts. If we explicitly state on the website as well, right, we, we go for pluggable compute. But let me elaborate a little bit, the first thing that we don't want to do, we don't want to create another language to access the data. Right. And we believe that that would be catastrophic. And people like to access data in so many different ways. They want to access data directly through language API's, they won't access the data through already popular tools. So it would not be wise to just create our own thing and try to convince people to just completely change the way they work every day. So that's, that's the first thing that that I left out since since day one and without cams, query parser in all the technology that comes comes along with defining a new language, so definitely not The new language. The second thing is we're not building a new SQL engine, there are so many wonderful SQL engines out there. Our strategy is to partner with, with all those brilliant people that are building those engines, we can alleviate a lot of the storage problems that probably they're not interested in solving if they really want to work, for example, in query optimization, so we we let those guys work on query optimization. So we were not interested in building a sequel, a sequel engine from scratch, what we are interested in doing in that respect is pushing down some of the primitives of the compute primitives that the sequel engine could use. Only because first if you if you push it down as close to the data as possible, probably it's going to be faster because you're going to avoid certain copies of the data. We're doing a good job internally in the core to do everything multi threaded, vectorized and so on, so forth. So we are equipped with with the knowledge and skills to do this efficiently. But also Most importantly, that certain computation primitives you find in SQL engines are exactly the same. In other entities as well, a filter is a filter, even in pedal or Judo or or pandas, it's the same. So why don't we push it down, do it very, very efficiently there, and then have all the other pieces of software that are plugged on top to utilize it. The same goes with goodbyes with merges. But we're not going to rebuild the whole sequel engine, because the sequel engine is, is not just a couple of primitives put together. There is a lot of intelligence, a lot of sophistication there. And we are not there yet. We don't want to do that. As I said, the original motivation was was linear algebra. And we're very much interested in building all those distributed algorithms on top of the cloud. We focus so far, mostly on the infrastructure. So how do we create a serverless infrastructure to be able to dispatch any kind of user defined function task graph, so that eventually, other people as well as ourselves, but also other users can build distributed algorithms with linear algebra algorithms being part of those on this infrastructure. So again, we kind of delegated building those distributed algorithms to anybody who is was equipped and capable and willing to build those those algorithms.
59:16 Tobias Macey
Tile DB is definitely an interesting project and has a broad range of applications. But what are the cases when it's the wrong choice?
59:24 Stavros Papadopoulos
Yeah, that's another great question. So target B is not a transactional database. Don't use time to be a transactional database. Theoretically, you can do transactions through Maria DB NT its integration with talebi. But that's not that's not our thing. It's Maria dB. But the credit if you do transaction is going to be is going to go to Maria dB, it's not going to us and we act as a Data Connector to that. If you want to use for example, some acid guarantees that you must have in order to be transactional, through direct access from Python. You're not going to get those today. You can get some of those guarantees that can get you a long way for certain applications. But if you are a core transactional application, that's not something that you would use target before, for sure. Another thing that you would not use it before, or at least you wouldn't change to tell the bees that if you're using a data warehouse, if you're happy if you're doing only SQL, if you don't care about interoperability, if you don't care that much about cloud storage, and separating storage from compute, then probably you should stick to the data warehousing solution that you hear because these are not competitors to us. You would use data to be even for data frames if you want to separate stories from compute, if you want to do user defined functions in other languages in any language, actually, because that's what we're trying to do. But not if you if you're sticking only to SQL. If you want SQL plus more than Talib is a great solution for that. And finally, we have not tested we have not optimized for streaming scenarios. Again, only because we didn't have use cases that demanded that, again, you cannot consider us as streaming solution as a core streaming solution. So transaction on streaming solutions, I wouldn't consider
1:01:11 Tobias Macey
before. You've mentioned a few different things that you have planned for the future roadmap of tile dB. Are there any other aspects of the work that you're doing that you have planned for the upcoming releases that you want to discuss or any other aspects of tile DB and multi dimensional storage that we didn't discuss that you'd like to cover before we close out the show?
1:01:33 Stavros Papadopoulos
Yes, absolutely. So the tie to be embedded engine, we will always evolve. There are so many issues even publicly on GitHub that we're working on hectically to get them done always on performance, always on added features. So to be embedded will always evolve and we will always have several people on call to be embedded full time. But the biggest bet that we have and the biggest investment of our time is going to go on top of the cloud. So, Tallinn cloud, again, allows you to share data and code with anybody. And it allows you to do everything server lessly. And that's exactly what we want to focus our efforts on. Because once you solve the storage issues, which we believe we did, to a great extent, especially for the use cases that we work on, the next step is how do we alleviate all the engineering hassles because, again, data scientists, they want to do scalable analysis, they want to get to insights very quickly, which can lead to scientific discoveries. That's what what a scientist wants to do, right? They don't want to spin up clusters. They don't want to monitor clusters, they don't want to debug clusters. So talebi cloud has this goal to alleviate all these burden from all the scientists that want to work with data at scale, and very, very easily. So the plans for the future double down ontology cloud, tons of cool stuff are coming up. So stay tuned. And you're gonna see them in releases very, very soon.
1:03:03 Tobias Macey
All right? Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
1:03:19 Stavros Papadopoulos
Yeah, there is a lot of brilliance in sophistication, data management today. That's not the problem that we saw whatsoever. The biggest problem that we saw was that any data management solution out there, especially very sophisticated data management solutions out there were architected around a single data type, for example, tables, and a single query engine, for example, SQL. If you use tables in SQL, there are tons of great solutions out there. But that was problematic, as I mentioned before, for other verticals, right. So that was the biggest gap. The biggest gap was that there hasn't existed so far as systems. That can work on any data seamlessly, right in a unified way, build all the data management features like access control, and logging and updates and data versioning on these universal storage format that can capture all the data, and then interoperate with all the languages and all the tools out there to give the flexibility to operate on the same data without converting from one to another, that system has never existed. And this is why we build talebi as the universal data engine.
1:04:32 Tobias Macey
Well, thank you very much for taking the time today to join me and discuss the work that you're doing with tile dB. As I said, it's definitely a very interesting project and very forward looking. And I'm interested to see where it goes in the future and some of the ways that the ecosystem grows around it. So thank you for all of your time and effort on that and I hope you enjoy the rest of your day.
1:04:50 Stavros Papadopoulos
Thank you very much. It's been a pleasure.
1:04:57 Tobias Macey
Listening Don't forget to check out our other show. podcast.in it at Python podcast comm to learn about the Python language its community in the innovative ways it is being used and visit the site at data engineering podcast calm to subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects from the show, then tell us about it. Email hosts at data engineering podcast.com with your story, and to help other people find the show. Please leave a review on iTunes and tell your friends and co workers