00:10 Tobias Macey
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to to play it. So check out our friends over at linode. With 200 gigabit private networking, scalable shared block storage and a 40 gigabit public network, you get everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution, they've got that coverage too with a worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads. They just announced dedicated CPU instances and they've got GPU instances as well. Go to data engineering podcast.com slash linode. That's l i n od e today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for their continued support of this show. And you listen to this show. To learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media chronium Global intelligence, od sc and data Council. Upcoming events include the software architecture conference, the strata data conference, and pi con us go to data engineering podcast.com slash conferences to learn more about these and other events and to take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey, and today I'm interviewing Kartik Ranganathan about gigabyte dB, the open source, high performance distributed SQL database for global Internet scale apps. So Carter, can you start by introducing yourself?
01:49 Karthik Ranganathan
Absolutely. Thanks for having me on here, Toby. So first off, hi folks. I'm Karthik, I'm one of the co founders and the CTO of the company you gabite which is the company behind gigabyte The open source, distributed SQL database. I've been an engineer forever, like, you know, right from like for many, many years and longer than I care to remember, I guess. And like right before starting gigabyte. I was at Nutanix. First working on distributed storage for about three years. Before that I was at Facebook for about six years working on distributed databases. The first database I worked on at Facebook was Apache, Cassandra. I mean, obviously, when I started working on it, it wasn't called Apache Cassandra and it wasn't open source. So it was the very early days subsequently started working on Apache h base. So me and all both my other co founders, we're all HBase committers, as well, and the team as a whole, we have this unique experience of working on databases and running them and production and so on and so forth. Before Facebook, I was at Microsoft working on the wireless stack and so on. So yeah, and after that, it probably gets boring.
02:52 Tobias Macey
Yeah, it's interesting how many people I've spoken with who have gone on to create their own database companies who have a background working with some of these different Open Source databases that have become sort of venerable and part of the general canon of the overall database market.
03:07 Karthik Ranganathan
Yeah, that's right. I think it's an interesting phase that we're in for sure. Like, especially like over the last 10 years, there's been an explosion of data explosion of digital applications. And every person building a database and running it, like definitely gets insight, and especially in the context of larger companies that are often ahead in time of the of the common enterprise, right. So there's a lot of learning to be had from that. And I do provide we bring a lot of that together, right. So and that's so it's our unique path, but I'm sure like you said, Everybody has their own reason, and there's definitely a need. And before you begin working on all these different databases, I'm curious how you first got involved in the area of data management. That's, uh, well, to be perfectly honest, is by accident, but let me try to go for a slightly more sophisticated answer. It was originally working on data In the context of networking, so it was distributed data, but not exactly data being stored. And I remember joining Facebook in 2007, right, like, interesting anecdote that like when I was joining Facebook, there were 30 million users on the site, give or take. And I remember thinking at that time, I mean, there's 30 million users already. I mean, how much more is this thing going to grow? Anyway, maybe it'll double in size. But there's interesting data challenges. So let me go work on the data stuff at Facebook. Anecdotally, it became 2 billion users plus and that's a different story. So but at that type of scale, like when you go from 30 million users to two or 3 billion users, there is an enormous amount of pressure at every layer of the infrastructure. And I was starting out building inbox search. So the way I got involved in databases was through inbox search. The problem that was presented to me and a couple other folks like we were working as a team was, you've got all this inbox data in Facebook, which was not Facebook messages of today. It's the older encouraging Facebook messages, and people needed to be able to search it. Now, the funny thing about searching your messages is that you do it very rarely. So it's read very rarely. But it's gets a lot of rights because every word in your message to each of the recipients is like a reverse index entry. Right? So so it is extremely right heavy, very rarely read. Nobody wanted to spend a lot of money on it. Nobody definitely wanted to sit around babysitting this thing because of its scale. And it had to be multi data center ready. Right. So taking all this together. Before I knew it, I guess I was involved in data and databases. And what we ended up building then finally ended up getting open source Apache Cassandra.
05:42 Tobias Macey
That's pretty funny though. I didn't realize that that particular background of the Cassandra project I know it came out of Facebook, but I didn't know that it was for that specific sort of write heavy read, rarely use case.
05:54 Karthik Ranganathan
Yeah, actually, there's a lot more had had to do with the actual use case. We actually decided constraint the problem and relax consistency. Because once again, if people couldn't find their messages, we figured they'd complain and we go after their original message, reindex it write it and we're done. Right. So eventual consistency in Cassandra was born from that aspect. The other funny thing was, I don't know how much of a bragging right it is. But I had the opportunity to name the project because everybody else was so busy building the project. So figured like Cassandra was a good name because delfy was taken. And the next most famous Oracle was Cassandra, right. So we ended up picking that name and who knew at that time it actually gets so popular.
06:36 Tobias Macey
So it's funny getting to get some context about the history of these projects that so many people have used and have become so widely adopted.
06:44 Karthik Ranganathan
Yeah, absolutely. And back then you don't realize it right. Like when, like, we were putting this thing together or, like, I mean, open source wasn't that popular back then databases? Definitely not. There was no nothing called NO SEQUEL back then. So it was it was a lot of interesting twists. sentence that the world went through. And it's been pretty rapid, right? The most equals now such a staple, staple thing. But back then it wasn't even a term.
07:08 Tobias Macey
And the other funny part of all this is that sequel has come back because of the fact that we have figured out ways to actually make it scale beyond what was originally thought to be the natural limits of that particular approach to things. And so that brings us to the work that you're doing with you goodbye. And I'm wondering if you could talk a bit about what the platform is as a product and some of the origin story of why you thought it was necessary and how it came to market? Absolutely, absolutely.
07:37 Karthik Ranganathan
So after working on Cassandra, like I told you, like, I went on to work on h base where like, when I ended up building some of the core features and HBase and having like, been a part of both communities, like people would consistently ask one question back then. And it was a question that I personally and all the people around me would soon come to dread because back then it was not a feature. It's like Could you just get Give me that secondary index please. Like, that's all I need. But I mean, it was very difficult to explain to people that yes, that's all you need, but making it consistent and correct and always work in every failure scenario is a lot harder than just adding a secondary index like so. So that was one of the core learnings where we knew that like, Is it worth rebuilding the entire database for just a secondary index? Probably not. Right. So but fast forward a few years, what we've seen is that the world is developing a number of applications, a new fundamental new breed of applications. And they have some fundamental properties, right? They need transactions, multi node transactions, they need low latency, so the ability to serve data with low millisecond latency, so you can see infrastructure such as 5g, the edge, so on and so forth, become more and more popular and all of these will impose low latency access. The other big paradigm is massive scale. So people want to be able to add nodes and handle a lot of reads and writes like, in fact, I would say that like what the big tech companies that top five or 10 tech companies were doing in terms of operations per day, we're seeing like regular startups being able to hit that level like, in fact, we have a bunch of startups, all doing more than a billion operations per day on the gigabyte database itself, right, which was considered a big thing. But people are able to approach that level of scale, because cloud has made it accessible. The other thing that cloud has fundamentally changed is geographic distribution of data, right? Whether it be for getting low latency access to users, or whether it be for GDPR, or whatever other purpose, people are increasingly thinking geographically distributed apps. And it could also be things like Uber and Lyft, having to do surge pricing within one mile geo fence or like, it has a number of different scenarios where the starts coming up. So if you put all this together, you need transactions, low latency scale, and geographic distribution. Now, you could give any API but the other change that we've seen happen in the market with the especially with the proliferation of so many new things, SQL databases is that people realize they can get some apps done quickly with no sequel. But when they really want to build other features, NO SEQUEL is often limiting and what it doesn't have as features. So the pendulum has swung the other way to people expecting Hey, why don't you just give me all of sequel and I will decide which ones I want and which ones I don't want, which ones are performing therefore I use most of the time and which ones are not performing, but I will use rarely, right, so sequel is definitely still dominant and is on a resurgence of anything. So put the two together. And you see that what you really need is a transactional database that is sequel that can be low latency, scalable and geographically distributed, right that this is the underpinning of your good like this is essentially why we are building you goodbye
10:43 Tobias Macey
And to your point about a number of different types of applications requiring some of this geo distribution and out of the box and at the start it has led to a number of other databases hitting the market with that being one of their main selling points, most notably things like Fanta DB cockroach dB. And I know that a lot of the inspiration behind that is some of the work that came out of the Google Spanner paper. And I'm wondering if you have any other thoughts in terms of what it is about the current state of the market in the industry and the actual application and development needs of these different users that is driving this trend of more different database products coming out with geo distribution and highly scalable transactional capabilities as their main selling point? Yeah, absolutely.
11:34 Karthik Ranganathan
Like, first off, like the way we view it is that databases like cockroach DB fauna, db, what have you that are embracing at the core, geographic distribution, transactions scalability, ha, it actually validates the market. I don't think it would be too much fun if you're in a market of one player, and that player is ourselves and we're the leader and were the last of the first and everything so the existence of more such projects and companies actually validates both the need to build such applications as well as the financial side of things, right? The the possibility to monetize. So it is it is exciting for us for sure. Now, as far as what is happening in the market as a whole, if you think about a cloud native application that starts out small, right? I mean, there's a different variety of different patterns in which users approach at building a lot of companies want to start small, because they're startups, but they want to make sure they want to have the insurance that they can grow when they need, right. But even when they are small, they don't want to deal with failures by hand, because a failures are always going to happen at the most awkward time. 3am never 3pm I mean, you can take it from us being at Facebook, we always got paged at two or 3am. It was never two or 3pm would have been a lot more convenient. But anyways, so first is ha and then as the data set gets bigger, they want to be able to add nodes and so the second will be scalability and then as the number of different types of access patterns. Keep proliferating, invariably other things like geographic distribution and more complex ways to access data starts coming up, right? So so this is the normal path for an app or a user and a company in the cloud, right? So therefore, it is not. And if you look at what databases are around to satisfy this, I mean, if you take the the, the older but well established databases like Oracle and SQL Server, and even like Postgres and MySQL, these are not built to handle that paradigm, that user journey, right. So what is the database that people end up picking? I mean, let's forget all the new entrants you mentioned, right? It's invariably going to be Postgres because that is the fastest growing database. The second fastest growing database is Mongo DB, right? So with Postgres growing so fast, and Mongo DB has a huge commercial company behind it Postgres does not it is completely a child of the open source right of people just develop it adopted use it and it is wildly popular because it is very extensible, very powerful and feature rich. Now, the reason we are This game is because what we're trying to do by building a new open source project is that Postgres is so popular yet it doesn't satisfy the cloud user journey, right? We want to build a database that can offer every single feature that Postgres has to offer, while satisfying this cloud journey, which is start small, make it easy to run in production by giving you ha, so address your issues at 3pm, not at 3am. And then grow when you need to and geographically distribute. So this is our journey, right. And obviously, each of these other projects you mentioned, would have their own thesis and their own reason for existence and how they are going about their journey. And it's probably perfectly valid for them because the market is huge, and everybody has their own mark to make. But our vision is like there's MySQL and posters, which are the open source RDBMS is we want to create a third one which is as fundamental as open as powerful and ready for the cloud.
14:52 Tobias Macey
And continuing on your point about postgrads being the growing market leader in terms of open source I've seen reports go either way about whether it's MySQL or Postgres, that's in the lead. But postgrads definitely has a significant portion of mindshare, regardless of which of those is the front runner. I'm curious what you've seen as far as the challenges of being able to maintain compatibility with that as an API and as an interface. And I'm curious to what degree of compatibility, you're supporting that postgrads interface, whether it's just at the sequel layer, and some of the specifics of how they implement some of their extensions to that query language, or if it's deeper in terms of being able to support some of the different plugins that are available in the broad Postgres ecosystem.
15:39 Karthik Ranganathan
18:35 Tobias Macey
also, yeah, there have been a number of companies that have gone through that journey. The one that comes to mind most readily is pipeline dB, where they started as a fork and then ended up refactoring to be a plugin to postgrads for being able to do in memory aggregates of streaming data and postgrads. And then in terms of the challenge that you're trying to overcome in terms of the use case, the project that comes with Most notably the mind as trying to target that same area is situs dB, which was recently acquired by Microsoft. And so I'm curious what you would call out as being some of the notable differences between gigabyte and cytus. And some of the specific features of you Goodbye, that would tip the balance in terms of somebody choosing your project for a project that they're trying to build.
19:23 Karthik Ranganathan
A Yeah, completely. Yeah. So with scientists, specifically, right, scientists is an extension to Postgres and we are not right. So the first question to answer is, what are we doing that make and it's not because we didn't want to be an extension is because we cannot be an extension. So the first question to ask I guess, is, what are we doing that prohibits us from being an extension, right, which will give you a clue into what are the unique features we support, right? The first thing we couldn't do was the Postgres system catalog, which is the set of tables that track all the other tables in Postgres. So this is the repository of all the databases, schemas, tables, users etc that insight is this case is still left after Postgres, which is on a single node, whereas in gigabyte, even that is distributed, right? So we wanted a shared nothing, no single point of failure type system. So anything fails, the data is still replicated, it automatically fails over and recovers. So that forms a fundamental difference. The second difference is the way to think about Postgres. situs site is reuses, the storage layer of Postgres and replication between shards is essentially Postgres replication. So which I believe is asynchronous. So if you lose a node, right, if you lost a node and the data on that node, you will lose a little bit of data. So that violates has it compliance. So there's always I mean, even though it's just a little bit of data, I mean, I'm telling you from practical experience, right, like we had like we've run big systems at places like Facebook, it's always a difficult emotional call to make. If, like, for example, you lost something or there's a network partition and you have to do the failover and you don't know if it's bad enough. Do the failover or if you just wait a little longer, it's going to come back and you don't have to go through this hassle and fill up stuff and explain to people what happened and why and what the impact is even when you don't know. And so it is complicated, right? So that's something that you provide is built for. And obviously, it's easier said than done. Because it's very fundamental to the database you have to touch replication you have to touch storage all the way below are. The second big difference is Postgres is situs uses each of the Postgres databases to store a shard of data. And you could have keys of your entire app distributed across the shards, you may perform a transaction across these shards. Now that goes through a coordinator node which coordinates this transaction because it is a distributed transaction and so cross shards, the problem with that is the coordinator node becomes a choke point or a bottleneck in terms of your scalability and, and so that often limits the number of transactions you can get from the entire system. It also kind of put Some restrictions on what type of like indexes you create, or what type of unique indexes and so on. So what type of features you can exploit across sharts. With you Goodbye, the direction we've taken is that the entire cluster comprising of whatever number of nodes you have in the cluster, you will access one logical unit. So that means you can just run as many transactions. And if you want more transactions, you just add more nodes. And it'll just scale. And if you have failures, they're seamlessly handled. And similarly, all your unique indexes are enforced across nodes. So these are fundamental differences. So to summarize the whole thing if you want, ha so absolutely no failure or downtime and no manual touchpoint then you go buy this better if you want scalability, like just add nodes and get more scale, especially when doing transactions and you know, unique constraints, and so on and so forth. And gigabyte is better, right? And obviously, we're a newer technology, so people have to I mean, I just played the other side of the balance to like, which is that like situs reuses, Postgres at the storage layer, so there's something To be said, for the maturity, but that's something you know that it behooves us as you go by a project to grow up and show the world that we are as mature. And so that's, that's where we've been focusing on things like jeppeson testing and correctness testing, and like, you know, getting into more and more use cases and, and, you know, earning our stripes so to speak.
23:16 Tobias Macey
Yeah. And one of the questions that I had had was whether you could buy requires any special considerations in terms of data modeling, given the nature of its distribution and the availability of geo replication, because I know that in some of the implementations of these databases that support horizontal scaling and cross region replication, there is a caveat as to how you think about modeling your data where small tables you would want to actually have located on each of the nodes, whereas the data that you're charting will actually span across nodes. And so it affects the way that you approach joining across tables and the way that you handle your query patterns. And so I'm wondering if you can talk to that and maybe some of the caveats that come out in you gabite as far as how you handle some of this geographical scale?
24:04 Karthik Ranganathan
Absolutely, I think this is a very, very astute point. And that it's one of the hardest points that we also strive to educate users about in general. So I mean, I just like give some general notions because it's hard to go into all of the details. But fundamentally, if you start at the beginning, or write in gigabyte has to be persisted on multiple nodes, right. So it will attempt to bring let's say, your replication factor is three, you can survive one fault, any right will attempt to write to three nodes, and we'll wait for two nodes to commit the data before acknowledging the user as a success. Whereas in a traditional RDBMS, you write to a node, the node simply writes data to his desk and acknowledges the user. So we've already introduced a network hop minimally one network hop right in while while making the user by handling the user, right, whereas an RDBMS simply writes to disk now disk is much much faster than the network, the latency of a network hop is like, you know, the order of a millisecond, depending if the if the machines are close to each other and could be much more like could be as much as 70 to 100 milliseconds if they are like, say, across the east to west coast or, or even different continents. So, so the first thing is to really understand the placement of data and to go in with the realization that your latency of rights will go up. I mean, this may not have a bearing may or may not have a bearing on throughput, because that depends on how fat your network type is, but your latency is definitely going to go up right. So it starts there. Now, on the read side, you can I mean, and there's a number of building blocks that you go by gives you in order to be able to make in order to make things efficient, such as moving all the leaders that can serve data into a single data center so that you can satisfy the read completely local to the data center, but you may have failures where the entire set of leaders fail over to a different data center, at which point you may have an increased replated six. So the second thing to realize is that in an RDBMS because you You built replication and failover, very tightly, you may be able to control and you redirect the app, you may be able to control agencies better, but it's a lot more involved, right. But in a database, you have to think carefully about where the failover will go and what will be the latency upon failure. So that's point number two. Now, point number three, I think your point on reads writes, table sizes, and so on, we are working on a feature called complicated tables, where you can place all of the data of your small tables into a single shard or tablet and let a few of the very large tables expand out and split and live across multiple nodes. Now, in this type of an in this type of a setup, if you didn't join that only joined that red data for the small tables, it will typically be pretty fast. But if you join data from one of the small tables to our two across two large tables, or at least involved in one large table, you could be moving a lot of data across the network. So that would be the next consideration is to think about what you're joined us right now. The Last point that I wanted to put out is that there's scalability and their scalability, right? So there are workloads that need, say a couple of terabytes of data and need to scale their queries, right. And then their workloads that need 10s or hundreds of terabytes, and that need to have insanely low latency reads. Now, it is very important to reason through for any of these workloads, what fraction of data you will actually read and have to transport across or have to process in order to satisfy the query, right, because at some point, the the this spectrum starts going over into the OLAP side where a single query is just doing a lot of work. And you'll have to move data back and forth and seamlessly switch over to that side. So it is important to make sure you distinguish between these two sides and keep it in the OLTP bucket. Because on a small database, it's okay to read most of the data like on an RDBMS because you are inherently bound by how much it can scale. It can only scale up to a certain point that is a distributed database gives you the promise of being able to add a lot of nodes and this So brings with it the unintentional danger of reading a lot of data. And so your queries actually become less scalable with time as you accumulate more data.
28:09 Tobias Macey
Yeah, and your point about there being different ways of defining scale is something that I'm interested in digging a bit more into, because people will throw out that term as a catch all when they may mean very different things about it. And that can lead to breakdowns and expectations as to what people are going to be getting when they buy something that quote unquote, scales. Because it might be that the system can scale vertically and take advantage of more CPU cores on a single box. Or it might scale horizontally in terms of being able to handle more read or write throughput because of the fact that you're splitting that across multiple network connections. Or it might be that you're able to scale horizontally across for storage. And so I'm wondering what the primary focus is in terms of the scalability of you gabite Db along these various axes.
29:00 Karthik Ranganathan
Yeah, so the simplest one to explain is if you need fault tolerance, I know this is not scalability, but it still requires you to spread your data across nodes. So the simplest one is fault tolerance, whether you have a little bit of data or a lot of data, the notion that a failure of a node should not impact the correctness or anything with your data. And your application should continue to function as it is, right. So that's like the bottom end of the spectrum, right? So that's where it starts. Now, from there, you can take the small workload and you can geographically distributed across multiple regions. And now you want to be able to run in that mode, right? So you have a notion of scalability when you consider RDBMS versus this setup. I mean, you could call it scale, you could it may arguably may not be scale. Now, from there, let's move forward. Now you have more and more queries coming in, which are only read queries, right. In the RDBMS world, you would have used read replicas to scale this out however, you read replicas are obviously going to serve stale data, right? In a distributed database like gigabyte, you can, which implicitly can shard your data. The shards live on various nodes and each node that the shard lives on, like the shard leader lives on can serve its own reads. So you get consistent reads from a larger number of cores, whereas this is something you could not have achieved with your RDBMS. So that's the first scale vector. The second scale vector is when you get a lot of rights. Now, let's assume that even though you have a lot of rights, your disk is more than the disk you have on a single node is more than capable of storing that data because most of these rights are updates. Let's just assume that for a second. So you're not bottleneck on the amount of data you store your bottleneck on the CPU, the how much how many cores you have that can possibly handle this deluge of updates coming in. So in this case, you want to split your data again, automatically sharted across multiple nodes and each node handle a portion of the updates right so again, you go by this a great candidate for this case. Now let's take it to the third case where you actually have a lot of inserts coming in and you have data growing and volume. Now you could put bigger discs, but at some point you're going to be you're going to users going to run out, the use case is going to run out of the number of cores, the database needs in order to handle that data set size. So at this point, you want to take the data and put it on other nodes, and you need to leverage more aggregate cores in order to be able to sustain that data size, set size, right at this point. Also, gigabyte is a great option in order to serve data. Now, rewinding water cases when there is no perceived scale, and when you go by it is arguably not a great fit, right? So let's take cases when you don't have too many updates, right, and your data set size is not expected to grow very big at any point of time. And your number of queries that you're handling is not expected to grow too large. And you have a fixed data set size and while the data set size, maybe big the amount of working data fits in memory. So you have Maybe even a terabyte of data. But queries are always coming for a small subset of data. And you don't care about 100% acid, you're okay with, you know, an asynchronous replica that gets promoted. You're okay with this type of a setup, then a more mature technology probably fits the bill at this point of time, right? I mean, obviously, as one of the believers and one of the builders of the data without the gigabyte DB project, I like to say gigabyte solves everything. But you know, it'll be probably true at some point. But there are other technologies that solve that today.
32:28 Tobias Macey
And going back to the storage layer as well. One of the other interesting points is that while we focus most of this conversation on the Postgres compatibility, you also have another query interface that is at least based upon the Cassandra query language and supports a different way of modeling the data. So I'm wondering if you can talk about some of the way that you've actually implemented the storage layer itself and the way that you're able to handle these two different methods of storing and represented And querying the data and some of the challenge that arises in terms of having this split and the types of access.
33:06 Karthik Ranganathan
Absolutely, yes. So we are a multi API database. Our query layer is pluggable, which means we can continue to add more and more access patterns in the future to help users build a richer variety of apps. So that's that was really the vision even from day one, we picked Cassandra specifically because the Cassandra language also uses a very SQL like dialect it's it also has tables, it has columns, it has insert and select queries and so on and so forth. So we use that as a building block and it has a rich ecosystem. It is good for a certain type of use cases like which are massive scale massive amounts of data reads and writes and ultra low latency which clearly complement the sequel, very relational use case. The thing that we changed from Apache Cassandra is that unlike Apache Cassandra y CQ L, the you gabite. Cloud query language is completely acid compliant, right? So we think of yc qL as a semi relational use case, and we about some of the dangers of scale out at massive scale, where if you issued a bad query or poor query, like you could really ruin not only your own life, but everybody's life in the cluster, because it all the nodes are performing a lot of work. And it'd be too late to, if take a while for the whole thing to settle down. And that could cause unintended consequences. And it's okay at a couple of terabytes is really bad at 10 or hundred terabytes. So the yc qL API restricts you from doing any of those queries by not even supporting them. So yc qL only supports the subset of queries in SQL that hit a finite number of nodes unrelated to the total number of nodes in the cluster. So there are no like, scatter gather type operations that do joins across all tables. And so it's really built for scale and performance. Right. So that's on the on the weisse eql. side. Now, where do we see these to fit in, if you look for workloads that are 10 200 terabytes or more right, and they need a very low latency access directly as the serving tier Have use cases such as time to live that you have to implement automatic data expiry with the features called Time To Live, why SQL perfectly fits the bill. It also supports compound data types, such as lists and maps and sets and so on inside like a single column on the other end, why CQ? Why SQL? The Postgres compatible API does foreign keys constraints triggers, like the whole nine yards, right on the completely relational side. Now, you asked about how we designed the layer below, right? Like, it was actually an interesting challenge for us like it is document oriented that way below. And what we figured was a document database is actually the most amenable to supporting a wide array of access patterns. As long as we can keep enhancing the storage layer, by the way, it's called doc dB, so I'll just use that term from now. So what we realized in the doc DB layer is that there's a number of access patterns that we have to optimize. And we have to leverage these access patterns in the corresponding query layers above right the advantage of a common DB layer below each is that the advantages of one start flowing into the other. For example, on the Y sequel API, we have the ability to store a lot of data per node, like one of our users actually tried loading 20 terabytes of data compressed per node. And then at that density level tried to do you know, hundreds of thousands of operations per second, and had tried to kill a node, add a node, expand the cluster so on right all of that seamlessly flows into the why SQL site right and why SQL side has, for example, features such as secondary indexes and constraints which we added to the Y SQL side. So developers coming in with the Cassandra knowledge and wanting to build those type of apps can actually use secondary indexes, unique constraints transactions, a document data type, JSON, V datatype, and so on. And the Y SQL folks, the Postgres folks wanting to do scale can actually leverage a Cassandra like scale. So it really marries the two at the layer below. Now, what is another unique advantage that's often overlooked is the fact that we Internally distinguish between a single rocchi access pattern and a distributed access pattern. So what this means to the end user is that like, if you went to a Google Cloud, you would put your most critical transactional workloads on Google Spanner. But Google Spanner uses atomic clocks is very expensive and has a lot of limitations. So you wouldn't put like use cases which have a ton of data in Spanner, you'd probably move it to something like a big table, right? So you go by brings both into the same database as just two different table types. So that's, that's really another huge advantage that the end user gets. Now as far as, as far as the challenges. I think that's that's actually an interesting question. I think the challenge always comes down to is twofold, right? Like first part is the addition of so many features into something that's core at the lower layer should not destabilize whatever exists, right. So that means and especially in something as fundamental as database, it's almost like a breach of trust. If we build a feature that brakes to something else and loses data, right? So so that means that the onus on testing is incredibly high, we have a super massive elaborate pipeline to test our product for every single feature matrix. And like we in fact go the distance of having a CI CD pipeline. Sure, very proud of that bids for spot instances, the minute somebody uploads a diff, a code, diff. So for code reviews, so the minute they upload their changes, we automatically bid for spot instances, and run spark based parallel tests like thousands and thousands of tests in parallel and before the review is done or the even the reviewer gets to it. Sometimes the results of what happened by running all this wide array of tests are out. We had to invest in doing thread based sanitizer address sanitizer, we had to invest in seed Lang and Mac and Linux and all sorts of different environments to build in Kubernetes, Docker so on and so forth. We have to do, we do jeppeson based testing, we do determine Failure is not determine. So we have a like, it's a very, very elaborate pipeline. So that's a big onus. I mean, but we still we, some of us actually enjoy working on that stuff, believe it or not so.
So, so that it works out as a team. So that's one part. The second part is People often ask us what we're going to do for compatibility with like, for example, Apache Cassandra or with Postgres, right. So the way we think about it is slightly different. We will do the compatibility slowly. Like that's not a concern for us. What is more important is enabling users to be able to build the type of applications they want to in the here and now instead of chasing versions, so we're not going after lift and shift off an application. We're going after lift and shift have an application developer. So a user that's familiar with Apache Cassandra, but really wants secondary indexes. I just wish I had JSON. I just wish I could do a couple of transactions here. Those are the guys we're going after, because we're really enabling a new paradigm of apps to get Built on a database that is not a new paradigm to them, right? So similarly, the Postgres folks, like, all of this is great, but I just wish I had the scale or I had the Ha. So those are the things that we're going after. So yeah, I think, I don't know if that gives a fair idea.
40:14 Tobias Macey
No, that's definitely useful. And to your point about the testing, and the CI CD pipeline, I definitely sounds quite impressive, and it also sounds quite expensive.
40:25 Karthik Ranganathan
It is. That's why like, the bidding for this had when we had to a specific project to do spot instance bidding. So the funny thing is this type of I mean, like and you're right, it the expenses add up very quickly, and we have to keep prioritizing how to keep the price down. So every time somebody puts up a diff, actually, there is something that goes and finds out what the bidding rate is, and then it uses that bidding rate to spin up instances in the cloud by bidding at that price. So the end the price is often far lower than what you would if you play that 24 seven rate and we do this paralyzed testing and then shut down the instance automatically. And the other part is this anti infrastructure is cloud neutral. So we can run it on any cloud we want, depending on where the Costco so so yes, it is expensive, but we have invested a lot to keep the price down.
41:13 Tobias Macey
And one of the other core elements to discuss in any database project is the issues that come about from the operational characteristics of it. And from the conversation so far, it definitely sounds like you're paying a strong focus on that aspect of the project. But wondering if you can just talk through a bit of the considerations that somebody who's interested in using a deploying gigabyte should be thinking about and some of the steps that are involved in actually getting it deployed in a production capacity and maybe going from a small scale proof of concept use case on a single node and then scaling that out to multiple instances or multiple data centers.
41:53 Karthik Ranganathan
Yeah, absolutely. So. So we support like, I mean, most of the popular ways of deploying So you could buy two runs on. And we've taken special care to have make it have no external dependency. So it runs on bare metal VMs. And containers Kubernetes to the works, as far as where you can deploy, obviously, you can deploy it in any managed Kubernetes. Like whether you're managing it yourself or a cloud providers managing it, you can deploy it on all the public clouds. And we have a number of integrations with, for example, cloud formation and terraform. And all of the various ways. So that's just the raw act of deploying the database right? Now, how you want to deploy it, like you could do the our most commonly deployed multi node paradigm is a multi zone deployment in a single region. So that's by far the most common we're increasingly starting to see a lot of multi region, hybrid hybrid meaning across clouds or on premise and on a public cloud, these type of deployments. So that's as far as the range of of deployments got right. So then the next piece is, we have a whole platform. So there's even in the multi data center deployments, you could do With three data centers where your use zero failure, you don't have to touch the thing, you can survive an entire data center failure zone failure, region failure. Or you could do to data center deployments with a synchronous replication one way or bi directional, right, which is a multi master. So, so we've seen all of this and and we have a third thing, which is a read replica, we actually have a user that's deployed a you gabite cluster level way replicated across various geographies of the world so that they can get ultra low latency access with a reasonably high number of operations per second, right. So so we're seeing all of this come to just come to fruition on the deployment aspect right now, as far as rolling it out in running it in production, and so on. One of the core value props we gave is high availability, which means that if a node dies, you don't have to worry at some point come in and replace the node. Right? So so that that makes things much easier already. We also support like a variety of other things that you would expect, like you know, encryption at rest encryption on the wire, the security side of things, authentication roles. based access all of that, ensure that so that the data is secure, right? So so that you can do secure, like day one, secure deployments rather than kind of do it as an afterthought, or like or so on and so forth. And then further, we also support exporting observability by exporting metrics through permittees. So we we have permittees ready metrics that can be scraped and put in and so then you can set your own alerts, and so on and so forth on top of that, and then monitor and observe what's going on with the database and get alerted right. Now. Finally, we're a database that is built for zero downtime. So node replacement and there's a number of our users that do like ami rehydration, right, they wanted to replace the entire node with a different OS with patches, for example. And similarly rolling software upgrades so that you will be able to go one node after the other and upgrade your software while the database is running and the app gets no impact, right. So there's also things like ALTER TABLE and alter schema where you want to add a column, proper column or do some other stuff. All of Those are online as well, where it just rolls through internally one note after the other. So we talked about a lot of these operational aspects that it supports. But to take this one notch higher and and all of this is in the open source whenever I talk right, so now comes the commercial aspect, the the product that we have that's commercial is called the gigabyte platform. And it is software that instantly converts your Cloud account or your set of on premise machines or something into a debus. So effectively, it strings everything we talked about into software, where you can just like turn key, say, hey, I want to deploy it on these machines are are you go figure out the machines and go spin them up yourself. You go figure out the security groups groups on AWS, and make sure that it's restricted the right way and the right nodes have access to each other. I want a multi region deployment with x nodes in this region y nodes in that region and the thing is going to get the whole thing done for you. With a click of a button you get like for example encryption at rest with integration into action. Management Service, you will get alerting you'll get like the ability to do software upgrades in a rolling fashion. So all of that is automated for you. So it's completely turnkey for folks that want to run it. And so this has been very, very popular with some of our paying users that have graduated into wanting to manage this solution at scale or for that business critical applications. And this is a gigabyte platform. So it's all the stuff we talked about, but bundled into a turnkey fashion and with an easy to use, you know, REST API UI, so on so forth.
46:28 Tobias Macey
And what are the other operational aspects of running one of these types of platforms is the consideration of backups where some people might view high availability and the fact that your data is replicated across multiple zones as good enough, but it doesn't actually solve the problem where you introduce an error and you need to be able to restore from a certain point of time, and I'm curious how you approach that, particularly given the fact that you're able to scale to these large number of nodes and large volumes of data.
46:56 Karthik Ranganathan
Yeah, no, absolutely. I think you raise a good point. I should have covered it in the first place, but Thanks for raising that backups is absolutely essential for the the application corruption that you mentioned. But also from the perspective of at the end of the day, we're a newer database. And people want the peace of mind that you have the data backed up, and you can bring it into a different cluster, or you can export it to a different system. Because we're scalable, we do a distributed backup. So the way this works is, I mean, I'm going to explain it in very simple terms, like, there's a lot of files in a lot of nodes, we pretty much keep around a copy of the file without disturbing it as a backup cut. It's called a snapshot in cluster snapshot. And then we take all of these frozen set of files across a variety of different nodes and then copy them into a target. And when you need to restore it, you can just get these files back appropriately split with the appropriate replicas on the different nodes in order to recreate the cluster. I mean, and you don't even need the number of nodes in the source cluster to be the same as the destination you can back up to like say an s3 and then Restore to like a GCP cluster, you could do all of these kinds of things with the platform edition, which is the commercial side we talked about, you can even do nightly backups. So you can just say I want a backup on some frequency, you can set like a cron schedule, and the thing is going to keep backing up for you. And you can do a one click Restore, which says like, hey, go to this s3 bucket, which holds a backup and you just restore it for me into this cluster. And you can just do the whole thing for you.
48:22 Tobias Macey
And for a lot of the open source infrastructure components that are backed by a business, one of the common patterns for managing the business model is withholding some of these different types of clinical enterprise features such as backups and Change Data Capture as a means of driving revenue to that commercial offering. And you mentioned before that everything that we've been discussing so far, aside from that hosted platform is available in the open source release. And it looks like that's been since version one dot 13 wondering if you can talk through your reasoning and motivation for including all of those What might be considered advanced features into that open source project?
49:03 Karthik Ranganathan
Yeah, it's a great question. So it's, it's it
Yeah, it's version one dot three, I think it was like towards the first half of some of the first half of last year. The reason for doing so like primarily is that like, our ambition as a project is to become as fundamental as a MySQL or Postgres, right. So like, we want to become another very fundamental piece of infrastructure for the internet for all apps being built in the cloud. So we want to become the default database for the cloud, right? So and as developers of database and as users of database we ourselves have personally felt this pain a lot. Were like a couple of features that you really need or held back and it might be a weekend project, but you can't choose that database anymore. And if it's a really critical project, you probably will end up like paying for support anyway because you want the peace of mind. So what we decided was that it's better to have long term greed, not short term greed. So we we do want to become big. We do want to become popular but not at the expense of day. Developers really understanding and using the project is the first point. Well, the second point is when we communicated this to our community of users, they pretty much were quick to point out. Hey, Postgres, and I guess my sequel don't really hold back on features like backups or security or encryption or what have you. And yet you say you want to become like them, but you're not really like, you know, do the same thing. So at that point, we kind of decided, yeah, this makes sense. And when we looked at our, our paying customers, our enterprise customers, they were mostly paying for the convenience. And like, it's because in the cloud, everybody is busy building so many apps and without knowing which ones will succeed and those that succeed just take off like a rocket and eat massive scale, that the the manageability of the whole thing and somebody is the way to take care of all of these deployments without having to have people babysit. Each one of them is like a much bigger value to them than the actual enterprise features that were held back. And secondly, with the world warming up to cloud with the world warming up to like, like our thesis is like, if you look At very popular database companies and database products like a products like Amazon, Aurora, or Mongo DB, what you find is that they are all open source at the core, like Amazon, Aurora is really a managed service built on top of Postgres and MySQL, which are fully open. Mongo DB reached where it did like with Atlas and the managed service on top of an open source core database, which is the Mongo database, right? I mean, obviously, Mongo has gone the other way and shut the doors to the community, I'm guessing they think they don't need that anymore. But to us, it's a long game. to us. The power of how deep you can get embedded in trying people trying to build apps on you is is actually the most important and rewarding thing. And we feel that once we get there, there'll be a lot of opportunity to monetize, right so
51:43 Tobias Macey
and in focusing your efforts on the long game, it seems to go against some of the accrued tribal knowledge of how best to run an open source company, at least as far as what's been put forth as best practice within maybe the past five to 10 years, and I'm wondering what you I've seen some of the feedback either from your community or from some of the other companies that you've interacted with as far as how that decision has played. And I guess I'm most curious about cases where you've had people trying to convince you that you made the wrong move, and that you should go back to trying to withhold some of these features as a means of trying to drive revenue.
52:21 Karthik Ranganathan
I'm actually funnily we've had the opposite, like we've had our enterprise customers tell us and paying ones at bat like those that were paying tell us that this is a great move, because they would end up paying us anyway for the convenience of the platform and for support, because we are going after mission critical workloads. So I guess in some sense, it is true that you would need to withhold stuff if you are not running in the most mission critical like if you are an add on to our plus one type of infrastructure, but us being a core type of infrastructure, people want the transparency love the transparency and in fact users more because of the transparency of anything right so on the community Users, our enterprises, they're all like completely behind this thing. In fact, ever since we made the change, our community has grown like crazy, I'd say almost a 10 X in just like in less than a year, right? And we've seen enterprises also get pulled with strong interest and come and tell us like, hey, this this, this is the right move. We want an open database, right. And that's because specifically in the area of databases, a lot of people are wary of like, you know, players like Oracle, where it's very closed, they don't know what's going on. They don't know how it whether it will work in the cloud, or what the exact value is, or how the feature works, or so on and so forth. But something that will live for a longer time that has a community backing that has wider testing by virtue of the wider community using it, the transparency and the fact that it's mission critical, they will come and pay. So I think that's the feedback we've gotten across the board. Nick, I don't know if there I think before we took the decision, there was a lot of second guessing and people trying to convince us but having seen what happened since we took the decision, like not too many people have have said anything to us at all.
54:01 Tobias Macey
And in terms of your plans going forward, I'm wondering what you have in store for the future of both the technical and business aspects of you. Goodbye.
54:10 Karthik Ranganathan
Absolutely. So let's do the business aspect because at our stage, we're more focused on the technical side of business relatively quicker. on the business side, we are, we just announced the beta of our cloud, the gigabyte cloud, because a lot of the smaller companies, the small and medium sized businesses are the fast growing companies, they really don't want to deal with even the platform on their side. They're like you just take care of the whole thing. And we'll just like deploy the database, give us an endpoint will use it, you scale you manage you upgrade, you do everything right. So we're seeing that as like one of the drivers currently, that's like a like a big thing that's happening. On the second side, there's a number of feature asks and number of asks of like, going into a number of different clouds, different ways of deployment to making all of those easier. So some of that work is also in progress, right. So like overall That's the the set on the business side. on the technical side, like, we have our aim, Our vision is to become the default database for the cloud, right? Like any cloud application, you're building, pick a database that is ready for the cloud. So we're seeing a natural affinity from a number of related projects. And we want to position ourselves as one of the best databases for these projects. So there's, for example, or M's that are traditionally are so far have worked on a single node, but we have the opportunity to change the way these or AMS and even JDBC works fundamentally so that it can, it is aware of a cluster of nodes of topology awareness for multi data center, and so on and so forth. So we're working with projects like spring and go and so on and so forth in order to bring this to fruition. Then there's the graph qL community, a lot of interest from the graph qL community because it's a modern paradigm for building applications. And graph qL itself is high performance and is stateless and scalable, but you need a database that's stateless. Level NGO distributed as well. So there's a lot of resonance of that. So the Postgres sequel community is also pretty interested because it is a it is everything Postgres. But with, it's great for if you need a che at the cloud and scalability, and so on. We're also a great fit for Kubernetes. Because we are a multi cloud and hybrid deployment very database where we don't have any external dependencies we have, we have built features so that we work natively in Kubernetes. So with Kubernetes, taking off on the promise of multi cloud and hybrid cloud, there's also pulling us along with it, which is, which is great. And finally, when you're building modern micro services, there's a lot of messaging systems. So like Kafka like for communicating between microservices and a huge ask for of people is like, Can you give us a change data stream? We've talked about the BCM and other things. So can you give us a stream of data that has changed in the database so that we can communicate between microservices and know what changed, right subscribe to these changes? So that's another area so these are just a few areas. There's a number of other areas where there's interest, but you can look to us making Good inroads good features good integrations into each of these ecosystems. So we'd like to have a simple message around why you provide is a great database for these ecosystems.
57:08 Tobias Macey
And are there any other aspects of yoga by DB or your position in the overall landscape of data management or any of the other aspects of your business or your work on the platform that we didn't discuss yet that you'd like to cover? Before we close out the show?
57:22 Karthik Ranganathan
I think we did a great job.
57:25 Tobias Macey
Yeah, I'm sure that there are a number of different sub elements that we could probably spend a whole other episode talking about in great detail, but I think we've done a good job of the overview. So for anybody who does want to follow along with the work that you're doing or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today. Ah,
57:50 Karthik Ranganathan
I think the the one big gap that at least I see and it's not directly in data management, but is it related is the is networking in Kubernetes like that. comes to mind, by the way, specifically when you're trying to run stateful workloads in Kubernetes. Like, first off, there's a lot of discussion on whether you should run stateful workloads in Kubernetes or not. But the answer is really irrelevant, because everybody's doing it anyway. So the answer is yes. Now, given that the answer is yes, and an increasing number of people are doing it, the Kubernetes ecosystem is scrambling to mature how how to run stateful, inside Kubernetes. However, Kubernetes is really strong at multi cloud, but the networking prevents multi cloud deployments. So so one consistent ask that we get at your provider, we've seen a number of this ask come up is how do you stitch multiple Kubernetes clusters running possibly in completely different regions or even clouds together using gigabyte, right? And we've actually had a bunch of these deployments where you have three Kubernetes clusters, and you gabite spans all three of them and keeps one copy of the data one replica of the data in each of these clusters. Now the most annoying thing about this by far is that each one is like a craftsman solution. There. Really like, we got to figure out which cloud what is the target source cloud? What are the different clouds? How does the networking work? How do you figure out how do you route. So that's one part that I think is a gap that could keep getting better over time. So that's, that's one piece. The second piece is there is a lot of a lot of interest in, in like serverless as a technology. So it remains to be seen how serverless and databases like open source databases specifically will end up playing. There's a number of serverless open source technologies, there's another databases like gigabyte that are created, and that will work well inside containers, how the tool work together, and whether it can go down to a zero cost because of the slow start problem like our will it only be for scaling as the workload increases. I think those are things that that remains to be seen. So that's another open problem that I would see. Yeah, I think I can't think of much else. I think these are two two big areas. And I mean, these are areas we're thinking about too. So
59:58 Tobias Macey
yeah, those are definitely two pretty substantial problems to try and solve. So I think that's plenty. Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with the gigabyte. It's definitely a very interesting platform. And the more that I've learned in our conversation today, the more I want to look into it further. So thank you for all your efforts on that front. And I hope you enjoy the rest of your day.
1:00:18 Karthik Ranganathan
Thank you for for having me online. Really enjoyed it. Great questions, great discussion, and have a good day yourself.
1:00:30 Tobias Macey
Listening, don't forget to check out our other show podcast.in it at python podcast.com. To learn about the Python language its community in the innovative ways it is being used. And visit the site at data engineering podcast com to subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects in the show, then tell us about it. Email hosts at data engineering podcast.com with your story, and to help other people find the show. Please leave a review on iTunes and tell your friends and co workers