Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.


episode 59: Apache Zookeeper As A Building Block For Distributed Systems with Patrick Hunt [transcript]


Distributed systems are complex to build and operate, and there are certain primitives that are common to a majority of them. Rather then re-implement the same capabilities every time, many projects build on top of Apache Zookeeper. In this episode Patrick Hunt explains how the Apache Zookeeper project was started, how it functions, and how it is used as a building block for other distributed systems. He also explains the operational considerations for running your own cluster, how it compares to more recent entrants such as Consul and EtcD, and what is in store for the future.

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to today to get a $20 credit and launch a new server in under a minute.
  • Go to to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at
  • Your host is Tobias Macey and today I’m interviewing Patrick Hunt about Apache Zookeeper and how it is used as a building block for distributed systems
  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Zookeeper is and how the project got started?
    • What are the main motivations for using a centralized coordination service for distributed systems?
  • What are the distributed systems primitives that are built into Zookeeper?
    • What are some of the higher-order capabilities that Zookeeper provides to users who are building distributed systems on top of Zookeeper?
    • What are some of the types of system level features that application developers will need which aren’t provided by Zookeeper?
  • Can you discuss how Zookeeper is architected and how that design has evolved over time?
    • What have you found to be some of the most complicated or difficult aspects of building and maintaining Zookeeper?
  • What are the scaling factors for Zookeeper?
    • What are the edge cases that users should be aware of?
    • Where does it fall on the axes of the CAP theorem?
  • What are the main failure modes for Zookeeper?
    • How much of the recovery logic is left up to the end user of the Zookeeper cluster?
  • Since there are a number of projects that rely on Zookeeper, many of which are likely to be run in the same environment (e.g. Kafka and Flink), what would be involved in sharing a single Zookeeper cluster among those multiple services?
  • In recent years we have seen projects such as EtcD which is used by Kubernetes, and Consul. How does Zookeeper compare with those projects?
    • What are some of the cases where Zookeeper is the wrong choice?
  • How have the needs of distributed systems engineers changed since you first began working on Zookeeper?
  • If you were to start the project over today, what would you do differently?
    • Would you still use Java?
  • What are some of the most interesting or unexpected ways that you have seen Zookeeper used?
  • What do you have planned for the future of Zookeeper?
Contact Info
  • @phunt on Twitter
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
  • Zookeeper
  • Cloudera
  • Google Chubby
  • Sourceforge
  • HBase
  • High Availability
  • Fallacies of distributed computing
  • Falsehoods programmers believe about networking
  • Consul
  • EtcD
  • Apache Curator
  • Raft Consensus Algorithm
  • Zookeeper Atomic Broadcast
  • SSD Write Cliff
  • Apache Kafka
  • Apache Flink
    • Podcast Episode
  • HDFS
  • Kubernetes
  • Netty
  • Protocol Buffers
  • Avro
  • Rust

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA


 2018-12-03  54m
00:12  Tobias Macey
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, you'll need somewhere to deploy. So check out the node. with private networking, shared block storage, node balancers, and a 40 gigabit network all controlled by a brand new API. You'll get everything you need to run a bulletproof data platform. Go to data engineering slash Linux today to get a $20 credit and launch a new server and under a minute, go to data engineering to subscribe to the show, sign up for the mailing list, read the show notes and get in touch and join the discussion at data engineering slash chat. Your host is Tobias Macey and today I'm interviewing Patrick hunt about Apache zookeeper and how it is used as a building block for distributed systems. So Patrick, could you start by introducing yourself?
00:58  Patrick Hunt
I Tobias. Yes, my name is Barry Patrick hunt. I work for Cloudera. I am currently the tech lead for data science and data engineering. Although I've worked on a number of different projects, including creating original configuration management and monitoring systems for for Hadoop that we sell as part of our product. I've worked for Cloudera, for about eight and a half years prior to that I was at Yahoo for about five. And the last two years I was the tech lead for the zookeeper project.
01:24  Tobias Macey
And do you remember how you first got involved in the area of data management?
01:28  Patrick Hunt
I'm an engineer at heart. I've worked on a bunch of different projects. I like to work on interesting things. I'm sort of a generalist, I've worked on in the networking space. I've worked in search, I got involved with zookeeper at at Yahoo, because we were looking at building out our internal cloud infrastructure. And we needed some distributed coordination capabilities. zookeeper was at the time, part of Yahoo's research efforts. So Ben read and Flavio, I had done some efforts around creating algorithms and protocols for distributed systems coordination. And we're looking at production that as part of our operations, and that's how I originally got involved with zookeeper. And I would say, you know, data systems in general.
02:19  Tobias Macey
And can you give a bit of an overview of what zookeeper is, and some of the history of the project and how it got started?
02:25  Patrick Hunt
Sure. zookeepers been around for quite a while. I think it's over 12 years now, since the project was originally conceptualized around, like the 2006 timeframe. At that point in time, Google had published, Mike Burrows, who's, who's amazing, had published the Google chubby paper, which is a distributed lock system that that Google had been using internally. I believe Ben and Flavio had looked at that and said, Hey, that would be something great for for Yahoo to have. And it would solve a number of problems that we were seeing in in production, especially around operations. So one of the nice things about zookeeper is it addresses issues, not only for developers, but also from an operational perspective. And that was one of the key areas initially, that we looked at addressing because it was kind of a tough issue. So like you said, 12 years ago, zookeeper started started in Yahoo research, when we started production, that's when I got involved. And initially, around 2007, it was open source on source Forge. And at that point in time, I was working with the Hadoop team at Yahoo, as part of the work that I was doing. And we saw that Hadoop was open sourcing their project in Apache, and we decided to move the project to Apache. And that's how around June of 2008, is when we moved zookeeper to Apache, and at that point in time got a lot of uptake, especially from companies. I think the initial users were folks like LinkedIn, Twitter, one of the bigger projects to adopt it publicly was h base to solve some of their HA problems. And it's actually pretty interesting, sort of a why of why would you use zookeeper, right? I can. We saw this a lot in the early days where people would say, Well, I could just have an Fs, and I can store some information on there and, you know, share it between a number of different systems or I could put some data into my sequel or some database and share data, what do I need? What do I need zookeeper? I think one of the one of the key things was we provide a service that's reliable and available. And that allows you to focus you as a developer of some other distributed system to focus on your domain problem, right. So if you're building key value store, and you needed to be highly available, ha, let's say, it'd be great if you could use the service to get that done, rather than implementing yourself, which can be pretty challenging. So a lot of these companies both from a development perspective, like HPC, is a good example where they want to fix their single point of failure problem with their HJ. But also, from an operations perspective, Yahoo, what we were seeing is that a lot of the teams who are building distributed systems we're doing, we're doing so in a variety of different ways, as they described, it might be NFL, it might be my sequel, etc. You can imagine the kind of difficulty that makes for the operations team, who then has to learn a bunch of different ad hoc ways to manage these distributed systems. zookeeper centralized that and provided some best practices for how to how to do that as a developer, which would then flow through to the operations staff.
05:51  Tobias Macey
And all of those other options that you mentioned, and Fs or MySQL are just their own single points of failure that introduce their own particular headaches for how things might fail. So I imagine having a more robust mechanism, and one that is more widely used and widely understood will help with the overall recovery effort, so that you don't have to try and remember all the different quirks of any given system, because even though they might all centralized on my sequel, for instance, they might do it in any of 15 different ways.
06:24  Patrick Hunt
Absolutely. And that was one of the, you know, that was, in the early days, when we were first, you know, promoting zookeeper, that was quite a bit of the feedback that we would get is, Hey, you know, it used to be so easy, right? I would just kind of stuff something in NFL, and that's worked perfectly fine. And generally, it works fine, you know, until it doesn't work fine, you know, Saturday at one o'clock in the morning. And you end up, you know, you end up with some issues. And that is another piece of feedback that we often get with zookeeper is wow, this is a lot more complicated, right? Like, I have to worry about all these issues that I never worried about before, like the network may fail. And you know, what do I do about it, having a service such as zookeeper that facilitates the distributed coordination and as a central point that you can then share between different services, addresses a lot of those problems, I think you may have heard of the fallacies of distributed computing. Have you ever heard of that?
07:24  Tobias Macey
And also the, the fallacies that programmers believe about networking?
07:29  Patrick Hunt
Absolutely. So
like Bill, Joy Deutsch, Peter Deutsch, and those guys, I think came up with the initial three or four. It was added to you by Gosling and a few other folks, but you know, things like, you know, people assume that network latency is 00. They assume that bandwidth is infinite, they assume that the network is reliable and secure. Those are some of the main fallacies of distributed computing. And zookeeper puts that right in your face, it requires you to consider those issues. One thing I also usually have to mention to people is that zookeeper doesn't really solve all these problems for you, right? It provides you a framework to address them, but it's not like, you know, magic pixie dust that we sprinkle on your system and everything is addressed. You still have to worry about these all these issues. But zookeeper provides a framework for you to do so.
08:19  Tobias Macey
And so can you dig a bit into some of the different primitives that zookeeper provides is as part of that framework and the ways that it encourages developers and operators to think about and plan for some of these different failure cases? Absolutely.
08:37  Patrick Hunt
zookeeper was one of the early distributed coordination frameworks. There's a few that have come out relatively recently. It's such as console and SCD, which are great. Initially, as I mentioned before, there was like Google chubby, which is not an open source project, but was available on Google, Google and kind of conceptualized through some of the papers that they published. But that was more of a distributed lock service was primarily involved with how do I take distributed locks. What zookeeper wanted to do is try to provide some primitives that would allow you to implement different use cases, we call those recipes in zookeeper. So there are various recipes. There are recipes for configuration. So distributed configuration management, that's probably one of the simplest ones discovery service discovery. We also provide distributed locks, leader election, distributed queues, we also support notifications as well. And that's usually used as part of these recipes in order to implement higher level primitives. So out of the out of the box, zookeeper kind of looks like a file system there. There. There's a hierarchy of what we call z nodes, z nodes can have children, they can also store data, which is a little bit different from a traditional file system, directories and files. And there are various capabilities such as ephemeral nodes, and sequential nodes, as well as notification. optimistic locking, is also supported that allow you to then take these various primitives and build the use cases, the things like distributed locks leader election that I mentioned previously. And out of the box, we zookeeper provides these primitives, there are other tools such as curator that provide more high level abstraction. So if you're looking to just get started and want to do leader election, you probably want to pull in, and there's another Apache project called curator which is separate, which provides this higher level of abstraction using Java, Java, object oriented primitives. Let's see. So zookeeper itself is a distributed system, it's made up of a number of servers that we call an ensemble. And the way that the system works is we elect a leader in that an ensemble and we form a Koran. So they'll be a leader with some number of followers that are made up by the initial list of servers. So you might have three servers, you might have five servers, you typically want an odd number, because it is a majority leadership election that happens to decide within zookeeper itself. So this is often a confusing area, which is zookeeper itself is electing the leader, even though within your application, you might want to do leadership election, that's that's different. But zookeeper itself is electing a leader amongst the servers. And as long as the majority of that initial an ensemble of servers is talking to each other has formed a quorum, the service is up and available. And that means, for example, you can have three servers, if one of the servers fails, the services still live, you could have five servers, if two of the servers fail, the service is still available. So the clients so we provide a client library that you embed into your application, as long as the clients can talk to the quorum, the servers that are able to communicate to each other, you get the services that I described previously, you could implement later election, and the leadership election would be valid and current, as long as your client can talk to the corpse that the majority of the one of the servers that's making up the majority, is that kind of makes sense.
12:12  Tobias Macey
Yeah, absolutely. And you mentioned that most of the higher order capabilities such as being able to easily do leader election without implementing a gear and logic on top of it, or being able to do optimistic locking is delegated to other client frameworks. So it sounds like most of those are built into zookeeper itself. And the project has opted to stick with just having these primitives and building blocks to make it more of a framework type approach. And leaving a lot of the specifics of how those primitives are used up to the application and system developers
12:51  Patrick Hunt
sort of yes, we do provide I mentioned previously, we have the this concept of recipes. So essentially, define how the various primitives can be put together to implement the recipes or the use, implement the use cases, such as leader election, distributed queues, etc. And you could certainly in many projects have just done that themselves on top of the zookeeper primitives. I personally think from seeing the number of systems use these things, it's, it's better to use one of these third party higher level frameworks on top of zookeeper, such as curator just because if all you want is vanilla leadership election, it's a lot easier and safer to use the curator provided capabilities than the curator code to do so because it allows you to just really focus down on your domain specific problem, because again, you're probably a developer who's building let's say, a key value store, you really want to focus on building key value store, and the less you have to focus on, how exactly does distributed locking work, the better. Now, there might be cases and there certainly are cases where you have some specialized use case, that could benefit from implementing things on top of the primitives directly. And certainly projects do that as well. And zookeeper is not the be all end all of distributed coordination raft is very popular because sometimes people want to implement the distributed coordination themselves, rather than offloading it to a project such as zookeeper. Because you can squeak out potentially more performance or you can optimize it for particular use cases, when you have a gym generic system, a general system such as zookeeper, you're obviously going to have some trade offs. And it's kind of the 8020 rule. Although I probably would argue in this case, it's more like the, you know, the 99. One rule where for most people, zookeeper is probably the best choice, especially initially when you're originally building your system. And then over time, you might get to a level of sophistication, where, you know, in order to optimize certain cases, you would either use the zookeeper primitives directly or potentially even implement distributed coordination yourself
15:08  Tobias Macey
beyond things like leader election or locking, what are some of the other types of system level features that somebody who's building an application or a distributed systems on on top of zookeeper will typically need to implement themselves that are still too special case to have just a general solution for,
15:28  Patrick Hunt
I would say things like distributed locking, discovery, distributed configuration. So essentially, a key value store leadership election, distributed cues, cover, most of the cases, where people often get into trouble is that where they try to use zookeeper for things it wasn't originally intended for. So it looks like a file system. So I'll use it as a file system and often get the question one gigabyte files and I want to store them in zookeeper will that work. And I need to share these files between various systems. Now, of course it work if you have one, one gigabyte file, but if you start scaling that out as your application grows, you can certainly run into problems. Also, zookeeper itself is not horizontally scalable. So as I mentioned previously, you have a number of servers that form the ensemble, three or five, you typically don't want one, by the way, because that's not not very reliable, there's no fail over support there. But having 23 is not the best idea either, because all the all the operations go through the leader, and having 23 servers to coordinate with adds a lot of overhead. So you typically want three or five in order to facilitate that fail over characteristic, but it's not going to give you higher performance. So that's one of the things certainly to consider with zookeeper is that there is a limit to the number of operations that you can perform per second, the throughput of the system, you know, that is limited by the the system, the networking, etc, that the architecture of the system. And there's no way to say you know, add more nodes and make it more performant. It actually decreases the performance as you add more nodes. So that's something to consider is when you're building an application, you need to do the distributed coordination, parts of zookeeper and zookeeper but overloading it with all kinds of other Hey, it'd be nice if I could use zookeeper as a traditional key value store or as a database, or not the are not the best choices. I know if that's answering your question directly,
17:37  Tobias Macey
ya know that that's a good answer. It's definitely useful to have that additional context. And one of the things that that leads me to question is, in terms of, you mentioned that adding new nodes to the system, does it increase the throughput, but from an operational perspective does it support being able to dynamically remove and add nodes into the cluster for recovery purposes.
18:01  Patrick Hunt
Until recently, the configuration of zookeeper itself, surprisingly enough, is fairly static. So there are configuration files that define the ensemble. And the only really, the only real way to make changes to that would be through changing the configuration files and doing a rolling restart of the server. Our most recent versions that we're working on right now introduced, I believe, in three dot five, which is about to come out of beta to the stable release train relatively soon added dynamic configuration support to zookeeper itself. So that's one of the first versions, that is the first version that actually supports dynamic changes to zookeeper. Aside from just doing a rolling, restarting the service, which is the way things have been managed previously, and that's actually a very popular feature now than a number of companies. A number of users are already using that capability, even though it's not out of beta just because it's kind of shattered and seen as super valuable to be able to modify the ensemble size dynamically, but we do support it through rolling restart. Generally, you don't make too many changes to zookeeper. Once it's you know, once you've defined the ensemble and roll that out into production, let's say, generally, in my experience that doesn't change very often typically change, it changes when there's a failure. So if one of the machines has a hardware failure, or you need to make changes within your networking environment within your production environment, you might want to, you might want to make some changes to the zookeeper ensemble. But once it's graded and up and running, it's pretty hands off one of the precepts. One of the principles that we built into zookeeper in the early from the very early days was that it should be pretty hands off. So once you set up the ensemble, it forms a quorum itself, it elects a leader, if one of the machines fails or goes offline or gets partitions say there's an network partition where one of the servers of the ensemble is partitioned off from the, from the main cluster from the quorum for some period of time, when it gets real when that when that partition heals, and the and the server rejoins the quorum, it will automatically update itself to the latest state of the service. So that's, that's without any operator involvement at
20:25  Tobias Macey
all. And can you dig into how zookeeper itself is architected? And some of the evolution of the internal design over the years that it's been built and maintained? Like
20:38  Patrick Hunt
said, zookeeper has been around for quite a while for about 12 years or so. And it's a Java application. So I think Java one dot one dot five, was was the primary Java version at the time. So if you look at the code, there's definitely, you know, definitely some legacy aspects of the fact that zookeeper was written at a time before many of the capabilities or niceties that have been added to Java over over the years were available. So certainly, we use like synchronized locks in in many cases, rather than some of the higher level primitives that are available today. Once you have a system that works and is used in production by many folks and has been tested, you know, both directly and tested. By time, it makes it hard to make changes to the core of the application. So one of the things you'll definitely see is that it's you know, the the code was written in a previous error, if you will. The basic architecture of zookeeper is a message passing system. So the clients pass messages to the server, the servers pass messages to each other. If you are familiar with the Sita architecture SEDA, which stands for staged event driven architecture, if I remember correctly, I think Eric Brewer was one of the folks who who originally helped to define that. That's basically the underlying architecture of zookeeper. Another aspect of our initial work and our initial principles was to try to build the system that favored or to build the system that favored availability and reliability of overall else. So zookeeper is your distributed coordinator, it is your distributed single point of failure, I would like to often say, so we definitely take very seriously the fact that this is how, you know, the, the zookeeper is the underlying engine for your system. So if something needs to fail over, it needs to be highly reliable, highly available, otherwise, your system will work. Obviously, people won't use zookeeper if that's the case. So we definitely favored availability and reliability. Overall else. So many of the underlying implementation specific details, such as use of Sita were chosen because we wanted to make the code as simple as possible, because we knew that we would make mistakes. And we want to limit the number of mistakes that we made. So the main event pipeline, for example, for the leader and for the followers, that process these change requests and and read and write requests from the clients is actually has traditionally been single threaded as a result, because we know if we knew that we could improve performance, but at the same time, we would reduce reliability, because the likelihood we would introduce bugs would be higher. So we actually reduced reduced things, we limited the amount of for example, threading that we did internally to improve availability and reliability at the fault of you know, at the on the negative side, you could have higher performance. Now recently, a number of folks have been doing work to improve performance and throughput. And those changes are starting to go in now. So there are some multi threaded pipeline pieces in the processing of the client requests. And those are those are relatively recent changes. And they've been, you know, battle tested. The nice thing about having large companies that run zookeeper and production allows us to really get some some battle hardening before we ship it out to the rest of the folks.
And the Apache process itself facilitates that as well, right, since it's open source. And
24:29  Tobias Macey
so in terms of the design of the system, as far as where it falls on the axes of the CAP theorem, it seems that you're favoring availability and partition tolerance at the expense of some amount of consistency. Because in some cases, if two different parts of the system get a different response, it's not necessarily the end of the world, as long as everything can keep running.
24:52  Patrick Hunt
Yeah, surprisingly, though, zookeeper has actually more of a CPE than system on the on the cap side. So if you're a client, and you make a request to zookeeper, if you get a success back from zookeeper, you're guaranteed that it's consistent. And one of the again, if you the the service itself is using a leader election with a majority rule. So if majority of the ensemble such systems, so two out of three, if you're running three, three servers and zookeeper are able to talk with each other and coordinate the services up and available, if you fall below the majority, then the service will stop and won't be available. So the client, as I said, the client is guaranteed that if it gets a success, that it's consistent, one of the trade offs that we make, however, is that we don't guarantee that the clients see a consistent view of the world. So as an individual client, you're always guaranteed that the world moves forward, you never move back in time. And the information that you're seeing is consistent, at least with respect to the server that you're talking with. So I didn't mention this before, but the clients are talking to a particular server in the ensemble in the forum. And the servers that make up the ensemble will only broadcast themselves as available or only will only accept connections from the clients if they're a member of the quorum. So if they are able to form part of that majority that I described, they'll allow clients to talk to them that they'll service requests, however, you might have client A that is talking to one of the servers that's currently in the currently in the quorum, one of the majority servers, and that would be up to date. However you might also have a client be that's talking to a follower of the ensemble of the quorum that has been periodically partitioned off from the service and their timeouts. So their session timeouts for the clients. So when you stay session with the service as a client, you specify a session timeout, if you're doing divergent here a bit. But if you're doing leader election, you're basically saying, I'm willing to tolerate a certain amount of certain amount of time, then I'm willing to be out of the, of course, it's a distributed system. So you're always going to be out of date, right? fractionally, but how much can you tolerate being out of date. So if you want to be the leader, and you can tolerate being out of date for 10 seconds, you would establish a session, initially up 10 seconds, and what that guarantees you is, you're always up to date with the main service within 10 seconds. And that's where that cap aspect kind of comes into play. So if you're a client that's talking to a follower that's been partitioned from the service, you might not know about that for, let's say, 10 seconds. Again, your client will always see a consistent view of the world that things are always moving forward. For example, say the server that you're connected to actually fails, you would have to disconnect and reconnect to another server, you're always guaranteed again, that things will move forward in time, but you're not guaranteed that two clients will see consistent view of the world, which can be a little bit challenging for for folks who are building on top of zookeeper. But it's kind of goes back to the you know, the fallacies of distributed computing,
you're always guaranteed that you're going to be behind this just the question of how much behind Are you going to be
28:27  Tobias Macey
and in terms of building and maintaining the zookeeper project itself, particularly as a centralized component of distributed systems? What have you found to be some of the most complex or difficult aspects of building and maintaining the project and educating users on how best to take advantage of it?
28:48  Patrick Hunt
I think in the early days, a big issue was just educating people and making them aware of the capabilities of the system, what it could do what it couldn't do. We ourselves were working out, you know what the best use cases were for zookeeper. And when a different system would be a better choice. We were building community as part of the Apache aspect of building zookeeper. Initially, we were the only ones who were writing the code, right. And we were using that internal yet Yahoo. And then once we moved to Apache, a big part of it was just bringing on additional folks, you know, like minded folks as part of the community who could work with the community to continue zookeeper development, keep it focused on distributed computing, distributed coordination, versus potentially adding other features that weren't towards that main goal. I think over time, as the system has matured, there's a lot of information out there about zookeeper, we're starting to see a few additional systems over the last few years that are providing similar kinds of capabilities. So more people are educated. I would say just the education part, again, initially, especially initially, that was one of the one of the biggest things today, I think we face a lot of issues just from the perspective of having a system that's been around for lots of years, and, and a lot of people using it and relying on it for their to build their to build their systems on. So things like maintaining backward compatibility is a huge problem today. Because we want to make changes, we find a bug, we find a security issue, we want to make changes that really can be difficult, because again, zookeeper was, you know, initially created, let's say, 12 years ago, things like Nettie didn't exist at the time, right? Zika Java itself didn't support SSL at the time for network communication. So we ended up building a lot of this ourselves, right? proto buffs didn't exist, etc. So we ended up building our own version of such things. If I was building a system today, I'd be taking advantage of how these things, and it would make it a lot easier to do schema migration on my transport level marshaling and and marshaling of data, etc. We didn't have all those things. So we ended up building it in. And I think, over time, just as you add layers of code, as you add complexity, as you try to fix bugs that you didn't expect, the code just gets more and more complex, I'd say that's, that's one of the bigger challenges that we face today, again, especially maintaining backward compatibility, because that is a value that people place on zookeeper as well, the fact that our versions are backward compatible, they can upgrade to the new version with fixes with improvements, and not need to really change their code that much if at all.
31:36  Tobias Macey
And he mentioned that in terms of scaling for zookeeper, it's mostly for being able to improve the reliability of the cluster more so than the throughput, but that there have been some recent changes to improve overall performance of the system by introducing some of the threading capabilities. So I'm wondering what are some of the main scaling factors for zookeeper and the systems that rely on it and some of the edge cases that users should be aware of in terms of utilizing zookeeper and ensuring that it is operationally reliable?
32:15  Patrick Hunt
Sure, a lot to unpack there. Let me let me see if I can go through some of them. So one aspect I can talk about too much before but under the covers, zookeeper is using a right ahead log initially, in a snapshot mechanism to store persistent state. zookeeper is an in memory database effectively. So all of the information needs to be in memory. So that's certainly one thing to consider is, when you're putting the information into zookeeper, you really limited by the heap size that you're allocating for the JVM that's running servers. zookeeper itself is backing up that information using traditional wall and snapshot operations onto the file system. So we're using the right ahead log and snapshot mechanism in order to store the information to the file system. One of the primary limitations as a result is how frequently and how quickly, we can f sync the right ahead log. So as the clients are making changes, they're doing rights into the system. That's one of the primary limitations of zookeeper in terms of throughput, is we need to write the changes to the redhead log f sync it before we can respond to the client that the right operation, what that it's trying to perform was successful. So the client will propose a change to the zookeeper service that will get forwarded to the leader. It's part of the atomic broadcast protocol, Zeb, that the leader will communicate with the followers, as soon as the majority of the followers have f synced the change to the to the discs, as part of the right head log, that change will be communicated back as successful to the client. So that's one of the primary limitations of zookeeper in terms of the throughput. What we often see in operations is people are trying to co locate as many services as possible to reduce costs, etc, and reduce their overhead in terms of the number of systems that they need to manage. So if you end up co locating zookeeper, with another service that is heavily used, utilizing the underlying disk, that can actually slow down zookeeper quite a bit, we actually recommend that if you're using a mechanical drive that you dedicate a spindle, as these help in that regard, although SSD is have their own failure modes that need to be considered, right Cliff in particular has been an issue in the past. And that's an that's an issue that's typically dependent on the firmware of the of the drive, which makes it particularly difficult to track down. But from an operations perspective, you want to size zookeeper serve the service based on the number of failures that you want to be able to support. So if you have three servers, as they mentioned, one of the service can die, the service will still be available, if you have five to die, five, five is kind of nice for production serving, because while it doesn't improve your throughput, it does mean that if you have to take one of the servers out for maintenance, you can still suffer an unexpected failure in the service will still be live. Let's see. So hopefully I've did I miss I'm sure I've missed probably a couple of your points if you want to redirect, but those are, those are some of the primary things to consider from an operations perspective. And in terms of sizing, zookeeper
35:44  Tobias Macey
Yeah, that's, that's all valuable information. And given that, particularly the in memory aspects, it helps in terms of the instance type selection for somebody who might be running in a cloud environment to know that they should optimize the memory side more so than on the disk side, for instance. So for people who are doing that deployment, it's good to have that knowledge. And you were talking about co locating services as far as zookeeper running alongside some other server process. But the other direction of using zookeeper as a multi tenant service. I'm wondering what the capacity is as far as being able to, for instance, use zookeeper as the coordination layer for maybe a Kafka cluster and a Flink cluster using the same underlying zookeeper deployment?
36:35  Patrick Hunt
Yeah, that's a great question. For most distributed coordination efforts, you're not going to be maxing out zookeeper, right, you're going to be you're going to have three machines that are trying to manage distributed lock, you might have thousands of machines that are managing a distributed lock. But the number of operations that you're performing as part of that are relatively small. And zookeeper is also optimized for read heavy workloads. Not something that I mentioned before. But in in terms of multi tenancy and scalability. zookeeper is optimized for read heavy workloads. So the thought is that you're going to be doing a lot more reads than you are going to be doing right. So let's say you're doing leader election, that requires a small set of rights, and then all the followers need to basically read the Z node, in this particular case, read, read one of the pieces of data from zookeeper and you're going to have potentially large amounts of systems doing that and zookeeper handles that. particularly well. zookeeper is a hierarchical namespace. As I mentioned, it looks very much like a file system. And that does facilitate, you know, having a slash apps, you know, slash apps, Kafka slash apps, solar slash apps, you know, Flink, etc. And we have a very primitive change route kind of capability to route kind of facility. So when you have a application that's connecting to zookeeper, it can actually route itself in one of these sub name spaces. So you could, like have a slash apps, Kafka and Kafka could route itself at slash apps, Kafka, so it would look like slash. So there are some multi tenancy capabilities built into zookeeper. There's quite a bit of overhead, if you're using zookeeper properly, there's a lot of there's a lot of unused. Again, it's not like you're doing large amounts of operations on zookeeper all the time. So there's there's unused capacity, maybe that's what's what I'm the word that I'm looking for. But there's quite a bit of unused capacity, typically, in a zookeeper system, if you're using it properly, because you're just not doing that much, right, you're doing leadership election, it's kind of thirsty, like when something fails, you might have a burst of activity. But for the most part, it's just curious. And in terms of operations in a multi tenant environment, there definitely are some things that we could do. And we're seeing some that's one of the things areas that has had recent development efforts. Facebook uses zookeeper extremely heavily to do their internal configuration management in internal coordination, they've contributed a lot of changes, they're in the process of contribute a lot of changes, because they use it at at a pretty, extremely high scale in a multi tenant environment. So I would say just the fact that we have the namespace is facilitates the multi tenancy aspect pretty well. But I in my own in my own day to day here at Cloudera, we're running Hadoop systems. And there are quite a few of the systems in the in the Hadoop ecosystem that rely on zookeeper nowadays for their distributed coordination, HDFS, age base, etc. And those systems are all sharing a single zookeeper typically, at least in an in a cloud era deployment, for example, and that that works perfectly fine. There's quite a bit of overhead or unused capacity, let's say in a zookeeper environment, because you need to have like three or five servers, do you facility, zookeeper service, there's just a lot of extra capacity that you have based on the fact that you need multiple machines in order to facilitate that.
40:08  Tobias Macey
And as you mentioned before, in recent years, there have been some other projects that were created and released for doing some of this centralized coordination of distributed systems such as at CD or console. And I'm wondering how zookeeper compares with some of those other projects and some of the cases where zookeeper might be the wrong choice and you would prefer to use something else?
40:34  Patrick Hunt
Yeah, that's a good question. Um, just as zookeeper itself, you know, looked at and it was was inspired by systems such as chubby back when we were conceptualizing, you know, what we should build obviously optimized for our particular use cases. I think the same thing has happened in these other spaces. So console, which is great product at CD, facilitating a lot of the you know, in the NGO space, especially with Cooper daddy's now quite a few of the capabilities of Kubernetes relied directly on SCD, if you look at I think SCD in particular was influenced not only by zookeeper, but also by raft. So john Osterhaus and some of his folks did a bunch of work more on the education side, I would say, so they took some of the concepts that had been floating around, kind of put them together again, slightly differently. And that provides you with raft, which is a great tool set for learning, distributed coordination, especially if you're in that minority, I would say of folks who need to implement distributed coordination yourself raft was really sort of conceived, at least in my opinion, more from an educational perspective, right, which is Paksas is great, but it's really hard to implement properly. zookeeper is good, because it provides you this service. But what if I need to build my own distributed consensus system? What How should I go about doing that? what's what's the best way to do it? Like, how would I be successful raft really, really well documented how to do that, and then many other systems such as NCD, were able to take those concepts and build around it. So if you look at CD, unless familiar with console, but I believe also to be the case, you know, they they looked at some of the best some of the best things out of zookeeper and some of the other systems and, you know, implemented their own system based on that, again, for their, for their specific for their specific use cases. Also, zookeeper was created again 12 years ago, and you know, things like JSON and some of these other de facto best practices for building systems, you know, swagger, API's, etc, didn't exist at the time. So zookeeper has a Java client interface that it provides, it also provides a see interface, many people have built their own language specific implementations of zookeeper clients on top of zookeeper, the same thing is here, right? If you're, if you're in the ghost, you're building a Kubernetes app, of course, you're going to use at CD, right, hopefully, you're not going to have to run to distributed coordination systems. I personally would would hate to have to do that. So I think it's a if you're in a environment where you have a distributed coordination system already, you probably would want to
43:16  Tobias Macey
and given the length of time the zookeeper has been around and the massive changes and developments that have occurred in that time span. I'm wondering how you have seen the needs of distributed systems engineers, and the system's themselves changed since you first began working with zookeeper.
43:36  Patrick Hunt
How has it changed over time, I think people are just much more educated now than they were when we originally started, as I mentioned, there wasn't, you know, there was Paxos and some of these other things. But I think building large distributed systems was a really sort of niche area, it's become more and more. So you know, if you go to Amazon and you want to build an application, you know, of course, you're going to want to build something that has ha capabilities, and is likely going to need the scale to the point where you're going to have to run multiple machines. And you need to then consider the operational aspect as well, because you have multiple teams that need to work together, how have things changed over time, I think people just become much more sophisticated, the demands have just increased in terms of the kinds of capabilities that you need the kinds of, you know, just the the tolerances have, you know, the have just decreased right, like the, you know, if it was an inch before, now, it's a fraction of an inch, right. And I think people rely on things like console and CD, zookeeper and more and more, have they changed? I mean, I have been a little bit surprised that there haven't been more distributed coordination systems available. It's only relatively recently that that CD and console have really, you know, come out and become available. I guess maybe the thing that I've been more surprised at is that there aren't more systems like this probably the thing that had been most surprised by and
45:02  Tobias Macey
as you were saying, the code base itself has accrued a number of different peculiarities because of the age of the original implementation. So if you were to start everything over today, with Greenfield and knowing what you know, now, I'm curious what you would do differently. And if you might choose a different language runtime to build it in, or if you've been happy with Java as the implementation target,
45:28  Patrick Hunt
I think we've been super happy with Java. It's a known quantity. Again, if you're building something you're going to spend, you know, months, a year, two years building a system, you're going to spend many, many more years operating it right. Once you build a system that relies on another system, that thing is going to exist for a long period of time. And you you mentioned a number of systems, Kafka, Flink, etc, who currently HB set that I mentioned that, that rely on zookeeper, that's not an easy thing to change. From there perspective, either. One of the big changes today is that there are a lot more facilities available for building a distributed system. So such as Nettie and other other capabilities for like transport level communication, proto buffs, etc, through that support schema migration, which is one of the areas that we kind of suffer from in zookeeper today. Because it because our implementation is kind of bespoke, we didn't build in certain niceties, like schema migration, which can make it difficult for us to change our on the wire protocol, which we occasionally need to do if we're adding more capabilities trying to fix a bug. So certainly, I would be leveraging a bunch of things that exist a bunch of tools that exists today in the in the open source ecosystem that didn't exist at the time, I think Java was a great basis for that I might be tempted to potentially use rust today. But again, getting back to the operational aspect, you know, I definitely want to choose something that the people who are operating the system felt comfortable with, both from making a system that's reliable and available, but also building something that I can operate, which would include things like monitoring, and debugging. And we've listed some of the well known projects that have built on top of zookeeper, but I'm wondering if there are any particularly interesting or unexpected ways that you've seen zookeeper used? I've seen a number of sort of anti patterns, used for zookeeper, certainly through the years. I mean, one of the things I'm kind of consistently surprised by is the number of projects that continue to adopt zookeeper as their distributed coordination mechanism. So the fact you know, it's easy when you're making the sausage to kind of see the problems under the covers, see the bugs and the kinds of issues that we're facing. But, you know, again, it's it's pretty encouraging to see that there are many new systems that are picking up zookeeper as their distributed coordination mechanism, even though there are other systems available today that that are potentially great options as well. I don't know I, I can't think of anything off the top of my head that is particularly surprising. At the end of the day, they're very common, they're very common set of problems that people who are building systems need to solve, such as leader election service, discovery, locks, distributed queues, and, you know, systems are picking that up and using it. And those are the most common things that I see, I, I can't say that I see sort of crazy, you know, crazy things that nothing, at least that comes
48:35  Tobias Macey
comes to mind. And going forward. What are some of the plans that you have for the future of zookeeper either in terms of improvements or new features or capabilities or anything in terms of the community or educational aspect of the project.
48:50  Patrick Hunt
zookeeper is a community based project and has been for quite a long time at Apache, which is great, because people come and go, new requirements kind of come and go, and companies come and go. But at the end of the day, you always have that Apache community that you can go back to we've been working really hard over the past few years to come out with an updated release that includes things such as the dynamic configuration support of zookeeper itself. So changing the ensemble dynamically, I'd say that's one of the biggest things that we've been working on is really kind of like shipping that it's been quite a bit of work on security. And we continue to work on security, just hardening all the system, both in terms of availability, reliability, but also in terms of security, getting so many eyes on it is great. From that perspective, as well. It's a pretty mature project, you know, being around for 12 years. I think one of the big things that we've been seeing recently, like I said, Facebook has been contributing a number of changes, they've been doing a lot of work on observe ability, because they use it so heavily throughout their infrastructure monitoring, and many management, managing of these clusters becomes even more important for them. So they've been contributing a lot of a lot back to zookeeper around observe ability, which I think is really great. So making it even easier to manage. You mentioned multi tenancy before, I think that's one of the areas we could also do some more work on. And, you know, as we get more sophisticated in our use cases, like multi tenant use cases, in particular, it's probably the one of the areas that will continue to focus on or focus on more, but it is a pretty mature project. We've added some capabilities recently around like TTO, which I think is a capability that we identified out of like SCD. And maybe or maybe it was console, but we've we ourselves are seeing some of the changes that are going into these other systems and saying, hey, wouldn't it be great if you could set a time to live on a zookeeper on zookeeper data, which is not a capability that we had until relatively recently?
50:55  Tobias Macey
So we're we're learning things as we go long as well, which is great. And are there any other aspects of the zookeeper project or distributed systems primitives or centralized coordination for these types of systems that we didn't cover yet, which you think we should discuss before we close out the show?
51:15  Patrick Hunt
Other aspects? You know, I think we touched on it. But the the operational side of things is a huge, huge issue. It's easy to provide these capabilities and to convince people to use these systems, but putting them in production and making sure that they're actually really reliable, really available. proving them out for the particular users use case is a lot of work and a lot of effort. And it's one of the main things that defines a successful project from an unsuccessful project. So I, I would say that that's one of the main areas,
51:49  Tobias Macey
right. And for anybody who wants to get in touch with you and follow the work that you're up to, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
52:07  Patrick Hunt
I think one of the main things that I see again, goes back to the operational aspect of things, one of the areas for the concern that I have is around observe ability. So whether you're building on top of zookeeper, or you're trying to build the data pipeline, or trying to build a data science model that you want to get into production, there are going to be unexpected hiccups. zookeeper still has bugs, surprisingly, after all these years, and occasionally you'll run into one of these bugs. How do you identify that that issue? It's going to impact your system? How do you identify how do you track that? How do you identify it? How do you debug it? How do you make sure that you get that notification and resolve the issue as quickly as possible? Whether you're using zookeeper whether you're writing a data pipeline, where the data may change unexpectedly, and suddenly your email processing is starting to fail. You've written a model that you deployed into production for data science, and you're getting unexpected results suddenly, for some unexpected reason. How do you identify that before it impacts your your business in a negative way? I see a lot of systems and a lot of work being done. And and it's great work. But how do we integrate all these different? There's so many components these days that have to be pulled together to build the system? How do we make sure that we can run that in production and quickly identify problems and resolve them at that's one of the areas I would say that I I spend a lot of time on today, helping people not only in zookeeper but in other aspects of the work that I do. And I think that's one of the areas that we need to continue to focus on as a community.
53:51  Tobias Macey
Well, thank you very much for taking the time today to join me and discuss the work that you're doing with zookeeper. It's a system that I've been aware of for a long time. And I've heard a lot of people talking about it. So it's great to be able to get more of a deep dive and a better understanding of the overall scope and history of the project. So thank you for that. And I hope you enjoy the rest of your day.
54:12  Patrick Hunt
Thanks for having me on and chatting about zookeeper. I really appreciate it. Thank you