Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

https://www.dataengineeringpodcast.com

subscribe
share






episode 78: Unpacking Fauna: A Global Scale Cloud Native Database [transcript]


Summary

One of the biggest challenges for any business trying to grow and reach customers globally is how to scale their data storage. FaunaDB is a cloud native database built by the engineers behind Twitter’s infrastructure and designed to serve the needs of modern systems. Evan Weaver is the co-founder and CEO of Fauna and in this episode he explains the unique capabilities of Fauna, compares the consensus and transaction algorithm to that used in other NewSQL systems, and describes the ways that it allows for new application design patterns. One of the unique aspects of Fauna that is worth drawing attention to is the first class support for temporality that simplifies querying of historical states of the data. It is definitely worth a good look for anyone building a platform that needs a simple to manage data layer that will scale with your business.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support.
  • Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Evan Weaver about FaunaDB, a modern operational data platform built for your cloud
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what FaunaDB is and how it got started?
  • What are some of the main use cases that FaunaDB is targeting?
    • How does it compare to some of the other global scale databases that have been built in recent years such as CockroachDB?
  • Can you describe the architecture of FaunaDB and how it has evolved?
  • The consensus and replication protocol in Fauna is intriguing. Can you talk through how it works?
    • What are some of the edge cases that users should be aware of?
    • How are conflicts managed in Fauna?
  • What is the underlying storage layer?
    • How is the query layer designed to allow for different query patterns and model representations?
  • How does data modeling in Fauna compare to that of relational or document databases?
    • Can you describe the query format?
    • What are some of the common difficulties or points of confusion around interacting with data in Fauna?
  • What are some application design patterns that are enabled by using Fauna as the storage layer?
  • Given the ability to replicate globally, how do you mitigate latency when interacting with the database?
  • What are some of the most interesting or unexpected ways that you have seen Fauna used?
  • When is it the wrong choice?
  • What have been some of the most interesting/unexpected/challenging aspects of building the Fauna database and company?
  • What do you have in store for the future of Fauna?
Contact Info
  • @evan on Twitter
  • LinkedIn
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
  • Fauna
  • Ruby on Rails
  • CNET
  • GitHub
  • Twitter
  • NoSQL
  • Cassandra
  • InnoDB
  • Redis
  • Memcached
  • Timeseries
  • Spanner Paper
  • DynamoDB Paper
  • Percolator
  • ACID
  • Calvin Protocol
  • Daniel Abadi
  • LINQ
  • LSM Tree (Log-structured Merge-tree)
  • Scala
  • Change Data Capture
  • GraphQL
    • Podcast.init Interview About Graphene
  • Fauna Query Language (FQL)
  • CQL == Cassandra Query Language
  • Object-Relational Databases
  • LDAP == Lightweight Directory Access Protocol
  • Auth0
  • OLAP == Online Analytical Processing
  • Jepsen distributed systems safety research

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA


share








 2019-04-22  53m
 
 
00:13
Tobias Macey: Hello and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the project to hear about on the show, you'll need somewhere to deploy it. So check out our friends at Lynn ODE with 200 gigabit private networking, scalable shared block storage and 40 gigabit public network you've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution, they've got that coverage to with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads. They just announced dedicated CPU instances go to data engineering podcast.com slash Linux, that's Li n o d today to get a $20 credit and launch a new server and under a minute.
00:55
Alexia Locascio is an open source distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos. Alexia unlocks the value of your data and allows for modern computation intensive workloads to become truly elastic with the cloud. With Alexia companies like Barclays, JD calm Tencent and two sigma can manage data efficiently accelerate business analytics and ease the adoption of any cloud go to data engineering podcast.com slash Alexey Oh, that's a Ll you x i o today to learn more and to thank them for their support.
01:33
And understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers time, it lets your business users decide what day Do they want where go to data engineering podcast.com slash segment i o today to sign up for their startup plan and get $25,000 in segment credits and $1 million in free software for marketing and analytics companies like AWS, Google and intercom. On top of that, you'll get access to the analytics Academy for the educational resources you need to become an expert in data analytics for measuring product market fit. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity into the Open Data Science Conference. Go to data engineering podcast.com slash conferences to learn more and to take advantage of our partner discounts when you register and go to data engineering podcast.com to subscribe to the show, sign up for the mailing list, read the show notes and get in touch. And please help other people find the show by
03:00
Leaving a review on iTunes and telling your friends and coworkers. Your host is Tobias Macey. And today I'm interviewing Evan Weaver, about Fanta dB, a modern operational data platform built for your cloud. So Evan, could you start by introducing yourself?
03:13
Evan Weaver: Great to talk to you, Tobias. I'm Evan Weaver. I'm CEO and one of the co founders of fauna. And do you remember how you first got involved in the area of data management? I do remember. So in grad school, I did a bunch of work. In Bioinformatics, specifically, I worked on G North logs and chickens as well as a plankton simulator. And for the for the gene project, I ended up using rails, because we needed a web interface to throw some data up. And that was sort of my first experience of with web programming. It was my first time using a real database. I I got super excited about rails because of like the blog demo screen casting, and then I spent a week trying to install Postgres. And from that point forward, I was basically doomed to spend the rest of my life paginating things and working on the data side of platforms. After after that project, I went and worked at CNET networks and did real sites there specifically, I did shout out calm and urban baby calm. Urban baby was a threaded real time web chat for moms. So if you take away the for moms, it kind of sounds like Twitter. Around the same time, my team at CNET left to found get up I left to go to Twitter as employee number 15.
04:31
Tobias Macey: And
04:32
And so in the time that you spend at Twitter, you ended up dealing with a lot of different issues related to databases and storage and consistency. And after that, you went ahead and co founded fauna and released the Fauna DB product. So can you start by giving a bit of an explanation about what Fauna DB is and your motivation for starting it?
04:55
Evan Weaver: Yeah, so and Twitter, I ended up running what we called the back end infrastructure team, we built all the distributed storage for the core business objects. So that was tweets, timelines, user social graph, image storage, the cash, price and other storage that I forget. We also worked on performance. And Twitter was one of the last great consumer internet companies that was built pre cloud. Like we were using co located hardware. There wasn't any cloud native software, we did do almost everything ourselves. And that's probably why there are a lot of great infrastructure startups that spun out of Twitter. And essentially, when we were trying to trying to scale Twitter up, we got involved in the NO SEQUEL movement, in particular, Cassandra. We hosted the first meetup. I wrote the first tutorial, we fixed the build, because people don't remember now, but it actually didn't compile when Facebook open sourced it. And we were hoping basically that Cassandra would develop into a global data platform that would be multi purpose, you know, reusable, flexible, productive, and then just didn't happen. And to scale Twitter, we ended up building point solutions where we take like, you know, db or Redis, or mem cache or some other local storage engine and put us Charlene facade in front of it. They manage replication and querying and transaction, ality and that kind of thing. But because Twitter was under such extreme time constraints, we just never had the chance to build that truly reusable platform that we wanted to build. So that's basically Fanta. I spent four years at twitter. When I left a couple people ended up coming with me. And we spent about three years in consultancy mode exploring the data space, working on a bunch of other projects, trying to understand you know, Twitter needed a social graph, but there's probably not a market for a social graph dB. Like, what do people really need as a general comprehensive data platform and then, in 2016, we felt like we had a prototype down we had an initial customer, we went and raised our seed round from CRV. And then we raised our series A in, in 2017, from point 72, and GB. And so you've been building this platform for a while and looking at the technical documentation, it seems to be quite the feat of engineering. So I'm wondering at the outset, what are some of the main use cases that Fanta DB is targeting that you found people were asking for and your exploration of what the data space was looking for and what it was lacking? What we found was really missing was that general purpose, you know, safe and reliable, cloud native database, so to speak, like we found a lot of people who said, You know, I looked at NO SEQUEL, and I like the scale. I looked at sequel and I like the flexibility of modeling. I can't get something that does both. That's the same experience we had at Twitter and partly why we ended up building all these systems more or less from scratch. We decided, you know, if we don't get this done, it's not going to happen like information science doesn't say that you can't build a transactional global high performance operational data store but the you know, in practice there's so much path dependence and software development thatat the time everyone who had tried to do that and basically got diverted into some niche like time series or click tracking behavioral data that kind of stuff like the NO SEQUEL vendors had given up on transaction ality data correctness safety and we're promoting a worse is better story. And the seagull community and falling back to well, vertical scale is all you need. Global is impossible .Never use anything new. So we found that there's a segment of the market though that was just refusing to believe that this was as good as it was going to get. And that's our market.
09:06
Tobias Macey: And so in recent years, there have been a couple of other projects that came out to enable scaling of transactional workloads across data centers, and potentially globally most notable of which being cockroach dB, which I know is based on the spanner paper out of Google. So I'm wondering if you can just do a bit of comparison as to how fun a compares to cockroach or any of the other products that are available in the market that are offering these global scale transactional databases. Yeah, it's a there aren't many entrants in the market because of the tremendous, you know, r&d burden.
09:45
Evan Weaver: I mean, five years ago, people were saying that these systems were literally impossible. So it's kind of in the cold fusion territory. And the main thing that changed that was really Google Spanner. And the spanner paper came out similar the Dynamo paper and the Dynamo paper said you can have total availability. If you treat your data this or that way. The Spanner favorite said you can have total transaction ality if you relax your availability requirements to this minimal degree which in practice is effectively totally available. But Spanner had followed on percolator and essentially there were two models at the time for doing global transactional multi partition consensus like really acid. And percolator was the first it also came out of Google. What percolator does is essentially scale up the primary replica model to data center scale, where instead of having a single machine that uses locks to coordinate all transactions, they essentially have a timestamp Oracle, it's called it which is more or less a lock server that can be individually scaled up and every node in the data center has to talk to that guy to do any useful work. And that gives you data center scale reads and writes up until like the limits of that machine. And it gives you It gives you global scale out for scale read, but it doesn't give you a global right. And Spanner came out and said, you know, we can use atomic clocks to synchronize the right path to replace the timestamp Oracle. And then everyone realized that there actually were mechanisms to deliver global transaction ality. But the problem with the spinner model obviously is the clocks and systems like TechCrunch, for example, attempted to import Google's clock synchronization strategy into a public cloud environment where you don't actually have atomic Lee synchronized clocks. And part of the reason that Spanner can pull this off is because the entire software stack is controlled and and the network latency is known the service latencies now and like the implementation of every part of the transactional write path is very tightly latency controlled. No garbage collection stalls, no VM pauses, no nothing. Because if you drift out of that clock tolerance, you'll violate correctness and you won't have any way to recover it. You might have corrupt transactions, and you have nothing to do you have no, there's no, there's no way to roll back. There's no way to even identify what got corrupted during the window, because you don't know if the clocks have drifted out of synchronization until after it's happened. And for that reason, you know, systems like cognitive bias towards only doing only doing reads and writes on the tablet leader, so kind of some of the global story isn't quite there. And we were looking at this at the time and we're like, well, we're building we're building for the wind, like our customers want this to be global. We want it to be global. We're not satisfied with the limitations of this cloud based architecture. And we took a look at the academic literature and we found in particular, there was one alternative at the time. And that was a, there was a protocol called Calvin, that came out of Daniel bodies lab at Yale.
13:09
Tobias Macey: And so you've been building the Fanta database on top of this Calvin protocol. And I know that you've also taken in some of the aspects of the raft consensus algorithm. And so I'm wondering if you can talk a bit about how Fauna itself is architected to be able to achieve this global scale and transactional consistency, and just some of the overall consensus protocol and consensus management that you use to ensure this global availability of the data as well.
13:43
Evan Weaver: Yeah, so what Calvin does is invert the synchronization model. Instead of using clocks to figure out when transactions occurred on the data replicas. It sends the transactions themselves to a log which then essentially defines the order of time. These transactions in the shared log are then a synchronously replicated out to the individual replica nodes very similar to a traditional NoSQL system. And that gives you a ton of advantages. So at the front end, sort of in the right path, you have a raft cluster, which is shard and partition entirely available spans nodes, that's accepting these deterministic Lee submitted transaction effects or intermediate representations, what have you that thing has no single point of failure, its global tamale data center, any node can commit to it within the same medium latency, regardless of how complex the transaction is. Then on the read side, you can have as many data centers as you please tailing off this log in lockstep applying the transaction effects locally, to their their local copy of the data means that on the read side, you get scale out experience, which doesn't require any coordination. So we can do a snapshot reads from any data center with single single millisecond latency. Whereas on the right side, you know, the, the latency for a commit takes about one majority round trip throughout the log nodes wherever they're configured to be in the data center. So you know, 100 to 200 milliseconds in a typical multi continent cluster. That's basically the best you can do in terms of balancing of like, maximizing availability without ever giving up the benefits of transaction ality.
15:35
Tobias Macey: And as far as consensus and consistency, I'm wondering what are some of the edge cases that could lead to data conflicts and how Fatah manages resolving or alerting on those conflicts
15:50
Evan Weaver: in fun. So fun offers a functional expression oriented relational language, it's very similar to link in the way you can post your transactions, you're reading relational patterns in, in, you know, maps and forwards and flat maps and that kind of thing. These transactions then get completely processed atomic Lee, you know, with acid in the database itself. So it's just like working with any other sequel system, except it's not sequel in that, if you think you might conflict. And you want to take a read intent on some other value, you just write it into the transaction. It's not like something like Cassandra, for example, which can express reads and writes within the same transaction. So in that sense, you don't have to do anything except describe what you know, with the, like the business model, so to speak of the transaction or the logic is supposed to be, we allow you to push down stored procedures, which we call functions, we allow you to build unique indexes consume indexes, create views, transformed data of which is transactional, Lee available. And in particular, because Calvin has a logical global, instead of dropping down to the individual leader, like raft leaders for partitions, phone and DB offers strict serialized ability for external consistency, just like Google Spanner, which is the highest possible consistency level. So there are no anomalies in foreigners transactional consensus resolution, there's no index phantoms, there's no you know, reversal of real time, there's no real SQL right skew, you just don't have to worry about it.
17:38
Tobias Macey: And as far as the underlying storage layer, and the data modeling that Fanta support, so I'm wondering if you can talk through how that's implemented, and specifically for the multi model capacity, how the query layer is designed to be able to allow for those different query patterns on the same underlying data.
17:58
Evan Weaver: Yeah, and the the motto is something we're investing a lot in now, we should talk about that in a minute. For sure. The underlying storage engine is an LS entry. It's derived from Cassandra's original level implementation in Java phone is written in Scala and Java primarily, it's not really very special. What's special is the temporality that foreigner layers on top. Because as part of the consistency model, as part of the SQL functional query language, we offer total access to the history of your data within the configured retention period. So you can run any query at a point in time, you can create a change feed for any query between two points in time, you can get, you know, Change Data Capture from indexes and tables and that kind of thing. And for data that has to be retained forever, you can configure it to do so for data which is derived and you only want to retain the latest version you can also configure to do so, man, it gives us a ton of power both in the language that's exposed to the end user and to and and for Calvin, which relies on that history, to to make, read and write and Jackie more efficient.
19:14
Tobias Macey: And so I'm wondering, given the ability to interact with these different views of the same underlying data, how an application developer would approach data modeling, particularly in relation to a sequel oriented or document oriented data store.
19:33
Evan Weaver: So fun, fun, it is a document oriented data store, we call it a relational NO SEQUEL platform, which means you insert documents, and you build relations in the form of indexes on top of them. But what one thing we've discovered, as we've gone to market and working with our customer base is that people want the operational power of the platform, but they also want easy integration with the languages they're currently using. So we've just announced our platform plan, as well as lunch graph que el as one of the first available languages on top of native. And our goal is to give people a completely transparent and native experience with these familiar languages, which will give them access to the underlying power of the platform. So if you want to go crazy and basically stay in power user mode, you can use SQL, which gives you transparent and direct access to all the semantics and and functional and operational capabilities of the underlying platform, including que es and security and temporality and all that kind of thing. But if you're just trying to build an app, what you get now is a series of basically best of breed standard languages for that modeling paradigm. So for crying, we now have graph SQL. And for key value, we have CQ L, which is Cassandra is native language. We're also working on SQL for relational for relational modeling, which will launch later this year. And then we'd like to also do a couple more data domains. In particular graph, which you can currently model directly in SQL. But we don't have a standard interface for what we found is that people are super excited about this strategy, because they want that shared platform, especially because fun allowed to access the same data from different API's. But they just don't want to deal with the learning curve up front, which is understandable, because SQL is pretty unique. Even though it's similar to link, you know, you have to get a specific driver, you have to understand fun as native semantics, which are very powerful, but also, you know, not necessarily intuitive or familiar out of the box. So I would say, you know, depending on what kind of app and data you're trying to model, grab one of those API's now and go nuts. And as soon as you need more power, you can always get it by dropping down to what's effectively at that point, an intermediate representation, just like a compiler compiles, you know, a higher level language to a byte code or something, you know, internal, a partial representation that is more explicit gives you more control, we're now doing the same thing with Nick kind of moves us from the database that does one thing to the data platform, you kind of get, you know, effectively, all of you know, AWS or Google clouds, operational data systems in a box.
22:41
Tobias Macey: And in terms of people who are first getting started on working with Fatah, in interacting with the SQL syntax, or starting to work with some of these higher level interfaces. I'm wondering what are some of the common points of confusion or surprise or edge cases that they run up against?
23:00
Evan Weaver: One of the things we did early on, was borrow a lot of terminology from the object relational movement in the 90s, you know, Twitter engineering and us, you know, have a reputation for kind of doing our own thing, hell or high water. And even though object relational databases basically died, we still felt that those paradigms, we felt like those, those patterns were more or less optimal for modern development practices. But the jargon that we imported from them is a little weird. So like, instead of tables or collections, you have classes, and instead of documents or rows of instances, another thing that's a little strange, I think, which we need to fix and use more conventional language, you know, indexes and find our equivalent of us, you can transform data, you can cover multiple terms, you can rank values, you can even write one index, that index has multiple collections. And these these are kind of, you know, similar to a functional programming language or something like F sharp, you know, camel Haskell, Scala, like fun is written in. These are super powerful, but also super abstract concepts. And I think it's, it's been a little difficult for a lot of our users to wrap their heads around a paradigm which is so compostable, but also not necessarily familiar with the practices that they've they've encountered before. Then at the same time, their features which are totally new to the database landscape like us management, like we run phone, a Service Cloud as a single global phonic cluster, we use the built in tenancy and queue management to provision new accounts within it. Because the database hierarchy is recursive, like a file system. So you can have a database that has other databases that have databases within each of those can have a priority, that priority is instantaneously scheduled at a sub query level by the query scalar. On each node, you can do things like consolidate a lot of different applications and different access patterns into the same physical cluster. Those features are very foreign to dB as two people building back end applications because they've never encountered them before. And by temporality is similar, there are very few production systems with capable, you know, by temporality implementations, most people's experience of change, data capture is very low level, like in Postgres, you have to grab some, you have to grab some third party plugin thing that tries to sniff the bin log and look at the binary format. And if you fall behind, you can't catch up because they've been long as God having that kind of stuff, highly available, transparent. In the high level, you know, programming language of the data system is just a surprise. And it's been, you know, we're doing a lot of work on docs on tutorials, as well as the new API's. But I think a lot of people encounter that in initial learning curve with the foreign concepts from the object relational paradigms and elsewhere and have trouble seeing through it to the underlying operational power. And as far as
26:15
Tobias Macey: the types of use cases, that fauna is built for in the types of application design patterns that enables, I'm wondering if there are any sort of unique architectures that it lends itself well to that would be impractical with a single purpose database, whether it's a relational database, or NoSQL, document store or something like that,
26:38
Evan Weaver: yeah, there there are a lot. And you need to kind of adjust the way you view the database, like the traditional view of a database, even if it's a distributed system is a, you know, a single workload kind of Britt all operationally heavy system that you just don't want to touch. And fun, it just isn't like that. It's it's entirely so managed, whether you're operating at yourself, or using our service cloud, you can scale it in and out, up and down. Everything happens online, everything happens without data corruption without service interruption with us management built in automatically. And you can adopt kind of a platform approach similar to the way you know, in a large enterprise, like some of our customers, they'll have an internal compute platform that uses Kubernetes, or what have you, D CEOs are some of the older you know, orchestration and cluster management paradigms. But databases are still special snowflakes. And you have to you have to kind of bring your bring your thinking forward and think you know, what, if the database didn't have to be treated differently, then my stateless capabilities, what if, you know, I could provision a database for every developer for every staging environment for every build? What if I could, you know, run analytics workloads against the production data database by giving them a low priority, read on the key, that kind of thing. So really adopting that cloud native mentality, for the data tier, especially the operational data tier, it's just not something people are accustomed to doing. So we can do a lot of education there a lot of demo and a lot of communication to show that. No, it really is safe, it really does work. And at the same time, having global transparent access to all your data with low latency also leads you into different series of design choices for your applications. Because if you have, you know, an app, which only lives and in US East, you know, in AWS or what have you like to say it's been in the original data center for a decade, and there's weeds growing up and rain is dripping on the roof, and all that kind of thing. Like, you don't really see the benefit of global scale out unless you start refactoring your app to also manage, you know, data center, like level fail over. So if you're building a Greenfield app, and you build it totally stateless, for example, you're using server lyst framework, then even experience, which is much more like running a CDN. But it magically has access to transactional correct data under the hood instead of just cashing. But it's a little difficult to kind of enter that world, from a legacy mindset or from a legacy app.
29:18
Tobias Macey: Yeah. And I was curious about what you were saying as far as being able to run analytical workloads on Fanta, because I know that it's primarily built for these transactional use cases. And then also to your point about being able to spin up different instances for pre production environments, or for developers to be able to experiment with, I'm curious if there is the ability to leverage either indexes, or if there's any sort of fast copy mechanism for being able to populate those pre production databases with either the entirety or some subset of the data that's stored in the production and transactional data store,
29:58
Evan Weaver: you're currently there's not, that's something we've been asked about a lot and, and want to get on the roadmap. Most most, you know, testing data sets are relatively small. So copying it with the high level layer isn't a big deal. But that forking branching model has been a request that we've got, and then we'd like to enable long term, one thing you can absolutely do now is, you know, for read queries, you can you can give a different version of the app a read on the key and test it against the production data and a completely safe way. That's something which is not really practical to do in a traditional already BMS, are now there's no sequel system where all you have is administrative access. And there's similar things you can do at the, at the user level with with our back system, you can create a security model, which lets untrusted clients access public data that lets users bootstrap themselves and you know, own their own little sphere of the data world directly from, you know, a single page running app or a mobile app or some other embedded device, like an IoT device without any intermediate, you know, proxy or security layer on top of the database itself.
31:12
Tobias Macey: Yeah, and that was another thing that I was impressed by is the level of granularity that you're able to offer in terms of the access control. So I'm wondering if you can talk a bit more about the security model of Fanta, and how user management and just overall cluster security factors and what the administrative interface is for being able to manage all of that?
31:37
Evan Weaver: Yeah, that's a good question. The operational management like true apps like adding a note and removing a node, adding a data center, happens through an admin tool, which you run locally on any machine in the cluster. But everything above that level is self hosted and part of the logical API. So for example, schema records aren't themselves any different. Then data records like an index is an instance within a database or a document. A database is itself a document within another database, all running up to the root of the tree. And at each level of the tree, you also can provision access keys, you can define exactly what they access with, you know, with with final programming, essentially, you can write lambda switcher embedded in the key and control you know, exactly what they're allowed to do and what priority and that their model because it's self hosted is pushed all the way down to the individual documents. So you can you can have a document which services and identity, you can have a scheme either by username password, which allows you to let you know untrusted devices create a new record without any intermediation, but by setting whatever their their password or secret is supposed to be. Or you can build a stateless service that delegates that identity, something else like L dash, or zero or whatever existing identity provider Facebook, for example, if it's a mobile app, you already have, then issues back access keys that have the appropriate scope, the appropriate our back, lambda is installed, the let you really push the security that you normally model in the app all the way down into the database. And that's super beneficial for two reasons. First, it's faster, because the database can process all this locally, you're not streaming back data that the user is not allowed to see. And second, you have a much stronger guarantee of fundamental security in your system, because you're not especially in a microservices environment. And this applies kinda transaction ality to, if the database doesn't handle these concerns in their totality, the more you move to a server list or a microservices environment, the more individual code basis you have, trying to agree on these access patterns, which are very, you know, very nuanced. Your typical security Hall comes from, you know, a bunch of well meaning implementations, which is somehow interact in an unexpected way. So if you can push that down into the shared data tier, you know, integrate through your data, just like it's 1982, and you've Oracle or something, you said, the database is the bus, everything talks to the database, everything is is you know, if you want stored procedures built in our back, that kind of thing, to make sure that what we're doing is correct, safe, properly managed, then you get a tremendous amount of flexibility, the application tier, because you just don't have to worry about that level of concern anymore.
34:42
Tobias Macey: And to your point about the user controls being just another record in the database, and being able to manage stored procedures. I'm wondering what the data types are natively infinite dB, and what capacity there is for being able to create custom or high are order data types, and what level support there is for being able to push some measure of application logic into the database in the form of stored procedures or custom function definitions?
35:10
Evan Weaver: Yeah, that's a good question. And this, there's a really interesting implementation detail and fun around this that we don't really talk about, you know, and your typical, like, take my sequel, when you created my sequel, database or cluster, you have to set the coalition What the hell is a coalition like it's the it's the order of ranking of individual data elements based on what language you speak, what other types of data you expect to query, what kind of indexes you want to you want to expose. And it's very difficult, especially in like a multi tenant environment to say this is what collation should be, it's also very confusing. And what we didn't findis every data type has a an or no position in what's essentially an infinite circle of all possible data elements. You can sort for example, floats and strings. Together, and you'll get an order which makes sense, you can sort and we have to be careful about you know, Sybil account based stuff because of the way branches but you can, for all intents and purposes, you know, sort most languages in a way, that make sense to the end user. And this, this means, you know, fun, it can be static Lee typed internally, and process efficiently, all the native data types, floats, strings, byte, longs, arrays, maps, what have you, every data element has a rank, a predictable rank rank that lets you order it sorted partition at what have you. But at the same time, you can do whatever you please on top of that, by taking advantage of that underlying order. So you can create essentially a struct which has multiple data elements in it. If you want your data to be ranked a different way, you can create an index which transforms it before it gets ranked, and also includes the original value. And this is a break from the object relational paradigm because the object relational paradigm is basically like, you compile like a native data type, you install it into the database, you have to define all this stuff about how the database will interact with it and sorted and rank it. And you usually can't, you know, create a column, for example, which includes that data type, but also elements of a different data type, you end up falling back to kinda like the the var char my sequel model where you're like, who knows what's in here, it's just a bunch of bytes. We learned, you know, through our own experience, and working with customers and that kind of thing that people don't really want to extend their database, they want to model their application on top of a shared set of primitives. So that's what we offer, we offer these native data types, but the language itself is, you know, typed, but dynamically typed, even though everything's stored static Lee internally, and we offer stored procedures, which we call functions, which like you push down land does into the database written in Africa, well, which can compose transform data, augment the language, but at a high level, like you're not compiling a Java jar to install, you're just writing a query, once you like the query, you can give it a name and put it into the database. Obviously, that function is itself a file document. So its version, you can see how it's changed by going through the temporal history. You can you know, transactional independent it when you're doing other operations if you need to, and you really get a much more comfortable kind of a double space experience where the database is a Compute Engine over data, which it makes transparently available to all the query processes, as opposed to, you know, having to think about it in a more
38:44
Tobias Macey: more legacy mindset. I really like the temporal aspect of fun of being able to automatically maintain versions of records because as you mentioned, it's not something that's generally built in as a first class concern into database engines, you either have to do, like you said, capture the right ahead logs in a Postgres database for Change Data Capture, or if you want to be able to keep that information readily available, you have to implement some sort of history table, which has its own edge cases that you have to work around, depending on how the data model changes over time. So the fact that it's built in as a first class concern, and that it's something that is accessible without having to go through all of these additional backflips and, you know, additional tooling, it's definitely a very valuable addition. And I'm curious, what are some of the other interesting features that are often overlooked or misunderstood? And any particularly interesting or unexpected ways that you've seen Fanta used?
39:46
Evan Weaver: temporality is definitely one like we originally, you know, we came from Twitter, right? So we thought every app would have feeds in it, which by and large was correct. But it turns out, you know, people are also struggling with much more fundamental data issues, like, is there data correct now available, but one thing we've seen is kind of an augmentation pattern where you have a traditional database, and you want to keep history, but you don't want to deal with logs, basically. So you can add a trigger into the application layer into the database itself, which replicates the individual changes into Fanta, then that can let you expose this changes, not just you know, locally, but globally, as feed as Change Data Capture for data services and other data centers. It can let you, for example, span cloud, if you have your app already built in a single data center in a single cloud, but you want to start getting data, like let's let me be more concrete. Say you have a legacy application, which is built in us to wine, it uses Postgres or something like that, but you're not going to migrate it to Google Cloud, at least not out of the gate. But you want to take advantage of some of the unique services and Google Cloud like the machine learning capabilities, for example, what you can do is add a trigger either an application like or the database itself to write changes to find a useful on a server lists are operating yourself, span that data into Google and then have it locally accessible, either reading from that, you know, reading from phone, a local update, an analytic system that has storage or directly querying the local phonic cluster from, you know, as a sparkly kind of service with the phone and driver that kind of inverts the database model and takes advantage of temporality to create a bus, but you don't just have change events, you actually get to query all of your data. So you can start moving that, you know, some of the pattern to trauma search by that legacy database into foreigner for things like change feeds, and change, data capture, and that kind of thing. You can also use the security model to to lock down the canonical data database more tightly, and then rely on phone and security model to expose that data to mobile apps, for example. So you can take a data base, which was built for a trusted deployment environment, on the web servers you you own manage, and essentially use phone as kind of a data CDN and get all that relational power that model empower that query empower to make that data globally publicly available to new views into the same underlying product.
42:22
Tobias Macey: And what are some of the cases when Fanta is the wrong choice and you would advocate for somebody to choose a different tool or a different platform,
42:30
Evan Weaver: one thing we encounter. Indeed, to be clear, fun is becoming the wrong choice less over time, but one thing we encounter with some frequency that will remain the case indefinitely is a time series use cases. A lot of people confuse time series and temporality because they both involve time. finest temporality is really about stories, storing history and change history of money critical, like user generated content stuff that's super important and time series is really about, it's about data that individually doesn't matter. And only is interesting in aggregate. So the the operational characteristics people are looking for are wildly different, you know, they want analytics, roll up aggregation features that Fauna doesn't currently have, they want it to be. They want it to be cheap, to the point that they give up, you know, correctness, transaction, ality even availability, or to store as much data as fast as humanly possible because the individual rose, you just don't really matter. It's all about the aggregates. So that's something we've encountered with some frequency. And you know, if you really want to do time series, you should grab something like influx or potentially Cassandra, get a truly eventually consistent NoSQL time series database, which is optimized for those patterns. And right now for overlap, for example, phone and doesn't have native analytics capability. So you can query it from something like spark with a phone and driver. But if you really want to do you know, kind of an age tap kind of scenario, the best thing to do is use finest temporality to capture data into a relational database, or a cloud analytics service, for example. And the temporality is nice for that, because you can do it in a restored about transactional correct way. And usually, these systems don't have to be, you know, globally distributed. So if you have a global phonic cluster, you want to do h tap, just spin up Postgres in one data center, you know, write a connector, which will, which will sink the data you care about into Postgres and soft, real time. And you'll get that capability. As we bring our own sequel capability online, these needs will diminish. But right now, those are kind of the two main areas where it doesn't really make sense. To be clear, we are building a general purpose data platform. So we're not we're not opposed to eventually implementing pretty much everything you wouldn't you would want at these different data domains. But right now the focus is crud, relational document, graph, key value.
45:14
Tobias Macey: And in terms of your experience of building and growing the technical and business aspects of Fatah, I'm wondering what you have found to be some of the most interesting or unexpected or challenging lessons that you've encountered in the process,
45:29
Evan Weaver: I think it would be safe to say when we we set out to build this project, we didn't realize we end up solving one of the hardest problems in applied computer science, in particular, global highly available acid transactions. There was there was no industry version of Calvin before Fonda, there was only the prototype that was written for purposes of the paper. So that said that the challenge of doing that certainly exceeded our expectations. I mean, most databases are working with a single candidate Slayer, if they had any consensus whatsoever. And some of the patterns for raft are relatively laid down at this point, although many people batch their their raft implementations. But to add an additional novel consensus protocol on top of that, because we've extended Calvin in quite a few ways, in particular, for performance and flexibility reasons was a tremendous challenge for us. And we were gratified to finish our Jepsen analysis with Kyle Kingsbury recently, which kind of validated the entire architecture of what we set out to build in the context of the academic literature and the history that came before and particularly Google Spanner and percolator and that kind of thing. So in the technical side, that has, by far kind of exceeded, I think the level of difficulty we initially assumed that never stopped us before, it never stopped us at twitter. So we still got it done. But I think that was a surprise. And then I think, you know, there's the usual stuff, which I think is common to technical co founders where, you know, I was a director at Twitter, managing a team of about 25 people, but managing a larger company growing it, growing it from scratch, having an executive team and line managers and that kind of thing. It's been a learning experience for me, because the the people are, are just as hard as the the technical aspects of the business.
47:18
Tobias Macey: And looking forward, what do you have in store for the future of Fanta, both from the business and technical side,
47:26
Evan Weaver: the biggest thing for us right now is these new API's, like we've seen, you know, a lot of our a lot of our cloud users already begin to implement graphical adapters for photo. So we're super excited to release the, you know, the first party native graph to interface and serve that market need more directly, we also have a lot of work to do on the sequel implementation, for example, and then, you know, some of the future interfaces will release beyond that the same time, you know, we're always improving performance, we're always improving that default consistency level, as you get pushing down latency even further, there's been a bunch of operational improvements lately, which we're excited about, which will dramatically improve certain workloads, we're making it easier to use in different operational environments like Kubernetes, and different clouds and that kind of thing. And we also have locality control on the roadmap for this year. And we've started work on the ability to define on a record by record basis, where your data lives in a single phonic cluster, that makes kind of the Shared Services paradigm, even more powerful, because you can lay out for around the world you can have in 25 data centers, and you can have every individual application or logical database that's accessing that cluster, decide on a row by row basis where it wants that data to live. And that's good for compliance. It's good for management of you know, red application costs, it's good for offering a shared service internally or in our own cloud. And really, you know, pushing you to that edge CDN kind of data experience.
49:10
Tobias Macey: And are there any other aspects of Fanta dB, or the photo company that we didn't discuss yet that you'd like to cover? Before we close out the show?
49:19
Evan Weaver: mean, we're hiring?
49:22
Yes, you you know, if you like to work on consensus algorithms, distributed systems, if you know, you like worrying about what can go wrong, rather
49:29
than working, go write
49:30
a database company is a good place to be in high school, I did some hobby stuff with electrical engineering, and I could just never, you know, quite get it together. Because it was so unpredictable, others analog stuff happening, and I ended up going into software. For that reason, I found it to be much more predictable environment. But then, of course, I do myself to having the same category of problems by working on distributed systems exclusively, which are again, especially in the cloud, incredibly unpredictable. Push the analog environments, you know, latency varies nodes, shut down data disk corrupt. Like, if you're excited about solving those things, please talk to us.
50:10
Tobias Macey: And for anybody who does want to get in touch with you about that or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I just like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today. I think,
50:27
Evan Weaver: to me, the biggest gap is really that the server lists edge experience, like we're pushing the granularity of application building down to literally nothing like you had kind of a series of incremental paradigm shifts from physical servers to co located or leased servers to virtualized servers to containers, they're still all little servers. It's like, if you like every thought you had, you had to mentally conceptualize it as being in a book and in doesn't make sense from a productivity perspective to think about software, especially distributed software this way, like, who cares, like how many functions can run within one container, I don't, I just want to know if I have the aggregate capacity to execute the workloads my users are generating. And it requires a complete inversion of that abstraction, which we finally have now, for the most part, with server list frameworks and the compute side, we've had it for a long time with CDN and the caching side. But data, especially operational data is always the last thing to move, because it's the riskiest, so you can now get, you know, some server lists analytics capability with things like snowflake, but your canonical, operational user generated mission critical, you know, data, which is the existential underpinning of the business still lives and essentially, you know, a mainframe and what we're trying to push with thought, and what the entire industry needs to push is, you know, bringing this paradigm to its logical conclusion, which is you shouldn't have to care. Or even though as an application developer, how your data tier is operating, it should be completely orthogonal. And at the same time, as an operator, you shouldn't have to care what your applications are doing, like the the model of a DDA who has to like go in and like tune queries and make sure everything is safe to execute and fail over nodes to huts, barriers and stuff is an 80s model, we need to move past that to an arm's length utility computing service model where something's behaving badly. You know, in Florida, for example, if the application is consuming too many resources, lower its priority, you don't have to know what it's doing as an operator. And if you want global resources, as a developer, just provision a new database, you don't have even have to think about where this data centers are located. That's the experience we're closer to a server less and we're already there with CDN, but data is just harder, because the the, the quality bar is so astronomically high, because you know, I mean, the NO SEQUEL man was notorious for for essentially killing businesses, like dig comes to mind with their experience with Cassandra and people are smarter now. And they demand that their database vendors really do the work. But until until the vendors do like we're doing it for, we're still going to be stuck in that mainframe mindset.
53:17
Tobias Macey: Well, thank you very much for taking the time today to join me and discuss the work that you're doing at Fanta. It's definitely a very interesting project and seems to be quite the feat of engineering. And there's a lot of great technical resources that you've put up for people to be able to understand what it is that you're doing and working on. So I appreciate all of the work that you're doing on that front end. I hope you enjoy the rest of your day.
53:39
Evan Weaver: Thanks. It's great to talk to you. Thanks for having me on the show.