Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

https://www.dataengineeringpodcast.com

subscribe
share





episode 129: Building Real Time Applications On Streaming Data With Eventador [transcript]


Summary

Modern applications frequently require access to real-time data, but building and maintaining the systems that make that possible is a complex and time consuming endeavor. Eventador is a managed platform designed to let you focus on using the data that you collect, without worrying about how to make it reliable. In this episode Eventador Founder and CEO Kenny Gorman describes how the platform is architected, the challenges inherent to managing reliable streams of data, the simplicity offered by a SQL interface, and the interesting projects that his customers have built on top of it. This was an interesting inside look at building a business on top of open source stream processing frameworks and how to reduce the burden on end users.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Your host is Tobias Macey and today I’m interviewing Kenny Gorman about the Eventador streaming SQL platform
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing what the Eventador platform is and the story
  • behind it?
    • How has your experience at ObjectRocket influenced your approach to streaming SQL?
    • How do the capabilities and developer experience of Eventador compare to other streaming SQL engines such as ksqlDB, Pulsar SQL, or Materialize?
  • What are the main use cases that you are seeing people use for streaming SQL?
    • How does it fit into an application architecture?
    • What are some of the design changes in the different layers that are necessary to take advantage of the real time capabilities?
  • Can you describe how the Eventador platform is architected?
    • How has the system design evolved since you first began working on it?
    • How has the overall landscape of streaming systems changed since you first began working on Eventador?
    • If you were to start over today what would you do differently?
  • What are some of the most interesting and challenging operational aspects of running your platform?
  • What are some of the ways that you have modified or augmented the SQL dialect that you support?
    • What is the tipping point for when SQL is insufficient for a given task and a user might want to leverage Flink?
  • What is the workflow for developing and deploying different SQL jobs?
    • How do you handle versioning of the queries and integration with the software development lifecycle?
  • What are some data modeling considerations that users should be aware of?
    • What are some of the sharp edges or design pitfalls that users should be aware of?
  • What are some of the most interesting, innovative, or unexpected ways that you have seen your customers use your platform?
  • What are some of the most interesting, unexpected, or challenging lessons that you have learned in the process of building and scaling Eventador?
  • What do you have planned for the future of the platform?
Contact Info
  • LinkedIn
  • Blog
  • @kennygorman on Twitter
  • kgorman on Twitter
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
  • Eventador
  • Oracle DB
  • Paypal
  • EBay
  • Semaphore
  • MongoDB
  • ObjectRocket
  • RackSpace
  • RethinkDB
  • Apache Kafka
  • Pulsar
  • PostgreSQL Write-Ahead Log (WAL)
  • ksqlDB
    • Podcast Episode
  • Pulsar SQL
  • Materialize
    • Podcast Episode
  • PipelineDB
    • Podcast Episode
  • Apache Flink
    • Podcast Episode
  • Timely Dataflow
  • FinTech == Financial Technology
  • Anomaly Detection
  • Network Security
  • Materialized View
  • Kubernetes
  • Confluent Schema Registry
    • Podcast Episode
  • ANSI SQL
  • Apache Calcite
  • PostgreSQL
  • User Defined Functions
  • Change Data Capture
    • Podcast Episode
  • AWS Kinesis
  • Uber AthenaX
  • Netflix Keystone
  • Ververica
  • Rockset
    • Podcast Episode
  • Backpressure
  • Keen.io

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA


share







 2020-04-20  50m
 
 
00:12  Tobias Macey
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, I want to test out the projects you hear about on the show, you'll need some more to deploy it. So check out our friends over at linode. With 200 gigabit private networking, scalable shared block storage, a 40 gigabit public network fast object storage and a brand new managed Kubernetes platform you get everything you need to run a fast, reliable and bulletproof data platform. And for your machine learning workloads. They've got dedicated CPU and GPU instances, go to data engineering podcast.com slash linode today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. Your host is Tobias Macey, and today I'm interviewing Kenny Gorman about the event adore streaming SQL platform. So Kenny, can you start by introducing yourself?
00:57  Kenny Gorman
Hi, my name is Kenny Gorman, co founder of Eventador?
01:01  Tobias Macey
And do you remember how you first got involved in the area of data management? Sure. So
01:05  Kenny Gorman
my background is actually as an Oracle DBA way back when I worked with my co founder many, many years ago now at PayPal and eBay, and that's actually how we met, he was at eBay. And I was at PayPal, and we had some significant database problems back then scaling those those stacks. We were both relatively early on. And we had problems in the in the Oracle Database realm that others had just hadn't seen before. And we worked on some of those hard problems around storage Veritas and Oracle and sun systems. And both started working together and really enjoying each other's company and fixing hard problems. And yeah, that's really kind of where, where my love with with data systems and databases and in streaming systems really kind of started.
01:46  Tobias Macey
Yeah, the early days of the web. were definitely an interesting time for actually finding out what the limitations are the systems that were available at the time.
01:53  Kenny Gorman
Yeah, yeah, we had this one thing. It's just kind of anecdotal, but it's kind of funny, we, uh, you know, you're supposed to be able to add a column together. table in a relational database just in real time, like there's no you know locking behavior really to that you're not like locking up a table or causing some sort of semaphore. But ultimately under the covers you are and our database was so busy that we couldn't even do these things in real time on production systems, because it just there was no time to get latches in the in the system. And I remember looking at the Oracle guy next to us who actually worked for Oracle. He's like, yeah, we haven't seen that before. So okay, well, that's great. We're in uncharted territory here. Good times.
02:30  Tobias Macey
Yeah, that that's definitely great when the people who build the system say I have no idea. Right. And so starting there, I can see how the interest in data has continued. And I know that you have also been involved in the object rocket platform, and most recently with event adore. So I'm wondering if you can just start by describing a bit about what the event store platform is and some of the story behind it and how you came about founding it?
02:54  Kenny Gorman
Sure, sure. Yeah. I'll just kind of tell you what a venturi is. Real quick. I mean, can I you know, go back and talk about how we got here. But essentially invented or is a platform that allows you to build applications on streams of data. The design was that we wanted to allow developers and data scientists and back end engineers and folks to be able to just query streams of data, just like it was a database. And you know, Kafka and streaming technologies and the distributed log have kind of taken off in the last few years. But ultimately, it's very hard to kind of make sense of that data, kind of wrap your head around the new paradigm of streaming data. And then, you know, ultimately, like building applications that are really cool on top of those data streams. And that's really what the event of our platform is designed to, to solve all in all in one. And the idea kind of came about we were so just to kind of a little bit of history. My background, again, my co founder, myself, Eric, we're working on MongoDB problems actually around 2010. And we started to figure out that Mongo DB back then was very early. I think I was one of the first few customers who really used it in production and fell in love with it. It was awesome. But it had some it had some quirky corner cases where it just, it just didn't work for me. Fast, we've had a heavy duty locking kind of design for for the database engine, and it made it slow under certain circumstances. And so we kind of figured out that like a lot of that had to do with IO and started building around SSDs. Sure enough, it's kind of a brute force, you know, mechanism, but SSDs made Mongo very fast. And so using some of the projects that came out of Facebook, where we could do hybrid SSD, and filesystem layers, and then ultimately, just pure SSD, we figured out the Mongo could just be really, really fast. And and we started to kind of build a product around making MongoDB enterprise enterprise grade is kind of what we called it. And we that's when we found it object rocket and ultimately sold that to Rackspace and grew that business to be relatively big business. But along the way, we started to have a lot of our customers and at this time, we had billions and billions of documents under management. It was just they, you know, in Mongo, it's a document on a row, but ultimately, billions and billions of rows under management and customers. Were asking us these questions like, hey, how do I you know, get up millisecond response time on this 2 billion document collection? And the answer was like, you're never going to get that that's I mean, you know, theoretically, you know, disks have to move to make this happen or, you know, buffers have to be rigid, there's so much, you know, index, retrieval plans have to be followed, there's just a ton of, you know, work under the covers for database engine to return the data in that kind of a in that kind of a mechanism. And it just wasn't really feasible. And so we started asked, like, what are you doing? What Why do you need to have this type of performance from your Mongo cluster? What's, what's the thing that's driving this, and it was these really kind of new school use cases where, you know, for instance, and I won't mention the customer, but it was like, you know, imagine if you wanted to look around a room and find a date and you wanted it to be very centric to the people that are in that room at that time. Maybe you're at a different place having dinner or you're at a bar or something and you wanted to get a sense of the customers that could be available for dating in the room and we thought well, okay, that's gonna be very, very real time kind of a situation. People are coming and going How are they tracking this data? And more and more customers started to kind of have these these use cases in similar ways. And we thought, okay, there's a pattern here. And this is super interesting in terms of database engineering, how are we going? How is the world going to handle these kinds of requests and use cases. And if applications are really being designed with this kind of sensitivity towards performance in real time, then there's got to be data systems behind them. There's got to be a backbone to make this work. And that's kind of when we fell in love with Kafka.
06:26  Tobias Macey
And I know that there was also the heyday of rethink DB that was trying to promise some of that same capability of push based data delivery, where as soon as a record was entered into the database, you could subscribe to that and then get an a push notification based on that. And I'm wondering what the differences are in terms of the capabilities of how rethink DB was approaching things versus what you're able to build out with something based on Kafka and some of these stream native data platforms
06:55  Kenny Gorman
right now? That's a very good question. I think ultimately, we can say at this point like there's there's different technology stacks that we've employed for various uses over the over the years. Right. And in my mind, and I think, you know, it's a relatively popular opinion is that the data sphere is not shrinking, or it's not. It's not, you know, one technology or another technology. It's really that as a data engineering professional, we have more choices going forward. And now it's up to us to pick the right platform of choice for the particular use case. And I think, you know, in terms of real time things, you know, distributed lock architectures, whether it's, you know, Kafka or pulsar or something else are fantastically cool and fantastically high performance, because it's an append only structure. And we did this in databases way back when, when we had to make the database super fast and had to handle a lot of inserts, we would just, you know, we wouldn't say we wouldn't do any update updates, we just have to make an insert only table. And that's a very, you know, kind of old school and crafty way of doing similar kind of thing. In fact, you know, and it's should be obvious that all the database is most popular databases have a distributed lot built into them. It's For their recovery structures, you know, so Oracle has a redo log, you know, Postgres as well, logs. Ultimately, these are append only designs and that way for a reason, they have to be high performance. And then, you know, they're typically designed to replay a stream of events, if you just pop that out of the database, so to speak, and, and, you know, create a new a new infrastructure piece based on just that you kind of end up with something like Kafka, at least that's how I think about it. And so that's really the kind of the right architecture, it's not really a database, but it kind of is. And I think that the hybrid of this pub sub style of interface with a durable and distributed back end is is really the sweet spot for a lot of these architectures.
08:39  Tobias Macey
And as you mentioned, there are a few different approaches to this streaming and event driven capabilities. And in recent months, there have been a lot more projects coming out to focus on being able to provide sequel interfaces on these streams, particularly in things like Kafka or pulsar, and I'm wondering how the capabilities of what you've built into To door and the overall developer experience compares to some of those other systems such as K SQL DB, or the sequel layer built into pulsar or recently the materialized project.
09:10  Kenny Gorman
Right. So, first of all, it's super exciting that we've seen so much energy in this space. You know, when we first started invented, or our initial prototype, involve sequel, in fact, it involves pipeline dB, we actually use and now that's part of conflict. And those guys are super smart. We they, you know, pipeline DB was super cool. It was built on Postgres. And but it really, you know, we used it because it made it obvious that we needed a sequel layer, and it made it obvious that you needed to be able to materialize results. The only problem was it just didn't scale like something like Flink. So, you know, we set off building our service kind of without that and built it from the ground up to what we have today. If you're looking at you know, kind of K SQL DB and where that's at, it's obviously very Kafka centric, it's built into it's built into Kafka uses Kafka for scalability and command and control and coordination and that's fine. That's great. You No, that's that's a great tool. I think what we view ourselves as it's more of a platform soup to nuts start to finish top to bottom, ingest to application kind of capabilities for the enterprise. And I think that's kind of where our heads are at is, hey, look, you're going to need to have, you know, schema detection tools, you're going to need to have a procedural logic and on the fly creation of user defined functions and things like this, enterprise features that when someone's really serious about SQL in production, jobs based on SQL, that's where, you know, we fit in real nicely, the materialized guys, the super smart guys, great product, obviously, it's, you know, built with different languages and such and on the timely data flow platform, very cool and be great to see, you know, kind of where the energy is going. The lot of cool innovation happening in our space. A lot of really smart people are working on it. I think we're humbled and excited to be part of it. So yeah, just kind of excited to see you know, how things go from here.
10:53  Tobias Macey
And in terms of the main use cases that you're seeing people using for the streaming sequel applications you mentioned the outset the sort of location based and very real time nature of the application that one of your first customers was looking to build, I'm wondering what are some of the other ways that you've been seeing the platform used and how it fits into the overall application architecture and how people are reconsidering the ways that they build and deploy the types of applications they're building on these event streams?
11:22  Kenny Gorman
Yeah, that's a good question. So you know, the, the major industries or verticals or whatever you want to, you want to, you know, kind of say, that are sort of early adopters in this whole realm, we've seen FinTech be a big part of that, obviously, processing financial transactions, real time is important. Doing things like fraud detection, like anomaly detection is a big part of this for fraud pipelines and other other patterns. The You know, we've seen that everything from kind of cryptocurrency to traditional banking and all over the place, including things like you know, micro loans on checkout and all those kinds of different you know, subsets of the fintechs So that's really exciting. And I think there's a ton of area to innovate there for cool products and services from, from various vendors. And you see kind of Capital One leading the charge there, it's just, you know, customers are growing to expect more real time applications, and they want to be delighted when they use their iOS app or logging on a web page or something. And though, that data is refreshed and and interacted with in real time, and I think that's really where streaming data, you know, comes in is, is FinTech and banking has been so bad for so long that you know, any kind of real time product that kind of comes in can be such a game changer and a competitive advantage for for a company. So that's, that's one big one. We've seen that sec a lot. network security in terms of things like obviously, you know, finding anomalies and attack vectors for intrusion detection, you know, obviously, just the raw data of packets. You know, flying over networks is an interesting streaming data problem in and of itself, so we see some traction there. IoT is a huge obviously. That's a very broad, you know, bucket of use cases. But everything from the mobility space to automotive into aerospace are kind of the three big ones there, where folks are trying to make sense of streaming data in general. You know, this IoT is interesting because data is being generated at a massive rate, sometimes, you know, more than once a second a, an event has been produced, and then aggregating those events, like in some sort of real time, ETL pipeline is very interesting and required in many cases, and then, in a lot of cases, how they materialize that data out to apps is a problem space. So it's very common today to take your data and put it into Kafka, because that's easy. It's easy to just, you know, publish some data to Kafka, cool, no problem, but it's also very common to then pull it out of Kafka and stick it into a database. And that's so that you can materialize it and read it with apps. And, you know, it's interesting today is that that kind of ruins the whole point of a streaming architecture, it is kind of going backwards to some degree and the reason that's very common is because It's, again, like I said, it's hard to materialize his results. So one of the things that we built into our platform is materialized views. And that just allows customers to skip that entire database layer and read right from the streams. And, and that's important in IoT, especially because there's just so much data, the aggregation and storage of it is is a big is a big problem space. And then I think Lastly, real time manufacturing, that is one that I'm personally excited about, and I think is gonna is going to grow over time, you know, building better products and understanding yields and how your supply chain and supply lines are working is a real time problem because there's $1 value assigned to not doing it in real time. So it's costly to perform poorly in that area. It's also opportunity costs that you could be leaving on the on the table in terms of being more competitive with your in your market space and things like that. So I think you know, that is a super interesting area and I think it's gonna continue to grow this you know, the industry giants out there are leading the charge there and doing very interesting things. And I think kind of the rest of the world has To catch up, and I'm excited to see that happen.
15:02  Tobias Macey
One of the interesting things about running these aggregates on the event data as it's being transmitted into things like Kafka, or whatever your streaming engine of choice happens to be is that in a lot of ways, it makes it easier to actually build meaningful analyses on top of it, because you already have the context co located with the data as it's streaming in and the timeliness of it. Whereas if, as you said, You're then replicating it back out into a database it and you know, fully normalizing it, it can become much more challenging to be able to run those same types of analyses because of the complicated logic that you have to incorporate into trying to recapture some of that context and time sensitivity to it. And I'm wondering what your thoughts are on the value of actually storing the raw events after it has propagated through the system. And whether you would recommend that it's more useful to store the actually aggregated data into a more long term storage because Because the raw events at that point are less valuable in isolation?
16:03  Kenny Gorman
Actually, that's a really good question. You know, I think we see it kind of both ways, frankly, companies are starting to kind of double down on streaming architectures, right? It's starting to become normal to take every click that's ever happened on your shopping site or whatever, and, and jam that into Kafka. That's just kind of, you know, normal operating business now. And the question is, what do I do with that data? And how do I make make value out of it, and it's not a one size fits all kind of kind of thing in these architectures, for instance, you know, hey, maybe I want to have a stream processor running on that, maybe I'm gonna run that and sequel, and I'm looking for certain events that are happening, and then I'm gonna set I'm gonna send an alarm, like maybe I'm going to send an email out to a marketing team or another areas of real time experiments, right? So I'm running an experiment on my website. Oh, it's not performing. Oh, it's really performing really bad. I want to know that right now, because I gotta make the change. So I'll run an alert on that. That's fine. You can build that in the venture platform and off to the races. But you may also want to store that detailed data somewhere else, right? Maybe you're dumping that, that just that raw stream to s3, and maybe you will run some sort of, you know, batch process on it later. That's another way we see it's very common. Or maybe you're going to put it into a data warehouse. And maybe that data warehouse is expensive, which they often are. In that case, you would, you know, maybe pre aggregate the data ahead of time, and then just put fixed aggregates into the data warehouse. So maybe you're going to do it by week by month by day. And that way, you're saving a ton of space with the raw, you know, with raw, the raw data in something that's very expensive with traditionally data warehouses have been all that stuff is really enabled by platforms like ours, because you can take that raw data coming in from Kafka, you can build sequel processors on it, you can route data to different places, you can pre aggregate it, you can dump it to s3. So all those things are kind of on the table. And I think like the best enterprises are leveraging tools in this way and thinking about it, like, Hey, get it into Kafka. And now let's think about in very serious way, how do I get the data out and where do I Where do I send it and who are my consumers downstream
17:58  Tobias Macey
and in terms of how you've actually built out event to door. I know that you have mentioned your affinity to Kafka. I'm wondering if you can just discuss the overall architecture of the platform and some of the ways that it has evolved since you first began working on it.
18:11  Kenny Gorman
Sure, sure. So there's, you know, there's some basic kind of design, tenants and various areas that we've tried to solve. So you know, it's a service based design. It's, it's built on Kubernetes. Some of our surfaces use open source technologies, like Flink, we've mentioned that and Kafka, but there's some, like higher level components that were really important as part of our design. And that is, so first of all schemas are hard, you know, the schema that most enterprises have is not I mean, if you think your enterprise has one schema, and it's perfect, you're probably lying to yourself. Most data is very messy. Most companies are working through trying to standardize and ensure that their schemas are good, but it is tough. And in a lot of cases, it's the Wild West out there. So dealing with nested data structures, dealing with foreign data feeds, dealing with different departments, you know, that's a big piece of the challenge and that's why we created Put transforms and have the deep nesting capabilities in our sequel engine because we wanted to be able to address those feeds, we wanted to allow customers to mutate those schemas on the fly if they needed to, we integrate with things like schema registry if needed, and try to get that piece of the puzzle really nailed down. The other piece of it is a good SQL engine. So we leverage Apache Flink for the core core SQL engine. But that's not really the whole picture. Because if you use SQL and Flink today, what you're doing is you're actually writing Java or Scala code, and you're putting SQL into that Java or Scala code as a string. And it's evaluated at runtime. And you know, as we know, sequels an iterative declarative language you you want to be able to iterate on your data structures and on your data with with sequel you want to play around with it, you want to understand maybe what groupings are important or what time frames are important. And then ultimately, you decide like, okay, that's, that's a good piece of sequel, I'm gonna, I'm gonna send that off and let that be, you know, a job and run continuously. So our SQL engines pretty cool in the sense that it'll allows you to iterate on streams of data. So it understands the schema, you can describe them. It's it uses NC standard SQL based on calcite. And that works nicely with Flink. And if you make a mistake, like in your grammar, like syntax error, it will tell you right away. So you don't have to wait for that job to run and redeploy, and go through ci, CD and all that kind of stuff. So it really allows you to explore the data, play with it, treat it like a regular SQL terminal. And then, you know, as you hit execute, it creates a job and sends it off to the server. It's highly scalable, robust, has robust capabilities around fault tolerance and things that frankly, Flink brings to the table. And you know, you're off to the races. And then lastly, is this, this notion of materialized views. So, you know, we have an engine that manages materialized views for us. That's something that we wrote. It manages the update and maintenance of materialized views, the indexing strategies, and really lets developers then address that data with any key, not just like lookup by key, it's not just a simple key value store. to actual robust, you know, restful endpoint. And so customers can build queries, they can send in parameters. They can, you know, assign predicates, and then build queries and build applications off of it. So those are kind of the, you know, from kind of input and ingest to processing and then ultimately to materialization. Those are the big kind of pieces that we've added on top of the puzzle using things like Kubernetes and Flink.
21:24  Tobias Macey
And given the fact that you are materializing the views. I'm wondering what you're using as the durable storage for that transient data. Is it something that you're writing back to a Kafka topic? Or do you have some sort of caching layer for being able to pull the current values from the materialized view based on the aggregates as they're updated? Good question. So
21:43  Kenny Gorman
today, we're based on the Postgres engine, and that gives us a ton of flexibility in terms of, I mean, first of all, it's a very good, you know, SQL database. It's a very well known and relatively bug free database engine, and then our management layer that On top of it that treats it like a materialized view versus database is a big part of that as well. So our software mixed with, you know, some of the Postgres open source goodness allows us to ultimately present those materialized views to the customer and keep them updated, age them out, and, you know, do all the things you'd expect from a view that's being updated in real time.
22:18  Tobias Macey
Another element of the schema support and being able to capture the appropriate context of data as it's flowing in is the idea of enriching the records as they're being inserted into Kafka. And I'm wondering what type of support you have for being able to either join across different topics or materialized views or adding a default set of data that gets injected as part of that input transform.
22:43  Kenny Gorman
So first of all those things are possible on our on our platform, we're a little bit different in that we treat every input, whether it's Kafka or anything else as a virtual table. So and you can join virtual tables. So if you have a Kafka topic in the EU region and one in the US and you need to join those in a piece of SQL, that's That's fine, no problem super easy. If you need to join something like Redis, maybe you've got, you know, like a common use cases like maybe with making this up, but like a voter application, and maybe you want to take geo coordinates and look them up by county, so you can populate a map or something like that, you might just have that geo to county lookup database in a Redis database. So we allow you to actually enrich on the real time from Redis. So that's pretty cool, super easy to use, as well as just use user defined functions. So you know, in user defined functions, if you have a static data structure you need to use or some sort of lookup table or lookup pass, you can go ahead and use that, you know, in real time as well. So we kind of see all the above. Sometimes, you know, I think people start off with sort of more static datasets and you know, actually put it in the UDF. And then they say, Oh, I'd be great if I could update this in the back end via some sort of database technology and then you know, Redis is a great fit for that. So, we see we see that as well. And then yeah, and then you know, obviously sources and sinks which are the inputs and outputs of any you know, kind of stream processing system are important. So, you know, s3 is an entity One, I think, you know, going forward, we'll have more support for more what I'll call legacy databases. But today, you know, if there's if you want to do change, data capture, or you just want to use Kafka or any of those kinds of things, all, all those things are possible and work quite well.
24:13  Tobias Macey
And then at the time that you started to vented, or my guess, is that Kafka was still the preeminent streaming platform for being able to handle this durable append only storage, but the landscape has evolved fairly significantly in the past, even just a year or two. And I'm wondering what types of changes have occurred in the overall ecosystem since you first began working on event to door and if you were to start over today? How would the current state of the industry influence your design choices and what would you do differently?
24:45  Kenny Gorman
Yes, there are other options out there today. I think that you know, Kafka is sort of the de facto standard. Each cloud kind of has their own little version of that as well. And I think that's kind of a second place at this point, you know, like things like kinesis and You know, I think if I look back, knowing what I know now, you know, I told you we created this initial prototype where we had kind of this end vision of of materializing results through streams and single processor in the middle, I would have actually tried harder just to kind of get to the end result. I mean, I think we, early on, you know, and I want to be humble about saying this, I think we maybe, you know, didn't necessarily predict where the market was going. But we kind of maybe felt that it was going to go this way that just kind of made sense. And, you know, here we are now, and you know, materializing views on streams is a real thing that there's multiple people doing it. It's a well known design paradigm, we didn't know that when we first started a venture, we just knew that this would be a cool way to to build things on streams. And I think if I if I knew where we're at now, in terms of the macro picture of the market, I would have just tried harder just to get here faster. Instead of you know, we went through phases where we thought, you know, hey, it's going to be you know, it's going to be a managed Kafka world or manage Flink world and in our software, and our platform is going to be a smaller part of it, and then go Ultimately figured out that kind of the reverse was true. You know, Kafka is very prevalent Apache Flink is a very amazing piece of software written by a very smart folks and a great list of maintainers and continues to grow in popularity and, you know, focus on correct results and, and good state management. So it is a is a great piece of software to work with and work on. And, you know, we're bringing what we can to the table. And I think, you know, it really, the sum of the parts really equals a really nice and powerful platform. And I think, like I said, if we would have known that before, I would have just tried to jump right to the end result but you never know when you're when you're building something, I suppose. Another element
26:37  Tobias Macey
of building on top of all these different types of systems is the operational complexity that it brings along with it because distributed systems are hard in any case, but trying to layer multiple ones of them together and then provide your own SLA s on top of it as a reliable foundation for other people's applications. I'm wondering what you have found to be some of the most interesting or challenging operational aspects of building and running your platform on top of these systems.
27:04  Kenny Gorman
Yeah, and that's a really, that's a really good point to make is that, you know, once you start, if you want to build this, you know, this kind of real time architecture in your company, in many cases you can, you'll have the talent to do it. engineering talent is prevalent these days. And you know, and ultimately, these applications are business critical. You're building something that business really needs to compete, and it's market space. And data is king, right? So the thought is like, well, I'll just go build this stuff. And you know, I'll use Kafka and I'll use Flink, and I'll build this stuff myself. And you know, companies have you've seen Uber build Athena x, you've seen Keystone by Netflix, and there's others, but those companies have put massive resource and massive time and energy into building this platform. And ultimately, like it just lets the customer self serve and the development customer self serve and build apps. And that's great. I think that if you're, you know, not Uber, or you're not, you know, one of these gigantic companies monoliths, then you probably do need to think about vendors in this space us or someone else to help glue together a lot of these pieces, because, you know, like I said earlier Stream Processing from a is a different mental state than then database engineering to some degree, the same thing with kind of like even if you're coming from the Hadoop Distributed batch data landscape, these are very similar, but also very different kind of mind shifts and how you deal with streaming data. How late data is, you know, you think about, you know, late data, how you think about schemas in a in a schema list world, how do you mix those two things? You know, obviously, sequel requires strongly typed data and a fixed schema, that's also pretty common for people just to jam whatever JSON into Kafka. How do you reconcile those two events? How do you get a strongly typed schema out of a JSON blob that people are throwing into Kafka? So you know, these are these are and then how do you support all that all day, all night? The whole time? And I think on a planet scale, and I think that you know, in those cases, you do need a good partner. You do need someone who's very good with support and understands the underlying technology stack To help you keep these things running, you know, I said earlier that, you know, a lot of these are production of business critical applications. And once you start to compete with your, in your market space with, you know, real time apps, and you know, maybe you're running machine learning models, or, or maybe you're just building really, really cool dashboards that customers, you know, the customers are attracted to. So they're buying more of your product, once you start getting those things up and running, you can't just back away from them their production jobs and they're in it's very important to your business. So getting again, getting a good partner that understands that stack and the nuances within the stack, I think is a pretty key and core thing. And then understanding that, you know, I should be able to the state of the art, I think for you know, the kind of the, the end state that's great for companies that is that they can self serve data, they can look at a stream of data, they can inspect it, they can pick the pieces they want, they can build their own filters and aggregations and then they can self serve it to whatever application they're building. Like that's the holy grail, I think. And you see and that's why you know, Athena x and the other platform rebuilt This is so that, you know, the data engineering folks, typically there's a few of them, even in big companies are can keep up with the demand from the business in terms of building real time apps. And I think that's a really nice goal for us from a design aspect for event to door. And I think, you know, companies should be looking and thinking in that way as well, because there's only going to be more applications being built on streaming data going forward. And, you know, how are you going to keep up? And so that's, that's why support and the right platform, I think matter a ton.
30:28  Tobias Macey
You mentioned, too, that you're building on top of Kubernetes for being able to manage the actual infrastructure level. And I'm curious what you have found to be some of the useful elements of that platform, and some of the challenges for being able to manage these stateful workloads on it.
30:45  Kenny Gorman
Right. Right. So if you've used Kubernetes you'll know that it's I think it's a love and a hate thing. I you know, look, I don't I'm not a Kubernetes expert that you know, team is the team is, but what I experience where I'm at is see a lot of, hey, Kubernetes makes Eks really easy and Kubernetes makes why really hard. And so systems are more opaque. From a support standpoint, it is harder to figure out where something went wrong and rectify it. The same time, you know, when we need to scale a cluster, it makes it very easy. And if we have faults or problems rescheduling jobs and delivering a seamless experience to the customer, is easier. So it's a love hate thing. I think Kubernetes is still relatively early, is very powerful and makes a ton of sense for what we're doing, I think but you know, going forward, I think, you know, it'll get better. And we'll, frankly develop even richer processes for for dealing with support and how do we quickly identify broken things and fix them. So
31:44  Tobias Macey
again, going back to the SQL support, you mentioned that you support anti SQL using calcite as the engine for being able to handle the parsing of that and wondering what are some of the ways that you have found necessary to extend Or augment the sequel dialect for being able to support streaming workloads and some of the different ways that user defined functions are incorporated into these workloads.
32:12  Kenny Gorman
Sure, yeah. That's a good question. So just you know, if folks don't know, Apache Flink has a has a sequel API, and that's ultimately what we leverage based on based on the things I said before, and that the dialect that understands is is the calcite, which is for the most part, Anssi, SQL, and Kelsey, it's very broad. So Flink doesn't support 100% of calcite, because some of that stuff is more batch related doesn't really apply. But if you go to the calcite page, and you're wondering, you know, if a piece of SQL works or not, then the calcite is the right place to look that up. I'll say that Flink has been on a development journey and the with the acquisition of Erica with with Alibaba. And the blink planner that Alibaba has brought to the table. I think even more SQL capabilities and richer SQL capabilities are are coming forward, and that's great. That's great. And it you know, it look it does, it already does stuff that's above and beyond the dialect for just simply calcite, which is things like match recognize and match recognize is doing complex event processing in SQL in a single set statement. It's really amazing. We used a lot here. And so that's kind of an add on from that standpoint. And then like you mentioned, UDF. So the way we do UDF is a little different, that we kind of harken back to our MongoDB days where JavaScript was the procedural language for the engine. And, and we have that here, too. So we actually, the JavaScript gets compiled down to Java and runs in the JVM, and sent out to the machines. And you can actually define that in real time. So we have an editor, you can build a snippet of JavaScript, it can be, you know, something simple, like, you know, converting numbers or looking up a particular value in a hash. Or it can be I don't know, farming out to a REST endpoint to enrich your data and obviously that'd be slower, but, you know, we have all that kind of capable So, case logic and procedural logic, and you know, anything under the sun really that JavaScript supports, you can, you can go ahead and build a UDF. With that, and then you're off to the races. You know, it's not one of those things where you have to recompile the server or build new jars or any of that stuff. It's just you just define it and use it your sequel just like you did in your favorite database, whether it was my SQL or Postgres, or any of these things, it works the same way, you just create a function, and you're available to use it right away. And you're off to the races. And we also use that JavaScript engine for our input transforms. So when you, you know, input transforms are cool, because if your data is coming in, and maybe you have arrays without keys, or maybe you've got messy data that needs to be normalized, or maybe you're some of the data is just bogus, like you just want to drop a data with a certain element and just not process it and only accept data with a certain schema or something like that input transforms were created for that that's also in JavaScript. So you know, if you're using a vented or on a day to day basis, you're writing SQL statements in calcite SQL. You're you know, maintain that with JavaScript, you know, with input transforms and UDF. And then you're, you know, creating restful endpoints that you're then pulling into your app. So we deploy an endpoint and you can use that your application and then you know, you get returned JSON data, right from that endpoint. So that's, that's kind of the different pieces of where they fit in. And how can a UDF kind of sit in the middle
35:20  Tobias Macey
as far as the workflow of somebody building on top of a vented or what is how does it integrate into the overall development cycle and the software development lifecycle of being able to version and deploy the different queries or different user defined functions that are powering people's applications and making sure that any schema transformations that happen are deployed at the same time as the applications that are going to rely on those different forms of the data
35:47  Kenny Gorman
events, you know, has a has a notion of projects and events where you can write you know, you can write Java or SQL you can just write Flink jobs and deploy them. It doesn't have to necessarily be in sequel, although we find most customers you know, enjoy you. in SQL, because much easier and quicker to get the job done in most cases, in that case, we use projects, and we have a projects interface where that's integrated with GitHub, you can plug that right into your ci CD pipeline, and, you know, integrate with the rest of your organization in that way teams and all of that. Now, that's just for Java sequels coming, that's something we're working on right now. So the next thing you'll see, you know, coming out from us is that you'll be able to check in your your, the various components of your SQL job, whether it be the sequel, The UDF, input, transforms, all the all the things that make your job, work properly, the configuration for the materialized view all that stuff, and then that lives in your source repo and then you know, when we launch a job, you can just launch from the repo so that makes production realizing and you know, having a resource repository versioning and ci CD real easy, and we do that today for Java and then sequels next,
36:46  Tobias Macey
what have you found to be the tipping point where sequel isn't sufficient for being able to complete a particular task and somebody would want to drop down to writing the jobs specifically for Flink or being able to approach it in a more procedural fashion. And
37:00  Kenny Gorman
we get asked that a lot. And there's a number of dimensions to the answer that I think are interesting. So the first one is that, you know, look, if someone's a fling shop, and maybe they've got a couple Flink jobs running, and you know, now the business is relying on them, and they're nervous about supporting them, and how do I scale them, they can port those jobs over to event store, no problem that you know, it's run those right off the bat and bam, you're off to the races. So from a transition and, and migration standpoint, it works really nicely to just keep using that job of, you know, going forward. So that's, that's one thing. The other thing is, it just depends on who's in the team and what the mix of engineers is, and, and the time frames that they have to complete a project. So you know, we, we had one customer is kind of funny, you know, he a Java guy total dyed in the wool Java guy. And he's like, Yeah, I would like to write this job in Java, but I'm gonna write it in SQL. And we said, well, why are you gonna write in SQL if you like Java so much? And he said, because I'm the only one who knows Java here. And if and if, you know, if I write it in Java, then I'm going to be on the hook all night long for support and on call and all that stuff, and I'd rather Right in SQL, so the nine other people that work in my department can can maintain it and, and, you know, change the job and mutate it over time if it needs to. So, you know, it's not just a matter of and I thought that was a very mature decision, like I thought, okay, you know, he's not trying to just, he's, he's thinking about the team, and how can we support going forward? And, you know, look, he's looking out for himself to a little bit, he didn't wanna be woken up in the middle, the nights I thought was funny. So, you know, I think we see a mix of both, if there are some libraries that you know, you can pull in a job that and maybe you're used to, you know, working in that IntelliJ is your is your is your home, then we support that too. And that's no big deal. But I think, you know, look, it's like a 9010 thing. 90% of almost all workloads can be expressed in SQL relatively robustly, 10% of the time, you might need a specific library, or you have a specific, you know, workflow that you're used to or code pattern that you're used to, and you'd rather do it in Java or Scala. So I think that's kind of where the breakdown is. Maybe it's 8020, or whatever, depending on the organization. But in the for the most part, most, most things are expressed in SQL and it's just because it's simpler in most cases. So I think that's kind of where we see it,
38:57  Tobias Macey
see it heading and is developed uppers are working on building out these streaming workloads and building applications on top of them and trying to schematize the input data, what are some of the sharp edges or design pitfalls or data modeling considerations that they should be aware of?
39:15  Kenny Gorman
Right? That is a huge topic in and of itself. schema management and data quality is a honestly it is it is a whole, it's a whole topic in and of itself. Some of the things that we've seen trip people up, though, are if people are pulling in data from multiple foreign sources, sometime that's like a partner sometime, that's a maybe a foreign REST API that's there pulling in billing data or something. Other times, it might be just departmental departmental differences. You know, department a and department B have different data structures for essentially the same thing. So that can be very challenging. And I think a lot of companies, there's different philosophies out there. A lot of companies say, Hey, we want you to adhere to this rigid standard on the schema, and here's a document on how to do it. And here's a central schema repository. You know, kind of that approach. And we've seen that we've seen that work and be successful and other approaches that people say, well, let's just let let's let's not burden them with that. Otherwise projects will stall behind, you know, waiting for, you know, schema approval or whatever. And let's just let them put the data in into Kafka in the format that feels native to them. And then we'll build the tools that allow that data to be used effectively in the rest of the enterprise. And that's kind of where we fall, I think we're more powerful in that case, you know, there's another company called rocks that out there, they automatically detect and handle schemas, we do that, too. They're more batch oriented, although they're kind of, you know, they're seeing that I ran down the wall, I suppose to me and going more towards streaming. So that's great. So you know, I think we we see, we see that we see that kind of, you know, let the data just be that the native format, and then build the tools to handle it in a more robust way kind of design more often than not, and that and again, that's the one that we work better with. And that was like the philosophy of MongoDB way back when right schema lists design and ultimately, everything has a schema, you know, and the world doesn't exist without schema, but let's Let's move it to the part where we absolutely need it and let that data be persisted and saved in Kafka and then used in an effective way with things like input transforms, or ETFs. Or just even being able to use, you know, nested structures in sequel is another way to deal with that. So, I guess that's kind of my thoughts there. I know that's not exactly answering what you're saying. But I think that's the the real struggle is how, where do we fall on the the militant lay a level of, you know, deciding that there's a fixed schema or do we let let kind of, you know, a loosely loosely defined schema be the norm?
41:31  Tobias Macey
Yeah, the the universal answer to any technical question. It depends.
41:35  Kenny Gorman
Yeah. Right. I was only one, it depends, maybe, maybe a lot more.
41:43  Tobias Macey
And what are some of the most interesting or innovative or unexpected ways that you've seen your customers using your platform and building on top of these boundless event streams?
41:51  Kenny Gorman
We had this we had this thing early on that I thought was super cool. Like I told you, we created input transforms and UDF in JavaScript, and we saw Folks starting to use them. And I remember we were on a support call, and they were talking to each other about our platform, like it was, you know, like with our vernacular, and like, it was normal, like teaching each other our platform. And I thought, Man, this is this is the greatest thing ever. I love it when, you know, we come up with an idea to solve their problem and they, they using it to solve a problem, and then they're teaching each other, you know, to solve that problem with with the toolset this one customer I was thinking of, you know, they really double down on the input transforms hundreds of lines of JavaScript code to process incoming data just because of the nature of that incoming data and I think their use cases quite unique. Not Not everybody has the same problem. But the fact that they were able to just use that up out of the gate and kind of got it and then you know, went with it. I was pretty proud and excited to see that happen. And frankly a little surprised that it worked so well for them right out of the gate and we've we've continued to modify it make it better. And I think it's it's even, it's even more robust now. So that was that was really quite cool
42:56  Tobias Macey
and in terms of your experience of building and scaling the event to To our platform and just working in the data management space, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
43:07  Kenny Gorman
I think streaming data is harder than we thought it would be. Overall, you know, I think when you dislike anything, it's kind of an iceberg in the sense that you start putting data into into Kafka. And you're like, Okay, I get it. And maybe you create a consumer group and you've got a microservice and you're pulling data out, you're like, Okay, this is cool. But that's like, you know, that's the aha moment with streaming data. But to then go from that, to my company runs and my production applications run on streaming data is a whole different thing entirely. It was sort of like in the MongoDB days is like, okay, it's super easy to put data into Mongo into a Mongo database. But now I need to shard that database. And now I need to, you know, index those nested structures or whatever, starts to get complicated quick. And I think that is, you know, ultimately, like when we're thinking about things like out of order data and back pressure, streaming data is a mindset shift from database technologies have old, it's extremely powerful in your applications. But it does require you to think differently about how how it all works. And I think that's one of the things we have really worked hard on is, is to make it feel and work like a database, so that folks don't have, you know, isolate them from as much of the detail of how sausage is made, so to speak for streaming systems as possible. And that's, you know, all about using SQL and all about materialized views and those types of things. So from a high level that that that's kind of where I think it was harder than we thought. I also think that and I said this at the start to some degree. I'm pleasantly surprised and excited to see more folks in you know, most of the time, I guess you'd say it's, it would be bad to have a lot of competitors but I think if I just take it from a kind of work, you know, my personal career standpoint and from a you know, from a craft standpoint, the thing we're building, the thing we care so much about and build every day. Think about all night, I really excited to see a lot of other folks in this space. I mentioned materialize earlier and kick the Office of the conflict folks with K SQL DB, you know, rocks that I mentioned. And there's others, you know, as Kenai, oh and there's a few other ones that are, they're great companies building really cool things. And it's nice to see this ecosystem evolving and customers to be successful with these with these stacks. It whether it's us or someone else, I like to see that the community is getting it and growing. And, and it's really, really satisfying to see someone build something cool. And like say, that was kind of easy. And you're like, Yeah, well, okay, I'm glad it was easy for you. We we figured out some of the hard parts of that for you. And I'm glad we did. So I know. That's kind of where I suppose I land on that.
45:38  Tobias Macey
And what do you have planned for the future of the inventor door platform?
45:42  Kenny Gorman
Sure. So, you know, we want to continue to make application developers data scientists, data engineers successful with streaming data. You know, the theme, I guess I keep talking about is, you know, this stuff is hard. There's there's a lot to it. And ultimately, you just want to build that killer application and you shouldn't have to have you know, Your knowledge doesn't have to expand all the way into the depths of all these different components. Maybe your company already has Kafka, and you just want to make sense of that data and just drop in and start building. And so those are the folks who want to continue to make successful with our product and build great things. And I think the world especially now has more data problems than ever data is gonna is now and will continue to be the the most important thing that a company has beyond its people. And you have to make sense of that data and you have to use it competitively, or you're gonna die. And using streaming data to do that is is is the killer app. And I think that's super exciting. So you know, our path is to is to build more API's do things like automatic fault detection and automatic scaling, use the clouds more effectively in terms of cost controls and scalability components. So there's we have feature sets and all those different areas we plan to continue to, you know, I think our REST API is is very robust in terms of its capabilities of secondary keys and operators and things like that, but we can To continue to build on that, too, we know that flexibility in the API and usability to the developers is going to be key. So that's that's one area, we're going to continue invest in and build on.
47:10  Tobias Macey
Are there any other aspects of the work that you're doing at advanta door or the overall space of streaming SQL and event streams and building applications on this real time data that we didn't discuss? They'd like to cover before we close out the show? No, I, you know,
47:23  Kenny Gorman
no, I think that's a pretty good overview of streaming stacks these days, I think, you know, we'll see what the I think the next year and a half or two years is gonna be really exciting for this for this field. I think that, you know, ultimately, customers are going to understand that they should, they should be building self service platforms for the stuff that, you know, they're their customers, the developers are data scientists need to be able to have access to this data. I think data science in general is going to grow to understand that streaming data can be a big part. I think today, if you ask, you know, data scientists where you know, what does data look like? I don't think necessarily across the board, a huge segment of them would say, Oh, it's streaming, I think that's going to change. And I think you will understand that data and movement has more value than data at rest across the board. And we'll see some more exciting innovations coming from from those teams of people. So I think that's exciting. But yeah, that's, I think it
48:15  Tobias Macey
for anybody who wants to follow along with the work that you're doing or get in touch with how do you add your preferred contact information to the show notes? And as a final question, I would just like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
48:29  Kenny Gorman
Sure.
48:30
You know, I think, ultimately, I think users of data in an enterprise don't know what data they have. They don't know how to use that data effectively in in their apps or in their projects, whatever they're building. And I think data discovery and understanding of schemas, as we talked about a little bit and having central repositories of this data, easily available, not just in the streaming sense, I think, I think across the board is has been a big challenge for companies. I think I said, you know, look, it's not just databases anymore. It's databases and batch and, you know, queuing and all sorts of different data systems make up a robust information architecture these days. And I think that just means more sprawl and more confusion. And I think that's a big area that that is still kind of Uncharted and left left unfixed, so to speak or on address to some degree. So, well, thank
49:18  Tobias Macey
you very much for taking the time today to join me and share your experience with building out the Aventador platform and working in the space of streaming data. It's definitely a very interesting and challenging domain. So it's great to see people like you out there trying to tackle it. So I appreciate all of your time and effort on that front and I hope you enjoy the rest of your day. Thanks to my su two.
49:44
Listening Don't forget to check out our other show podcast dotnet at Python podcast comm to learn about the Python language, its community in the innovative ways it is being used and visit the site at data engineering podcast comm to subscribe to the show, sign up for the mailing list and read the show notes If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story, and to help other people find the show, please leave a review on iTunes and tell your friends and co workers