Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

https://www.dataengineeringpodcast.com

subscribe
share





episode 105: Automating Your Production Dataflows On Spark [transcript]


Summary

As data engineers the health of our pipelines is our highest priority. Unfortunately, there are countless ways that our dataflows can break or degrade that have nothing to do with the business logic or data transformations that we write and maintain. Sean Knapp founded Ascend to address the operational challenges of running a production grade and scalable Spark infrastructure, allowing data engineers to focus on the problems that power their business. In this episode he explains the technical implementation of the Ascend platform, the challenges that he has faced in the process, and how you can use it to simplify your dataflow automation. This is a great conversation to get an understanding of all of the incidental engineering that is necessary to make your data reliable.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com today to find out more.
  • Having all of your logs and event data in one place makes your life easier when something breaks, unless that something is your Elastic Search cluster because it’s storing too much data. CHAOSSEARCH frees you from having to worry about data retention, unexpected failures, and expanding operating costs. They give you a fully managed service to search and analyze all of your logs in S3, entirely under your control, all for half the cost of running your own Elastic Search cluster or using a hosted platform. Try it out for yourself at dataengineeringpodcast.com/chaossearch and don’t forget to thank them for supporting the show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Sean Knapp about Ascend, which he is billing as an autonomous dataflow service
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what the Ascend platform is?
    • What was your inspiration for creating it and what keeps you motivated?
  • What was your criteria for determining the best execution substrate for the Ascend platform?
    • Can you describe any limitations that are imposed by your selection of Spark as the processing engine?
    • If you were to rewrite Spark from scratch today to fit your particular requirements, what would you change about it?
  • Can you describe the technical implementation of Ascend?
    • How has the system design evolved since you first began working on it?
    • What are some of the assumptions that you had at the beginning of your work on Ascend that have been challenged or updated as a result of working with the technology and your customers?
  • How does the programming interface for Ascend differ from that of a vanilla Spark deployment?
    • What are the main benefits that a data engineer would get from using Ascend in place of running their own Spark deployment?
  • How do you enforce the lack of side effects in the transforms that comprise the dataflow?
  • Can you describe the pipeline orchestration system that you have built into Ascend and the benefits that it provides to data engineers?
  • What are some of the most challenging aspects of building and launching Ascend that you have dealt with?
    • What are some of the most interesting or unexpected lessons learned or edge cases that you have encountered?
  • What are some of the capabilities that you are most proud of and which have gained the greatest adoption?
  • What are some of the sharp edges that remain in the platform?
    • When is Ascend the wrong choice?
  • What do you have planned for the future of Ascend?
Contact Info
  • LinkedIn
  • @seanknapp on Twitter
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
  • Ascend
  • Kubernetes
  • BigQuery
  • Apache Spark
  • Apache Beam
  • Go Language
  • SHA Hashes
  • PySpark
  • Delta Lake
    • Podcast Episode
  • DAG == Directed Acyclic Graph
  • PrestoDB
  • MinIO
    • Podcast Episode
  • Parquet
  • Snappy Compression
  • Tensorflow
  • Kafka
  • Druid

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA


share







 2019-11-04  48m
 
 
00:10  Tobias Macey
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to to play it. So check out our friends over at Lynn node. With 200 gigabit private networking, scalable shared block storage and a 40 gigabit public network. You've got everything you need to run a fast, reliable and bulletproof data platform. And if you need global distributions, they've got that covered with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads. They just announced dedicated CPU instances. Go to data engineering podcast.com slash the node that's LINOD today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for their continued support of this show. This week's episode is also sponsored by Data coral and AWS native server lists data infrastructure that installs in your VPC. Data coral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs rather than pipeline maintenance. Revenue Murthy founder and CEO of Dana coral builds data infrastructure is at Yahoo and Facebook scaling from terabytes to petabytes of analytic data. He started data coral with the goal to make sequel the universal data programming language. Visit data engineering podcast.com slash data coral today to find out more. And having all of your logs and event data in one place makes your life easier when something breaks. Unless that's something is your Elasticsearch cluster because it story too much data. Chaos search frees you from having to worry about data retention, unexpected failures and expanding operating costs. They give you a fully managed service to search and analyze all of your logs and s3 entirely under your control. All for half the cost of running your own Elasticsearch cluster or using a hosted platform. Try it out for yourself at data engineering podcast.com slash chaos search and don't forget to thank them for supporting the show. You listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, cranium global intelligence Alexey own data Council. Upcoming events include the data orchestration summit and data Council in New York City. Go to data engineering podcast.com slash conferences to learn more about these and other events and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey, and today I'm interviewing Sean nap about us and what he is building is an autonomous data flow service. So Sean, can you start by introducing yourself
03:00  Sean Knapp
Hi, thanks for having me twice. I'm Sean up the founder and CEO of ascend IO.
03:05  Tobias Macey
And do you remember how you first got involved in the area of data management?
03:09  Sean Knapp
I do. It was actually at the very start of my career, others back in 2004, I just graduated college as a computer science major, and ended up starting at Google on the front end engineering team. And part of our big mandate was to do a lot of experimentation with the various sort of usability factors on web search, really trying to get a better engagement with our users. And a lot of my academic background was around cognitive science and human computer interaction. And what I found very quickly was in experimenting with all these usability factors in our users, I actually ended up spending far more time writing data pipelines to analyze the usage of our users. As you know, when you're at Google scale, simply trying to figure out what your users did yesterday, required pretty remarkably sized infrastructure and sophistication to inform what you should do next and the efficacy of what you had done before. And so very quickly, I went from being a front end engineer to specializing deeply in data engineering and data science.
04:21  Tobias Macey
And how did you find that difference in terms of the tool sets going from a front end engineer to dealing with a lot of back end data processing.
04:31  Unknown
It was incredibly different in that front end is a very visual experience you have what I think are fairly mature tools and technologies. When we went into a lot of the data processing domain, what I found was the the raw capabilities, the ability to store and process data at incredibly large volumes, was quite mature. The rest of the tooling and he system around that wasn't as advanced. And that was sort of my first experience and data management and pipelines in general was, it's very easy to write a pipeline to do a thing. But to write many pipelines that depend on each other and interconnect and do more sophisticated and advanced things together, I really became a lot of that. That challenge that I observed really early on in my career.
05:25  Tobias Macey
Yeah, it's definitely interesting how a lot of these tools have developed from, as you said, being very technically capable, but they leave a lot of sharp edges that you can happen to harm yourself, if you don't know exactly what you're doing and have a lot of background in the space. And it's interesting to see how the industry has been progressing to add a bit more polish and safety features into these systems to make them easier to approach.
05:51  Unknown
Yep, I wholeheartedly agree. I think that it's something that we see in the natural evolution of many technology domains. I think that this is really sort of the big encumbrance we find today. And the thing that that makes life as a data engineer, more challenging to put it simply is just trying to maintain these these powerful yet brittle and finicky technologies is really becoming one of the larger pain points of modern day data engineering.
06:22  Tobias Macey
So can you start a bit by explaining what the Ascend platform is that you've been building?
06:27  Unknown
Yeah, I'd be really happy to. So you know, about four years ago, we sat down and took a look at the data ecosystem. And to look like we've been building complex analytics and machine learning and data systems for really the last at that point, I was the last 1112 years of my career. And one of the things that I really wanted to try and tackle was, how do we make our lives as data engineers easier, more effective? And simpler. We feel like we're constantly reinventing the wheel constantly being paged at three in the morning, there must be something better. And so that the sort of core thesis behind a sand was what if we could create a new technology, something that's not a storage or a processing system, but instead, a control system, similar to what we've seen in other industries and other domains. For example, like how infrastructure now has Kubernetes, a declarative model for infrastructure that has an intelligent control plane that orchestrates the underlying infrastructure? Could we do something like that for data pipelines? And so we spent a good bit of time really architected? What would a declarative model look like for data pipelines? And can we architect a an orchestration system that provides this control plant that pairs both declarative configure In code with an intelligent control plane to automate a lot of the operation, performance tuning and scaling of data pipelines that to date, we had had to simply manually code.
08:12  Tobias Macey
And can you talk a bit more about some of the background and origin story and your inspiration for creating it in the first place? And some of the aspects of the problem space that you're working in that keeps you motivated?
08:25  Unknown
Absolutely. So one of the things that was really motivating is, you know, a few years before I started ascend, I spent a lot of time with our engineering teams, building these pipelines that, you know, simply put, we felt that we could describe these you know, what were fairly big, hefty pipelines with a handful of sequel statements yet we found we were writing a ton of code and dealing with incredible pains of maintaining those systems, due to the intricacies of the data and challenges like later arriving. data and duplicating data and failures in the underlying systems. And so we took a step back and said, well, there's an innumerable set of heuristics and problems that we encounter here. And when we think about it from an academic perspective, there's, you know, we have great pipeline are we have a great processing technology, and we have great storage technology. But we don't have a pipeline engine. You know, when I use a database or a data warehouse, there's a query planner and a query optimizer that run and optimize with that that database engine, we don't get the same thing with pipeline today, or in essence, re implementing database query planner logic, every single stage of every single pipeline. And this seems like a really interesting and incredibly hard challenge to solve. But if we can solve that we can truly introduce a new wave of data engineering, and get ourselves out of this data, plumbing, business and far more into data architecture. This is. And so that was really excited to think through what we could actually enable across the ecosystem if we got people
10:07  Tobias Macey
out of the muck, and my understanding is that the foundational layer for the platform that you've built is using Spark. So I'm wondering if you can talk a bit about your criteria for selecting the execution substrate for the platform, and some of the features of spark that lend itself well to the project that you're trying to build?
10:28  Unknown
Yeah, that's a great question into interestingly enough, when we originally started ascend, we started with Bitcoin as the execution engine, as we started with a simple sequel dialect, because it was entirely declarative in nature. And we knew that before it was a very scalable engine. And obviously, as a result, we operated purely in Google Cloud environment. And the idea behind this was in those early stages to really just prove out the concept of this declarative pipeline engine and orchestration layer. And as we proved that out and found that we could really solve some pretty powerful and compelling challenges, we then started to change our focus to how do we make this extensible to multiple compute engines? How do we make this accessible to multiple cloud environments. And that's where we started to do a lot of our research, we looked into being we looked into Spark, we looked into a handful of other technologies, and also interviewed a lot of our friends and customers and so on across spaces, or industries. We found that spark was the simply most popular, it was one that we had the most expertise and understanding and house and it was one where we felt we could really provide a multi cloud multi platform advantage. And for developers, one of the things we also found was, there was a strong desire to want to still build for Spark. Just simply not have to deal With the brittleness and the finish Enos of it per se, until we actually found some really cool capabilities of doing the, the approach of not only can we manage a spark infrastructure for you, but we can remove a lot of the, the scaffolding around using spark and really focus a lot more of the engineering time on data frame transformation and logic, as opposed to parameter ization and tuning and tweaking. So we found we could get, not just this this multi cloud and multi platform benefit, but really could expose spark to our users in a way that was really compelling.
12:37  Tobias Macey
And I'm curious, what are some of the sharp edges and limitations of spark the have run into and the process of building on top of it? And if you were to rewrite spark from scratch today to fit your particular set of requirements, if there are any aspects of it that you would change?
12:56  Unknown
Yeah, I'd say you know, we've there's a bunch of tweaks and new One says we've done over the course of for example, as we became HIPAA compliant, like we had to do quirky things with how do you manage both encryption and compression of data as it's being stored and in transit and things that just weren't quite properly supported there but it's a like these are always just like the small sharp edges you find the, you know, really big interesting one that I would like, love to see Spark, even more for fully embracing and they're doing a lot more with this and in three dot O, is native Cooper Nettie support. We're really big fans of running spark on tapes. We actually have been running Kate's as an underlying infrastructure since January of 2016. And all of our spark usage today, all runs on elastic communities, infrastructure, and so really continuing to invest in that tight connection between sparking Kubernetes for us has been an area of extreme interest.
14:04  Tobias Macey
So can you describe a bit more of the technical implementation of ascend?
14:10  Unknown
Yeah, the the technology itself really works at a couple of different layers, you know that the infrastructure there, we've designed to run on all three clubs, that being Amazon, Azure and Google. And as a unified infrastructure layer, we run two Kubernetes clusters. One is for what we call our control plane. That is all of our micro services that operate at the metadata layer. It's about 15, maybe 20. microservices now, combination of node, go Lang and Scala services that mostly talk to your PC to each other and build a pretty cohesive model of what's going on in the system. And I can talk dive more into that. And then the data plane is the other Kubernetes infrastructure. That's a elastic Lee scaled on spot in France practical instances, that runs both coop spark for a lot of our spark infrastructure and also runs workers. These essentially auto scale go based workers that we use for a lot of processing that sits outside of spark where the shape and model of the work required fits better into a custom set a work that's run directly on Kubernetes, as opposed to in Spark. But both of those run inside of this elastic compute infrastructure,
15:34  Tobias Macey
and what are some of the ways that the implementation details and the overall system architecture have evolved since you first began working on it and some of the assumptions that you had going into the project that have been challenged or updated as you started to get deeper into the problem?
15:52  Unknown
Yeah, I'd say the one of the key things as part of as part of this notion of declarative pipelines Is the the core engine that operates on everything is this control plane control planes? responsibility is to take this blueprint that is the the output of the data architecture and continually compare it against the data plane that essentially what is already been calculated and what exists today. And that control plane has to answer the questions of does what I have already existing in the data plane reflect the blueprint at the logic plane? If it doesn't, what needs to be regenerated or updated or deleted? And how do I dynamically do this? So when we first architected the system and that that control plane, the idea behind this was, well, let's inspect the entire graph to you apply a bunch of compiler theory and you do a bunch of things like we shot not just the data but all the transforms that were performing recursive the all the way down to the data Shaw's to try rapidly determine, have we done the appropriate work? Or do we need to do new work? But what we found is even at scale, this becomes harder. And so we started to invest a lot of time and energy in answering the question of what happens if I'm tracking not just 10, or even 100,000 transforms that may have millions of partitions of data. But what if I'm tracking hundreds of millions or billions of partitions of data, not just records, but actual individual files? How can I within a second or two rapidly determine whether or not my blueprint actually matches would exist in the underlying storage layer. And so this was one of the these huge areas of investment, where we spent well over a year for a big chunk of the team, building the next generation of our control plane to be able to do that, essentially, inspect this massive underlying blueprint. data are the massive the blueprint of data and the underlying physical state of data, and within a matter of a second or to tell you what new work has to be done. And that was probably one of the biggest undertakings that we've had to go through as a
18:14  Tobias Macey
company. And in terms of the interface that an end user of the Ascend platform would be interacting with. How does that compare to the spark API? And what are some of the additional features and functionality that you've layered on or the changes in terms of the programming model that's available?
18:34  Unknown
Yeah, the the interfaces themselves, from a sort of a tactical perspective are similar in the sense of, there's command lines, there's sdkz, there's API's, we also offer a really rich UI experience to as you can navigate the entire lineage and dependency of all the various data sets and where they came from and get the profile data on them and so on. I'd say the the mental model However, is slightly different. When we think of how our data flows work. They are separated from the execution layer in the sense of, there's no jobs that is wrong, right? So when you send something to spark, you're saying run this job. When you send something to ascend, you're saying make it so and so the fundamental sort of approaches. You can send declarative instructions to spark but it's contained within a task or really a job. It's not a task and sparks vernacular. And so the idea behind this is, what if you could take not just one job but your entire world of jobs and make those all declarative and so the you know, I would say that if spark were to have a perpetual query planner and optimizer that looked across all jobs that didn't ever done before and understood the storage layer as well as the compute layer and then the dependencies between those. That's kind of how ascend looks at it. And so, you know, even we've spent little, very little time trying to optimize how spark itself runs. But we spent a lot of time trying to optimize what gets sent to spark. So those those paradigms are, I think, best summarized as the difference between imperative programming models and declarative programming
20:22  Tobias Macey
models. And my understanding is that in terms of how somebody would actually create a desired end state with the SN platform is by writing these series of transforms, which you guaranteed to be side effect free. And so I'm wondering if you can talk a bit more about that mechanism and some of the ways that you ensure that there are no side effects that in the transforms and some of the challenges in terms of the conceptual model that engineers go through coming from spark and working with us and
20:56  Unknown
yeah, I'd be happy to so you know, trying to have training forms B side effect free sort of fits into a couple of categories here. You know, one thing that we did, and this is what's one of the benefits and starting with sequel is we can parse that sequel and we can analyze and understand that sequel, the output schema, the inferred partitioning mechanisms, and really optimize that in harmony with the downstream transforms in sequel that is working with, or that is working on that data. And so, part of the ability to avoiding the side effects is if you start first with the language like sequel, it makes it much easier for us to always know where a piece of data came from, why it got there, and whether or not the calculation is and that partition is still valid, or did something upstream change and that needs to be recalculated. And that was because we could just parse sequel really effectively and understand what that dependency chain is. So that Really where it started. As we opened up more pi spark and data frame transformation capabilities, it really became the balance of exposing a lot more of the raw horsepower and capabilities, while asking users to inform that the control plane enough with hints so that you know, things like, is this a full reduction? Is it a partial reduction? Is it a straight map operation, we can infer a fair bit from those code snippets. But we at the same time do need the developer in a sort of architectural assistance to properly optimize that system. Then what we did is at the the underlying layers, we've put in a lot of work into our storage layer to do things like a lot of optimization around D duplication. So for example, before we ever send any tasks to spark or any job to spark, we first look and say, well, what's the transformation that has been done at on what sets of data? Is it being done on? And have we ever done this for any data pipeline anywhere else in our ecosystem? And is it possible to optimize it? Is it already actually sitting in s3 or GCS or Azure Blob smart. And if it is, we don't even have to send that work, we can simply leverage the same piece of data. And so that optimization, they're tied with much more of a functional execution model where we never overwrite that individual piece of data, but instead introduce it introduced a safety layer of atomic commits at the metadata layer allowed us to ensure that no data was ever propagated unless it passed the integrity checks and was properly committed and became a new update of that model. And so it was this combination. To recap, it was this combination of much more declarative models, understanding of the The developers intent that then informs that control plane with the safeguards of functional programming models tied to safety checks like atomic meds and integrity checks to guarantee that nothing ever flows through that shouldn't actually be there.
24:16  Tobias Macey
And my understanding too is the in addition to the execution layer, ascend has the concept of structured data lake capability as well. And I'm wondering if you can do a bit of compare and contrast between what you're building there and some of the capabilities that are available through Delta lake?
24:36  Unknown
Yeah, be happy to. So you know, what we did with a sense structured data, like, as we said, All data that can be managed and exposed by the structured data lake is really outputs of various pipelines and data flows. And what we can do is part of this is leverage all the underlying metadata that we collect where we know how to To do things like dynamically partition the data based off of the transforms are being performed and what we understand already about profiling of the data, we can also do things like guarantee atomic commit and atomic reads of the data, again, based off of really inserting this abstraction layer between where you're accessing that the block level, if you will, of data or the partition level, within a blob store, and the metadata layer to ensure that level of consistency. And so, for us, our approach around this was really oriented towards can we ensure that you have a well formed well structured data lake that is directly reflective of the pipelines that operate on it and exposes more of that metadata. But I'd say about Delta lake which is a really cool technology to solves a slightly different problem, which is, you know, at the core is how do you insert additional data that gets your data lake to behave a little bit closer to how a data warehouse would be, where you can get snapshot in and, and compression of data, as you're incrementally adding to those data sets, has a handful of other really cool capabilities as well. But I'd say they both, you would work in concert with one another, but solve different problems.
26:23  Tobias Macey
And for a data engineer who is onboarding onto the platform and looking to complete a particular project, do you have any concrete examples of what the overall workflow looks like and how that fits into some of the broader ecosystem of data platforms?
26:40  Unknown
Yeah, we do. One of the things that that we've opened up that's pretty cool for a data engineer that wants to try ascend is literally we have examples of this. Now on a weekly basis where folks literally go to our website, you can get straight access to the product. In a trial environment, without ever having to put in a credit card or talk to anybody, it's really instantaneous to start building on the product, and then really that that experience and how they integrate into their existing systems. We've worked pretty hard to make this seamless in this actually one of our metrics and our goals of what we're trying to optimize is, you know, as you drop into any sort of a trial environment, or if you were already a customer, and you have that is you simply create one of three core constructs. It's either a read connector, a transform, or a right connector. And so most users obviously start first with the read connector, you describe where your data is sitting, it could be sitting inside of s3, or redshift or a Kafka cube. And as you described, that, you literally point to send towards it and ascend will start actually listing it out, analyzing the data. If it's not snappy, compressed RK files, we're going to convert it and store it internally for you. So it's optimized for spark automatically. And then you can actually just start building against it, you can do things like, we have a capability called variable data flows, where any stage of any data pipeline, we expose like a warehouse, where you can do just dynamic ad hoc queries, and then rapidly convert those to full stages of pipelines called transforms, and vice versa. And so the iteration process becomes really quickly to start the boat up a few stages. And then either on the tail end, the way that it works is we make it really easy to either write your data back out redshift, a query, s3, GCS, abs, etc. Or actually access the data, if you're trying to put it into like Tableau or Power BI or others. We have SDK and APIs as well as a URL structure for embedding those straight into other systems to read out the records or the raw byte streams, from any any data flow. So we've added a lot of these capabilities to try and make it as fast as possible to be able to create that that end to end Hello world experience, if you will.
29:02  Tobias Macey
And can you also talk a bit more about some of the pipeline orchestration system that you've built into us and and some of the overall benefits that that provides to data engineers as well?
29:11  Unknown
Yeah, I'd be happy to know that the pipeline orchestration level, the really key benefit of this goes back to what we talked about, which is that that shift from an imperative system to a declarative system, the benefit of this is that we need to automate and offload a lot of the painful pieces of things like well, what happens when you have later I think data do you need to shift how your partitioning of that data? Or do you have to retract previous calculations and update them and propagate them through as a result. So things like that end up being handled in a declarative model, as the system itself simply analyzes and detects what data has started to move through what data already moves through and knows the lineage independently of those. And so what we found is it's much faster to architect and design these pipelines. Get them deployed out in scale. The other big benefit that we've seen is the duplication factor that we get with a lot of the underlying infrastructure and storage, as well as the pipelines allows us to do things like rapid branching of pipelines that don't require reprocessing. And so as we are able to take an existing pipeline that may be running in production, you can simply branch that, like you would branch code, and only modified bits and pieces of that. And the only reprocessing that's done, even if you're developing on it, is really just the Delta, the changes to that data set and that pipeline versus having to reprocess everything, then I'd say that the third piece that's also related is remove a lot of the scheduling burden you used to have to set like I want to monitor these data sets at these times with these timers and triggers and check for these parameters. Even as you're gluing together different bags of pipelines, there was a huge scheduling burden to actually pair all these together. What we figured out was we can actually remove a lot of that and simply make it from our hence the name data flow ask based off of the data models, as opposed to a scheduling and trigger based model
31:21  Tobias Macey
for people who have existing investment in the overall spark ecosystem and existing spark jobs. Is there a simple migration capability that you have available as far as being able to translate their existing jobs to the new programming model or run them directly on ascend? Well, they work on remapping them to the paradigms that you are supporting, or is it something that they would have to do it piecemeal, where they just do direct migrations from the existing infrastructure and then do the translation to deploy on to us and
31:54  Unknown
yeah, that's a really good question. We we've definitely worked with a bunch of customers to really accelerate that migration path. One of the really cool things that we found is that ability to both simplify the code as it migrate over. And then also optimize a lot of the the pipelines themselves. You know, one of the best examples, actually a case study on our website, we were able to, as part of the migration process cut out 90% of the code required for a particular use case, simply because of so much of this was the scaffolding around spinning up spark and and managing the parameter ization and tuning versus the actual architecture of the data and the system itself. And so what we find with a lot of our customers as they start to, to build on ascend is that ability to either integrate into their existing pipelines or rapidly migrate that over in a much simpler declarative model to ascend itself. The other piece that I think is super important to highlight is when are really big believers that your tech stack isn't going to have just any one piece of technology, right? Like you're going to run some stuff with us. And you're going to run some of your own spark stuff, probably, or presto, or some other hive or Hadoop cluster. And as a result, this is one of the reasons why we launched this, the notion of this structured data lake is, hey, let's actually give you the ability to put your own your own spark jobs, your own Hadoop jobs, or even put up notebooks directly accessing the underlying optimize internal storage layer of ascend, so that we can really plug in just like any other piece of your, your infrastructure. It's just a, an s3 compatible interface at the end of the day, they can hook into any the rest of your tech stack.
33:45  Tobias Macey
And my understanding is that for providing that s3 interface, you're actually using the mid IO product for the gateway interfaces on top of the non s3 storage system. So I'm wondering if you can talk a bit more About that, and some of the other technical specifics of the structured data, like as far as the file formats or schema enforcement that you have in place.
34:08  Unknown
Yeah. So the the Mineo gateways was super cool. And we're really big fans of the technology. What we essentially did was every environment that we stand up, has that gateway running. And that sort of classic Min IO approach is to map from and map an s3 compatible interface to Azure Blob store, Google Cloud Storage, or a Hadoop file system or a plethora of others, but really sort of preserve the same file path structure, more or less. And what we did that was really fun was we grabbed that and created a different handler inside of Mineo that rather than mapping directly back to any particular blog store, first talked to our control plane. And the idea behind this is because we do have this advanced duplication data and jobs and tasks, not too dissimilar from how you would see like a network file system does block level duplication, you have to actually construct a virtual file system based off of the metadata itself. And so as you're listing out the the data services and the data flows and all the underlying data sets, that you have access to that man IO, gateway is actually talking to our control plane and saying, hey, what is the actual file path structure and system in place, and even listing for the underlying partitions of a particular data set itself, maybe digging into all sorts of different directories and structures that that are just optimize that are shared across a lot of the the sort of upper component level model, but just optimizing Dee doop. And so that communication model, required a bunch of optimizations that really tuned for performance and so on, but gives you an essence a really consistent stent and elegant s3 compatible data like the XA. The interesting ways that we expose that data today, it is all, we give you the straight, snappy, compressed RK files that we actually pull into spark and process and move around on our own. But we do also give you the ability to stream out those records through an HTTP or HTTPS interface, and CSV or JSON or other formats as well, if you want to dynamically convert those to feed an application. So we've taken both the sort of the low level access of just raw parquet files at the f3 interface, as well as a sort of a more application level API to get the JSON or CSV or other formats as
36:45  Tobias Macey
well. And what have been some of the most interesting or challenging aspects of building and launching us and that you have dealt with.
36:53  Unknown
You know, I think there's been a lot of really interesting and challenging aspects. You know, one is obviously There's a pretty broad range of use cases that people have for data pipelines. And, you know, you get everything from people who are like, I have one file that is a truly, truly massive file and data set that has to, you know, crunch through on this particular cadence. Then you see other folks who literally are generating hundreds of millions of files across their data set that they're trying to push through Spark. And so the the fun part as engineers is to actually take the sort of plethora of different use cases and access models, and boil that down into a couple of these reusable patterns that we believe can actually fit multiple use cases at the same time, and to refactor the problem scope, if you will, into just a handful of really powerful capabilities. They give it a sort of a new layer and a new platform to build on top of
37:54  Tobias Macey
and what are some of the most interesting or unexpected lessons that you've had to learn in the process or edge cases that you've encountered? Well, building us and both from the technical and business perspective?
38:07  Unknown
Yeah, I'd say, you know, from the edge cases and so on from the technical perspective, you know, we've we placed a lot of bets really early on in the Kubernetes domain. And that's paid off tremendously. Well. You know, one of the things that we've had to work hard on is the sort of various levels of Cooper Nettie support across the cloud providers. So you know, we we will really relish the day when we can actually just run on all of the the hosted Cooper daddy's capabilities that are like super tuned and optimized and on spot or printable instances, as we'd love to get away from having to manage a lot of that we're not quite there yet. And so having to build up a lot more expertise around managing on the infrastructure side is, for example, one of those areas, you know, I would say, on the, the non technical side, or I guess I was a pseudo tech Nicole said, you know, there's been a really interesting investment on our side of how do we have a really high, high output really high skill set team on the product and engineering side where we create a different model that we've seen with a lot of other startups where we've gone. Not the classic agile and Scrum methodology, where it tends to be team and code base centric, but we've shifted really successfully to a model that is much more agile based off of projects itself. And so we find that teams end up coming together and dispersing very quickly based off of the projects they're working on, really towards some goal and some outcome of a new feature that they're building or a new capability. And one of the things that we found coming off of this is the output and what we're able to accomplish is really, really remarkable by comparison. The teams Find that they're like we're in fewer meetings, we're moving faster, we're launching more features. And probably the coolest part is everybody gets to own big capabilities that they get to drive all the way through. And so we're finding really across the board, that especially for a fast moving startup, such as ours, that model of product management and software engineering has really been far more fruitful and exciting for us.
40:27  Tobias Macey
And then in terms of the overall capabilities of the system, or business success that you've achieved so far, what are some of the elements that you're most proud of? And in terms of feature sets or capabilities, any that have gained the greatest level of adoption?
40:43  Unknown
Yeah, we're not too surprisingly, the last two capabilities we've announced have been really, really popular. Both are credible data flows. We were actually shocked a little bit when we launched that it requires We launched it to see if folks would would notice. Not too many folks really discovered it inside of the product. But once we actually started to socialize it with our users, we've been shocked as to how much they're using this wearable Dataflow approach. It just makes it so much faster to build pipelines. So we've been really happy to get the the metrics off of that. And then the structured data lake has been really awesome. I get it. Honestly, it makes it so much easier to connect into so many other parts of the ecosystem. And we're super stoked about that. And then I'd say, you know, honestly, that the last part that's been really fun to watch has been, we just started to open up a lot more of tutorials and how tos on the product a bunch of natively integrated into the product, a bunch of our on our dev portal. And we're seeing people really gravitate towards using those as a way of testing out the platform and self serving on it.
41:56  Tobias Macey
That's been really cool to watch and in terms of the current state of the system, what are some of the sharp edges that remain? And when is it sent to the wrong choice for a given project? Yeah, that's
42:08  Unknown
a great and super fair question. The you know, it's a the easiest one that we see a lot of the time is, if your data volume is either moving or evolving slow enough, or, frankly, could fit inside of a data warehouse, I'd honestly recommend keep your data there. Building pipelines is harder and more challenging or working super hard to make it way easier to build pipelines. But we see a lot of folks who, honestly, their data cannon should fit inside of a warehouse and you should do it. And then we you know, I'd say there becomes a point in time where it really makes sense to build pipelines, you're doing more advanced or sophisticated things on your data or the volumes are getting larger. And you know, we oftentimes find folks who have massive data sets that definitely don't belong inside of it, their data warehouse and the things are either crumbling or they're spending insane amounts of money on that. And that's, you know, probably usually six to 12 months later than they should have started moving to pipelines. But that becomes a really good time to then change gears and really look at whether it's an ascend or another technology, but getting it out of the warehouse.
43:15  Tobias Macey
And then looking forward, what do you have planned for the new year to medium term future of ascend, and any trends that you see in terms of the industry that give you inspiration for the long term future of the platform?
43:28  Unknown
Yeah, there's a couple of trends that we see that are, I think, are going to be really impactful. You know, one is obviously the move to multi cloud, we see a lot of interest across the ecosystem of being able to run not just in one cloud, but across all of them in ways that treat each cloud region or zone, really as a micro data center with localized compute and storage of with intelligent movement of data based off of dependencies. And so we see that as a really interesting domain. The other area that that we see that is really interesting is, you know, we use spark heavily. But there's a ton of other technologies out there that open up a whole world of other use cases that are really interesting for us, everything from TensorFlow, to Kafka, to Druid. All of these open up different use cases and different capabilities, for example, and in many other technologies that are that are out there. And this is part of why we architected and designed a send this way was really as a control plane. You know, while we can orchestrate. We also have the ability to orchestrate across different storage layers and across different processing layers and analyze and monitor access patterns of how users are tapping into data. And as a result can do really optimal things that you ordinarily couldn't do. If you're just pulling data off of disk, and so doing things like intelligently building profiles off of how you're acquiring your data, and perhaps moving it into the druid, or moving it into a more memory back storage layer, are things that we can really start to do and extend out that that underlying compute infrastructure to the appropriate tools and technologies dynamically. So those are both areas that we're really excited about.
45:26  Tobias Macey
And are there any other aspects of the work that you're doing at ascend or other aspects of the idea of declarative data pipelines or anything along those lines that we didn't discuss yet that you'd like to cover? Before we close out the show,
45:41  Unknown
I'd say the thing that we're really observing is as we start to look into 2020, and see a lot of the lot of the sort of big efforts and initiatives that we're all putting into data engineering, the shift from more imperative based systems to declarative Systems we've seen across a bunch of other technology landscapes. And what we're finding is for most teams and most companies, as we get more and more use cases, more and more data, and more and more people are contributing to that those data pipelines and that data network inside of our various companies, we just get an exponential increase in complexity. And what has helped actually alleviate the pain and the late night or early morning pages that we get that wake us up when we really don't want to be working is that complexity and trying to battle with it. And what we've seen is that shift to these declarative based models allows us to just write cleaner, simpler, more compact code and more elegant systems as a result. And I think we're going to see that that really become increasingly important as frankly, data engineering gets more and more popular
46:57  Tobias Macey
well for anybody who wants to follow along with the word You're doing and get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
47:12  Unknown
It see the biggest gap, at the end of the day is being able to answer the question. Well, I guess I'll put it a couple of questions is, what data do I have? How was it generated? And where did it come from? Is it we can't answer those questions. It's really hard to build systems that are more automated, it all then falls back on us as engineers to we have to be able to build technology that can answer those questions.
47:38  Tobias Macey
Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with ascend definitely looks like a very interesting platform that's solving a lot of painful pieces of data management in the current landscape. So thank you for all of your efforts on that front. And I hope you enjoy the rest of your day.
47:56  Sean Knapp
Thanks so much, you too.
48:04  Tobias Macey
listening. Don't forget to check out our other show podcast.in it at python podcast.com. To learn about the Python language its community in the innovative ways it is being used. visit the site at data engineering podcast.com Subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects in the show, then tell us about it. Email hosts at data engineering podcast com with your story and to help other people find the show. Please leave a review on iTunes and tell your friends and coworkers