Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

https://www.dataengineeringpodcast.com

subscribe
share





episode 84: Managing The Machine Learning Lifecycle [transcript]


Summary

Building a machine learning model can be difficult, but that is only half of the battle. Having a perfect model is only useful if you are able to get it into production. In this episode Stepan Pushkarev, founder of Hydrosphere, explains why deploying and maintaining machine learning projects in production is different from regular software projects and the challenges that they bring. He also describes the Hydrosphere platform, and how the different components work together to manage the full machine learning lifecycle of model deployment and retraining. This was a useful conversation to get a better understanding of the unique difficulties that exist for machine learning projects.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers, for engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Data Engineering Podcast listeners get 2 months free on any plan by going to dataengineeringpodcast.com/clubhouse today and signing up for a free trial. Support the show and get your data projects in order!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Stepan Pushkarev about Hydrosphere, the first open source platform for Data Science and Machine Learning Management automation
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Hydrosphere is and share its origin story?
  • In your experience, what are the most challenging or complicated aspects of managing machine learning models in a production context?
    • How does it differ from deployment and maintenance of a regular software application?
  • Can you describe how Hydrosphere is architected and how the different components of the stack fit together?
  • For someone who is using Hydrosphere in their production workflow, what would that look like?
    • What is the difference in interaction with Hydrosphere for different roles within a data team?
  • What are some of the types of metrics that you monitor to determine when and how to retrain deployed models?
    • Which metrics do you track for testing and verifying the health of the data?
  • What are the factors that contribute to model degradation in production and how do you incorporate contextual feedback into the training cycle to counteract them?
  • How has the landscape and sophistication for real world usability of machine learning changed since you first began working on Hydrosphere?
    • How has that influenced the design and direction of Hydrosphere, both as a project and a business?
    • How has the design of Hydrosphere evolved since you first began working on it?
  • What assumptions did you have when you began working on Hydrosphere and how have they been challenged or modified through growing the platform?
  • What have been some of the most challenging or complex aspects of building and maintaining Hydrosphere?
  • What do you have in store for the future of Hydrosphere?
Contact Info
  • LinkedIn
  • spushkarev on GitHub
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
  • Hydrosphere
    • GitHub
  • Data Engineering Podcast at ODSC
  • KD Nuggets
    • Big Data Science: Expectation vs. Reality
  • The Open Data Science Conference
  • Scala
  • InfluxDB
  • RocksDB
  • Docker
  • Kubernetes
  • Akka
  • Python Pickle
  • Protocol Buffers
  • Kubeflow
  • MLFlow
  • TensorFlow Extended
  • Kubeflow Pipelines
  • Argo
  • Airflow
    • Podcast.__init__ Interview
  • Envoy
  • Istio
  • DVC
    • Podcast.__init__ Interview
  • Generative Adversarial Networks

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA


share







 2019-06-10  1h2m
 
 
00:11
Tobias Macey: Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the project to hear about on the show you lead somewhere to deploy it. So check out our friends at Linode. With 200 gigabit private networking, scalable shared block storage, speedy SSD, and a 40 gigabit public network, you get everything you need to run a fast, reliable and bulletproof data platform. And if you need global distribution, they've got that covered to with worldwide data centers, including new ones in Toronto and when opening and Mumbai at the end of the year. And for your machine learning workloads, they just announced dedicated CPU instances where you get to take advantage of their blazing fast compute units. Go to data engineering podcast.com slash Linode, that's l i n o d e today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for their continued support of this show. And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers for engineers. clubhouse lets you craft a workflow that fits your style including PR team tasks cross project epics, a large suite of pre built integrations and a simple API for crafting your own was such an intuitive tool it's easy to make sure that everyone in the business is on the same page and data engineering podcast listeners get two months free on any plan by going to data engineering podcast.com slash clubhouse today and signing up for a free trial. support the show and get your data projects in order. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations Riley Media Day diversity and the Open Data Science conference coming up this fall or the combined events of graph form and the data architecture summit in Chicago. The agendas have been announced and super early bird registration is available until July 26 for up to $300 off. Or you can get the early bird pricing until August 30 for $200 off your ticket. Use the code be an LLC to get an additional 10% off any past when you register and go to data engineering podcast.com slash conferences to learn more and take advantage of our partner discounts when you register for this and other events. And you can go to data engineering podcast.com to subscribe to the show, sign up for the mailing list, read the show notes and get in touch and to help other people find the show. Please leave a review on iTunes and tell your friends and co workers. Your host is Tobias Macey and today I'm interviewing Stepan Pushkarev about Hydrosphere, the first open source platform for data science and machine learning management automation. So Stefan, could you start by introducing yourself?
02:59
Stepan Pushkarev: Hey, Tobias. So thanks for the intro. And then step on bushcraft. I'm CTO of Hydrosphere, machine learning management platform. My personal background is in data engineering, backend engineering. And I've spent a couple of years working closely with closely with machine learning engineers and delivering this stuff to production and to life. So that's my background.
03:22
Tobias Macey: And do you remember how you first got involved in the area of data management?
03:26
Stepan Pushkarev: So I did not remember exactly. It's probably it was probably the earlier versions of spark back to 2016. Probably not 16. Or even like, 14, I don't remember exactly. So it's, it's kind of the older older software engineering world and,
03:44
Tobias Macey: and
03:45
Stepan Pushkarev: especially designing and building distributed systems evolved. And basically major, major stuff that we've been working on our databases, data integrations, and it was the smooth transition from the classical software, software development to so called Big Data software development. So it's now it's, it's not a buzzword anymore. But early Hadoop and Spark days, it was very cool to deal with.
04:17
Tobias Macey: And so for anybody who hasn't listened to it yet, you and I talked a little bit about your work at Hydrosphere, I think was two years ago now at the Open Data Science Conference. And I'll add a link to the new to our conversation there in the show notes. But wondering if you can just start by giving an overview about what it is that Hydrosphere is, and some of the origin story of how it got started, and your motivation for getting involved with it.
04:42
Stepan Pushkarev: Yeah, sure. Sure. So back to 2016. And I guess, I wrote, I wrote an interesting blog post and keep hitting nuggets, it was the topic was data science, expectations versus reality. And it was kind of manifest all of what we've been working on. And it was some very, very high level, just just some notes that, Hey, guys, build a community are talking about the old that benefits of machine learning and analytics. But what is the reality? And the key takeaway, there were a few takeaways out of the those blog posts. First one was tools, evolution, that existing existing tools in machine learning or create, but not stable and user friendly. And we will certainly see a riot rise of new tools that will augment machine learning engineer and data engineer in the future. The second takeaway was education on cross skills. So when data scientists write code, the need to think not just about abstractions, but need to consider the practical issues of what is possible, what is reasonable. So and the third takeaway was that improve the process. So DevOps, my the solution, in terms of the machine learning lifecycle machine learning workflow, it was a very like winded definition. And I was trying to play with the sun, like maybe coin some name and define those, like value proposition, getting some feedback from the community. But if you were now, it may seem very missing very obvious for four people nowadays, because the community has grown. And big companies, the major players have evangelize a lot of cool, cool and good stuff. So. But anyway, the like three years ago with was kind of new, and we were looking for a community for we're looking for attraction and some feedback. And the reason I was trying to share those notes and thoughts that Yeah, we've been working on some cool projects, on a consulting basis, our parent company where I was a part of, which is a machine learning consultants, consultancy and solutions provider, and we have to deliver the stuff to production for our clients to make it work 24 seven, and we have, as far as we were, we were like contractually obligated to deliver stuff to production, we had to invent all that stuff for ourselves and automate our own process to decrease our overhead and to automate our routine work. So we started just building some DevOps, like tools for not the continuous continuous delivery, continuous integration, but some like minor stuff that will help us to to, to move faster, to close, close that gap between like training and retraining, and make make things more more user user friendly for our internal users, for data scientists and machine learning engineers. So we're and from that very wide idea of, hey, let's do something cool. And let's automate our own process we, with by we, without a, it might, it may make sense to find a new niche and also open source this project, and start like in your website, starting your GitHub repo and start talking to people and getting some traction. So that was a story. We evolved. We were after that Odyssey. In Boston, when we met first time, we did some good traction, and it was kind of best of the best of the water. And it was our very first public conference, then then we participated in, got some good feedback. And yeah, since that time, the single thing that that's spaces is currently even categorized by Gartner. It's a it means that it's already here, it's already there, you should you should take a look at this enterprises and big companies, big and small companies should take a look and consider this, like Model Management in their day to day operations. And
09:36
Tobias Macey: in your experience of building and maintaining Hydrosphere and working on consulting for people who are getting involved in machine learning. I'm wondering what you have found to be some of the most challenging or complicated aspects of managing those model deployments and managing machine learning at a production environment. And if you can provide a bit of comparisons interest to how that relates to more traditional software life cycles of deploying and managing production environments for software applications.
10:09
Stepan Pushkarev: Yeah, sure. So I would, I would rate the first probably the most challenging part is dealing with exceptions, edge cases, and certainly blank areas of the model. Most of the data science teams are not there yet. But anyway, so it's with something I need to mention another like in a in a software engineering world itself. And as well as in machine learning world. So dealing with state with the model state, for instance, that your model state for its, while this would be really challenging to scale it horizontally, and so on. So it's kind of another another issue. And as far as you do not control this, Cody cannot make it really scalable enough dealing with API integration is it's also kind of for inheritance from the software engineering world, but slightly changed the slightly different in machine learning. Even like, array based, API versus named array is kind of such a small things. But that makes they make integrations really, really hard when data scientist exposed an API of 200 200 features in just an array. And the you should just remember the indisposition of a particular field to pass the parameters mean, and expect the same array as an output is kind of useless for software engineer who expects a nice JSON API with a named features named parameters there. So it's kind of cultural, but usually, it creates a lot of a lot of like, cows in the integration pipeline. So yeah, but as I'm, as I mentioned, by probably the for the first, and the main and major challenge at this moment is iterating, iterating. With the model and, and dealing with the new concepts and new exceptions that you face in production. Even I would, I would like to mention the example we've been working during last week on a on a nice demo for a conference with the podcast, it was just a demo for a manufacturing conference, when we you know, on a consulting basis, when we provide safety control based on the computer vision for manufacturing sites, and construction sites. So you can make you wear a hard hat and wearing gloves, and so on. So we basically let you in or, or deny and access. And while working on this, of course, the a lot of like open source data sets, public data sets to train with, like ready of the shell machine learning models, you can quickly within a couple of days hack prototype, it can hack, pretty decent machine learning model that can recognize person in hard hat and personable downtown had had it, it can even authenticate that person, and so on, and so on. But obviously, when we started testing this model, it's really easy to fool the model. For instance, if you were, if you just take off your head hard hat and place it somewhere in your shoulder, it's easy. It's the model recognizes you in like you recognize it recognizes a safe access. So it's kind of there are many other tricks you can make. Obviously, the food the model, so when we started iterating, we started just adding and adding this educators and when you test it, even within like a group of five people, it's really challenging to track all those edge cases, and aggregated into a new retraining badge. And like basically automate this process to, to iterate, iterate quickly. So obviously, that like deploying on Hydrosphere, that our own tool made our process much, much easier. We even with this small group, we were able to, like make the model reliable and stable relatively quickly. So and obviously Dell, when we will deploy the model to production, there will be much more edge cases that will automatically gather and like incorporated into our training, retraining and testing testing pipeline, so we will not miss it in the future. And
14:59
Tobias Macey: so for anybody who's not familiar with Hydrosphere in particular, can you just outline the overall design and architecture of the platform and talk through the different components and how they fit together?
15:11
Stepan Pushkarev: Sure, sure. So yeah, by the way, we're all open source. You can check it out on GitHub, dig, dig, dig a little Google, there is nice documentation. It's your user documentation, but we can dig in into the source code as well. So we are primarily Scala guys. scowling JVM guys so Scala aka microservices, a couple of databases for different purposes or stress for by relational data. We store metrics in flux they be what restore inputs and outputs of for each model in in a row format and the rock the be not familiar rock the Bs, something that a batch of guys a graphical streams using in a on the backend. So yeah, but Docker obviously for the Kubernetes for deployment and orchestration, couple of like Python, microservices, Seelye and Python runtime for machine learning models, we would stay extensively use streams. We switched from Kafka streams to live streams, just because we need a dynamic kind of our models that could be added in runtime, and we didn't run time and, and so on. And the Kafka topics are not designed to deal with this, this tire, this type of like this type of environment, this type of use case, we use topics, customs, sorry, but the Yep, in general, that's that those are named what's a main technologist for us.
17:01
Tobias Macey: And as far as the different pieces of Hydrosphere, I know that you've got the serving layer, and then you've got the metrics layer for being able to determine the overall health of the model and production. So I'm curious if you can just talk through maybe what the different concerns are for those pieces and how they tie together to fit into what the overall workflow would look like for somebody actually building and deploying and managing their model and production and how it simplifies the overall process versus what a homegrown solution might look like.
17:37
Stepan Pushkarev: Yeah, everything starts starts with model cataloging and model deployment. So once the model has been built, we basically hook into the training output to the binary is produced, produced by training pipeline might be a pickle file, it might be the protobuf file for pens or for other formats formats like keras pytorch, and others with crookedness into this pipeline, and basically upload the binaries to our service. And we extract all the metadata from out of the those binary formats and generate, generate additional metadata. By by hooking into that training pipeline, when we capture training hyper parameters would capture all those like characteristics of training data, its distribution, it's all under statistics, and a whole bunch of other metadata that might be useful for downstream applications to use, and to trace to trace the request and trace the root cause of any other any prediction. So we catalog guys, the the models, we build a Docker containers and storage, the Docker images torrents in storage and Docker registry. And for us, machine learning models are immutable Docker images that could not be modified, obviously. And this is the way for us to do version two, our versions are not just an index, a number in database, it's kind of physically packaged machine learning model with with all the dependencies, and, and all the metadata. So that the next step is basically serving, when by serving women and deployment of Docker container in a in a runtime environment, you can deploy the model when the different type of run times, for instance, different versions of TensorFlow different versions of like other libraries or just tweak, tweak some runtime parameters to me to be more like to gain more latency and to gain latency. And so that's, that's a typical the typical microservices architecture. So we launched in orchestrate this, this Docker containers using Kubernetes API or Amazon DCs API, it's
20:29
kind of just Well, well integrated and well orchestrated platform. But we do not reinvent the wheel here. It's just a classical that those classical, well implemented. Architecture, microservices architecture. The second piece of for the second major part of our platform is we call it sonar. sonar is also a couple of microservices that shadow or Miro the traffic from the main prediction pipeline, and do all the magic with analysis of inputs and outputs, analysis of the model health been monitoring the monitoring the models, basically, we monitor, a lot of we are basically we monitor almost everything that that is related to machine learning model, and it happens kind of automatically, we generate a whole bunch of statistics, we generate whole bunch of metrics, and we have a user defined metrics that you can trade where you can basically monitor a particular output for instance, for instance, a confidence of machine learning model and assign a track called to deal with and to alert 20 of your confidence of machine and both of your predictions are below a particular threshold. Yeah, and those like solar part this is all based on streaming, we analyze we calculate all the distribution put calculate all the histograms and data profiles on the fly, this is this could this may not be explicitly explicitly required by the by the use cases may mean that the model may not degrade so quickly. So we do not need a real timeless here. But this design consideration has been made. Just be just because it's live streaming here is it's much more convenient from the operational standpoint, you are you fail fast, if you didn't, if you cannot process data, you fail fast, you you can easily recover, you can easily replay what you have not processed, it is just a concept of just fast data versus big data. So, if you do not process data, when it arrives, you will you will likely not present for process it later. So yeah, and there is a lot of it in that like sonar part. So we can, there are some sophisticated machine learning models like January for adversarial networks, some variations of art in corners that are designed, designed to profile and monitor your production traffic for any for anomalies for edge cases. And basically to provide the insights into your further further iterations and machine learning lifecycle. So we can retrain and sap sample and retrain she subsample data and retrain machine learning models in
23:51
Tobias Macey: easily. And I like what you're saying about the fast data versus big data when it comes to managing the metrics for the models, because I imagine that you know, particularly depending on the problem domain that you're working in, but some of those alerts might be very time sensitive in terms of knowing when you need to adjust the parameters for the model, or maybe roll it back or revert it or rerun the training process to ensure that you can correct for some of that model drift. And so I'm wondering if you can dig a bit more into sort of how you collect some of those metrics and the types of information that you're looking at for signals to let you know, when you need to take some sort of corrective action for the models that you're managing, and that are currently being served and providing information back to end users?
24:45
Stepan Pushkarev: Yep, yep. And this is, of course, there are different use cases, in in some of the use cases, you may not even have a data drift at all, for instance, that if there is a question, classical collaborative filtering for recommendation systems, basically, the training in a batch mode, that where the training is happening right before the prediction job, so that's not the case. But there are cases when the the the recent drift, but it's kind of slow drift that your user behavior is being changed over time and rerun some recalculate some metrics overnight. And that's also ok. But if you take a user experience into consideration when you deploy the model, and you will, you will need to watch the model health, we're been a next like 10 minutes, the next like 15 minutes, just to make a sandwich, sanity check. Okay, it's working, it's working well. So we basically need the real time metrics. And yeah, so as I mentioned, we analyze it in capitals in Africa streams. And what we do with trace, trace, of course, all the all the route of the of the prediction request through the through the models, there might be much more than there is there might be more than one model in the pipeline. And we actually do each request with the, with the necessary metadata to be to be stored. And we store all the requests, and we store all the predictions for for the audit ability and for discovery ability purposes. For instance, if you have a high level metric that, hey, you're some data distribution has changed, you will meet you will, as a machine learning engineer, you will need to dig a bit deeper and deeper to a particular request that or sample request that caused this data drift, for instance, been a very summer toy example demo that we're running, that we demonstrating. If you train your machine learning models on classical money's data set. And all of the sudden, you start sending letters, instead of digits, you will need to get an alert, then click on this alert, drill down into detail see, okay, the something is going wrong. And you will need to see a particular request. Now that caused that that alert or the alert my because not just by a particular request, it might be caused by like a thousandth of a request and you will need a relevant user experience. So it needs to be done kind of near real time way. So and then if you so you have that particular prediction, you will meet a chance to trace it back to machine learning model version two, all the to machine learning prediction and training pipeline to the original data set that mentioned machine learning has been trained with, for the same for audit ability and for the ability to make a decision for retraining or reconfiguration or maybe hyper parameters tuning. So basically each request as a full information down the about the data set, it has been trained with hyper parameters, it has been trained with concert deployment configuration, and all the metadata that might be relevant for this for this particular prediction. That's kind of cool the name value proposition for for end user that we that we're proud of.
28:57
Tobias Macey: And you highlighted a couple of things, they're that can contribute to the overall model drift or model degradation in terms of changes in the usage patterns for end users or changes to some of the input data. But I'm wondering if you can just talk through in general, some of the factors that will contribute to that sort of model drift and any type of contextual information that you're monitoring and alerting can feed back to the data scientists or data engineers or machine learning engineers to understand what what alterations to make the training process to correct for that, and just some of the overall contextual knowledge that's necessary to be able to engineer in sort of resistance or just tightening the feedback loop for keeping those models in proper working order and ensuring that they're doing what they're intended to do.
29:58
Stepan Pushkarev: The first time the most frequent use case is just not enough training data or not, well, architecture, that retraining pipeline. So we usually data scientists are being provided with a data set that they play with. And that's it. So they iterate there's in their laboratory environment, and then there is kind of training serving data skew, it's what I see, it's one of the most, from the one hand, one of the most simple reasons of, of, it's not a degradation, it's just, it's just organizational and architectural reason of having machine learning model deployed, trained with the on a bit different data set that than you expect, in production. So and that's actually the partially, it's interesting use case, the case, it's it obviously demonstrate that demonstrates that machine learning models and machine learning environment and production needs to be designed to be tolerant for this type of use cases from the very, very beginning. And other typical reason of, of inconsistency between training and serving data is just the enterprise wide environment, when you deploy a particular model, and you have more than one consumer of this model, one of one of more like department or one of more or more teams that are using this model. And some of the teams may misinterpret the more your API and will start sending just a different just will will start using this model just for a different use case. It's also it's also the case it's nothing, nothing to deal with the concept drift, it's more just organizational and use use case drift, as an owner of the service, you will need to be notified that I your your end user is sending is trying to classify, classify something else or starting to extract the information from a very different text and from a bill that has been used for very different domain, for instance. And in other words, use use cases that are just really frequently changing environment in IoT, in deployment of new deployment of new type of sensors, deployment of the new type of locations, that's, that's dope them really, really big. And actually, most of the computer vision and text, text applications, they require this type of this type of iterative approach for discovering new and new concepts. And it's not adrift, its kind of expected behavior. Hey, you're discovering it's just a part of your part of your system. That's, that's probably it. From my side. Yeah.
33:33
Tobias Macey: And another thing that I was interested in when you were discussing some of the metrics collection, is what you were saying around having that be useful for being able to sort of revisit the overall path and trace of the interaction of data flow through the model to understand what were the inputs, and what were the outputs for being able to retro actively evaluate what the decision making process was like, so that for cases such as GDPR, where you need to be able to say why a particular decision was reached in a machine learning context. And so I'm wondering, sort of how the sonar component fits into that type of regulatory environment, or some of the other interesting insights that are able to be surfaced by going through that information.
34:27
Stepan Pushkarev: Yeah, this is kind of must have feature of all the enterprise enterprises, that you need to store all the predictions, and you need to be able to trace back to the the origins to the exact data set that has been used for the model to train. For I'm not sure about GDPR, I'm not super, super familiar with that, with that particular requirement from GDP GDPR. But it, it seems kind of be a must have requirement for any enterprise that and not even enterprise, it's a requirement for any type of organization that is trying to deploy machine learning models to production. And the one thing is just to store requests and responses and save it to somewhere on s3. And another thing is to make it really useful for users, for business users. And for machine learning engineers to use it for either for disc discovery purposes either for like audit purposes, basically, you need to store it in a variable format format, and to be able to be visualized, and gather some like statistic, our statistics out of it. And if you will think about this, it will if you will think about this further, it will become your ground truth feature store will discuss with all the discovery ability for features, with all that nice stuff that will become eventually your name data repository within within enterprise. So it kind of flips the that that paradigm when you have a training pipeline kind of separated from the rest of the production. And it's just just a query to your database and data science and machine learning engineers to play somewhere on the side with the data and train the models and the push to production. And when you start like blurring the line between the that type of offline environment for machine learning, experimentation and online environment, you will see that the online, always up to date always like you're always fresh feature store that has been built right from the production track traffic, and which that is well discovered to work. Well indexes is kind of the the main asset, you you have, you have an enterprise. So it's kind of been interesting observation that I see more and more organizations. So it's, we'll see where it where it
37:34
Tobias Macey: goes. And as far as the state of the sort of level of sophistication of different organizations and the available tooling for managing machine learning projects in production, wondering if you can sort of give some background about where things were when you first started work on Hydrosphere and how the overall ecosystem has evolved since then, and how that has impacted your overall approach to the design and development of Hydrosphere. So
38:07
Stepan Pushkarev: yeah, Originally, it was, as I mentioned, by been writing some like blog posts, trying to coin the names and trying to explain different type of people. That Yeah, it's, it's something that is going to be probably not not the next big thing, but definitely niche. Yeah. Last year, we've seen, Google has published a nice paper about the missing gap in machine learning, Becca system, other create tools like cube flow, taffy, TensorFlow extend the mouth flow, and many, many others emerged. And this really helps to, to explain people, what does White's why, especially the white aspect of the things we're building, so it definitely accelerates, Excel accelerates the education and the community growth for young boys and mentioned keep flow, Mel flow that's tools has similar names, but slightly different purposes. And in my opinion, they really add the value to the, to that ecosystem probably ml flow is, for me is it's a bit more mature project driven by data bricks and Microsoft now it has more value add contribution to the end user. So you can just for those who are not familiar is mo flow is that it's a tool for experiments tracking. And there's one one part of that one major part I think of this of this tools, experiments tracking, we can basically track all the old audio runs all the experiments, all the metrics, also, and all the outputs and inputs, and you're like hyper parameters and other stuff in a collaborative way. So we can share it with your manager, you can share it with your team. And basically, also very interesting, you can use it for audit ability of your training, training pipelines, so we can prove anytime, what what what data your models has been trained with, and so on and so on. It's it's a Python based, it's kind of easy to start easy to play with hasn't has a nice, like very precise use case, keep flow is a bit wider. Projects started as the way to deploy tons of flow on Kubernetes. Obviously, like to flows driving, driving users to Kubernetes ecosystem, and trying to become just to just own that territory of machine learning. I'm Kubernetes. And I think the the main value add product that they has released recently is keep flow pipelines. It's basically an abstraction the Python DSL on top of our go pop pipelines are the with a tool on Kubernetes native, get ops two. So there are the buzzwords, but it's all the new stuff that the community is learning now. So we can in a very, very simple way on a on a high level, you can think it does a new Jenkins Kubernetes native Jenkins and for defining this continuous delivery pipelines and deployment pipelines, I will not go into detail. So of course, some like the DevOps people and infrastructure people in containers, guys will, will disagree with me, but
42:04
there is something to discuss, discuss there. So yeah, and I'm keep for cash flow pipelines are kind of funky, the Python, DSL on top of that, it is still not not as mature as, as Jenkins for instances, airflow, but it is. And it is still not it doesn't add any particular value into machine learning space. So it's supposed to have a machine learning specifics. But it's has a little at this moment, at this moment, you can you can visualize machine learning out machine learning, like you can visualize your steps outputs in a way that is kind of in a way it is well received in the data science community, for instance, you can visualize a precision recall, you can visualize a confusion matrix of particular training pipeline training step and basically have it as an output, you can have tons of tons of board attached to the logs of your training pipeline. So it's kind of nice integration with with those tools. But still, for instance, there are some some users, some like experienced users that are were like, has some experience with air flow type of tools with Jenkins, Jenkins type of like CIC CIC tools, they will find it as a very, very new project, I think if you if you will take a look at their roadmap, they will be adding really cool features that are that are machine learning. specifics machine learning related. So with, you should stay tuned, but at this moment is kind of, I would say it's more of an experimentation phase, you will, you will write more code. And he will have you still write a lot of boilerplate code to define to flow pipelines, pass the parameters in and and parse the outputs of those of this, like pipelines taps each step, obviously into flow and Kubernetes. World is a separate Docker container. And like basically defining this this steps is kind of there's an overhead right now usability overhead. But overall, I think it's, it's kind of it forces users to do it in the right way. By the time the q4 community will keep floating and will add more user friendly DSL and user friendly abstractions and the features into their DSL, it will become more useful for users. And Yep, for we are really looking at this ecosystem, evolving ecosystem and building integrations with cash flow and ml flow. And also with some like proprietary cloud, cloud, Asia and AWS, AWS services, like like sage, mango, and cetera. So these ecosystem, of course, evolving Well, were part of the second system and trying to be as much as close into this community to this community.
45:34
Tobias Macey: And as the availability of different tooling. And the overall level of understanding and sophistication of practitioners in the field has grown in the past few years. I'm wondering how that has influenced overall design and architecture of Hydrosphere, and some of the types of tooling or platforms that are often used in conjunction with Hydrosphere and how that might break down across the different sort of roles across the data team as far as data engineer versus data scientist or machine learning engineer.
46:06
Stepan Pushkarev: Yes, so our own architecture has not changed. Like on that, on a high level, high level architecture has not changed from the very beginning. We have, as I mentioned, replace some like Kafka with Kafka stream streams, we been using heavily envoy and his two for traffic management. And traffic routing between models and other microservices was switched to another implementation recently with some with our own. At Don's been to that process. Just because we have the envoy and the envoy is not is not designed to handle machine learning specifics routing when you need now, for instance, you need the A B testing, we call it response based A B testing, when you make it this, you're out 100% of the traffic to model a and hundred percent of the traffic to Model B and then have a kind of aggregator that will not not aggregator, that kind of controller that decides what out what response to return back to user A or B. So it's like response based routing, and some other interesting features that are really required in machine learning. We switched to our own implementation of that. We started like, we started, even started contributing into Android, but then we saw that are like it's so too far from what we need. And, and the we do not have any influence in the community there. So it's, it was easier for us to just to build the features ourselves in terms of other other like tools and ecosystems, of course, evolving. We are cloud providers, the book, they are part of the innovation, and Asia is doing great, great job on also simplifying deployment than serving machine learning models. They also have some like monitoring stuff that might be comparable with this are the the features that we offer. To some extent, sage maker is a bit behind the this this monitoring system they have not done I don't I don't think they have obviously, obviously thought about this. But to the best of my knowledge, it's not on their roadmap, that this moment we have the cloud providers are innovators. Open Source ecosystem, the source of cool like data, data sets version into Linnaeus, these kind of quite popular this this morning. And there's a intro there are interesting discussion in the community. Narrow we we had the discussion to integrate with dB, DDC inside doctrine, Mr. misspelled them data, data sets, version tools, but then we decided decided that if you architect your data, data platform in the right way, so you have immutable data sets, and the older version of data set could be done just by having immutability in your data. So a query two immutable database, data set will will should always return the same result over time. So and basically, this query is a version of your, of your data set. So we can, it could be done without any additional tool to links. And now we use it extensively on our own projects and our own employer implementations. So but it's also an interesting topic to discuss how to version your data sets, we just just take away, yes, there are tools get line Get, get like tools for version in data sets. But first of all, you can do it with a simple with,
50:20
you can do it without a special like tool for that just by
50:27
proper architecture design of, of your data, data sets and databases,
50:33
Unknown: and
50:35
Stepan Pushkarev: yam trade and having your training data sets immutable by design. And first of all, yep, if you still need it, there are great tools that you can that you can utilize and that capture all that interesting, like, also capture all that interesting metadata and store that metadata alongside with the,
50:56
with the data set.
50:57
Tobias Macey: And when you first began working Hydrosphere and got further into the problem domain, I'm wondering if you could talk through what your original assumptions were going into the project and how they have been challenged and updated in the process of building and growing the platform and getting more involved in the machine learning ecosystem.
51:18
Stepan Pushkarev: Yeah, of course.
51:19
So as I mentioned, from the very beginning, it was we were our design and our like, main value proposition has been based just on our own experience. And that was, that wasn't a purely academic research, of course, it was a real life, real life integration. But the major kind of thing for us is the feedback from users. For instance, if we, if we talk to the customers, and they do not select our platform for their, for their, for their business, for bipolar for a particular reason that we're missing some features or missing some like integrations, it basically facilitates all the developments and all that development slush like products, which marketing efforts. Currently, what we've learned via this audit ability, and observe ability of,
52:23
of machine learning predictions
52:25
and audit, the ability
52:27
of machine learning pipelines, including training pipeline and and serving pipeline is kind of must have, this is a feature that we haven't thought from the very beginning. But we have added that reasonably also the technology is evolving. So like explain ability of machine learning models, there are new tools and new research, adversarial networks, one was started there, there was no January developer seller networks didn't exist. So we we know, we're basically scraping all the latest and greatest from the research and training and which is applicable to or our own our, our main goal and main value proposition. For instance, we do have machine machine monitoring models that monitor that like data drifts. And they are black neural networks. And obviously, it's challenging to explain why a particular request is, has been highlighted that that's an anomaly or some edge case. And we added that explain the ability future that we explain our own predictions, explain our own judgments. And can of course, explain it to the end user say, hey, we'll see you have some like, you have a concept brief. This is the reason of your concentrate that particular feature has been is has drifted to that way to that direction.
54:09
So it's, it's also cool. And
54:13
we can really help to that our main mission and vision to provide
54:21
to provide the
54:23
instant feedback loop for data scientists and machine learning engineer so we can iterate quickly
54:32
in production, not just in research, but
54:36
Tobias Macey: in production. And in terms of the future of Hydrosphere and your experience in the space. So far. I'm wondering what you have planned in the future for the platform, and anything that you're keeping a close eye on in terms of ongoing developments within the community and the overall ecosystem. Yeah, so
54:58
Stepan Pushkarev: are our main goal for the next quarter is integrations integrations and Yeah, well, there is something that we're very feature rich platform. Now a lot of features a lot of like product features. What a neat we're missing currently is integration integrations with the cloud providers, integrations with
55:25
open shift ecosystem
55:26
with cube flow and no more that we will be it includes not just not just like development of those integrations. It's also It also includes
55:40
documentation tutorials, and,
55:42
and blog posts for that. So it's our main go to market strategy in this moment.
55:49
Tobias Macey: And are there any other aspects of the Hydrosphere platform or the overall concerns of managing machine learning projects and production and tracking and maintaining upkeep of the overall health that we didn't discuss yet that you'd like to cover? Before we close out the show?
56:06
Stepan Pushkarev: I don't think we have covered pretty much everything that I had in mind. We can we can talk. We can talk about different topics, but I'm not sure it's my deal. It will the relevant for this show.
56:20
I have a lot of stuff from on streaming. We are like,
56:24
just training versus mentioned vs. And all that tech system caffeine stuff that's not relevant at all to Machine Learning Management. But yeah, I would be happy to elaborate on that, as well.
56:37
Tobias Macey: Okay. Yeah, I mean, that sounds like it could probably be a whole nother episode on its own. So we'll have to have you back on to dig deep into that whole problem space and your thoughts there. So for this topic, I suppose we'll call it a win. And so for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll have you had your preferred contact information to the show notes. And as a final question, I just like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today?
57:11
Stepan Pushkarev: Well, it's a, it's a pretty wide question. So
57:16
Unknown: the
57:19
Stepan Pushkarev: Yeah, just
57:21
everything that is related to production. And I think we have a pretty decent tooling and for for research, and even for research, you know, when you when you train your machine learning model, and it's being trained for for, like, overnight, and for a couple of hours. And
57:48
if you are like performances being
57:50
if you're training, training, and training performance, does a board like interface, but with more precise and more metrics reach information, which may, which may make it even decisions on your, on your training performance, and maybe a stop at or reaching that it's kind of a big, big topic, or Cool, cool projects, and cool car vendors that are also looking into that direction. So it's all that metal learning and automated machine learning with metal learning is, it's kind of cool stuff. Obviously, you even even when we train our machine learning models, usually we
58:38
kind of we try to try to
58:41
do to try to do it in automatic way. For instance, when you deploy when you deploy your main model, which we train some complimentary models that will monitor your main model. And it actually, it's very challenging to do it and in fully autonomous wait. So you can just supply the data, you feature engineer, you engineer feature automatically, and you and you basically choose the best model. So you can, you should really put your training training pipeline on autopilot. And this is kind of another there are leaders in this in this space. And I would see that not not nothing, I would see more in want to lean in that will argument that training pipeline and make it not fully autonomous, but will keep the data scientist in the loop on machine learning engineering the loop by providing really cool like dashboards and metrics, by semester for like, but with, with more details and more more information on the make a decision. And and also supporting others different other frameworks are pytorch and, and keras. And others. So that's one direction direction. Yeah, and then another direct direction is just managing the this enterprise workflow, or for accepting and approving the models to go to production. Because we all like this A Giles style deployment of deployment, deploying stuff to production and managing by engineer sort of machine learning engineers, but of course, and the price workflow is much more complicated. So you can you can have some user testing, you can have some acceptance testing, and basically, verification of by management team probably or by business users, hey, that this model is Yeah, ready to production. What I see most of them enterprises are really, really requires, this could be done, this could be done with classical software engineering tools, like ICT tools, like Jenkins sensor, whatever. But obviously, like machine learning, machine learning will have its own specifics and leads on its own metrics to be approved by by business users. So and you see, this is this already goes to the business users not not to not to machine learning engineers level. So it's every machine learning model in most in in some enterprises will need will need to be approved by business users. And also to to be compliant with all the regulations and so on and so on.
1:01:54
Tobias Macey: All right. Well, thank you very much for taking the time today to share your experience working on Hydrosphere and the overall state of affairs for managing machine learning and production. It's definitely a very complex topic and one that is continuing to evolve as more people are actually getting to a point where they're leveraging these capabilities in their products. So thank you for the your time and perspective and for your work on Hydrosphere and I hope you enjoy the rest of your day.
1:02:25
Stepan Pushkarev: Yeah, thank you for the thoughtful questions. Have a Have a nice day. Take care. Bye