Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

https://www.dataengineeringpodcast.com

support

claim!

report

Managing The Machine Learning Lifecycle
[transcript]

Summary

Building a machine learning model can be difficult, but that is only half of the battle. Having a perfect model is only useful if you are able to get it into production. In this episode Stepan Pushkarev, founder of Hydrosphere, explains why deploying and maintaining machine learning projects in production is different from regular software projects and the challenges that they bring. He also describes the Hydrosphere platform, and how the different components work together to manage the full machine learning lifecycle of model deployment and retraining. This was a useful conversation to get a better understanding of the unique difficulties that exist for machine learning projects.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers, for engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Data Engineering Podcast listeners get 2 months free on any plan by going to dataengineeringpodcast.com/clubhouse today and signing up for a free trial. Support the show and get your data projects in order!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Stepan Pushkarev about Hydrosphere, the first open source platform for Data Science and Machine Learning Management automation

Interview

Introduction
How did you get involved in the area of data management?
Can you start by explaining what Hydrosphere is and share its origin story?
In your experience, what are the most challenging or complicated aspects of managing machine learning models in a production context?
- How does it differ from deployment and maintenance of a regular software application?
Can you describe how Hydrosphere is architected and how the different components of the stack fit together?
For someone who is using Hydrosphere in their production workflow, what would that look like?
- What is the difference in interaction with Hydrosphere for different roles within a data team?
What are some of the types of metrics that you monitor to determine when and how to retrain deployed models?
- Which metrics do you track for testing and verifying the health of the data?
What are the factors that contribute to model degradation in production and how do you incorporate contextual feedback into the training cycle to counteract them?
How has the landscape and sophistication for real world usability of machine learning changed since you first began working on Hydrosphere?
- How has that influenced the design and direction of Hydrosphere, both as a project and a business?
- How has the design of Hydrosphere evolved since you first began working on it?
What assumptions did you have when you began working on Hydrosphere and how have they been challenged or modified through growing the platform?
What have been some of the most challenging or complex aspects of building and maintaining Hydrosphere?
What do you have in store for the future of Hydrosphere?

Contact Info

LinkedIn
spushkarev on GitHub

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

Hydrosphere
- GitHub
Data Engineering Podcast at ODSC
KD Nuggets
- Big Data Science: Expectation vs. Reality
The Open Data Science Conference
Scala
InfluxDB
RocksDB
Docker
Kubernetes
Akka
Python Pickle
Protocol Buffers
Kubeflow
MLFlow
TensorFlow Extended
Kubeflow Pipelines
Argo
Airflow
- Podcast.__init__ Interview
Envoy
Istio
DVC
- Podcast.__init__ Interview
Generative Adversarial Networks

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

fyyd: Podcast Search Engine

June 10, 2019 1h2m

Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

https://www.dataengineeringpodcast.com

Managing The Machine Learning Lifecycle [transcript]

Managing The Machine Learning Lifecycle
[transcript]