Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

https://www.dataengineeringpodcast.com

subscribe
share





episode 112: Building The Materialize Engine For Interactive Streaming Analytics In SQL [transcript]


Summary

Transactional databases used in applications are optimized for fast reads and writes with relatively simple queries on a small number of records. Data warehouses are optimized for batched writes and complex analytical queries. Between those use cases there are varying levels of support for fast reads on quickly changing data. To address that need more completely the team at Materialize has created an engine that allows for building queryable views of your data as it is continually updated from the stream of changes being generated by your applications. In this episode Frank McSherry, chief scientist of Materialize, explains why it was created, what use cases it enables, and how it works to provide fast queries on continually updated data.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Frank McSherry about Materialize, an engine for maintaining materialized views on incrementally updated data from change data captures
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing what Materialize is and the problems that you are aiming to solve with it?
    • What was your motivation for creating it?
  • What use cases does Materialize enable?
    • What are some of the existing tools or systems that you have seen employed to address those needs which can be replaced by Materialize?
    • How does it fit into the broader ecosystem of data tools and platforms?
  • What are some of the use cases that Materialize is uniquely able to support?
  • How is Materialize architected and how has the design evolved since you first began working on it?
  • Materialize is based on your timely-dataflow project, which itself is based on the work you did on Naiad. What was your reasoning for using Rust as the implementation target and what benefits has it provided?
    • What are some of the components or primitives that were missing in the Rust ecosystem as compared to what is available in Java or C/C++, which have been the dominant languages for distributed data systems?
  • In the list of features, you highlight full support for ANSI SQL 92. What were some of the edge cases that you faced in complying with that standard given the distributed execution context for Materialize?
    • A majority of SQL oriented platforms define custom extensions or built-in functions that are specific to their problem domain. What are some of the existing or planned additions for Materialize?
  • Can you talk through the lifecycle of data as it flows from the source database and through the Materialize engine?
    • What are the considerations and constraints on maintaining the full history of the source data within Materialize?
  • For someone who wants to use Materialize, what is involved in getting it set up and integrated with their data sources?
  • What is the workflow for defining and maintaining a set of views?
    • What are some of the complexities that users might face in ensuring the ongoing functionality of those views?
    • For someone who is unfamiliar with the semantics of streaming SQL, what are some of the conceptual shifts that they should be aware of?
  • The Materialize product is currently pre-release. What are the remaining steps before launching it?
    • What do you have planned for the future of the product and company?
Contact Info
  • frankmcsherry on GitHub
  • @frankmcsherry on Twitter
  • Blog
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
  • Materialize
  • Timely Dataflow
  • Dryad: Distributed Data-Parallel Programs from SequentialBuilding Blocks
  • [Naiad](Programs from SequentialBuilding Blocks): A Timely Dataflow System
  • Differential Privacy
  • PageRank
  • Data Council Presentation on Materialize
  • Change Data Capture
  • Debezium
  • Apache Spark
    • Podcast Episode
  • Flink
    • Podcast Episode
  • Go language
  • Rust
  • Haskell
  • Rust Borrow Checker
  • GDB (GNU Debugger)
  • Avro
  • Apache Calcite
  • ANSI SQL 92
  • Correlated Subqueries
  • OOM (Out Of Memory) Killer
  • Log-Structured Merge Tree

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA


share







 2019-12-23  48m