Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

https://www.dataengineeringpodcast.com

subscribe
share





episode 81: Build Your Data Analytics Like An Engineer With DBT [transcript]


Summary

In recent years the traditional approach to building data warehouses has shifted from transforming records before loading, to transforming them afterwards. As a result, the tooling for those transformations needs to be reimagined. The data build tool (dbt) is designed to bring battle tested engineering practices to your analytics pipelines. By providing an opinionated set of best practices it simplifies collaboration and boosts confidence in your data teams. In this episode Drew Banin, creator of dbt, explains how it got started, how it is designed, and how you can start using it today to create reliable and well-tested reports in your favorite data warehouse.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Drew Banin about DBT, the Data Build Tool, a toolkit for building analytics the way that developers build applications
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what DBT is and your motivation for creating it?
  • Where does it fit in the overall landscape of data tools and the lifecycle of data in an analytics pipeline?
  • Can you talk through the workflow for someone using DBT?
  • One of the useful features of DBT for stability of analytics is the ability to write and execute tests. Can you explain how those are implemented?
  • The packaging capabilities are beneficial for enabling collaboration. Can you talk through how the packaging system is implemented?
    • Are these packages driven by Fishtown Analytics or the dbt community?
  • What are the limitations of modeling everything as a SELECT statement?
  • Making SQL code reusable is notoriously difficult. How does the Jinja templating of DBT address this issue and what are the shortcomings?
    • What are your thoughts on higher level approaches to SQL that compile down to the specific statements?
  • Can you explain how DBT is implemented and how the design has evolved since you first began working on it?
  • What are some of the features of DBT that are often overlooked which you find particularly useful?
  • What are some of the most interesting/unexpected/innovative ways that you have seen DBT used?
  • What are the additional features that the commercial version of DBT provides?
  • What are some of the most useful or challenging lessons that you have learned in the process of building and maintaining DBT?
  • When is it the wrong choice?
  • What do you have planned for the future of DBT?
Contact Info
  • Email
  • @drebanin on Twitter
  • drebanin on GitHub
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
  • DBT
  • Fishtown Analytics
  • 8Tracks Internet Radio
  • Redshift
  • Magento
  • Stitch Data
  • Fivetran
  • Airflow
  • Business Intelligence
  • Jinja template language
  • BigQuery
  • Snowflake
  • Version Control
  • Git
  • Continuous Integration
  • Test Driven Development
  • Snowplow Analytics
    • Podcast Episode
  • dbt-utils
  • We Can Do Better Than SQL blog post from EdgeDB
  • EdgeDB
  • Looker LookML
    • Podcast Interview
  • Presto DB
    • Podcast Interview
  • Spark SQL
  • Hive
  • Azure SQL Data Warehouse
  • Data Warehouse
  • Data Lake
  • Data Council Conference
  • Slowly Changing Dimensions
  • dbt Archival
  • Mode Analytics
  • Periscope BI
  • dbt docs
  • dbt repository

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA


share







 2019-05-20  56m
 
 
00:14
Tobias Macey: Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the project to hear about on the show, you'll need somewhere to deploy it. So check out our friends at Lynn ODE with 200 gigabit private networking, scalable shared block storage and 40 gigabit public network you've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution, they've got that coverage to with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads. They just announced dedicated CPU instances go to data engineering podcast.com slash Linux, that's l i n o d today to get a $20 credit and launch a new server and under a minute. And understanding how your customers are using your product is critical for business is of any size to make it easier for startups to focus on delivering useful features segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers time, it lets your business users decide what data they want where go to data engineering podcast.com slash segment i o today to sign up for their startup plan and get $25,000 in segment credits and $1 million in free software for marketing and analytics companies like AWS, Google and intercom. On top of that, you'll get access to the analytics Academy for the educational resources you need to become an expert in data analytics for measuring product market fit. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers. You don't want to miss out on this this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity into the Open Data Science Conference. Go to data engineering podcast.com slash conferences to learn more and to take advantage of our partner discounts when you register and go to data engineering podcast.com to subscribe to the show, sign up for the mailing list, read the show notes and get in touch. And please help other people find the show by leaving a review on iTunes and telling your friends and coworkers. Your host is Tobias Macey and today I'm interviewing drew Bannon about dB T. The data builds tool, the toolkit for building analytics the way that developers build applications. So Drew, could you start by introducing yourself? Sure, thanks Tobias. I'm Drew Bannon. I'm one of the co founders at Fishtown analytics and a maintainer of DPT digital tool. And do you remember how you first got involved in the area of data management?
02:47
Drew Bannon: Yes. So in college, I did an internship with an internet radio startup based in San Francisco called eight tracks. And I started there as a software engineer, but I worked on some internal analytics functions at the company. And so these were things like building an in house event tracking system and building out some internal reporting. And so as a as I kind of built these things out, I understood that I needed to transform these events in redshift the data warehouse were using. And I thought a lot about how to expose these tables for analysis and our internal analytics tool. And so that's kind of where I got my start. And I eventually came back to Philadelphia to finish up school. And I started working in RJ metrics, which was a business intelligence company in Philadelphia. So that's where I met my two co founders, Kristen and Connor. While I was there, we, yeah, at RJ metrics, when we were in our geometrics, they went through this transition in which the core product was cloud bi was sold to magenta, and out of that spun another company called stitch data. So what we saw this stitch data thing happening, it was called the RJ metrics pipeline. At the time, we understood that the advent of databases like redshift really changed the game for analytics. And so we went off to figure out how to put redshift to work with all these big data sets that you could pipe in pretty easily.
04:06
Tobias Macey: And so out of all of this, you ended up starting the DB t project. So can you give a bit of an explanation about what it is and your motivation for creating it?
04:16
Drew Bannon: Sure. So in the earliest days, I read some of the first code lines of code for DB T. The big problem was that we wanted to build these logical abstractions in redshift. And so we wanted to create views that say calculated, just say orders. The problem with redshift was that you couldn't drop a view and recreate it topically without cascading the drop to all the downstream views. And so we started building this tool that would help you create views and understand the dependencies between different views, such that you could rebuild this whole, like pipeline of views on redshift from the command line. And so eventually, that evolved in our thinking about exactly what this thing was evolved. And we realized that what we were doing was creating software engineering, workflow analytics. And so we knew we wanted to version control of this code, we wanted to test it, we thought a lot about Devon, prod environments, for example. And so all of these things together
05:10
Tobias Macey: from DB T. And so the primary use case seems to be oriented around data warehouses and doing these in database transformations and building these views. So I'm wondering if you can just talk a bit about where it fits in the overall landscape of data tools, and the overall life cycle of the data in your analytics pipeline?
05:33
Drew Bannon: Sure. So these days, it's pretty easy to get data into a data warehouse. And so there are off the shelf tools like stitcher five train, or you might roll your own email with air flow. Once that data is there, you want to transform it. And so I mentioned views. That's how dbt worked in the early days. But since then, we've built that diabetes capabilities to build tables and what we call incremental models that are like incrementally refresh tables, for instance, and so on, these things happen inside of your warehouse, dbt will connect to redshift, or snowflake, or Big Query, etc. and run sort of create table as or create view as or insert statements to transform this data based on a select statement that you've written to codify a data model. And so dbt fundamentally works by materializing these data models in your warehouse is tables and views. So once you've created these things, there's a whole suite of things you can do with them, or testing and documenting etc. But probably the thing you want to do is connect your BI tool or your Jupyter Notebooks to them. So so you can query a clean data set that you've prepared with dbt.
06:36
Tobias Macey: And one of the questions particularly out of the idea of cascading the Delete, and recreate a views and a redshift table, but also for being able to create these different tables and views. One is just curious about some of the performance implications of having to do these cascading operations and then rebuild the entire view structure. And then also in terms of the creating tables and potentially manipulating and transforming the data in the warehouse, just being able to have confidence that you're not accidentally removing data or mutating it irreversibly. And just the overall safety of of the operations that you're building and any way to gain enough confidence to not have a panic attack every time you run it by accident.
07:25
Drew Bannon: Yeah, it's our belief that all things analytics should be you know, uneventful. So we think hard about how to make this not a scary proposition.
07:32
So let me tell you a little bit about how dbt fundamentally works. dbt is a compiler and a runner. And so when we talk about compilation, there is a ginger context, Ginja, the template language present when you want to execute a model. And so these models are mostly SELECT statements that do some sort of transformation. But you can also use a Jinja function that we've added to the context called ref for reference. And so by referencing other models, you can build up this dbt DAG, of all of your transformations. And so on the topic of performance or rebuilding specific models, you can do really powerful model selection to say I want to rebuild my Orders table because the logic changed or because the data was updated, as well as any model downstream of it. And there's some pretty clever models, selectors we provide around tagging and source data sets and parents children and things like that, that help here on the topic of performance in general. So dbt will, leverage database specific performance tuning functionality. So in redshift, this is sort of disk keys on big create looks like partitioning and clustering and on snowflake, it mostly just works the way you would hope. So this, this support for clustering, coming, but we find that you don't need to use it, except for really very large tables. So you can make use of these performance tuning levers in order to make your tables well tuned for queer years. So if you really big like, say sessions table that you've rolled up from events with DB T, and you want to make sure that your BI tool is snappy. When you query sessions, you can throw a sore key on the session, start timestamp, for instance, with a config change in in dbt, the other question that you asked was about how you make this uneventful and how you get confidence that you haven't broken anything, right. So in dB T, there's sort of a notion of developed like different environments. And so every analyst that uses DB T has their own scratch schema, by convention, we do DB T, underscore and then your your first or last name. So for me, I'm DB t Drew, the production schema is usually called analytics or prod or something like that. And so what you can do is, you can read DB t in your local development environment. And it will transform your source data, which is never mutated in this process, but it will transform it and place it into your desired destination schema. So for me, I'm transforming from say, you know, public DOD orders, wherever your source data lives, hopefully, you know, shelf five overs or something like that, and placing it into dbt drew orders. And then that process of transformation, I'm only affecting data that I specifically want to mutate, which is my sandbox data set. I'm not touching any production, analytics schema data that would say be queried by your bi Tueller or external queries. And so what's great about dbt is it works really very well with version control tools like get, and so you can version control this whole repository, you can make your changes locally, and run them against your sandbox schema. And then only when you're ready to deploy this thing to production. Do you make a pull request, have your colleague revere code merge to master and then run against the production schema.
10:46
Tobias Macey: And so one option there is to use some a common paradigm in software teams of only letting the continuous integration system actually executed execute against production so that you'd never have have a human potentially introducing their own error or using the wrong command line flag as they're executing the run so that you can be sure that everybody has reviewed the code before it actually gets run against production?
11:13
Drew Bannon: Yeah, certainly, right. dbt, in particular, makes this continuous integration question really tractable. So what you can do is because DB T, has such a first class knowledge of environments, you can actually make a CI environment in which your whole dbt project is built from scratch. So every dbt model, because the select statement is adapted, and you can run them all, and they create your desired destination tables and views. But what that means is, when you're running it locally and development, it's a dbt Drew, but in your ci environment, maybe you open a pull request, and you can automatically run your full dbt dag, into a into a scratch schema, call, do pr 135 or whatever, then you can run your tests against that schema. And then if you really want, you could plug your big tool for some key reports, swap out analytics for you know, PR one through five, make sure the reports all build and get user acceptance that the PR is, you know, solves the problem that you were trying to solve or it does a good job of representative data that you want to analyze. And so that's like a totally natural and great way to use dbt that that we that we certainly practice based on analytics.
12:22
Tobias Macey: And when I was looking through the documentation, I noticed that there's first class support for adding tests around the different models and operations that you're creating with diabetes. I'm curious if you can talk through how those are implemented. And just the overall workflow of building a new analytics report from scratch using bvp.
12:42
Drew Bannon: Sure, so dbt has basically two types of tests built in. One of them, we call a schema tests. And these are sort of template ID assertions about your data, and so on databases, like say Postgres, you can set calm constraints, and you can say, this column should always be unique, or this my primary key or it's not not, but most of the analytical warehouses out there don't support this full suite of database level checks on your columns. And if they do support them, actually, it's problematic insofar as like unread shift, if you say that something is unique, but it's not unique, it won't be enforced in your career results could be incorrect. So we generally don't recommend that people use database level constraints. And instead, we recommend something called like batch data testing, the big idea being that you actually transform the data. And then you assert that the things you think about the data are true. And so dbt ships with four built in tests, there's uniqueness, not known this referential integrity and enumeration of accepted values. And so out of the box, you can test that these things are true that they hold against any column in any model that you create, you can additionally create your own schema tests with macros, which is a super interesting topic. And what you can basically do is write the template for this. So all the schema tests work, you read a select statement that returns zero rows, if the constraint holds, and more than zeros if the constraint does not hold. So basically, the the not melt test says count the number of rows that are no and if it returns any number other than zero, you know that you're tested in the home. And so you can write a similar type of test for you know, whatever constraint you can imagine like that you're, you know, the sum of two values equals a third value, for instance. So those are schema tests. And once you define them, they're very easy to apply sort of as metadata to your model specifications. And they're really handy way to quickly assert that that your assumptions hold both when you're actively developing models. So test driven development, you can do a dbt, or also it's super useful for, you know, the case of regressions, when you go and modify other seemingly unrelated code down the line, we've all been in the position where we write a bad join that fans out some Orders table and plates revenue. And the schema tests for uniqueness and foreign key constraints, for instance, like really helped find these things before they make their way to production, I think the other question you asked about was the process of actually building report on top of the dbt model, sorry,
15:03
Tobias Macey: yeah, just talking through the overall workflow of getting started with dbt and going from having a problem and then all the way through to solving with dbt and getting it into production.
15:15
Drew Bannon: So people have certainly built reporting out before dbt existed, or they do a good job of it without dbt. But the thing that dbt really helps you solve is building this corpus of data, this like information architecture in your database. And so having a really good definition of exactly what an order is, or what a user is, for instance, that's a really challenging problem, when you have multiple different apps with varying levels of authentication, for instance, visitors, users, etc. And so you can build out all these data models with dbt. And so usually, the starting point is like you have some problem. And you're just you don't want to go and encode the transformation in your actual analysis. So here's what we see a whole lot, you are an e commerce company, and you sell your your product in a dozen different countries. And so every time you want to do any analysis, you have to sort of build a union that unions together the orders from a dozen different countries. And so when do you get tired of doing that, and you check out dbt, and basically what you do is you open up a folder, you you pip install dbt, or install dbt, you set up your connection and information, so that dbt knows how to talk to your database, and then you open a model file. And these are just dot SQL files in some get version repository on your hard drive. And so what you do is you start typing the sequel SELECT statement that produces the transformation that you want to materialize in the database. And so maybe you start by, you know, select star from us orders union, I'll select star from ca orders and you enumerate your, your dozen different data sets, and then you press Save at the end, and you pop up in your terminal. And so now, it's the actual process of invoking dbt, to materialize the sequel, select statement as a as an object in your database. And so the way that you do that is simply by executing dbt run, which is the secret of why dbt is named what it is, it's a it's very short and easy to type compared to something like the world's greatest did built tool. That'd be too long, I think so. Okay. So once you have run your dbt project DB, you will find all the models that exist in your identified dbt projects. So you get is identified by get directory dbt projects are identified by dbt project demo file. And so if you have one of these files, you're in a dbt project, and anything in a folder called models, by default is considered a model. And so maybe you build that thing you write your tests to insert to assert that your order IDs are unique and not know and that every order has a user associated with it, say and then all you have to do is go into your BI tool. And instead of selecting from a dozen different orders, tables and a big union, you can replace that with Select star from my case, dbt Drew. So if I run that code, and it checks out, everything looks great, what I can do is, you know, go do the get flow and push my code merge to master deploy in production. In my production environment, the target schema is called analytics instead of dbt Drew. And this is all configuration, you know, and so suddenly, you can swap out dbt drew for analytics in your reports. And you're pointing to the production data model for your union to orders data set. The wonderful thing here is as time passes, if you find it that you actually wanted to exclude it, some orders that were that were test orders, they didn't really count, you want to filter on the IP address for your office or something, you now have a logical abstraction for what an order is, you're not pointing to 12 different source tables in your BI tool, you're instead pointing to the abstraction of an order. And so you go into your model, and you at the very bottom at, you know, where you know, isn't turtle equals false or whatever, you do the same get flow to push to production. And all of a sudden, all your reports that point two orders now have an updated, more correct definite definition of what an order is. And this is something that you certainly could do by clicking into the 50 different reports that touch orders in your BI tool, but certainly, it's probably nicer to just do it in one place instead.
19:11
Tobias Macey: And also, we don't have this version repository of queries and reports and the ability to collaborate on it fairly easily, then you end up leaving everyone to write their own sequel, usually ad hoc, and they might have their own assumptions as to what an order is or what a customer is, or how to structure the query to join across different tables. And so everybody's going to have slightly different views of the data or slightly different outputs. And so definitely having the ability to have one location that everybody can look to. And one interface for everybody to collaborate on makes it much easier and more scalable to have more people working on building and interacting with these Analytics reports and these analytics pipelines.
19:57
Drew Bannon: Absolutely, I think it's a great point, what we find is that in the process of building up these data models, what you're actually doing is generating knowledge about your organization. And you're saying here's exactly what an order is, or here's exactly how we calculate MRR. And to that end, dbt ships with auto generated documentation about your project, you can run dbt docs generate, to generate this single page app, have all of your models with all the tests on the columns and descriptions that you can populate for these different models. And so if you do have some consumer of the data that isn't using GBT, they have a great place that can go and see all the models that exist and all the columns and your pros about all of it. And so in that way, it's sort of a catalog of all the data that your organization commands and serve instructions for us to
20:42
Tobias Macey: Yeah, and I think that that is definitely very powerful, because particularly having the documentation be generated as part of the code as opposed to something that someone does after the fact or alongside the work that they're doing means that it's much more likely to stay fresh and actually be up David periodically, rather than somebody putting in the time and effort to write some prose once when they first build the overall reporting pipeline. And then it quickly grows stale and useless over time as new modifications are made. Yeah, that's absolutely right. And another interesting capability that dbt has is the idea of packaging, and being able to package up these different subsets or reports or transformations so that they're reusable across different teams and across different code bases. So can you talk a bit about how those packages are set up and implemented, and also maybe talk a bit about who the sort of primary drivers are for the different packages that are currently available? Sure.
21:44
Drew Bannon: So when dbt runs, it will look in a couple different places for we will call it resources. And so an example of a resource is a model or a test of a model, or things like documentation, snippets, etc. And so one of the places that looks is your models directory, which are the models that you've created, but the other place it looks is a folder called dbt modules, which is sort of note inspired. And so what you can do is just drop whole dbt projects into that dbt modules folder. And they get picked up as though they're a natural part of your project. And all of these resources become available in the compilation context the dbt provides. And so there are basically two types of packages that that are produced. One is data set specific packages, and the other is sort of macro or utility packages. An example of a data set package is something like snowplow. And so we're huge fans of the snowplow event tracking system at Fishtown analytics, the big idea is that you can track events from all your different clients. And they flow directly into a big events table in your warehouse. And so this event table is like an immutable event log, it has the full history of you know, every interaction that you cared about to track in a single table, which is phenomenal. It's a great resource. But the problem is, it's difficult to plug a BI tool right into that, either because it's too much data or because the things you really care about are hard to analyze, like how many people had two different events in a single session. And so what we frequently find ourselves doing is rolling up these events into sessions using some code that was actually originally produced by the snow pub team called their web data model. And so what we can do is we can make a package of these transformations that go from bra events, to page views to sessions, all the way up to users. And then we encode these things as dbt models. And if you include this package into your dbt project, when you take dbt run, these models will automatically run you can also reference them from your own models. So if you want to do marketing attribution on top of session ization, that was provided by the snowplow package, you can absolutely do that. The other broad type of package that we make is maybe more focused on macros. And so the Jinja template engine supports something called macros, which are kind of functions that return text basically, in most cases, texts, we've actually been Kak them. So they return other things, which is pretty wild how we do it. And and so what you can do is if you find yourself writing the same type of code over and over again, what you can do is make a macro that accepts arguments and spits back out, usually the sequel that you need to encode that particular piece of logic. So here's a really good example that shows the full like force of the dbt compilation engine, we wrote a Actually, let me grab it, somebody contributed a pivot macro, that you could point to a table and a specific column. And you can say, pivot out the values into this column using this aggregate function. So you say, look at the Orders table, look at the, you know, have this better example. Look at the products table, look at the products color, and then pivot that pivot that out to like color red color blue color green, with a one if that's true, or zero, if it's not. And so this is probably something that a lot of people have written manually many times over. But when macros and the ability to sort of encapsulate logic plus packages, which is a distribution mechanism, we can write that thing once, and many, many people can benefit from it. So this is one example of a macro that was contributed by member of the dbt community. But really this this dbt utility package that contains the pivot macro has dozens of macros, many of which were contributed by dbt users. And the really cool thing is a lot of these people are in engineers by trade their analysts. And so for a lot of them, it's their first time contributing to an open source repository. And that's a pretty cool experience to be the benefactor of the the code that they wrote.
25:34
Tobias Macey: And as far as the packages themselves, they are primarily the sequel or agenda statements. But I'm wondering if you add any support for being able to create additional filters in the Jinja context for people who are building these packages? And if so, if there are ever any additional dependencies that get pulled in as a result?
25:56
Drew Bannon: Sure. So currently, dbt is not supported custom filters at all, though it is something on our roadmap. And I think that you're right, that's a great opportunity for sharing the same type of code that the many people can benefit from. But you're right to identify that the Python environment is one of the most difficult aspects of this whole question. In practice, we find that all of the analysts that use dbt have not in pristine Python environments. And so the idea of installing things into them or getting them to install something into it based on third party packages a little bit difficult. And so I think I'd want to have a better handle on how dbt works with its Python environment in general, before tackling that kind of question.
26:42
Tobias Macey: And so the way that all of these models are defined is by these sequel, SELECT statements. And so I'm wondering what you have found to be any limitations of the abilities of creating models based on these sequel queries.
26:58
Drew Bannon: Sure, so I think this is one where constraints kind of breed creativity. So there are a ton of example, a ton of benefits of, of encoding all of your data transformations purely as SELECT statements, you get a difference right out of the gate. And so if you run the same model against the same data set, you will 100% of the time, get the same output, maybe 99% of the time, get the same output. That's definitely not the case, when you write transformations that use delete and update and insert kind of haphazardly. And so behind the scenes dbt takes your SELECT statement, and it wraps it in CREATE TABLE adds or insert into, you know, from select, for instance, in order to either create or mutate the objects in your database, without you actually having to write that specific, potentially non dominant logic. And so this, this can be a difficult thing to do to reframe an insert statement that you have or an update statement as a series of SELECT statements. But in practice, I don't think I found an update statement that couldn't be rewritten business select. And really the benefit here is that with SELECT statements, you're you're building data models, update statements are, you know, I'm types mutations, where's the Select is that you produce a thing at the end. And so every time you produce the thing with dbt, you can test it, and you can document it. And so maybe it's strictly internal, and you never want the world to see it, you can configure it to live in a custom scratch schema that you delete at the end of your dbt run. But just in the middle of the run, you get an entity that if there's a bug, you can inspect, you can select from, which you can't do if you just run it doesn't, you know, updates and inserts and deletes in order. So Truly, I think, you know, in our introduction we have in the dbt documentation, it says in big letters, everything is a select. And I think that's one of the the core benefits of dbt is that it forces you to think about how to define your transformations as as pure transformations
28:55
Tobias Macey: via selecting data. And just overall being able to to make sequel code reusable, and compostable is difficult and sometimes impossible. And I know that you've added in these capabilities of ninja template, which helps to alleviate some of those issues. But I'm wondering what you have found to be some of the remaining issues and potential shortcomings of using ginger for composing these various different sequel fragments and snippets, and how it compares to other approaches that might use a higher level language that actually compiles down to the sequel that is closer to a sort of Turing complete environment.
29:39
Drew Bannon: Sure. So there was a good post that I read the other day was called, we can do better than sequel by I think, folks from HDB. And basically, they enumerate all of the challenges of dbt and one of them is not orthogonal ality you, you can actually use the same query in places that you think you should be able to, or composition is difficult. They talk a lot about on the heels of no values. And so the direction that they get to the end is like, we need a language other than sequel to query databases. And so I like maybe so the other people I saw on Twitter thought, this is a really good critique of sequel, but also I I kind of disagree with the conclusion. And so I think sequel is, you know, the lingua franca for querying databases. And it has been for, you know, since the 70s, it will be for another 40 years, I think. And so the question then is, how do we define that, in my mind, good practices around using sequel to make it modular and reusable. And so for dbt that sort of happens two different ways. One of them is we think less about reusing SQL snippets, that certainly happens a lot. But if you have maybe a fragment of a query, that's less useful than actually building a named object, and database. And so if the thing you really want is to like, have a persistent definition of orders in your database, and not a query that can generate orders, dbt, certainly something that makes that very doable. There are legitimate cases where you do want to reuse a sequel snippet, for instance, and one of these places that we see come up a lot is around window functions. And so what you can do is you can write a macro that returns a window function definition for say, the sum over some partition with some configurable time range of a dozen different time ranges you care about, you can encode that in a macro, and then you can reuse the pointer to that macro many times over in many different models. And so the thing that becomes important is thinking about the actual interfaces of of, say, these macros. So maybe you have macros that return columns, or macros that return the names of tables or sub queries. If you're not cognizant of exactly the contracts between these macros, it can be very hard to reuse these macros and to compose them. And so the thing I've been thinking a lot about is, if sequel word to have types for sequel fragments, what would they be? And so I'm not so far down this path. If anyone has more of all thoughts on this, I would love to hear them. But broadly, I think that there are interfaces like a query interface. And so something that's query double is the name of a table or the name of a view equivalent Lee, it could be a select statement, but only if you wrap it in parentheses. And so regardless of what you have, maybe it's the name of a table, maybe it's actually some query wrapped in parentheses, you can always say select star from and then that thing, right. And so if you're cognizant of the interfaces of your macros, and your in your conscious, you do the same things with you know, scalar value, say it's either a literal, or the name of a column, or a sub query that returns a single row, you can do all these things, and you can be consistent in your interfaces. And then it becomes a lot easier to compose these things on top of each other.
32:47
Tobias Macey: Yeah, so no, I definitely like the way that you approach the idea of having these interfaces to these different concrete objects within the database, rather than just trying to use the typical approach, munching together a bunch of different text snippets into something that's supposed to make sense as a whole, which often doesn't, and sometimes just doesn't even function at all. So I like having that perspective, particularly from someone who is working closely with the space and trying to make these sequel objects more compostable, and more reusable and able to be collaborated on versus just giving up on sequel in general and saying, No, we're just going to either move to a higher level language that compiles to the sequel, or just, you know, throw your hands up and just have you know, v1, v2, v3, or like 15 different slightly slight variations of the same query to do slightly different things.
33:41
Drew Bannon: Yeah, exactly. You know, I mentioned HDB. But there's the other prior art here is, I think, look, ml, which also compiles to sequel, but they tackle this problem of dimensions and measures, they deal, I think, a lot less with the composition of broader sequel constructs. And so I don't think it's a solved problem in this room today Exactly. Like, my instinct is that it's not going to be a technological solution, it's going to be sort of a convention, a solution to this problem in which people kind of add constraints to their own workflows in order to make certain other types of things possible.
34:14
Tobias Macey: And so can you explain now how dbt itself is implemented and how the design and capabilities of the tool evolved since you first began working on it? Sure.
34:24
Drew Bannon: So when we first built dbt, the only warehouse that was in our purview was redshift. And so it was, in fact very redshift specific in all of our like it, it kind of looks like a tool that existed to run queries against redshift in in topological sort order, you know, and so over time, our horizons broadened. We started making dbt work with Big Query and snowflake. And it became, it started to look a lot less like a tool that ran queries against the database. And it looked a lot more like a compiler. And so there's a parsing step. And after parsing, we pluck out all these rep statements that I mentioned earlier, to understand the shape of the graph. And then we analyze the graph and make sure there are no cycles anywhere and your dependencies. This is a whole topic. But dbt has a notion of an ephemeral model, which is sort of like a sub query that can be compiled into other models, but doesn't actually get materialized in the database. So we we recursive, we link all the ephemeral models that are relevant in the given context. And so really, it looks a lot more like a compiler, if you look at the code these days than it does, you know, a tool that connects to databases and runs them. And what we were able to do was take all the database specific code out of Python out of the dbt core package even, and package these things up in their own packages. And these are like Python, pi pi modules. And so there's a dbt redshift package that has all the logic required to do the things he needs to do on redshift. So we actually do these things with macros. There's a CREATE TABLE macro, in the dbt redshift package that knows how to create tables unread that's aware of sort keys and disk keys. And similarly for Big Query and snowflake. And so what this lets us do is we call it sort of an adapter plug in architecture, we can now make dbt work with all types of different databases. And so we just produced maybe at the beginning of, of 2019, a presto plugin, and shortly thereafter, a Spark SQL plugin. And I know some folks are thinking about hive, and Athena, as well, and as your sequel data warehouse. And so the big ideas like we can have our core adapters that that we use, and test and verify. And if folks want to make their own adapter plugins for their own databases, they can like feel very free to do that. And it's totally possible with dbt. And I, that's the biggest architectural difference between dbt of 2016 and dbt of today, is this, this core divergence of like dbt, the compiler that knows about an adapter contract and the dbt plugins that they know all the specifics, and there are many of how to build things on redshift and stuff like and jQuery and etc.
37:03
Tobias Macey: Yeah, and I think the recent edition of the presto and spark capabilities will certainly broaden the reach of dB t as far as the number of people who are able to use it because well, data warehouses have traditionally been the default place for analytics to take place, data lakes are becoming more prevalent, particularly as we start to have more volume and variety of data. And so people would want to be able to have compostable and reusable analytics in their data lake before they get it to the point where it's ready for a data warehouse. So I definitely think that that was a smart move to add those capabilities.
37:43
Drew Bannon: Yeah, thanks, it, you know, for us, it was, it wasn't in remains a learning experience. And that are, you know, we've put a lot of time in on redshift and stuff like a Big Query. But really, professionally, this is our first foray into spark sequel, and we haven't used presto before for in a production environment. And so there's a lot to learn about you. Obviously, the sequel specifics like what does it create table as they'll click on presto, a level above that? It's things like our transactions supported or should we try to use transactions? How do we atomic Lee replace objects in the database, and then maybe a step even beyond that, which is really the level that we're operating at right now. And where we could probably use the most help from the experts in the community is how do you actually implement these paradigms? Well, on a data lake and so on snowflake, we can incrementally build a table with emerged statement, or same for Big Query. But on sparks equal, it probably looks more like insert overwrite for a partition, and you do this thing in a partition basis. Like none of the databases that we use, with the exception of Big Query, have a notion of partitions spectrum on redshift, but it's sort of the same issue. And so we're trying to we're trying to rethink how do we take the dbt paradigm which has so many known benefits? And how do we adapt it to these other databases that that provide different mechanisms for actually creating or modifying relations in the database? And certainly, that's an area where like, it comes with experience there. We've tried a couple different iterations here. And they work in some ways, and they don't work and others, and we're very much
39:14
Tobias Macey: so learning. And what have you found to be some of the features of dB t that are often overlooked, but do you think provide significant benefits and that people should pay more attention to?
39:25
Drew Bannon: Sure, so the secret sauce of dbt is the DAG at the core of it. And in our release, we made maybe a month and a half ago, maybe six weeks ago, we introduced the notion of data sources into dbt. So previously, when you wanted to build your models, if you want to select from other model you selected from breath, and then the name of the model, you know in your Jinja curly brackets, but if you want to select to select from a source data table, you would just say select star from, you know, schema dot table. And so the problem was diabetes dag ended right before that source data, which is such a obviously a crucial part of your transformation like the actual data that's, that's being processed. And so what we did was we added another function, analogous to ref, but it's called source. And it lets you reference the source data table. And then separately using pretty much the same syntax you use to to specify models, you can specify sources. And so this is like a yam old configuration. And you can say this is my snowplough source schema, it has a table called event in it. And then in your model, you can say select star from, you know, source snowfall event. And this does two things, dbt will make it possible to say run all the models that depend on stupid data, which is really cool. You can visualize this in the dbt documentation, it actually like builds the edge. The other killer feature that I think is currently under under utilized is that you can specify the loaded at timestamp for the source data tables. And then you can use the dbt source snapshot freshness command to understand if your data is recent within acceptable SLA s. So in your config, you say I always expect my stupid data to be less than 24 hours out of date, well, you can run the dbt source snapshot freshness command. And if that's not true, dbt will exit with a non zero exit code. And then you can fire off alerts and in whatever way you see fit. And so this is really wonderful. Like we after deploying this thing, we set it up on our own internal analytics project. And we realized that three of the tables that we were using for reporting were in fact out of date, and we fixed them real quick. But you know, there's no way of knowing really, unless you're unless you're in there.
41:32
Tobias Macey: Yeah, definitely being able to have that visibility of the recency of the data is valuable, particularly if you're trying to create reports with any sort of timeliness.
41:43
Drew Bannon: Yeah, that's exactly right. In general, it's this, this whole conversation about, you know, making analytics look like engineering, if you're working on an app, you you instrument, monitoring and alerting, because you know that some things are going to fail, and you want to understand what it happened before your users find out. So this is another example where we take a tried and true best practice from the engineering world. And we
42:05
say like, okay, what's the analog and analytics?
42:09
Tobias Macey: So dbt itself is an open source project. And I know that you've also built the dbt cloud capability. So can you talk a bit about the additional features that that brings in,
42:19
Drew Bannon: sir, so we built dbt cloud, I guess the thing to say is, fish stone analytics, in addition to maintaining dbt is an analytics consultancy. And so we built the first versions of dbt cloud about three years ago, in order to deploy dbt for our clients. And so we were working with people that they didn't have any sort of data competency, we needed to set up a warehouse for them, get the ATL set up, do the data modeling dbt, plug into BI tool, etc. And we didn't want to like spin up air flow for this, you know, just to run dbt jobs nightly, that'd be kind of crazy. So what we did was, we made this thing called I think hosted dbt that then eventually morphed into dbt cloud. The big idea is that it is a sort of user interface and Job Scheduler on top of dbt. And so from this interface, you can go in there and set up your connections, connect to your GitHub repository, whatever provider you use, and say I would have run your this job DB to run nightly at midnight UTC or every four hours. And you can also in the same interface, set up your test to run maybe after your jobs have run to alert you with email notifications, if if anything didn't run the way it was supposed to run. Right. The other really cool thing that you can do is it builds your dbt documentation for you and hosts it. And so if you want to invite a bunch of your colleagues to look at your dbt documentation, you can do that without you know, setting up some server behind your VPN,
43:48
you know, for instance,
43:50
and long term it isn't will continue to be a user interface on top of dbt. And so we have a ton of metadata about your runs, like how long every run takes, or even within a run, how long each of the models took, for instance. And so exposing this, this data, either as visualization or like via the API is super valuable, in addition to just even simpler stuff, like viewing the logs for for a job that ran yesterday and failed. You know, if you're running dbt on cron, or something production,
44:19
you know, some analysts that's like, Oh, my tables are
44:21
out of date. They have to SSH into the cron server to tell the logs or they have to ask you what happened, you know, and so dbt clubs and interface into this, you know, production, deployment of dbt sets that everyone that interacts with dbt can understand what's happening, view the logs, be the documentation, and soon, like,
44:39
build their dbt models within the interface.
44:42
Tobias Macey: And so what have you found to be some of the most interesting or unexpected or innovative ways that you've seen dbt used?
44:49
Drew Bannon: That's really good question. So we have a dbt slack group that's open to the public. And I cannot say enough good things about it. The people in there are like encouraging, helpful and kind of knowledgeable, but also every now then we see people that are trying to do some pretty interesting things with dbt. So novel applications, you know, one thing that I think is really compelling is when people use dbt, to power their their in app analytics, and so they're exposing numbers to their clients are charts to their clients, they use dbt to actually like transform that data and test that everything's valid. I feel like generally, internal analysts are more forgiving than, like external clients. And so that's a lot of trust that people put in dbt. And I really like to see people doing that on the topic of maybe particular things, it's probably probably macros, one of the really cool things I saw someone do was, they want to use a recursive CTE to walk like a tree or a graph that was like, like a self referential table with, I don't say customer IDs, or something that referred back to itself. And so if you're in redshift, you can't use recursive cities. And what this person did was they built a macro that they can could provide with like a depth and it would self join the table to itself some specified number of times, such that it could like flatten out the whole, the whole tree, that was pretty cool. It's really neat to see people like implement almost database level functionality using ginger macros. Like I once a tweet that said, If you feel bad about programming, just remember that computers are rocks that we tricked into thinking. And so I was thinking like, yeah, it's it's not crazy for works, you know?
46:28
Tobias Macey: Yeah, people do crazy things with computers. And sometimes it's, you have to ask yourself, one who, how do they have that much free time? And to you know, why are they even bothering? But most of the time, the answer is because you can, and sometimes that's good enough.
46:41
Drew Bannon: Yeah, that's exactly it. I mean, the the meta point here is that a lot of these analysts that you see, VT are, like, they don't have a computer science background necessarily, they came from some other domain, but they're every bit is like capable and creative, and you know, good at their jobs as a software engineer. And so by giving them a substrate upon which they can be creative, and like build their own tooling, that's like such a compelling part of the work that we do. To me, I feel like historically, analysts have been serviced by products that you pay for. And so certainly, uh, you know, I want a lot of people to pay for dbt cloud, but fundamentally, in the open source version of this thing, you can do everything you need. And I think that's really great. It's really like the best part of my job is when I get requests from people that have never made a pull request before. They're like, Hey, I have this problem, I solved it, I want other people to benefit from it. That's a really cool part
47:29
Tobias Macey: of you know, the world that is like materialize, if you will, by dbt. And in terms of your experience of building and maintaining the dbt project, what have you found to be some of the most interesting or useful or challenging lessons that you've learned?
47:42
Drew Bannon: Sure. So I think that I know more about the DDL, I'm documented quirks of the modern warehouses out there then than like most people in the world. And so something you think should be easy, like I want to atomic we replace this table is like a really challenging problem, or across this sort of disparate set of databases. And, and so it's like, oh, you want to do it this way on redshift. But will we pick three doesn't have transactions are like snowflake has created replace table. But if you try to do that, on redshift, you can narrow the column definitions changes, like for instance. And so a big challenge for me was kind of understanding that like, sometimes, it seems like a good idea to abstract over a problem. But if you don't have full knowledge of the actual problem space, you're better suited by like, say, duplicating code, than trying to build some faulty abstraction, you know, and like, shoehorn it in there. That's, that's a lesson that I've learned many times over in many different domains and doing this type of work. Like no abstraction is better than a bad abstraction. I guess, simultaneously, I've learned like, the right way to do all this stuff. And that's pretty fascinating, too. I'm always happy to talk about like, DDL specifics, like here, here's an example. You can't rename a view on snowflake. If it selects from a table. It doesn't exist anymore. That's the thing that I know that like, I wish I didn't, but you. Yeah, I don't know. And the other the other really big thing for me, like I started as a software engineer, writing the code for dbt, and increasingly, much to the benefit of the people that use dbt. I don't write much of the code anymore. And it's instead my, my colleague, Jake, who's phenomenal, writes the majority of the prs for dbt court. And so I find myself moving to like, product management role. And the way that that manifests on an open source project is like, there's a lot of people on slack that have questions or suggestions, or they show up in GitHub, with issues and feature requests. And so it's sort of a trial by fire for me for like, how do you
49:36
do a good job of managing an open
49:38
source community. And another like, to my benefit, my colleague, Claire, who started just over six months ago, is has been the
49:45
community manager behind dbt. And she's doing such a phenomenal
49:48
job and like handling the actual community aspect,
49:50
I can focus on like, the community development aspect
49:53
of it. So you get issues from people
49:55
and GitHub, and it's like, how do I prioritize this thing? How do I help the user maybe see a different way that we could accomplish this, that's more in line with the vision that we have behind the vt? And a lot of it ultimately, the end of the day, it's just like, there are no rules for how to be a good maintainer, I think I was talking to another bill, I should say, another, I was talking to a prolific maintainer, open source software at the data council conference. And I asked him, I was like, how do you know what's acceptable and what's not. And he said, like, Drew, you just gotta be kind of people, you know, like, don't be rude to anyone, and it'll be fine. And so I internalize that, like, I try to understand where people are coming from all the time. And, and also like, kind of be firm, if there's something that they want really badly, but like, I don't want to cooperate. And dbt, like, I'm happy to point someone to the fork button on GitHub, you know, but overall, this is the shared experience of like, you know, we're all trying
50:41
to build this thing together.
50:43
And I'm in the position of like, Product Manager, maintainer has been so fascinating, and I love it so much. And certainly, it's one of the more one of the areas where I've grown the most since since starting working on dbt three years ago,
50:54
Tobias Macey: and what would you say are the cases where dbt is the wrong choice, and somebody should reach for a different tool?
50:59
Drew Bannon: Sure. It's so this is a great question. And I think that maybe less of these days, but certainly, a little bit earlier, people would show up in our in our slack group. And they would say, hey, I want to run these for like, update and insert statements. How do I do that with dbt? We'd be like, hey, wait, listen, here's the here's the Best Practices Guide, you'll find that there's no mention of interrupted in it, here's how to do it the dbt way. And so we've been pretty concerned about exactly like how to do things well, with dbt. And we have a we have a document online called like his dbt, the right tool for my, you know, data transformations. So I would say that, in general, if you're like, find yourself fighting against dbt rather than kind of cruising along with it, then like dbt might not actually be the best way to generate and run the sequel that you need. This is things like custom DDL. Or if you want like very close to real time processing, like dB, she's a batch based process or fundamentally, like if you don't have the data in your warehouse, that you need to transform diabetes, probably not the right tool.
51:56
But if if you don't have any of those challenges, if
51:58
there's already data, you can do in a fast paced way, and you're happy with SELECT statements than
52:02
like dbt's a great way to, to execute
52:06
and and test those queries, for sure.
52:07
Tobias Macey: And are there any other aspects of dbt or data warehouse transformation that we didn't discuss yet that you'd like to cover? Before we close out the show?
52:16
Drew Bannon: Yeah, there's a whole suite of tools that that have come up so many times for us in our consulting work, or that that folks in the community have mentioned, that we built out. So
52:24
one of them is what we call archival.
52:27
It's the ability to generate type two slowly changing dimensions from a source table. And so this is something you would run sort of out of band from your dbt process, maybe every hour, you want a snapshot at specific table, and you can understand like when the name of a product changes, or something like whatever value might be updated in place, you can capture that change, and then look back and see the in given time intervals, what the value was for specific columns. archival is pretty cool. We're overhauling it. And our next release, I'm really excited about that there is so much of dbt is deeply customizable from within your own project. And that's the thing that's hard to convey in a in a conversation like this. But you can kind of write your own macros in your project and override internal functionality of dbt. So if you don't like the way that dbt creates tables, and you would prefer to do it a different way. You can like make your own materialisation, for instance, and you just like put that in your project. You don't have to fork dbt and dbt will will sort of pick up your materialisation definition. That's super cool. And I you know, one example is like a community contributed. It's called insert by period. materialisation. It's sort of like our incremental materialisation, but it doesn't on a daily basis, which sort of approximates, like these insert overwrite partitions we were talking about earlier. That's pretty neat. I think beyond that, like it is a sort of game changer for like, how you actually do the analytics work that you do. It's, I think, welcome by a lot of teams out there, that might have a mess in their BI tools and consistencies across different reporting interfaces. And
53:56
Tobias Macey: so for anybody who wants to get in touch with you and follow up along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
54:13
Drew Bannon: Yeah, that's a really good question. I think that the last few years have brought a sea change for how data is ingested into databases. So these off the shelf ATL tools are great. The tooling around actually like building ATL if you need to build custom stuff yourself is great compared to it was a few years ago, I think that you know, the actual warehouse technology has quantum leaps in the past to five years, going from just say, maybe a little bit longer than that at this point. But like a pre redshift world to where we are now where it's like was redshift, but also big queries, amazing. And so flicks phenomenal as warehouse, those are all great. And I think the the sort of last part of it is on the big stack, and like the big part of the stack. And so I think that these individual tools all work really well. But they're just say the, you know, the lookers in the mode analytics and the periscope, the world. They're all like great tools for what they do. But the actual interoperability with the rest of the stack is I think, kind of, like not all the way there. And the the place that I think the highest, the most amount of value can be added is an understanding, like the connection between the queries that are running in your BI tool. And just, you know, maybe this is a dbt centric point of view, but which dbt models, they're querying, for instance, and actually like uniting the whole stack from source data through to, you know, emailing your warehouse transformation analysis, if you could have that unified picture of lineage for like an actual chart, understand the provenance all the way back to source data. That's like a really powerful thing. That to date has been elusive. But I think like we're getting closer and closer to it.
55:51
Tobias Macey: Yeah, no, I agree that being able to have that complete end to end visibility is highly valuable, particularly when you're just looking at a visualization at the end of the and you say, you know, that's great, but I have some additional questions that you can't really answer unless you know where that data is coming from, and what's been done to it. So I agree that that's definitely something to continue to work towards. And so yeah, I just want to thank you very much for taking the time today to join me and discuss the work that you've been doing on dbt. It's definitely an interesting tool, and one that I look forward to seeing continued to be developed and maintained and one that I'll probably introduced to some of my teammates. So thank you for all that and I hope you enjoy the rest of your day.
56:34
Drew Bannon: Great. Thanks so much Tobias. It was really a pleasure. Thanks for having me on.