00:12 Tobias Macey
Hello, and welcome to the data engineering podcast the show about modern data management. What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I'm working with O'Reilly on a project to collect the 97 things that every data engineer should know and I need your help. Go to data engineering podcast comm slash 97 things to add your voice and share your hard earned expertise. And when you're ready to build your next pipeline, I want to test out the project to hear about on the show, you'll need some more to deploy it. So check out our friends over at linode. With their managed Kubernetes platform, it's now even easier to deploy and scale your workflow. So try out the latest Helm charts from tools like pulse our package derma Daxter. With simple pricing, fast networking, object storage and worldwide data centers, you've got everything you need to run a budget. Proof data platform, go to data engineering podcast.com slash linode. That's l i n od e today and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macey, and today I'm interviewing Douwe Maan about Meltano, an open source platform for building running and orchestrating ELT pipelines. So Douwe, can you start by introducing yourself?
01:24 Douwe Maan
Yes, of course. First of all, thanks for having me, Tobias. So my name is Douwe Maan like you mentioned and I work at GitLab. I've been at this for a little bit over five years now I originally joined as a developer, then I became development leads, which turned into the engineering management role at some point. And then about nine months ago, four years into my time it gets that I moved over to the Meltano projects. And, of course, it's what I'm here to talk about
01:48 Tobias Macey
today. And do you remember how you first got involved in the area of data management?
01:53 Douwe Maan
Yeah, so really, nine months ago when I got into the Meltano team, is when I got involved in the area of data Management. Like I mentioned, my background is in software engineering. I go into get lab as a developer five years ago when I got into development, engineering management, and a year ago or so dribbled down a team of gitlab was in need of an engineering manager. And at that point, having seen gitlab grow from 10 people to about 1000, I was kind of starting to feel that itch of wanting to work on a new project again, and maybe something smaller. So when the Meltano opportunity came around within gitlab, I didn't hesitate and I grabbed it. So I've only been involved in anything related to the data space for about nine months or so. Before that, of course, I knew about data management, but I've really only started to read up and become an expert. So recently,
02:38 Tobias Macey
given that you're so new to this area, what are some of the aspects of the learning curve that you've been running into as you get ramped up on the project and the use case that it fills and some of the challenges within the overall ecosystem that you're trying to tackle?
02:53 Douwe Maan
Yeah, great question. So it's interesting in a way because I joined the team back in September of last year. But it wasn't really until March of this year that I actually started digging into, you know, data engineering Data Management District call it and the tools that are available in this space and the problems that they men need to solve. When I came on board to the McDonald project that had already been around for about a year and a half inside get lab. So we had a pretty clear idea of what the tool was that we were trying to build. And I was really just approaching it as a developer building that tool based on the roadmap that we had laid out ourselves as as becoming, you know, up to speed on the data space, and what what what kind of tooling is available here. And then the role of getting up and on like fill in the space and how that relates to you know, what we really set out to do three years ago, and where our opportunities like today, the learning curve had actually been less steep than I thought it would be because I've had the opportunity to talk with a lot of really great people from the data seen who I've met them being introduced through channels like locally optimistic on slack and also the singer and DBT ecosystems have a lot of really great people in them, who have been able to basically kind of point me in all the right directions. I wouldn't call myself an expert on data medicine in general, but I am a little bit of an expert now and specifically the open source ELT field, in large part because of this 100 from these people. And through that information, I'm actually pleasantly surprised by the amount of material that is available online and written about, you know, how all this fits together. So the learning curve hasn't been as deep
04:24 Tobias Macey
as I expected, actually. And given that you are still new to it, and you still have sort of the beginner mindset as it pertains specifically to data management and ELT, what are some of the benefits that you see that providing as somebody who is taking on the project lead role of Mel Tano?
Yeah, I think one thing that makes a difference is that my background is very much in software engineering. So that means that especially coming out of get lab, which is of course, a, you know, an intimate platform for the entire DevOps lifecycle, it means that from the get go, I am approaching data engineering and this entire topic with this software development and DevOps mindset, and I kind of Come into it expecting all of the benefits that best practices of DevOps like code review, continuous integration and delivery and version control in general provides. So it means that while historically it looks like both data engineering and real tight pipelines have been implemented and realized through you know, more and more visual tools Informatica as one that, of course, is a big name, even though today it's like no longer be you have to go through modern tool, but it means that in trying to build Montano making it you know, fit into the DevOps lifecycle of it's just another software engineering project with people interacting through version control and contribute to me as a given. And that means that in figuring out how to build a data engineering or ELT tool, that these are kind of fundamental from the beginning instead of something that I had to learn over time. So I think one of the benefits of my being a software engineer by trade who is now getting into data engineering, instead of the opposite, which is of course what you see in a lot of other data engineer tools out there means that's what people who are looking for some of these software engineering and DevOps benefits What they will find out that I will probably be closer to what a traditional software engineering project might look like, compared to people who are only just getting into that you're trying to learn it. So from the get go, if you start working with Madonna, we will expect you to be comfortable with topics like version control, continuous integration, and deployment, and then go to View and the like, because we think that these are a core part of what makes Meltano, and then similar tools, you know, what they bring to a team that if I can have today, where a lot of that extra value actually comes from if you're looking at it from a from, you know, collaboration and ultimately, efficiency and perspective. So in that sense, my being in my office and really having to hear from experience data engineers, what it is that they would like to talk to, like no time has to do with me then kind of figuring out okay, how do we fit that into this, this DevOps approach to data engineering, and it's been really valuable that I don't have to any preconceived notions of what an ELT five times looks like. And that's also why and you'll see this later on in the call to one of the main things that I'm actually looking forward to contribution to Montana right now is just in the fall. Off feedback by, you know, experienced people with data engineering in general and open source data engineering around senior steps and targets in particular, but we'll get to dosa in a second, I'm sure.
07:12 Tobias Macey
And digging deeper into Mel Tano itself. Can you give a description about what it is and some of the story behind it and I know that you recently pivoted in terms of the main focus behind it. So if you can give a bit of context there as well,
absolutely. So, like you mentioned, actually, today, not always exclusively an open source platform for building running and orchestrating ELT pipelines. And in the moment, I'll clarify because if you'll see pipelines are specifically ELD pipelines built out of fingertips and targets for the extraction and loading bits and then DBT models for any transformation. But originally, when Madonna was founded two years ago, within get lab, this was only a part of what we wanted to realize with meltdown. So two years ago or so in the summer of 2018. They get that data team was scaling up ramping up as a whole us was growing and we realized we didn't to do more with the data that we were gathering. So we started to build our data team to put together our data stack. And coming from the open source. You know, backgrounds get up itself being an open source project originally. And even today, being an open core product were a really large amount of our engineering time every day and can goes into the open source version, which is freely available to all rather than a proprietary edition that we make money on, we started looking for open source tooling. First, before we check out some of the more popular proprietary and paid tools, we wanted to see if it was possible to build a full data stack with everything our data engineers, analytics engineers, as well as analysts and data scientists would need just out of open source tooling. So what we found is that a lot of open source tooling already exists. And if you, you know, went through the trouble of actually tying all of these components together, you could build a pretty robust data integration pipeline out of only open source components, but we also realized that the glue in between these different components so you got to think of the extractor loader to transformations themselves but also the orchestration Which, which manages running this on the schedule and then making sure he's running reliably and that we will be notified and fails. We recognize stuff between all of these open source tools that existed, this glue hadn't necessarily been failed in, or at least nothing, there wasn't an open source tool available that you could really just get started with. And 10 minutes later, see your data flow from the data source to the data warehouse, and then also have the opportunity to actually start analyzing it. So what we realized is that there would be value in US building the tooling to glue together these various open source components, some Athena was founded in get lab. With the idea being well, we want this ourselves to build it. And that's also filled with open source for wider relatively quickly though, we came to the conclusion that no Donna's the pace of development of matado was not able to keep up with the growing needs of the actual get that data to the bus sets to use this not on a project. And this has to do with the fact that of course, we wanted to extract data from various data sources, SAS API's and data formats and loaded into a data warehouse, but we realized that if we were going To have to write and maintain these data extractor herself. This will take a lot of time and a lot of effort that points might be better spent a little more actually, you know, getting something out of that data and we might be better off going with a proprietary tool for the moment. But be a good laugh, we still very much believed that this should all be possible with open source tooling. So about that a project sticker stuck around even though the get lab data team at the time was no longer actually using it. And since with gitlab, we had found quite some success in offering a you know, a full tool for the DevOps lifecycle through an entire engineering team doing everything from virtual troll issue tracking, see ICD as some amount of tooling around, you know, security checking as well, we saw that there was a place in the open source space for a similar tool for the data lifecycle, again, the idea being that composed of various open source components, you could spin up this tool called matano. And immediately have a kind of a starting point or a starting point for your entire day. A team to have a single source of truth for what their data pipelines and their data strategy looks like, not just from the data integration perspective, like how do you get it from the sources to the warehouse? But also, what do you do with that data? Next? Well, of course, you're still hooking into analytics or bi software, not working with with Jupiter or other data science practices that ultimately all connect with that data warehouse. So that's our model started with this kind of end to end vision. We want to build a tool that does this all because we saw value there with get lab as well. But then over the last two years, it became clear that this gentleman's vision while it resonated with with other people on data themes out there, we hadn't necessarily been able to actually attract teams that were at that point, willing to start evaluating and thirdly, or were actually able to start contributing to it to make this vision a reality with us. So three months ago, two months ago, back in March, I really started looking into Okay, we feel something pretty cool now and it does actually work. You can use mobile typo for end to end data lifecycle. So it can do everything from data integration to data transformation. And it has some basic point and click. And then lithics functionality built in as well with a modeling language inspired by MOOC ml allows you to basically describe the schema in the data warehouse and then describe how these various tables relate to each other and can be joined to that you can then use the multi channel interface to kind of point and click and get some simple dashboards and reports out of that. But we realized that that this feature just future where an entire team would be able to use matano. And I don't want to say nothing else, because we recognize that on any of these kind of steps in the life cycle, specific themes where people might have bigger needs to go beyond what matano is able to deliver today or ever. And we are you know, Mel Dotto is not too opinionated about whether or not to use all of it spark or whether you kind of pick and choose and decide to stop certain things out for possibly a proprietary tool or some other open source tool. But we realized that the story of matado and a team actually playing it was still defensible that entire team first being convinced of the extra value, they would get out of this email data compared to what they are using today. And then if we wanted this team to actually contribute and help us make this a reality, that also means that this team would need to be, you know, further technical or at least comfortable contributing to Python projects. And not just about the project itself, but also the specific extractors and loaders for all of the various data sources and data warehouses. So we came to the conclusion that with this end to end division, we were not actually able to get the people excited that we needed to get excited to make this a reality with us. Because if you want to convince an entire team of the value of this integrative tool, then basically already needs to have reached a level of quality in each of the various steps that make up that lifecycle. And ultimately, the person getting most value out of any data project is of course the people doing the insights, getting the insights at the end of people doing the analysis or running the you know, the the note two projects against the for example. And these are not always the same people who are actually capable of controlling bidding to extractors that are, you know, highly, highly, highly technical Python projects specialize in, for example, pulling data out of Salesforce or Google Analytics or Facebook ads or what have you. So we reached the conclusion that in order to make this eventual future in which there was a single end to end tool that teams can get started with build only out of open source components for that to be a reality have, you really had to start at the beginning of the journey, which is data integration. And the interesting thing is that we realized that in the open source space, there exists a number of really great open source tools that kind of sit at the end of the data lifecycle, so think of bi analytics tools like react meta base, or Apache superset, and then which you can connect to the data warehouse and then you get going from there. And there exists really great open source transformation tooling, as well as specifically DBT, you know, within CBT models, which, which analytics engineers who are capable of SQL are, you know, obviously, for great reason, getting really excited about these days to transform their data. But then the first step in the pipeline where you're actually getting our data out of the data source and piping it into a data warehouse, we realized that there were some open source projects that that kind of attempt to to to make a dent here and to try to offer something that could really serve as an alternative to some of these proprietary tools and how some tools out there but today in large part because I think the Steve's house that proprietary ELT and they integration platforms has such great data source and data warehouse support, both in quantity and quality. A lot of companies out there even if they do use open source technology in the transformation or the BI analytics stage, or if they were interested in putting together and open to the DATA step is completely based on open source components, including something like airflow for your frustration, the actual data integration view, most companies that don't have the resources to actually build and maintain off the extractors themselves would end up opting for very understandable reasons. For one of these a host of proprietary tools or of course, you pay some money up front. And then or, you know, I guess in most cases on a subscription basis or usage basis. But and then of course, you give all of the burden of both maintaining and building these extractors, as well as the burden of actually keeping this pipelines running Staveley in a production environment, you leave that over to this other party. So we realize that the place where the both our end to end story would really kind of have to start isn't that integration bits because no one is going to switch to a full open source end to end data lifecycle tool, unless it's data and integration chops are competitive with what people can find in pain of proprietary tools out there. But a decision was made to for the time being start focusing specifically on turning Meltano into, you know, a really great and truly competitive open source alternative to these proprietary OLTP fans out there with the kind of greater ideological goal being to make the power of data integration available to all by building this triple circle tournament because right Now the data integration space has essentially become pay to play where unless you actually have the resources in house to build and maintain these extractors and all of this tooling for running and orchestrating these pipelines yourself, you are almost forced to go with one of the paid options out there, which means that to a large portion of the companies out there in the world who would benefit from doing something more with their data, they are currently not actually able on making progress on that goal or on that ideal until they have figured out the data integration step, which usually now means paying for it. So we realized that there was a great opportunity in the open source data space, not only analytics for the iPhone, specifically, because like I mentioned, there are a number of tools that already filled that need on the data transformation stage the same as the case. But on the data integration space. We felt that that is really where the open source data story kind of falls apart because for most companies today, the open source tooling available Just isn't sufficient and cannot truly compete with debate options out there. So here's sort of what that owns that back in March, we pivoted very specifically to the ELD side of things. And that's what we've been trying to get into the attention of the public over the last month or so. And a month ago, we officially announced that new direction with a blog post. And that is also what, what what sparked my reaching out to you over Twitter and my being on this podcast today. So I'd love to talk more, you know, in the duration of this talk of this interview about this future direction and what this means and then where we could where we can find contributions from people.
18:36 Tobias Macey
Yeah, that's definitely as you said, one of the biggest challenges is, once you have the data, then it's generally fairly specific to the organization and the questions that you're asking as to what you do with it. But getting a hold of the data in the first place, as you said is one of the challenges because of the fact that there are so many different sources and the number of sources is generally growing at any given time for any given organization. Also, as those data sources evolve and mature themselves, the specifics of how to integrate with them, or the format of the data that they're producing is going to evolve, which means ongoing maintenance, because you can't just write the integration once and then not have to touch it again, you have to make sure that they stay up to date with all of the representations and all of the available options for what that source data set is able to provide to you. And as you mentioned, one of the projects that you are working on using to help bootstrap your work is the singer project that already has some library of tops and targets for being able to pull that data out and load it into some destination. But I also know that the overall community around that solution is a bit of a sort of patchwork, there's not really any sort of cohesive aspect to it. And so I'm wondering, in your efforts to build this open source data integration platform, what are you seeing as the primary strength of singer as an option And the benefits of using that as your basis going forward? And what are some of the shortcomings that exist either in the community or technological aspects of it that you're trying to improve or work around and the work that you're doing? And they'll Tonto?
Yeah, great question. And that's exactly where I wanted to go with this next. So like you mentioned, the singer, ecosystem. So Meltano, data integration, of course starts with with extracting and loading. So you need an extractor and a loader, you need a tool that manages pulling the data out of a data source and then another tool that manages pushing the data into a data warehouse or other file format, or whatever it might be. And singer is a specification that describes how to write scripts that can take the role of extractor and loader and specifically what the spec singer certification does is describe a format for the intermediary, I guess format that sits next to the singer extractor outputs, and then serves as inputs to the singer targets. So So ultimately, the singer specification is not much more than a description of how the depth and targets can really communicate like what is the format that this extract the data should be in at the intermediary step that then any arbitrary targets can take as inputs for it to convert that into the correct you know, insert statements or whatever you have in order to load that data into a data warehouse. So a project like Meltano, and data integration platform always, always, always starts with like, how are we going to write these extractors and loaders? So when the data was originally started, and they get that team started, you know, looking into building extractors and its motors for the data sources that we ourselves will have to connect with, we have first looked around to see what what formats and what options and what libraries of existing steps and targets or rather extractors and loaders already existed and we came across singer pretty quickly. An interesting thing about singer is that it is essentially it was built by stitch specifically stitch is one of these ELT hosts that you'll see platforms that we think that Madonna will be able to compete with one day And stitch found that the singer specification to allow stitch users as well as data engineer consultancies that surface issues or to build extractors for data sources that stitch didn't support yet that could then be kind of plugged into the stitch system once they have passed some review process. So what you see if you look at the current dictionary, or library of secret singer that's in targets is that they exist for a good number of data sources, but especially those for which stitch doesn't have native out of the books support just yet. And it's very powerful because like you mentioned before a new SAS services are popping up every day. And there are various reaches out to around the world where the most popular SAS tools might not be the same ones that the US company is likely to use. So it's really great. That stitch allows their users to, you know, kind of build these plugins that allow data to be extracted from from sources that were previously support. And because singer has this kind of existing community around it off both users who are trying to build the steps and targets for you to stitch as well as Data engineers and data consultancies that are building these for use by their customers. We saw it as the most promising open source ecosystem of extract parameter loaders today. Another advantage is that it's all written in Python, which is of course, the de facto language of data engineers today as well. There are some alternatives, you know, open source ELT specifications, and then sets of extractors and loaders that have been written in languages like Ruby. But obviously that will have downsides from the perspective of wanting to make it really easy for actual data engineers to get started with maintaining these we just probably picked singer FM originally but since singer is was originally founded, and then still explicitly kind of is primarily intended to be used with the stitch platform. The senior steps and targets by themselves don't get you the data pipeline that you would actually be comfortable running in production. So a couple of things that we found are currently lacking in the in the Sierra ecosystem are first of all kind of a really great story about once you found the singer and guitar singer Theremin single target an extractor and a loader for the data source and data warehouse that you're trying to use, how do you actually turn this into a data pipeline that you would be comfortable running in production and you know, not having to double check every day to make sure that that didn't break down, because at the lowest level senior types of targets are just single executables that take a couple of flags, and that you send it in and send it out to, you know, output the input data in following the singer specifications. So you need some kind of runner tool around that, that can actually keep track of piping these together in a reliable way, managing the configuration, both the tap and the target. So you should think about credentials or other configuration options, as well as managing the state of the pipeline so that when the pipeline is run a second time, it starts off where the first one left off. So at a senior ecosystem, a couple of different runners currently exist and you can run these locally and they work just fine. But then if you want to deploy this into production, you have to figure out yourself how you're going to orchestrate fortunately You know, first airflow support bash operator, which allows you to just call out to one of these runners and then every you know orchestration platform or workflow management system supports bash scripts or commands in a similar way. But it still requires quite a lot of manual setup work. So as I said in the beginning, the idea was that Madonna would provide the glue around the open source components it would consist of, so from day one matano, kind of started filling in business glue, and turn itself into a runner for singer steps and targets. But once you have a singer, a runner, takes care of configuration and entity selection and state management, you still actually want to run this in production. So if you're comfortable deploying airflow, or deploying, you know, Luigi or deploying prefect or what have you, then you should already be able to use the steps and targets with one of these existing runners or with matano as a runner, but this still means that the learning curve and the barrier to entry is pretty high. If you compare it to To someone who can just go for example, essential calm, the sign up immediately be presented with a dashboard, you know, with all of the Logos with the supporting data sources, click Connect button, enter some credentials, and then have your pipeline running there and be confident that you can just kind of forget about it. And you'll be notified if something breaks. And for the most part, you can expect that the platform will kind of fix itself, especially if it turned out that the data source changed in a way that make the extractor incompatible. With a party like stitch, you can of course, assume that they will have the resources to fix that, then even if your product fails, it will stop it will stop failing and work again on the next iteration, the next interval a tick. And currently in the open source ecosystem. There's tooling around running them and the point them and monitoring them and actually being able to say, Hey, I set this up once. And now I'll deploy it and I won't have to worry about it. Again, that doesn't really exist. So a big barrier to entry there is that if you actually want to use an open source and free data pipeline, you have to think figure a lot of this stuff out yourself. And again, autonomous trying to make that a lot easier by providing this tooling and glue that makes it that simple to set up a singer based data pipeline that can have an optional transformation step using a DBT model. And then also orchestrating this up with a supportive orchestrator, like airflow and then deploying it using, for example, a Docker file, which which I'm actually working on right now. So this is one of the barriers to entry to the singer ecosystem. Another one is the fact that the secret steps that have been created are very you know, they're they're varying in in quality and in maintenance and in a future completeness relative to the proprietary data connectors that you might find that also paid hosted vendor partners. The reason for this, of course, is the fact that Well, like I said before, for really understandable reasons. Most companies even if they have some points, explore using an open source ELT platform, they understandably would usually decide to just go with one of the big vendors anyway. Because you don't necessarily want to take on that burden of maintaining the data source extractors and loaders that you need all by yourself. So a lot of these even if they've been written once, and they work, they're not necessarily used SS frequently or in production for them to actually be maintained to the level that you can get started with them today and actually get a great quality out of them. And part of the reason for that is also that if you actually use singer types with stitch, you're less inclined to build a single step for a data source that stitch already supports out of the books. And since stitches own extractors and loaders are not actually open source, that means that there are more singer types that are kind of in niche markets and in local markets, while all of the popular tools are served by stitch but not necessarily by singer steps because there just hasn't been as much of a motivation to build those because again, most people using singer are probably using it with stitch and at that point, why would you build a singer tab if the stitch extractor already exists and the same kind of goes for singer targets because If you use fingertips with stitch, it's actually still stitch that is responsible for loading this data into your data warehouse, which means that the singer targets that exist, I have all been written by people who do want to run singer steps outside of the stitch ecosystem.
29:17 Douwe Maan
While stitch itself hasn't been particularly motivated to support those, because of course, at that point, you're kind of competing with the Dallas offering that they offer. So they are not inclined to build this, this tooling around the singer ecosystem, because in a way, they would be empowering the open source community to not need them as much anymore. And so this is a couple of things like I mentioned, the data, the quality and the quantity of single steps in progress is firstly, a barrier to entry for new users. The lack of tooling or a lack of a great deployment strategy is a very entry and then there's also effects. Building single steps and targets today is not as easy as it could be or you know, really as we want it to be or as it should be, because by There exists a lot of documentation around the senior specification. And there exists Of course, a number of senior tests that are all open source and you can find our repos on GitHub and reviewed our code, there doesn't exist a kind of cohesive set of best practices, or a boilerplate or templated system almost know how to get started with a step that will be you know, feature complete and robust and reliable and ready for production. And right now, building your own step is very much a matter of reviewing 567 different steps, taking the best bits from all of them and trying to piece it together yourself. So Additionally, there is opportunity in providing better tooling around actually building and maintaining testing steps and targets, which will of course increase people's confidence in their own data pipelines because today, as good as data could be as a runner and deployment that's for singer steps and targets. Ultimately, the quality of your singer, base TLT pipeline is only going to be as good as the specific depth But that you're using so we see big opportunity to in building more tooling and documentation, and perhaps, you know, util libraries that go beyond what is currently available to make it easier for people to set up a single step for a new SAS API they want to include in their data project. And these are all things that various parties, various of these data consultants have been referring to, and are already using fingertips. And some of them also senior targets and pipelines. Some of them have open source, some of the tooling they're running themselves, but none of them have to go so far as building entire kind of suite of tools to really make this as easy to use and and get started with and keep using as the proprietary hosting options out there. And that is exactly the gap that we want to fill. Because by empowering the existing singer community, the singer ecosystem can really start living up to its potential. And once it does, in combination with the matano platform to actually run and build and deploy them. You will end up in a place where are companies that are currently not doing anything with their data at all, because for whatever reason, they might not be able to afford one of the hosted options or there might be like legal reasons like for example, with GDPR in Europe or a HIPAA in the US if you're dealing with health information, which preclude them from using one of those tools, using an E book, for example, they're their highest tier subscription levels, which actually do come with with things like HIPAA compliance or GDPR compliance in case you want this to be hosted in Europe. So we want to get to a place where through building Mel Tano we empower the existing senior community to the extent that the singer ecosystem grows to a place where in combination with them altano tools, even people who are not currently familiar with singer and who are not even comfortable writing or maintaining biceps and targets themselves will be able to come here to find something really easy to deploy and get started with a great set of data sources and warehouses supported out of the box so that they can really get started with McDonald's. While it might have taken them, you know another six months from another funding hands or another one or two data hires before they would otherwise have been able to get started doing anything with their data. And that's why very explicitly, it's kind of about empowering people to start doing more with their data and then turning this tooling into a commodity so that every company can benefit from from these, you know, doing something with their data to the extent that they currently can't.
33:25 Tobias Macey
Today's episode of the data engineering podcast is sponsored by data dog, a SaaS based monitoring and analytics platform for cloud scale infrastructure, applications, logs and more. Data dog uses machine learning based algorithms to detect errors and anomalies across your entire stack, which reduces the time it takes to detect and address outages and helps promote collaboration between data engineering, operations and the rest of the company. Go to data engineering podcast.com slash data dog today to start your free 14 day trial. And if you start a trial and install data dogs agent they'll send you a free t shirt. And I want to dig a bit more into the specifics of the actual singer specification and the fact that it uses standard out and standard in as the transfer mechanism. But before we get to that, I want to dig a bit more into the focus that you have currently in terms of the target audience that you're working with and trying to cater to in the current incarnation of Mel Tano. As you ramp up to the point of being more generally applicable, and how that particular focus is informing or constraining the architectural and design decisions that you are making as you build Mel Tano and its current implementation.
Yeah, great question. And, like you said, it's very important to stress that the target audience of Meltano today is different than what it might be six months from now. And it looks different from what it might be a couple years from now, but today, we are specifically targeting data engineers who are already familiar with it. And then comfortable with or at least exposed to singer depths and targets we are specifically targeting people who are already in the singer, community and part of the ecosystem because these are the people that we can get the most relevant feedback from at this point because these are people who either are already running singer and target based firefights and production based on their own kind of handhold setup. And these are the people actually building these senior steps and targets either for their own use or for their their clients in case of consultancies. But these are the people who can at this point, give us most feedback to kind of make matano the go to tool for running building running and orchestrating your pipelines based out of build out of singer depths that settle down thank you because these are the people are already doing it without multinomial and we want to with all of their feedback, make Milltown a good tool that they actually want to use going forward for their own and then that will empower more people to because then Meltano two will be heavily informed by people who have actually already done this. So if that describes you, dear listener, then definitely check out donno if you are the second kind of people were interested in this point or dose, we've already found out that they are interested in running open source data integration people, data engineers who are comfortable, you know, working to open source projects and are comfortable working with with medical data quality projects, in some sense, people who won't necessarily be looking for a massive amount of support, and are excited to build this with us who have already decided they want to open source data pipelines, but they might not necessarily be familiar with singer yet. And these are the people who we want to show that because I felt that our singer is kind of the best possible option right now for them if they want a 25 pound like this, because these aren't people who at the next stage will start using first of all the existing singer types and targets that exist but then would also be comfortable potentially contributing to them if they find bugs that they want to fix, or to build new ones for steps and targets sets for data sources that aren't currently supported. And these people we're attracting are right now mostly smaller companies before that. I mentioned earlier whatever reason are not currently part of the market that is addressed by these paid and proprietary tools for for a myriad of reasons. But we are seeing quite a lot of interest from developing countries, for example, where of course, local income and local prices they can charge for their products are in a lot of cases far lower than what is common in the US, which will automatically means that us tools are often out of reach for these companies. So they're more interested in open source. And after that, once these data engineers have really gotten the singer ecosystem to a place where the quality and quantity of data sources starts getting closer and closer to what the proprietary platforms offer today, then little by little the target audience will kind of growing they'll start growing in the opposite direction, right? Even people who are currently paying will start wondering why am I paying for this as this mythical thing seems pretty robust, and it seems to be able to do all the same stuff I'm currently paying for. And at that point, of course, there will still be companies who are are looking for 24 seven support, we're looking for all the various things you get with no proprietary vendor, but I can also see a future in which get lab or other parties will start offering a hosted version of pro tanto, which will then, you know, hopefully still be cheaper and more extensible than the other platforms out there today, because we do kind of build on this compute community support that ecosystem of targets and just find it to take this question one step further. The ultimate future I see is one in which writing steps and targets and we're extracting the loaders will no longer be a responsibility specifically of data engineering teams or data engineering tools. But I can see a future in which matano data pipelines and specifically singer steps will be as much of an open source standard that specific SAS providers, especially newcomers into the market, will themselves author their own Official singer test and the same way you will see data warehouses shipping their own singer targets because it allows them to initially immediately plug into the data projects and data pipelines. All of these users have melanoma and singer Well, right now, a newcomer like this into the market would need to wait for one of the big almost the parties, they either decide to allocate resources themselves to building it, which will take a while chances are after founding of that project originally, or it might be that you know, a customer will need to pay for them to be able to include that new SAS API or SAS tool in their data integration story for the future. Which means what this means is that today if you are entering a market in which a lot of SAS tools already exist, that are widely supported by data integration platforms, you are in a significant disadvantage as a newcomer, because many of your prospective customers who you would want to switch away from what they're currently using, well not too so if what they're currently using is supported by data integration platform, but what do you want them to switch to your new tool or your new data warehouse is not yet. So in the future, I don't think we'll be depending as much on the individual community members or data engineers to build this integration. I think we'll see this being almost a given that any company that wants to be part of people's data pipelines, will themselves have an official step or an official target, which ultimately helps everyone including the end user, because everyone using Tableau or anyone else using a singer compatible data integration platform will be able to connect with all of these data sources and data warehouses from day one without ever having to worry about the quality of the individual separate targets, or what to do with a bug occurs because you can expect that this will be part of the expected offering of the actual SAS provider you're using. So that's kind of where we want to go in terms of target audience, but today, we're targeting specifically people already part of the singer community, because today matano is primarily a singer steppin targets, running and deployment that are Construction perform
41:01 Tobias Macey
and digging into the specifics of how Mel Tano is implemented in its current incarnation. I'm wondering if you can just describe the overall architecture and some of the ways that it has evolved from the original direction where it was trying to be this all encompassing tool that included the entirety of the lifecycle.
Yeah. So since in the beginning, we knew that while we wanted Medina to be a convention over configuration tool, where you know, most people would just be able to get started without having to tweak too much. We did recognize that not everyone would want to use every part of Madonna, like not that out of the seven letters in the word matano actually stands for model extract, load, transform, analyze notebook and orchestrate because that was kind of the wider into envision we had in mind at the time, but we knew that we were not going to be able to convince everyone to go and use all of Madonna at once. So architecturally Meltano starts with the concept of plugins and extractors and loaders and transformations but also Transformers Like DBT and orchestrators, like airflow to build out our own plugins. And ultimately your Meltano project with just a single source of truth for for your data production, your data pipelines has a Meltano YAML file which which connects to various plugins that you've plugged in, which are kind of just dependencies that point at either a specific pipe pipe package, or no, it gets repo URL that contains a Python package. So because of this plugin based approach, the difference in terms of Okay, we're going to focus only on ELT for the time being and lots of these other steps that have to do with analysis and note booking etc. Only really meant that we wouldn't stress those other plugins anymore because if you're using both download with only extract, load and transform plugins, even in the previous iteration, it would basically already be the exact ELT tool that we have today. So that plugin based approach means that it was very much pick and choose and you don't need to use all of it. You can use it as a simple cigarette. That's our secret weapon 30 runner, you can use it as a singer, you know, as a pipeline runner, if you also want to take BBT transformation into that. And you can use it as a system to kind of abstract away the orchestration layer, if you're, if you're comfortable only using pipelines consisting of E, L and T steps that just need to be run on a schedule. So actually, the original architecture of making this very much pick and choose and plugin based allowed us to pivot relatively easily to focus on only a specific part of that whole story. And someone using both I know today, if they don't dig deep into the documentation will never know that it can actually do a couple other things that we are now for the moment explicitly not stressing, but we have also not removed these things from both, either because if we do find a user who is you know, motivated and inspired enough like hey, it would be cool if I also did basic pointing to the Olympics. We want this developer through the site, this contributor to the site to start contributing into that direction and then making that part of it more powerful because we do see Dano very much evolving into that action word community takes us but it doesn't necessarily need to be exactly what I've had in mind from the beginning. And it's very likely that we will be you know, spending months and months or years and years just focusing on ELT but just like we saw with get lab there is power in allowing people to go beyond the standard functionality in office today and add some extra features that they want but it's very much up to the community to see where it goes and fortunately the plug in based architecture allows for that really easily and just as an example of the power of that right now a singer texts and targets to Meltano are just extract her in a loader plugins that happened to use the stinger runner. So hypothetically, theoretically, if another extractor remoting framework comes up that people started asking us to support or if an alternative to DBT becomes popular, it is doable to allow to to add a new transformer plug in type or a new extractor motor plug in type two metadata which will allow us to move in that direction. Because again, we want to be the glue between these different tools. More so than lock people into a specific set of tools. And the idea is very much that the metadata project is your data project where your data engineer, analytics engineer, analysts, etc worked from. And we want to be able to evolve with data teams as they decide to move to different tools over time. And what we've seen recently is that we started out with supporting specifically the airflow orchestrator, which means that if you are using mailto, and you want to start orchestrating, or in this case, you know, running on a schedule your pipelines, it's really easy to add airflow as the back end orchestrator implementation. But because this is also plugin based, it will it's relatively straightforward to add support for another orchestrator like prefect or Luigi so that again, it's up to individual data teams, what they prefer what they already have experience with or what they want to plug into that they already have deployed. And they'll Dano makes it really easy to specify the different tools central exists, consists of and how those how those are tied together and more so than if looking In any specific, any specific combination of tools and that architectural, you know, bathroom is very much what has allowed us to go easily as we do. And it's pretty crucial to the future that we see with Madonna, basically out living, the specific open source tools that are invoked today that people might come towards. So I think it's less likely that we'll ever move away from single steps and targets because obviously, we are also investing in making that ecosystem more, you know, having the ecosystem grow and empowering the community. But on the front of orchestration, you're already seeing that airflow is not necessarily losing popularity, but projects like prefect are being considered by a new teams over airflow because, of course, these tools also evolve with the data space, and hopefully metadata will be able to evolve with the data space as well.
46:49 Tobias Macey
Yeah, I definitely appreciate the pluggable aspect of Mel Tano. And being able to replace the orchestrator, as you said with something like prefactor Daxter. And the fact that the singer town Some targets are able to be built and iterated on in isolation without having to worry about how they hook into the overall ecosystem or the specifics of Mel Tano. And digging more into the singer specification and wondering what you have found to be some of the challenges that you and your users are facing when going from that easy on ramp up. I can run this locally on my machine, I can get data out of this service. And then I can pipe it into this other service by just using the pipe operator and bash and some of the complexities of scaling those and deploying them into production and monitoring their execution and their overall health and some of the ways that you are looking to address that within Meltano.
yeah, great. So the senior specification itself, we have so far not really considered changing. I think a lot of the power at the sense that we think in is currently it serves the needs of engineers that serves to Nice. So if you know what you would expect from a specification that allows a data pipeline like this, and of course from our side, there's a lot of value in explicitly starting out, not wanting to change the specification, because right now, we are not at all in a position where we can do something as kind of decisive. Suffice it like that and our powers very much in trying to become the go to runner for singer, they step pipelines, etc. What we have found is that on the side of specific steps and targets, there is a lot that thieves can do to actually improve those to be more, you know, ready for skill and etc. and a party, a group of people who have done a really great amount of work there are transferwise, a UK based I think, startup that have recently published pipeline wise, which has their own runner for senior steps and targets, which comes with their own forks of a number of steps and targets as well and they are spending a lot of time making these targets for you know, snowflake, Postgres and BigQuery really great and then feature complete and ready for production. There's still a lot of opportunity there. And like I mentioned, what we want to do with Toronto is at some point also empower people to actually build steps and targets that are just as robust as the ones that know transferwise. And then some other data teams out there are currently building on the front of the singer specifications so far, I think it provides enough for us to be able to build this platform off of it. Of course, you know, singer taps and targets or router singer steps are already being run in production by stitch. And they were of course built to serve the needs that they have for probably plugging it into their existing infrastructure and having them kind of compete with the, you know, extractors that they natively support for some definition of native that I don't think are actually written using the singer specification. So the singer specification is fine, but I think where people really you know, run into trouble when I tried to deploy these. It's just because there are a number of moving parts like you have to configuration to singer that's targets that needs to be provided in a config dot JSON file, which can be passed in a flag to the actual executable, but the first is configuration file will contain a mix of you know, standard, you know Boolean configuration values, but also contain the credentials used to connect to either to the SAS service or data warehouse. So if you want to deploy a singer temper target into production, you got to figure out okay, how am I going to separately manage these sensitive secrets from these these settings that are fine having checked into a get repo software, and that's not currently addressed by by singer specification or the tooling provided around that Similarly, a secret tip when it runs out updates an internal state dictionary to kind of say, you know, how far have we progressed with with sinking data from this data source and then the state file at the end of the data pipeline, it needs to be saved and then passed on to the temp in the next condition so that it starts off where it left off. But if you want to run singer steps in production, and you only have a target to little, you know, five bubble executables, you yourself have to kind of set up the infrastructure around it to manage the state and then There's also entity selection. If you have a data source that supports a lot of different entities and properties, a lot of different tables and columns, the way that you tell a singer template to only actually sync a subset of that is by providing a catalog file and a catalog file describes the entire schema and then it can select specific entities and properties. But generating this catalog file today means that you have to either completely manually generated based on the specific entities and properties are known for or you can in follow the discovery mode, which is implemented by a lot of steps, which literally just means running with that question normally would, with dash dash discover as a flag, which will result in effect actually outputting or generating it cuts off a JSON file, which by default, selects every single attribute or a subset of attributes, and then the process of actually performing the step when you truly run in sync mode to only low to only extract a subset of these entities. theory or in properties means modifying that JSON file that is generated by discovery mode to only select or, you know, at the selective colon true property to the specific properties you want to extract, which again, literally is kind of a manual process right now, which might involve modifying a massive JSON file, which is, of course, very error prone. So but that also helps us with that by adding some commands that make it easier to specify in a declarative rule based way which entities and properties are actually looking for so that this catalog file can be automatically generated and passed to Medina for configuration, we do something similar in which attano has different configuration layers, environment variables, first of all, kind of following the principle of Environment Variables be kind of the go to way of exposing sensitive or environment specific configuration to your application. And then there's another configuration layer which is the actual config object inside your project McDonald's yamo file, which you can use for non sensitive people. figuration in there. There's one or two more options there based on based on your preferences and your setup. And on the protocol of managing the status of the state of these these steps and targets that are really, it's really the state of the pipeline Meltano manages the pipeline for you. And if this goes up set of scheduled pipelines, and knows that for each of the scheduled pipelines, it needs to store to state and reuse it on the next invocation. So this is this is kind of functionality that is relatively basic in a singer runner, run our fingertips and targets and a number of these runners exists like I mentioned before, because cigarette targets are used by various data consultants and themes in production at their clients or in their own themes, data stacks. But then you still got to go one step further, which means you actually want to deploy this pipelines on to, you know, your own cloud or if you haven't somewhere, or Kubernetes. Or maybe you want to Helm charts and make it really easy to deploy this. And we really want to get to a place where someone doesn't need to know that's running Our tests and targets means dealing with state and configuration and entity selection, we want to make these, you know, simply configurable plugins into the mouth out of that form. And auto manages everything else for you both in managing these various aspects of each singer test target and pipeline, and actually making it really easy to deploy these on to production. So in terms of the runner tooling that manages, like I said, save config and entity selection, a number of tools already exist, but none of these have gone so far as trying to abstract away these kind of aspects of singer types and targets. And we can't afford to abstract that away by allowing users to interact with extractor and loader plugins, just like they would with other multimodal plugins like the airflow orchestrator like the the DBT transformer, which means that as a multimodal user, the way you configure any of these different plugins is identical, whether that's through environment variables, or sudo, Meltano config, COI or su through the config object in your mythology. We'll file so we want to eventually kind of abstract away the singer and pep and a singer specification specifics because ultimately we think it confuses data thieves that just want to get stuff done more than it actually helps them to expose those underlying bits of what makes a singer or target a singer temporary targets.
55:20 Tobias Macey
And in your experience of taking over this team and working with Mel Tano, and helping to understand the direction to take and actually building out the platform, what have you found to be some of the most interesting or unexpected or challenging lessons that you've learned in the process?
I mean, the most unexpected lesson for me really was that, you know, when I when I came into the position, back in March, where I kind of need to figure out where to take nothing off from here, like what is the best path to for the next couple months or the next six months or whatever? Because, you know, obviously from gallops perspective, I thought it was an r&d project and it's been, you know, invested in for two years, but we do expect the first year results At the end, at least in user update, and then actual increase in contributions. And unfortunately, we haven't seen much of that over the last few years while I have been some initial interest. And then some people who have been giving us feedback over the last few years very few people had actually converted into to users and contributors. So I ended up in a position where I kind of need to figure out how to change them where to go from here. And at that point, like I mentioned earlier, back in March, I very much wasn't read up on the state of the data space and the state of ELT tooling and how about that fits into the needs of the data engineering community and specifically the the open source minded data engineering community. So one thing I was just kind of really fortunate to find is that almost unknowingly by building this, this thing around our for ourselves, almost unknowingly, we had built something that I came to the conclusion through talking to some of these senior community members was actually something they were really looking for and waiting for, and then People Express with us that while they have been able to kind of get out of the ecosystem, the singer ecosystem, what they needed up to that point, they did feel that the potential of the community and ecosystem was as far as new work was being realized today. And that had a lot to do with lacking tooling and documentation. So I was really fortunate to find that, of course, you know, because we, as the final team decided to build something that made it easier to build startups and targets. It turned out over time, that we had actually been that whole time been building the exact thing that the thing or ecosystem has already come to the conclusion that they needed to go further and grow. So I was happy to find that we hadn't gone completely in the wrong direction, building an end to end platform of Elon wanted or betting on some open source technology that was actually you know, falling out of favor and out of popularity. But I found within a couple of weeks, since starting to think about where to go from here and then talking through singer ecosystem data engineers that Meltano actually resonates a lot with these people. When explaining than these terms of let's build a true open source alternative for data integration problem more so than when it was described as let's build an end to end platform for for the data lifecycle and for data teams, and we still believe you know that there is value in this end to end story in the future. And I would love to kind of see the community take it there and develop it into that direction, if that is where we decided, you know, we can bring value. But I was very happy to see that what we've done so far, even if it didn't pay off immediately is paying off massively now, because we built something that is really hit a nerve over the last month. And the response that I've gotten from senior community members and data consultancies that are using senior reps and targets are evaluating using senior dev some targets in the data sets that they offer to their end users, all of them, you know, often spoken to us, which is a good amount of the ones that are in deca community have been really excited to not just use it and try it out and give us feedback but also actually build it and make it happen with us. I would never have expected for example, that within two weeks after hour, kind of announcing the new direction for Madonna. And then you'll focus on a part of your company called applied labs already reached out and said that they are planning to replace pipeline wise, which like I mentioned, as an other singer pipeline runner, they're planning to replace it with North ammo in applied data, which is the data platform, the integrated data platform that they offer to their clients, because they see the value that will come from not just focusing on building a tool that can run synchrotron targets, but also the tool that provides the UI around it that can be used to to treat these pipelines, just like you would a pipeline in a tool like stitch, or you just have a UI, you pick your connector and you configure it then you hit the start button and then you check it you know, on a schedule to see if the monitoring you know if the graphs and everything still looks good, we want to develop without going into the same direction because today it's most appropriate for data engineers who are highly technical, but we want to get it to the place where everyone can start using mono really easily and they see the value and they have actually committed to putting one and a half Engineers on the belt auto project for the next two months specifically to focus on building out to this data pipeline management user interface, which we already have in a basic Innovation Forum and Meltano right now, if you run the Meltano UI commands, but explicitly, I've been focusing on the CI and deployment story over the last month or so. But it's been really great and kind of confirming of what we're trying to do here to see that community members are already starting to contribute not just a couple of hours a week when I feel like it's but people actually believe in decision just as much as we do want to make it happen and are putting their money where their mouth is, and working with us on making, you know that data pipeline UI reality, and that will definitely be part of making a milk out of control. he competes with hosted options out there a reality because we know that all of these users that we want to target, especially the less technical ones, as smaller startups, we know these are not necessarily going to be comfortable running CLS locally or managing and deploying their own project. You know, using Docker file. So I've been really heartened to see that we're not alone. And we seem to really have hit a nerve. And I could never have predicted that a month and a half ago when I was kind of faced with where to go from here. And kind of all the options seemed equally good and bad. But I'm glad that with the help of the data engineers I've talked to, we've really been able to build something that we are really all excited about together. And we're going to try and make in real life together, because our intention is for McDonald's to not be another tool built by gift lab. So try to get some users to get that community, we really want to build a tool for data engineering community with the data engineer community. And that's playing out exactly as I hope it was a month ago. Are there any other aspects
1:01:40 Tobias Macey
of your work on Mel Tano, or the overall space of data integration or some of the challenges in an end to end tool for managing the data lifecycle that we didn't discuss that you'd like to cover before we close out the show?
No, I think we start with all of it. And then I very explicitly don't want to talk too much about the ultimate vision for today because either one either, even though it's still kind of In the back of my mind as an eventual future, which I could see male founder developing, which will really be up to the community. And I want to build something great with the community. And so far since we're doing that, so we've covered it all, thank you so much for giving the you know, just opportunity to talk about the project and then reach a broader audience. And I hope that people in the audience who hear some things that might be relevant to them, I will check out Meltano, give us some feedback. And even if today, it's might be quite far from something you would actually consider deploying into production. If you give us that feedback, know that I will continue to work 40 hours a week to make it a reality. And like I mentioned, people are starting to step up, we're going to be investing significant time as well. So even if today Not that I was not quite what you were expecting it to be. Check it out a month ago from now or six months and see where we go and then try to get us there with your health business built
1:02:46 Tobias Macey
this together. Well, for anybody who wants to get in touch with you or follow along with the work that you're doing or contribute to the project. I'll have you add your preferred contact information to the shownotes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap For the tooling or technology that's available for data management today.
1:03:03 Douwe Maan
I mean, my answer can only really be one thing, which is the gap in the lack of a true open source solution, even existing in this space, I think you could never call a market or space saturated until there is an open source equivalent that can actually rival the paid offerings out there, especially if this is a space where most of the end users are a good amount of the engineers are themselves actually programmers perfectly capable of coming together and building something like this, if we actually combine our forces, so in the data management space, and in the data integration space, I think there's a massive opportunity to kind of disrupt it from the bottom with open source technology, and we can do it together. And otherwise, you know, great data integration tools exist, I'm not claiming that there are is nothing out there today if you want to integrate your data. But the last thing open source, some song starts off funding free and truly, you know, open and accessible to everyone out there, then a significant of the markets to a significant part. The market it is as if there were no tool at all. And that is what I think is the biggest gap in the space today lack of open source solutions.
1:04:07 Tobias Macey
Well thank you very much for taking the time today to join me and discuss the work that you're doing on Milton. Oh, it's definitely very interesting project and one that I intend to keep a close eye on and possibly employ for my own data platform uses. So thank you for all the time and effort you've put into that and the rest of your team as well. And I hope you enjoy the rest of your day.
Thank you. It's a nice thank you so much for
1:04:32 Tobias Macey
listening, don't forget to check out our other show podcast.in it at Python podcast comm to learn about the Python language, its community in the innovative ways it is being used and visit the site at data engineering podcast comm to subscribe to the show, sign up for the mailing list and read the show notes. If you've learned something or tried other projects from the show, then tell us about it. Email hosts at data engineering podcast.com with your story and to help other people find the show. Please leave a review on iTunes and tell Your friends and co workers