Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

https://www.dataengineeringpodcast.com

subscribe
share





episode 83: Evolving An ETL Pipeline For Better Productivity [transcript]


Summary

Building an ETL pipeline can be a significant undertaking, and sometimes it needs to be rebuilt when a better option becomes available. In this episode Aaron Gibralter, director of engineering at Greenhouse, joins Raghu Murthy, founder and CEO of DataCoral, to discuss the journey that he and his team took from an in-house ETL pipeline built out of open source components onto a paid service. He explains how their original implementation was built, why they decided to migrate to a paid service, and how they made that transition. He also discusses how the abstractions provided by DataCoral allows his data scientists to remain productive without requiring dedicated data engineers. If you are either considering how to build a data pipeline or debating whether to migrate your existing ETL to a service this is definitely worth listening to for some perspective.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers, for engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Data Engineering Podcast listeners get 2 months free on any plan by going to dataengineeringpodcast.com/clubhouse today and signing up for a free trial. Support the show and get your data projects in order!
  • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Aaron Gibralter and Raghu Murthy about the experience of Greenhouse migrating their data pipeline to DataCoral
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Aaron, can you start by describing what Greenhouse is and some of the ways that you use data?
  • Can you describe your overall data infrastructure and the state of your data pipeline before migrating to DataCoral?
    • What are your primary sources of data and what are the targets that you are loading them into?
  • What were your biggest pain points and what motivated you to re-evaluate your approach to ETL?
    • What were your criteria for your replacement technology and how did you gather and evaluate your options?
  • Once you made the decision to use DataCoral can you talk through the transition and cut-over process?
    • What were some of the unexpected edge cases or shortcomings that you experienced when moving to DataCoral?
    • What were the big wins?
  • What was your evaluation framework for determining whether your re-engineering was successful?
  • Now that you are using DataCoral how would you characterize the experiences of yourself and your team?
    • If you have freed up time for your engineers, how are you allocating that spare capacity?
  • What do you hope to see from DataCoral in the future?
  • What advice do you have for anyone else who is either evaluating a re-architecture of their existing data platform or planning out a greenfield project?
Contact Info
  • Aaron
    • agribralter on GitHub
    • LinkedIn
  • Raghu
    • LinkedIn
    • Medium
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
  • Greenhouse
    • We’re hiring Data Scientists and Software Engineers!
  • Datacoral
  • Airflow
    • Podcast.init Interview
    • Data Engineering Interview about running Airflow in production
  • Periscope Data
  • Mode Analytics
  • Data Warehouse
  • ETL
  • Salesforce
  • Zendesk
  • Jira
  • DataDog
  • Asana
  • GDPR
  • Metabase
    • Podcast Interview

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA


share







 2019-06-04  1h2m
 
 
00:11
Tobias Macey: Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline, I want to test out the project to hear about on the show, you'll need somewhere to deploy it. So check out our friends at Luna node. With 200 gigabit private networking, scalable shared block storage, speedy SSD, and a 40 gigabit public network, you'll get everything you need to run a fast, reliable and bulletproof data platform. And if you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and one opening in Mumbai at the end of the year. And for your machine learning workloads, they just announced dedicated CPU instances where you get to take advantage of their blazing fast compute units. Go to data engineering podcast.com slash low note that's l i n o d today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for their continued support of this show. And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers for engineers. clubhouse lets you craft a workflow that fits your style, including PR team tasks, cross project epics, a large suite of pre built integrations and a simple API for crafting your own. With such an intuitive tool, it's easy to make sure that everyone in the business is on the same page and data engineering podcast listeners get two months free on any plan by going to data engineering podcast.com slash clubhouse today and signing up for a free trial. support the show and get your data projects in order. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations that Riley Media Day diversity empty Open Data Science conference coming up this fall or the combined events of graph form and the data architecture summit in Chicago. The agendas have been announced and super early bird registration is available until July 26 for up to $300 off. Or you can get the early bird pricing until August 30 for $200 off your ticket. Use the code be an LLC to get an additional 10% off any pass when you register and go to data engineering podcast.com slash conferences to learn more to take advantage of our partner discounts when you register for this and other events. And you can go to data engineering podcast.com to subscribe to the show, sign up for the mailing list, read the show notes and get in touch and to help other people find the show. Please leave a review on iTunes and tell your friends and co workers your host is Tobias Macey and today I'm interviewing Aaron Gibraltar and Ragu Murthy about the experience of greenhouse migrating their data pipelines to data coral. So Aaron, could you start by introducing yourself?
02:58
Aaron Gibralter: Sure. Thank you, Tobias. Thank you for having me. Yeah. Again, my name is Andrew balter. I'm one of the directors of Engineering here at greenhouse. I work with two different teams. I'm both on the product engineering side, running a team building one of our products. And I also work with our data science and data engineering teams greenhouse is a talent acquisition suite that helps companies acquire and retain the best talent. And yet, I guess as as is the case for most companies these days that build software data is incredibly important to us.
03:34
Raghotham Murthy: And Greg, who could you introduce yourself as well? Absolutely. Yeah. My name is Rocco Muti. I'm the founder and CEO of data core company where we are automating the building of data pipelines. And we have built it in such a way that it's all servers, we have been working with greenhouse over the past couple of years. So I'm really excited about correlation, were able to kind of just take folks through the journey that been all went through, why we were going and why they were getting started on using their data as well.
04:04
Tobias Macey: And also, for anybody who wants to dig deeper into data correlate itself and your experience of building and growing that company. I'll refer them back to the other interview that we did with you. And I'll add a link to that in the show notes. And so going back to you again, Aaron, can you do you remember how you first got involved in the area of data management?
04:22
Aaron Gibralter: Yeah, I think it's a bit of an interesting story. For me, as I mentioned before, I work with two different teams, and two different disciplines. Product engineering and data are a bit different. And so the data piece was a little bit more happenstance. I think when I joined greenhouse four years ago, and not too long into my tenure, my boss, now our CTO, the VP of Engineering at the time, Mike before it asked me to step in and get involved with our data scientists at the time we had one is his name is Andrews arm, he came from academia, he was astrophysics, Professor, PhD, and he had gone through one of the data science accelerate our boot camps to kind of transition from academia to the business world. And so he joined greenhouse and kind of started building out some of our the data pipelines that we used to wrangle the data and also started to build out our reporting and analytics capabilities. And as well as some machine learning stuff, but he was alone working on the team and Mike before asked me to start to work with them to think about different use cases. And so so that's how I got involved. It was kind of a need the company had, I had a little bit of spare bandwidth at the time. And so it was, you know, just luck in a sense, but it's definitely become a huge interest for me. And I've become quite passionate about both the data science and data engineering side and revenue for anyone who hasn't listened to your prior interview. Can you share again, how you first got introduced to the area of data management? Yeah, so I've been an engineer working on data infrastructure and distributed systems for a while, starting back in the day at the MIT all the data pipelines that typically need to get built out for the schools.
06:14
Tobias Macey: And so going back to you, Erin, you mentioned a bit about what it is the greenhouse does some curious if you can talk a bit more about some of the ways that you use data within the business and your overall data infrastructure as it was before you made the move to data coral.
06:30
Aaron Gibralter: Sure, it's, it's interesting, I think we've come a long way. As I mentioned, greenhouse is a talent acquisition suite, we have a number of different products that help companies with their hiring and onboarding process. I think it's SAS software. So you know, companies buy our software, and then it's, you know, provided through the web. And as a result, I mean, I don't want to sound like it's, it's unique to us, like, I think pretty much any software company now generates a ton of data, every user an actual user interaction, and kind of the state of the database at any given time has a lot of meaning. And so for us, the use cases for data, which is kind of like this big a Morpheus blob term, kind of spent spans the gamut from kind of us understanding our customers and how you know how much value they're getting out of our platform, to helping our customers understand their own data, even better, and, and kind of everything in between the state of the of our data pipelines. Before data coral, were kind of, as I mentioned, we had one data scientist, who is a little bit of a, he's a, he's a polymath, he's a generalist, he's always been interested in everything from the infrastructure side to the modeling side. And so he actually dug in and ended up building out our himself. And so he, I think what our infrastructure team did at the time, this is three or four years ago, as they, they set up an easy to instance for him, and our VPC and gave him his own RDS and said, Do whatever you want go to town. And so he stood up airflow there and built out a series of ETFs, to start to pull data from our production, databases are just connecting to our followers, and, and in in our infrastructure, and just pull data out and reshaped it into the data science warehouse, which was his RDS, and then the connected, you know, BI tool, I think, at the time, we were using Periscope. And so started to build dashboards on top of it so that we could understand our our customers behavior. And we had a feature at the time, I think it was called the matrix. And the idea was, it was kind of like a matrix of every feature in our product, and every customer and like checkboxes on what you know, which features which customer we're using. And that gave our both our product team and our customer success team a better understanding of, you know, the lay of the land. And so that was kind of the first stage and exposing this data. I think you just going back to your question, what the use cases now have expanded, the company's much bigger. And I think we've even gone into building predictive features for our customers. So we have a feature in our product called greenhouse predicts. And what it does is it looks at the status of a particular job pipeline, how many candidates are in every stage, you know, how many candidates are an application review, initial screen, phone screen, on site, and so on. And we built a model to predict what when we expect a higher to be made. And so that's a feature that we offer to our customers. And so that requires us, you know, to train the model, we pull data out of our data warehouse in a particular shape, and the we build the model train it and and then have deployed a an inference engine to allow our application to make those predictions. Those are kind of the range of cases that that we that we have. But most of as I mentioned before, most of it was ad hoc. What like Point to Point ETFs. Before data, Carl, I know that was a bit of a long winded answer to your question.
10:24
Tobias Macey: No, that was great. And just to give a bit of a flavor to I'm curious about the just overall team size and structure that you're working with for the data engineer versus data science breakdown and the sort of overall number of customers for that infrastructure. Sure. And by customers, do you mean internal users?
10:47
Aaron Gibralter: So when we started, as I mentioned, we literally just had one data scientist with some support from product engineering and infrastructure, doing everything. So the epic time when I joined the needs four years ago, I think our engineering team was probably around 20 to 30 people, and the whole company was around 70 people. And we had one data scientist doing everything. And now we are the company's 300, or over 300. And our engineering team is 7200. You know, it's always hard to keep track with how much we're growing. But we still have a pretty small data science team. And I'll go into this a bit more later on. But I think what data Carl has provided us is it gives us the ability to punch above our weight class in terms of our data science team, and what we're able to provide. So we have no data engineers at greenhouse. Know, we have some teams that help with data engineering, but we have no dedicated data engineers. And we have two data scientists and where we're growing right now. So we'll be bringing on another, say two to three data scientists this year. And, and with no expectation or need to hire any data engineers to support that. And largely in part because data, Carl gives us the tools to manage it ourselves without needing to, to handle the infrastructure, the number of internal stakeholders we have is grown a lot. We predominantly started with customer success as our main stakeholder, but we have moved into working a lot with marketing, finance, sales, support, and engineering, r&d. So understanding our own site performance, kind of our engineering throughput, our product okrs. So there's so much that we can do and so much more that we can do. And that's why we're growing. I think, as I said, I think we're able to punch above our weight class and do a lot more because we have because we've adopted a coral and it provides us a lot of leverage.
12:58
Tobias Macey: And can you talk through a bit more about the types of data sources that you're working with for pulling into your data warehouses and into your analytics?
13:07
Aaron Gibralter: Sure, at first, when we managed our own ETFs it was mainly just our production database. So we run Postgres as our as our main production database for our application. So it's mainly pulling data out of that about what our customers you know, we're doing and the product like what you know, what data artifacts are being left over from user interactions, and I think we may have built out some Salesforce ETFs to pull data from Salesforce because Salesforce is a source of truth around customer information like key points of contact and in you know, segmentation, like industry size or interest, you know, what industry the company is in or with, you know, what their addresses for like, Look, company location. So those things we were pulling from, from Salesforce, and now we pull in from a lot of sources into our data warehouse with data coral. So the main sources we have, again, the production databases, is probably the most important, but we also pull in Salesforce data, Zendesk data, JIRA, and so now, Salesforce, Zendesk, and the production database are all more customer data, whereas JIRA is our kind of internal internal data, you know, what cards were shipped, you know, what features were shipping, and how quickly and so now getting in, we're actually going deeper into that, pulling data from data dog and, and Asana. So other project management software. And so that's been incredibly helpful to to have in our in our warehouse.
14:43
Tobias Macey: Yeah, being able to correlate sort of general usage patterns with when a specific feature might have been shipped, I can definitely see as being very valuable. Exactly. And so in terms of the data pipeline, as it existed, sort of leading up to the point where you started looking around for alternatives. I'm curious, what were the biggest pain points that you were dealing with, and the ultimate motivation, and that led you to reevaluate your approach to the EPL processes that you're managing,
15:12
Aaron Gibralter: I think it was a, it was a combination of things that led to the reevaluation, I think the biggest pain points where I think at the time, our team was, was quite small, so it was just one person. So there was a lot of concern around a single point of failure, all these ETFs being built out in kind of like a custom way, you know, with one person building them, you know, something, you know, we always talk about bus factor and engineering, I like to maybe think about more like the lotto factor or something a little less dark. But you know, for some reason, Andrew, you know, if he decided to leave, I think we would have been in a really bad state, you know, there's very little, he was kind of working in his own world, we even called his his part of are VPC his last name is, as I mentioned, is Andrews arm it, we call it VPC, virtual private server. And, you know, whenever you have something like that, it kind of raise it, you know, it's a bit of a organization smell or robustness smell. And so, one of the biggest pain points was that idea of the single point of failure. The other thing is maintenance of it was was taking up a fair amount of his time, ideally, a data scientist should be spending most of his or her time in some sort of leverage capacity around data as opposed to working on the plumbing. And so, you know, I think probably at the time, I would have to, you know, ask him, but my impression was that he spent at, you know, between 25%, and maybe the worst times 50% of his time, kind of wrangling data, as opposed to spending it on analysis or predictive pieces. So that was definitely a big pain point was, you know, how much time we were, you know, investing in it.
17:05
Tobias Macey: And so once you came to the point where you decided that the current state of affairs wasn't really going to be tenable in the long run, I'm curious what your criteria were for determining what a replacement would look like, whether that was a build versus by decision, or if you are looking at bringing on dedicated data engineers on staff and the types of tools that they might be interested in, and just the overall process of going through that evaluation, and then the ultimate decision making process?
17:32
Aaron Gibralter: Yeah, I think it the way you put it there as a generous way, I think, in some ways, it was less intentional, in many ways we got lucky. So I think things were working. I think, in some sense, the cost, the main costs, or pain points were opportunity costs. I think, because I was new to the field, I didn't really know what good look like. So I think we were okay, in some ways with the status quo. And I think, luckily, we had an advisor through one of our investors that would speak with Andrew, on a regular cadence. And he, he mentioned that we should consider looking at data coral and thinking about our data stack. And so it was through that, that we started to explore it and realized kind of what the possibilities would be. But, you know, in some ways, I wish that we had been more intentional earlier on and saying, hey, what does this look like a year from now. But I think we were just kind of treading water and doing the best we could with the situation. And so I do think that, in retrospect, if I were to do it, again, I'd have a much better perspective on like, on efficiency, and, and where we're spending our time, and what, you know, what leverage looks like. But I think ultimately, we got lucky. And that data coral kind of came along. And, and we were also, you know, I think it was also very early and David, it was, you know, very early data corals existence as well. So it's kind of like a fortuitous moment, or, you know, it was just luck, that we kind of found each other at the right time. And we could grow together and, and get to a much better place. And so,
19:13
Tobias Macey: after you got introduced to Ragu and data coral, and made the decision that you were going to replace your existing pipeline with what data Cora was building, can you talk through what your overall experience was of getting on boarded and making the cut over and determining the sort of data quality ensuring that it matched what you were already getting? Or that it was improved? And the overall experience of sort of drawing down the previous infrastructure in favor of data? coral?
19:45
Aaron Gibralter: Yeah, definitely. I think another just going back a little bit another big piece, in terms of our evaluation criteria, but I think the one thing we knew all along was that we never wanted our data to leave our our VPC, I think, as a case with a lot of companies are we and maybe not, but we care a lot about the security and, and compliance piece for us is of extreme importance, we have a lot of very sensitive information at greenhouse. And we obviously, you know, would never want, we would never use our customers data in a way that we weren't explicit about. And so one piece there is that we we knew that in any world, it was either something we would have to build internally. But the idea of a vendor, where we would actually have to pipe data into someone else's warehouse or through someone else's pipes was not really an option. And so that was I think data coral was built that with the with the idea that the data stays in your own infrastructure from the start. And so I think when we found that out, that's kind of one of the barriers, like one of those barriers to entry was immediately lowered, and then something that we were interested in testing out. And so I think, yeah, I think that, that, that was a really important piece in the evaluation process, and allowed us to, to test out a use case, without without, like moving mountains, because I think, right now, every time we add a sub processor, it kind of has to go into all the contracts, all the customers have to be made aware that their data is going to, you know, be you know, and we have a number of, you know, we use AWS So, like, I think it's not possible or not like a building our own hardware and, and doing everything on prem. And so it's not crazy that we would send or use a vendor to to achieve something, you know, cheap some value for our customers. But in this particular use case, we knew that we wanted to, to keep the data flow and in our own VPC. So when when we were talking about the cover process, and using data coral, I think it was quite easy for us. We did a security and compliance review to make sure that the kind of all the cloud formation templates that the data Carla's handing us and and all the stuff that they were providing us was was sound. But beyond that, we I think it was easier for us to give it a try, because we knew that we was running in our own infrastructure. And so that was kind of the lens that we use is like, Hey, we can stand this up. Let's try it out and see how it works. And if you know if it seems like it's, if it's, if it seems like it's worthwhile, or if it's if it's if it's easy and provides value to us, then we'll continue and if not, we can, we can easily spend it down and do that. And so that was a pretty easy way for us internally to evaluate it. And I think I talked to brag a little bit about this. But because because we started using data Carl very early and data corals existence, I think that the kind of our process was done, definitely not cut and dry. I think there was a lot of us working together to figure out the solutions to our problems. But I think we got to a really amazing place. And I think that process now for a company evaluating data, coral would be quite different. I think data calls quite mature now, then. So I think that whatever I say, here would be very different for someone trying it out now. But yeah, that that kind of process probably took, I don't know, ragged, you have a sense of that is probably over the course of a year that we adopted data coral, and then we're finally able to sunset, our existing ETFs. Maybe even longer than that before we before we send set our existing ETFs. But I think that if we did it today, it would be much faster.
23:43
Raghotham Murthy: Yeah. So maybe I can add a few things here. So we started talking to greenhouse very, very early on in our existence. And clearly, my goal, at that point in time was to see the kind of architecture that I'd come up with, and we're kind of working on with that actually makes sense. And the great thing about greenhouse has been, as you can imagine, for an early stage company, the most valuable thing is time from potential customers and patients. So they were kind of Aaron and his team, they were convinced about the architecture. And as you mentioned, the fact that we are running within their VPC lowered the barrier to entry for the for them to actually try it out. And then just like what we do even now for existing customers, it started off as like one use case, I think the first use case was to pull data from their production environment, and move into a data warehouse redshift in this case, and essentially get Andrew out of kind of having to query like, like a follower database of the production data of the follower of the production database, or like having to do anything kind of in an ad hoc manner. Instead, it was actually a better thing to have him just focused on doing the analysis itself. And he could do that directly on retro. So that is kind of the first take for us to just kind of replicate one data source. And then slowly but surely, we started adding more and more connectors. And Aaron mentioned before the kind of pulling from Salesforce and as Judah and or not, I think at least have lost track of the number of different sources from which they're collecting data. But in terms of the cut over period, like the way that typically happens, right is that if they if companies typically have something already, that is kind of processing their data, they don't want to just rip and replace everything, because that actually is much harder to do. Instead, the better thing to do is to kind of take one use case and and actually make it work. And our whole kind of micro services based architecture means that we can live alongside whatever companies might have. So as more and more use cases came along, where people were able to just use the record directly instead of whatever they had. Originally, there were net new use cases that were being directly work on in using data coral, of course, the existing stuff can can be moved, or you can kind of keep it around at sunset them or however you want to, like plan it out. The goal for data coral is to make sure that whatever use cases you want to actually move, those are the ones that we can actually make super easy. And again, very early on it in our engagement with greenhouse given that we had essentially like an overall idea of the architecture. And like an initial kind of pre alpha, even implementation, we did kind of go through a lot of learnings while working with with greenhouse, and even the whole kind of the security architecture and stuff like that. I mean, we work pretty closely with greenhouse to get it to a point where there's not only that their security team was happy with it. But we were able to leverage that work and get a lot out of it, even with our other customers or even while working with AWS to get to the advanced technology partnership, there's a whole set of questions in the questionnaire, as you can imagine, when you're going to try to get to these compliance or these kind of partnership levels around sec unity. And whether you're a data processor, all of those questions essentially, were completely irrelevant to us because all of our software was running within the customer VPC
27:09
Aaron Gibralter: that the security peace, I think we work through very closely together. You know, as I said, security is of utmost concern to us. And so what when we, I think that was the hardest part of getting started with data, coral, not in a bad way. But we just had to figure out an architecture that made sense such the data, Carl would not have access to any of the underlying data. And we were able to get there. And so it was really a great experience working with data, Carl, to make sure that that was the case. And I'm glad that it is contributed. And I'm happy that it contributed to kind of the overall architect the standard architecture for for how data Carl does these engagements
27:47
Tobias Macey: and Ragu. I'm also curious about some of the sort of edge cases or sort of sharp points in your infrastructure and architecture that ended up getting ironed out in the process, onboarding Aaron, and greenhouse and any of the other customers that you were working with? And in the sort of similar time frame?
28:08
Raghotham Murthy: Yeah, absolutely. I mean, one of the main things that you realize after kind of when you're building a system, and then getting a bunch of customers that these initial users of the platform are like the system is that there are, as you said, like sharp edges, right. So it's very easy to get into bad states, the amount of error checking or like error propagation, kind of you don't pay as much attention to it early on, because you're mainly trying to establish the viability of the technology overall. So for the most part, like at least initially, we would have to hand hold customers through in a setting up data coral, like setting up these connectors. And then as the data was kind of flowing, they would be able to kind of fend for themselves. But then if there were errors and stuff like that, instead of them knowing about it through like a tool or whatever, they might kind of run into problems, because hey, the data is not fresh, what happened. So we had like, like Slack channel, where people can just ping us. And along the way, we have clearly made our overall platform a lot more robust, we've gotten to a point where our customers typically don't have to kind of actually worry about kind of data quality, we kind of catch errors sooner than anybody else can actually kind of notice that you're able to fix them. And again, all of this is happening, because you're providing this whole kind of automated data pipelines as a service, right. So we use this notion of a cross account role that allows us to monitor everything that's happening in the in the customer in the in the installation, while still not having access to any of the data, the data itself is encrypted using keys that our roles don't have access to. So this whole combination of providing, like a SAS offering, but within the customer VPC, I think that has allowed us to, in some sense, give, give ourselves the time to build out the automation, while still kind of using operations to make sure that everything is actually working well. And I
29:57
Aaron Gibralter: could I don't know if you want me to go into these details, Ragu. But I think it was there's some interesting like getting into the nuts and bolts, one of the pain points we ran into, I think the original ATL system that data coral was working with, and frankly, what similar to what we had going in our own airflow. ETFs was this concept of pulling data out by the updated com, we have timestamp columns, and our Postgres database that presumably say, the last time the row was touched. Unfortunately, for us, our application, the timestamp columns aren't written are not trigger based in our database. So it's up to the application or the query writer to always update the timestamp and at the update time, and so we found that there were actually cases of bulk updates that we did, that would affect large number of rows, but not touch the updated at timestamp. And so we actually had a data quality issues or data consistency issues between our production database and our warehouse, where certain rows would, you know, be different in the production database than we're displaying in our data warehouse. And this led to a number of pain points in our analyses were, especially around candidate pipelines, things like application stages, or application stages is a good example of like a kind of the presence of a candidate at a given stage, some of those would get updated in bulk and then not be updated in our data warehouse or our warehouse would say something about the state of the pipeline, but it would not be accurate. And this is something that data coral, we ran into a data coral as well, because their original cells were based on polling and using the updated at timestamp. And so we we realized that with data carlon went through a number of different strategies to try and fix this one involved adding actually adding triggers to our production database to make the time to attempts to be automatic. But the more that we thought about this, the more that our team was worried about actually implementing something so heavy in our database, just for the purpose of of our ETFs and our data warehouse. So we ultimately decided to not go the route of implementing these triggers, or custom store procedures and Postgres and instead started to think about using logical decoding to stream changes from our database. And so that's ultimately the path that we went down. And data Carl did too. And I think we've been extremely happy with the results. But you know, that's an example of kind of some of the work we did together to to, to try and figure out the best, the best way to get data out efficiently and consistently.
32:47
Raghotham Murthy: Yeah, absolutely. I think this is a great example where for us, it was kind of greenhouse pushing us to kind of get to the next level. And we were kind of building it, as we kind of had these conversations with them. And again, how the customer was able to kind of work through these kinds of situations, or these kinds of problems is actually incredibly valuable for an early stage company. And also, I think the timing was right, this was around the same time that RDS in Amazon, like, was starting to support logical decoding, we were able to just leverage that and be able to kind of provide a solace way of actually pulling all the changes from these databases and be able to apply the changes in the warehouse.
33:28
Tobias Macey: And so once you made the cut over and started using data coral more full time, and we're getting ready to sunset, your existing infrastructure, what were your evaluation criteria for determining that the data quality was sufficient that you were able to replicate all of the prior capabilities, and that everybody was able to do the work that they needed to do once you had made the cut over?
33:56
Aaron Gibralter: So yeah. As to your question about kind of the big, you know, this is the kind of the big wins are knowing now when when things are ready, I think it was a as a as I said a little bit before, it was a bit of a gradual process for us. So most of what we were building in this test phase with data, Carl, were new, kind of new analyses a new reports. So I think it was really once we were confident that that the new reports had happy customers that we that we knew that we could cut over everything. So the specific example we have this, I think the main kind of one of the guinea pig projects that we took was, we have this, this set of dashboards that we we call the thing that was originally called the QPR deck. Now it's an E br, so QB, ours, our quarterly business reviews, er means executive Business Review, think it's a pretty common concept in SAS Enterprise companies where account manager or a customer success manager will sit down with a customer and talk about, Hey, how are things going? And how can we make sure that you are achieving your goals as a company? And how can you use our product to achieve those goals. So it's a pretty common concept and in this in the world that we, you know, in SAS software, and so we had our original ABR deck, powered by those original ETFs. And so that's where we would often run into these data and consistencies that I was talking about before. And so what we decided to do was build out the new Eb, our dashboard, which again, would be the dashboard in our BI tool that a CSM would use to generate the charts for their specific customers, that they would then either you know, you know, print out as a PDF, or copy and paste into a PowerPoint presentation to talk about with those customers. So we decided to rebuild that using the data, coral materialized views, and the data, coral data. And so that was kind of one of the big first use cases. And it's, it was something that I think went over pretty successfully. And we were able to get a team of, you know, dozens of customer success managers to us. And so once you know, kind of once that was up and running, and they weren't using the old one anymore, that was kind of a signal to us that we could, we could start to use data coral for more of these kinds of workflows. So the next one, which was pretty big as we automated our financial reporting, so taking data from our transactional, financial, I guess, financial system with all the kind of line items of what our customers are paying for, and reconciling that with Salesforce, all in sequel through a series of materialized views that then create a dashboard for our finance team to know to basically we build what we call the HR momentum report, which is our annual recurring revenue momentum report that shows how much our customers are paying us what you know, what is changing over time, and, and breaking that down by segment and slicing and dicing it different ways. And then allowing them to kind of download a copy that data from our BI tool that they can then pull into Excel, massage, more to understand the different ways is a huge win, and our finance team is super happy with it. There's another use case that also involves the customer success team. As I mentioned before, we one of our biggest stakeholders is our customer success team. This customer success team here at Green House uses a tool called to tango, to manage relationships with customers. It's it's kind of a CRM type tool specifically for customer success. And basically, it helps automate keeping tabs on customers health, and automating communication with the customers, like sending out email communications and surveys, and other things like that. And the way that to tango works is you in order to understand customer health, you have to give it data about your customers. And so there are kind of two main ways to do that they have a JavaScript library, you can throw in your page, and you kind of give it some JavaScript instructions. And it will instrument your application and kind of track your customers and to tango also has an API where you can send it events from from a server from a back end. And so when we were evaluating to tango, and this is kind of me wearing my product engineering hat, I'm as a product engineer, I'm always very hesitant to throw third party dependencies into kind of core product workflows. As much as possible, I like to avoid putting JavaScript additional JavaScript on on a page that can either have errors or cause load, you know, cause low time to increase or you're essentially running someone else's software or someone else's code on your page. And so that always makes me nervous. So I knew that including the tango JavaScript on our page was, it was not something I was a fan of. And so something that I discouraged our, our team from doing and and I mentioned that during our discovery phase with Tango. And so the obvious next choice is to type data in through the API. So basically, to tango has an API and you can say, here's an event, you know, this thing happened, you know, customer navigated to this page, or customer made a higher in our in our software, and so that would, you know, contribute to the health score. And here again, one of you know, the naive implementation, I think, would be to actually litter the code base itself, our production code with calls, instrumentation calls to tango, you know, in the controller, that handles a higher being made in greenhouse, we could fire off an event two to tango, to say, hey, a higher was just made. But that would mean that our code base would start to get littered with this instrumentation that we would then have to maintain over time and as behavior change, we would have to change. And to me, that seemed like a very bad idea as well. So I immediately suggested that actually, instead of this being a product engineering problem, I would shift it on to the data side, the data science data engineering side. And so data, coral has a to tango publishing slice that allows us to send data to that to tango API. But what's nice for us is we don't have to worry about what that API looks like, what shape of data they want, is all that is really all that matters. And it's actually what shape data, coral expects that data to be in such a consented to tango. And so what we're able to do is right, a series of materialized views to transform the data into the right shape, and data, the coral handles the rest, it will periodically push those events, two to tango, and everything is kind of handled asynchronously and doesn't interrupt any of the product. And the product engineers actually don't even worry about this, they don't think about it. And it's really nice for them not to have that on their mind at all. So this is kind of an A fantastic use case where we have data and we want to send it to some other system. And so we can transform it and send it to that system without interrupting, you know, without getting in the way of any other work.
41:36
Tobias Macey: And so now that you have been using data coral for a while, and you're able to get all of your ATL processes done just using their capabilities without having to have any dedicated data engineers, I'm wondering one sort of how you would characterize the overall experience of yourself and the people who are directly working with data coral, and the way that you're using the time that you freed up with trying to maintain your prior ATL pipelines?
42:06
Aaron Gibralter: Yeah, I think that's a great question. I think for the most part, the experience of the people on my team working with data call has been great, I think we've had a good working relationship. And I think there have been times where we've had to work through some stuff that was not ready. And, and you know, this record mentioned, the experience that we've had is probably unique one, having grown with data, coral, but I think overall, it's really easy for someone on a data scientist on our team, to think about the data flows in the form of like sequel transformations, essentially materialized views, it's really kind of a common, like a lingua franca like a common language that sequel is really easy for us to all understand and having data go from, you know, one table to another to another, as opposed to kind of flowing through, you know, hard to define scripts or transformations that are aren't so straightforward, makes it makes the whole system really approachable for both our team and our collaborators. So even even the slightly less technical folks and like within embedded within the stakeholders, like you know, the people that work in CS operations, or marketing operations, or sales operations, we can talk to them about materialized views, and collaborate on what those should look like. And you know, what the, what the shape the data should be in. And it's it all kind of makes sense to everyone, as opposed to being this black box of of, you know, data flowing, you know, all sorts of directions. And yeah, I think in terms of what are you know, what we're doing with that spare time, I think it just, I think we can basically pour that back into the higher leverage activities like analyses, or even predictive analytics. I thank you on the data engineering side, the people within our organization, the engineers that have helped with data engineering over over the years have been able to rather than being dedicated data, data engineers, are able to work on other parts of our internal tooling system that is a bit higher leverage, for example, our CI CD platform, we know how code gets from a developer's machine to production, we, you know, are we have, we've built out an actual internal pass here at greenhouse where we are able to deploy to ephemeral dev environments, and engineers to join our team, product engineers to join our team are kind of awestruck with some of the processes we have that allow this really easy testing and staging on the fly. And so I think we invested a lot in that. And we've been able to invest a lot in that because we don't have to, to work on data engineering. I think we're not going to chalk it all up to that. Like, I think there are a lot of other pieces to the puzzle. But I think just In short, as I said before, not having to worry about these ETFs allows us to focus on higher leverage activities.
45:12
Tobias Macey: And just wondering if you can also quickly talk through what the current workflow looks like for building and maintaining the data flows that you're deploying on to the data coral platform, and just what the interaction pattern looks like, and how you're managing and organizing the code that you're deploying for managing those data flows and how you ensure sort of discovery ability or visibility of what the flows are.
45:39
Aaron Gibralter: That's a fantastic question. That's, that's kind of like the next. I think that's one of the next big things for our team. As we scale as we hire more data scientists, that's going to be extremely important for us to have that discovery ability. And that, you know, that structure that makes sense. Because if we don't, I think there's going to be a lot of rework or you know, stuff on each other's toes. I think this is also an area that data coral is working on. And it's another piece that I feel like, I know that we feel the data corals working on this lot. And I hope that they feel that we're contributing again, on this Friday in terms of feedback. But I think there's some work to do here in terms of how to standardize these workflows. So to get into the nuts and bolts, basically did a coral provides a CLA tool, you know, you run data coral, and then a command like data, coral organize. And then you can say, Matt view, materialized view create, and then you specify a path to a file, a dp l file data programming language file, that's essentially a sequel, a sequel command, and with some comments and some annotations on the top of it, let's say you know, what kind of a materialized view it is, and what what is the frequency with which it should be refreshed, and so on. And so that sequel file is kind of like the source of that is the transformation that what's going to happen. And so in a naive world, you basically have people just to write, you know, write some sequel, and then run into through the CLA and create these map views. Obviously, we want to be doing code review, we want to have, you know, we want you know, someone if someone's going to create a new materialized for you, we want someone else to approve it. And we want it all under version control. So we have a single get repository called Data coral that contains all of our materialized views and a structure that makes sense to us. So we basically have the different schemas as top level directories. So you can imagine a schema roughly correlated with a kind of a use case. So can say like analytics underscore CS, like Customer Success analytics, all the materialized views that power, the dashboards that the CS team uses all the materials for us there are in that directory and that schema, but what we've had to do is write some, you know, make files or scripts to kind of make the process a little bit more streamlined. And then I think there's the the risk that we don't have any kind of CIC is continuous integration or continuous deployment of these things. So we still have to run them manually. So even when we open a pull request, we've had to come up with our own process, where will open a pull request, say, Hey, I'm going to create this materialized for you? Can someone take a look at it? And once it's approved, then I use the CLA to deploy it. But there's no like, it's not being enforced. And it's not being automated. And I'd love to get there. But I don't, I think in some ways, like there's some work for us to do. And in some ways, there's work that data coral is doing to make this a bit more streamlined as well.
48:43
Raghotham Murthy: Yeah. So just to kind of give an example yet another example of kind of how Reno's is helping us move forward on this. So one of the things that we have done right now is this whole compile step. So earlier, people would be able to just kind of create this metal as us and we would kind of automatically infer the dependency isn't that agenda, the pipelines. But now, with these kind of deep yellow files that are all kind of in one kind of repository, when you're trying to update one of those metals use, you should be able to kind of get a compile step that will then tell you if there's anything downstream that might actually get affected, right, because we know what the data dependencies art. And again, with greenhouses kind of leading the way in terms of providing the right kind of use cases, we've been able to kind of get started on the compile step. And then the idea would be that would actually provide like a CI CD pipeline where you change that one metal is you somewhere in the middle of the day. And then you should be able to not only push it to production, but be able to kind of run it in like a test mode, from that node all the way downstream, so that you know what the difference is going to be like, after the test has been. So after the change actually has been applied. So these are actually things that we are actively working on and and again, with green, helping lead the way in terms of finding the right kinds of use cases.
50:03
Tobias Macey: And so as you continue to work with data, coral, and data and revenue as you can do to work with greenhouse, I'm wondering what you're hoping to see in the future in terms of the platform evolution, or any plans that you have going forward to add new capabilities or capacity to data coral?
50:24
Aaron Gibralter: Yeah, I think that that piece that we were just talking about is one, one part of it, kind of the operational ization production, I think, is where it's basically like, making this whole process scale and be discoverable. And, and, you know, this, the data warehouse is becoming its own production system. And with any production system, you want some sort of a staged approach to change management, you don't want to just be doing it live. So what we just talked about with the kind of the different tools that can help kind of stage a change to the data pipeline and show that that that's a big piece of what I'm kind of looking forward to in the future. And I think the, on the other end of the spectrum is is more, you know, while sequel is an incredible way to express these data transformations, I think there are some use cases where things are a little bit more complicated, or you might want to do something a bit more advanced. I think Greg will probably speak more towards this. But I think building more sophisticated data transformations with using the same system will be incredibly valuable.
51:40
Raghotham Murthy: Yeah, so to basically add to kind of what I mentioned, one of the things that we're hearing from greenhouse and other customers is that they'd like to move beyond sequel to be able to kind of specify more complicated transformations. But we really like the whole set of abstractions that sequel provides around kind of excellence, a data dependency specification, as well as kind of the abstraction around just say, Okay, what do you want to get done, not how. So we have come up with this abstraction called the user defined table generation function. Again, this is not new. query engines, like hive have had it for a long time. But we have come up with a way where people can actually plug in their Python code to be able to do kind of much more complicated transformations. Even do things like inference, basket inference, and things like that. And you should be able to plug that into a data flow. And the data flow specification itself is done in SQL, because I mean, that's how we are able to infer the data dependencies that will then generate the data pipelines. So this is one of the features that we are actually super excited about, because that will hopefully allow data scientists to do a lot more than just write sequel.
52:56
Tobias Macey: And are there any other aspects of the work that you're doing at greenhouse saw the work that you're doing at data coral, or the sort of interaction between the two companies that we didn't discuss yet they'd like to cover before we close out the show?
53:07
Raghotham Murthy: Yeah, actually, one of the things that I think maybe you can talk about kind of the business requirement. And Tobias just to kind of point out here, this was one of the big use cases that Aaron and I had talked about earlier, but we didn't get to it in terms of I guess, the big wins aspect is around GDPR. So like about early last year, when I was trying to figure out how they're going to get GDPR compliant on the analytics warehouse. And one of the things that was also the driving factors for the whole logical decoding aspect of pulling data was to be able to deal with heartbeats of the data. And that is something that we were able to clearly provide on the collect side. But once the data was in the analytics database, we wanted to get to a point where kind of practically anonymizing data so that it was actually very easy for greenhouse to be able to comply with the right to be forgotten rules. So when this kind of requirement came along from greenhouse, we actually work with them pretty closely to get to a point where even data that was coming from their kind of API's. from different tools like Salesforce and the Zendesk and JIRA, we are able to actually anonymize all using the same kind of material as your framework to then allow them to be kind of compliant to the right to be forgotten road. And you want to add to add to it,
54:29
Aaron Gibralter: I think it was around May last year that this this was happening. And I think that we you know, we wanted to do everything that we could to be compliant. And I think that one of our big worries was like, Well, if if we collect all this data in our data warehouse, and don't have like a easy way to to propagate the deletes that would be a big exposure point for us. And so we wanted to make sure that we're able to handle that. So we, as Robin mentioned, we worked with with, and I think when we started, when we brought this up, this wasn't necessarily top of the roadmap for data, coral, but I think as we spoke about it more, I think it was clear that it would be a big piece for any company that, you know, wanted to remain compliant. And so we work together to figure out well, how do we move from the current implementation that we have to one that will be compliant. And so we were able to get there. And I think that that was, you know, a big win for us. And we were able to make our legal team happy and saying that we do, you know, we do comply and propagate those, those deletes,
55:39
Tobias Macey: and anything else that we should cover before we close out the show?
55:42
Raghotham Murthy: Yeah, I mean, I think I just wanted to thank Aaron and Dino's again, for the patients that they've had, as we have kind of grown believe that the kind of customer that any startup could kind of dream of. It's been literally that kind of experience kind of working with me now. So I kind of look forward to looking at them. Going forward.
56:03
Aaron Gibralter: The other thing I mentioned that we work with periscope data as our BI tool, but that was a number of years ago. And the current BI tool that we work with is mode about analytics. And I think we've been super happy with with mode as as kind of the kind of primary window onto our data warehouse. And this is actually an interesting use case that we have meta base is a open source BI tool. And so data coral provides a meta base slice. And so basically meta meta base is a really kind of it's pretty powerful BI tool where you can write sequel queries against your data warehouse, your redshift instance, and and see the results in the browser, which is obviously what a BI tool does. But I think that what we discovered and rolling out metal base is that it was a little rough around the edges for something that would span the entire company. And so so we decided to keep meta base as like an internal tool for our engineering team, for our data scientists to kind of prototype queries, but we decided to not roll it out across the entire company. And the other part of that is that right now, the way that metal base is organized is that it's kind of has relatively blanket access to our redshift database. And so anyone who has access to the database, which is a small subset of people have, you know, access to a lot of data. And, and that's good and bad thing. It's good, because it allows us again, to prototype some of these queries, but we obviously don't want everyone at the company to have access to every piece of data. And so what we've been able to do, again, using materialized views, is we transform subsets of data into specific schemas in redshift. And those are the schemas that we give mode access to. And so another piece of this is that we just don't like we don't transfer PII or PCI to the schemas to which mode has access. So again, like that's kind of like another layer of security and compliance is that we are able to use materialized views to sanitize the data for more public, by public. I mean, within greenhouse, of course, more public consumption. And so that's the kind of another interesting use case that we've found to be a pretty big win for us here is that we can, we can do that and feel asleep easier at night, knowing that we don't have to have everyone that have access to all the data, which is, I think, probably a worry for a lot of people.
58:46
Raghotham Murthy: Yeah, again, the fact that matter is is something that gets deployed inside of your VPC means that it becomes kind of only accessible to your VPN kind of add to that security.
58:56
Aaron Gibralter: Exactly. But again, that's not everyone has access to our VPN, like the customer success team, you know, isn't getting, you know, isn't logging on to the VPN, like our engineering VPN. So that was another reason why we don't we didn't roll it out across the company, because the technical hurdles were were too great.
59:16
Tobias Macey: All right. Well, again, thank you both for that. For anybody who wants to follow up with either of you, I'll have you add your preferred contact information to the show notes. And so as a final question, I just like to get each of your perspectives on what you see as being the biggest gap and the tooling or technology that's available for data management today. Aaron, starting with you.
59:37
Aaron Gibralter: Good question. I think this is actually a piece that maybe I would have touched on before, but we didn't get into it. But I think the thing we're seeing more and more is, is having our customers demand real time data as opposed to or close to real time. And so the idea of, you know, either providing that data through, you know, these data dumps, like once a day or every hour are becoming the idea that you could could transform and stream data in real time. This is definitely something that people want. But I think the, you know, the tooling is still complex, you know, to get data out of a production system and maintain it and send it in real time. I think we're getting close. But I think I'd like to see that. And I think that's something again, that I've talked with Ragu a lot about and I think David coral is, you know, is thinking about it as well. It's like, how do we how do you for leveraging loud, logical decoding, to pull all the changes that are coming from our Postgres database and into our data warehouse, other efficient ways to look into that to to transform it in real time? And Ragu?
1:00:51
Raghotham Murthy: Yeah, I mean, from my perspective, I just find the biggest gap is in complexity of the whole, we'll get a tool chain that's there across like all kinds of functionality that is needed for data management. And it's still very hard. Even though there's quite a lot of options, it's actually pretty hard for any one company to be able to say, Okay, now I know exactly what the right into and toolkit needs to be like, we are doing a little bit to help standardize tooling for end to end data flows. But as more and more companies have lots more data and lots more kinds of use cases, it's I only think that there's going to be a problem that it keeps increasing where there are more and more options. And each one does like a small sliver of what you need. And now it's up to you to actually put them all together yourself. And I think that's, that's something that we are trying to put a dent into.
1:01:52
Tobias Macey: All right, well, thank you both, again, for taking the time today to talk through your experiences. It's definitely valuable to to get some sort of insight insight into the ways that different people are running their engineering and managing their data platforms. So I appreciate the both of you taking the time today, and I hope you enjoy the rest of your day.
1:02:12
Aaron Gibralter: Thank you so much, Tobias.