Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

https://www.dataengineeringpodcast.com

subscribe
share





episode 92: Solving Data Discovery At Lyft [transcript]


Summary

Data is only valuable if you use it for something, and the first step is knowing that it is available. As organizations grow and data sources proliferate it becomes difficult to keep track of everything, particularly for analysts and data scientists who are not involved with the collection and management of that information. Lyft has build the Amundsen platform to address the problem of data discovery and in this episode Tao Feng and Mark Grover explain how it works, why they built it, and how it has impacted the workflow of data professionals in their organization. If you are struggling to realize the value of your information because you don’t know what you have or where it is then give this a listen and then try out Amundsen for yourself.

Announcements
  • Welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Finding the data that you need is tricky, and Amundsen will help you solve that problem. And as your data grows in volume and complexity, there are foundational principles that you can follow to keep data workflows streamlined. Mode – the advanced analytics platform that Lyft trusts – has compiled 3 reasons to rethink data discovery. Read them at dataengineeringpodcast.com/mode-lyft.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, the Open Data Science Conference, and Corinium Intelligence. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Mark Grover and Tao Feng about Amundsen, the data discovery platform and metadata engine that powers self service data access at Lyft
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by explaining what Amundsen is and the problems that it was designed to address?
    • What was lacking in the existing projects at the time that led you to building a new platform from the ground up?
  • How does Amundsen fit in the larger ecosystem of data tools?
    • How does it compare to what WeWork is building with Marquez?
  • Can you describe the overall architecture of Amundsen and how it has evolved since you began working on it?
    • What were the main assumptions that you had going into this project and how have they been challenged or updated in the process of building and using it?
  • What has been the impact of Amundsen on the workflows of data teams at Lyft?
  • Can you talk through an example workflow for someone using Amundsen?
    • Once a dataset has been located, how does Amundsen simplify the process of accessing that data for analysis or further processing?
  • How does the information in Amundsen get populated and what is the process for keeping it up to date?
  • What was your motivation for releasing it as open source and how much effort was involved in cleaning up the code for the public?
  • What are some of the capabilities that you have intentionally decided not to implement yet?
  • For someone who wants to run their own instance of Amundsen what is involved in getting it deployed and integrated?
  • What have you found to be the most challenging aspects of building, using and maintaining Amundsen?
  • What do you have planned for the future of Amundsen?
Contact Info
  • Tao
    • LinkedIn
    • feng-tao on GitHub
  • Mark
    • LinkedIn
    • Website
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
  • Amundsen
    • Data Council Presentation
    • Strata Presentation
    • Blog Post
  • Lyft
  • Airflow
    • Podcast.__init__ Episode
  • LinkedIn
  • Slack
  • Marquez
  • S3
  • Hive
  • Presto
    • Podcast Episode
  • Spark
  • PostgreSQL
  • Google BigQuery
  • Neo4J
  • Apache Atlas
  • Tableau
  • Superset
  • Alation
  • Cloudera Navigator
  • DynamoDB
  • MongoDB
  • Druid

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA


share







 2019-08-05  51m
 
 
00:12  Tobias Macey
Welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline and want to test out the project to hear about on the show, you'll need somewhere to deploy it. So check out our friends at Lynn ODE. With 200 gigabit private networking, scalable shared block storage and a 40 gigabit public network you get everything you need to run a fast, reliable and bulletproof data platform. And if you need global distribution, they've got bad coverage to with worldwide data centers, including new ones in Toronto and Mumbai. For your machine learning workloads, they just announced dedicated CPU instances. So go to data engineering podcast.com slash Linux, that's LI and ODE today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for their continued support of this show. Finding the data that you need is tricky, but Amundson will help you solve that problem. And as your data grows and volume and complexity, there are foundational principles that you can follow to keep data workflows streamlined mode, the advanced analytics platform that Lyft trusts has compiled three reasons to rethink data discovery, you can read them@mode.com slash Lyft. That's mode.com slash LYFT. You listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity, the Open Data Science Conference and cranium intelligence. Upcoming events include the O'Reilly AI conference, the strata data conference, and the combined events of the data architecture summit and graph forum. Good data engineering podcast.com slash conferences to learn more about these and other events and take advantage of our partner discounts when you register and go to data engineering podcast.com to subscribe to the show, sign up for the mailing list, read the show notes and get in touch. And to help other people find the show. Please leave a review on iTunes and tell your friends and co workers. Your host is Tobias Macey. And today I'm interviewing Mark Grover and Tao Fang about Amundson, the data discovery platform and metadata engine that powers self service data access at Lyft. So Mark, can you start by introducing yourself?
02:33  Mark Grover
For sure. Thank you for having this Tobias. I'm Mark. I'm a product manager Lyft. I have previously worked at Cloudera. As an engineer working on big data systems having contributed and committed to many open source projects at Lyft. I focus on data products, including products for data discovery and data trust, which is the core of the conversation today.
02:54  Tobias Macey
And do you remember how you first got involved in the area of data management? Uh,
02:58  Mark Grover
yeah, it was by chance there was. I've been working in the Big Data space for a while. And I came to left and wasn't this general space of data products, which is like what products and you built to make data users that look more effective and productive. And during the course of user interviews, and the speed at which the business was growing. That's how I really got into the space of data and discovery and data quality.
03:20  Tobias Macey
And can you introduce yourself as well? Sure.
03:23  Tao Feng
Thanks for having me. So my name is Tom. I'm an engineer that data platform team. So I primary working on Amazon projects, also on Apache alpha, which is a local managers system. So before my time and live, I work at linking on infrastructure performance and data related projects.
03:42  Tobias Macey
And do you remember how you got involved in the area of data management,
03:45  Tao Feng
mostly my my time to get involved in data management. So we see a pain point in that data discovery, data management, and we see a need to build something for the user to solve this pain point, and live. And
03:59  Tobias Macey
so the Ahmanson project has recently been released as an open source project, and you both have presented on it at a couple of different venues. And so I'm wondering if you can just start by explaining a bit about what it is and the problems that it was designed to address it Lyft?
04:15  Mark Grover
Yeah, for sure. Let's start with the problem. I think what was happening at Lyft, about a few years ago was that the traffic on the left half was increasing exponentially, that led to an increase in the amount of data that the organization had to store as well as to process. So that was the first problem of scale the scale in the modern data. The second problem of scale was the number of employees and people were using data day to day to make decisions drive the business forward was also increasing exponentially. So there was a point in time where a Lyft double in size every year, and that those two combinations of scale, led to a problem where people who have been around for a long time contains tribal knowledge in their head around various different data systems, data sources, and the company. And newer employees were still very smart, effective individuals weren't able to have that context to do their jobs. Well, right. So that was one problem. So then people started solving this problem through the various ways of figuring out answers to questions, people have the most hated ways to use slack. So productivity of people went down because people are using slack to ask these questions like, Oh, I just joined, I want to optimize ETFs or left? Where do I find the source of truth free to write. And so someone will tell you something, hopefully a bunch of people will tell you the same answer. But many times different people will tell you different answers. And then you have to figure out the much harder question is, is this data trustworthy? Right? And once you have figured out this data is trustworthy, then you have to figure out, okay, what's the right model for this table? Like, is it eta at what time, right? Because, again, eta is shown on your app multiple different times, you get shown before you a plus the right after you request a ride, and so on and so forth. So you got to build this model off the data and figure out do I join with something else? What keys do I used to join, and there was this whole problem, first of discovering and trust, and second of understanding, and that really drove the productivity of our data users down. And that was the problem we're trying to solve, we wanted to really have an experience that prevalent of what Google is on the web, where you see relevant results really well on the top, and you have a very quick, swift experience for search and discovery as you go and wanted to have that same experience at Lyft. And that led to a product compliments. So if I may, I'll go into describing just a very high level what that form factor is now. And then we can talk a little more about it later. So the form factor is you go in, and there's just a search box, where you can search for any data asset I left and you search, say, you search for ETFs. And the example I was sharing earlier, you get a list of different data assets. Now these assets can be taken. So these are tables are views in various different databases. And soon they will be dashboards as well. So you can see work that's been done and dashboards and other analysis or notebooks in this system. But later on in the roadmap, you will find it in various different other ways, like streaming applications, or Kafka topics, and so on, so forth. But focusing on the table experience, you click on the table, you see all the information on this table. So the schema, the the name of the table, the descriptions and the columns and the profile of the column. So the men and the Max's and so on, so forth, the frequent users of this data, and then you see a preview of the data as well as a skeleton Corey on this data to kind of get you started started. And the key point we wanted to keep in mind is that we wanted all as much as possible to be automated metadata, right? So no one is going and saying this is the right table to us. And we put a lot of time into what kind of metadata and what kind of opinions Do we need to form on top of this meta data in order for building that model of trust and serving that model trust,
08:00  Tobias Macey
right from the very get go. And one of the things that I'm curious about coming out of that description, there is some of the internal vetting or quality control that goes into determining how a data set ends up being listed in the Amazon catalog, and then any sort of ranking information or user feedback for being able to determine what the correct placement is based on any type of query that you might be issuing to find a given data set?
08:28  Mark Grover
Yeah, fantastic question, right. And I think we are on this scale, right? I consider curation. We're a bi engineer already engineers curating every single data model that's coming out, and making sure it's high quality, and it remains high quality all the time to place off complete chaos, where you have so many so much data, it's growing in a more democratic fashion. Not everyone is aware of what all the places to go to war, and so on, so forth. And I would say most companies are somewhere in that spectrum, right? Hopefully, you're in out of the chaos side. And I don't think the growing company can be on the on the curation side. So the examples that you chose Tobias, for example, about the ranking, what Amundson does, is that it uses two axes for ranking one is popularity, and the other one is relevance. So when you type eta, it matches eta with a bunch of different fields within the table, the name of the table, the description of the table, the column names and the column descriptions, and the tags to surface, all the things that match with the term eta. But then it also ranks them based on popularity, popularity being the amount of sequel scoring that happens on that table. So tables, I get query more show up higher in the search results than table to get scored less. And we also attach different ways to automatic queries, like a new TL job versus an ad hoc query, like human recording the table. And that's how we started to build this, this intelligence around. What is a good proxy for trust? In this example, we just quoted it as the popularity or the amount of sequel queries that get written against the table as a as a measure of trust where
10:09  Tobias Macey
that and you said that before Amundson came into production, you said, Lyft, a lot of the way that data discovery happened was just by word of mouth or knowing who to ask. And I'm wondering, at the time that you first started building Amundson, what the landscape looked like as far as products or systems that were available for solving this data, cataloguing and discover a problem, and what was lacking in those existing options that led you down the road of building something from scratch?
10:41  Mark Grover
Yeah, for sure. So at Lyft, there were a few attempts of solving this problem, there was a wiki page, which had a page per table and somebody kept it up to date. But that was very hard to enforce and make sure it was up to date, the most common way was twofold. So for discovery, people were just asked on slack. And it really depends on how mature your team have been formed, the person who was leading that team or worked on the data for that team is still around. So if you're a New Scientist, you will go ask other scientists and the team. Sometimes you got to figure out how this is instrumented. So you would go to the engineering team, there's just a lot of time being wasted. in figuring that stuff out. And the tools people used were one of three, they use this outdated wiki page, they used slack to ask these questions. And the last thing is they would go to various different sources of metadata themselves. So they'll go read the ATL code themselves, or look at the airflow, DAX, and try to make the connection between the tables and the corresponding here, flow tags. And all that was super painful. Like what we learned at that people were spending 30 to 35% of their, upwards of that of their entire workflow. And you know, these people are hired to write MMO models or write sequel queries. They spent all that time actually just finding out what is the right thing for me to use? What is the right thing for me to join? And
11:59  Tobias Macey
in terms of the overall data ecosystem, that Amundson is fitting into, what are some of the source systems and downstream systems that you are integrating with? And I'm also curious how it compares to some of the other work that's being done in parallel. Most notably, that comes to my mind is the Marquez project from we work?
12:19  Mark Grover
Yeah, so I'll help answer the first part of that question. And how has a lot more thoughts on the ladder in terms of the integration Lyft is a very heavy s3, user in the s3 data has been stored in the hive meta store. So we use hive and presto to access this data. So the over the top most integration for us was with hive and presto, and essentially anything that used hive metal store, and that could be Spark, and so on that that is definitely the case. So that was the first integration we built. We also as a part of the second integration, we can talk about more about this later, we built integration of people. So essentially, we have a graph of data have nodes and edges in these graph. And then we have people added to this graph very recently. And so we integrated with our HR system to get that information in this table in this graph, as well. And in terms of what has happened more recently, as after we've open sourced it, a lot of community members have started to use it and have contributed to various different data sources. So now we have integration with Postgres, which brings in integration with things like redshift, because it follows the same same model. And then we also have integration with bit query from Google. And both of these are in thanks to the amazing community members that we have.
13:40  Tao Feng
Yeah, I could take the question so to say compare, allow Muslim we work on on his Portugal and Marcus before going to them. Let me do a brief introduction on the architectural Ahmanson and then see the comparisons. So our monster is more focusing on the data discovery perspective. So he has three Michael services. Now one front end services for user to data this to the data discovery and your Power BI backed by two other backend services. One is called meta data services. The other is search services. Meta Data Services is very modular, and plugging a book so you could talk to any persistently. By default we should be new for GS a persistent layer and the community contribute Apache endless proxy to be the persistent layer for this metadata. And the services the search services which power the search query run by the front end services to compare with Marcus from we work so my understanding of Marcus project is Marcus is a metadata project focusing on metadata, data governance, and is more line the school like similar to Apache endless. So in this case, if we could have a proxy layer to talk too much, this is possible to use our muscle with Marcus as the backend engines.
15:09  Tobias Macey
And you mentioned the overall architecture of how Amundson is currently implemented. And I'm wondering how it has evolved since you began working on it and some of the primary assumptions that you had going into the project that have been challenged or updated in the process of building and using it
15:25  Tao Feng
to talk about like how Amazon architecture has evolved. We started this project late last last year around April made. And almost initially, we have done a lot of architecture discussion. And we come down into three micro service architectures like finance services, the metadata service and search services. And we decided to use a poor approach to get the metadata with. For that we build a generic data ingestion library called Data builder, which could we implement the interface and which we can talk to different competitive source equal talk to a bakery, postgrad ratio or hive. And we start with high because hive is the most use data system and live assist and the overall architecture stay the same, but the implementation changed quite a bit. For example, initially, we want to keep in sync between Amazon, the metadata store, you met Ahmanson, with the upstream source, what it means is that if we pull the metadata from high Middle School, for example, about a table name, column name and column description, and then we persist in Neil project graph ash and expose by metadata and use by to fund it, and where were you so try to modify the description, for example, you will not only persist in our new project graph, we will also persists back to high megastore, we found this kind of coupling has a lot of limitations. For example, He doesn't work with full we build table because like, if user try to do a four wheel table again, or this modified discussion will be be lost because I have the original discussion processing some key hub file. And second change we did is swipe initial restart, we are poor model to get a meta data. Now we evolved into a pool and push mix model. For example, like to pull model is great to get started to get meta data. But once we reach a scale, let's say a lot of Team A lot of organization we live on to push metadata into Amazon is hard to build a different index college for everyone. So we start to leverage Cathcart up your metadata pushing model to for these purposes.
17:49  Tobias Macey
And for being able to populate the metadata story so that you're using this Kafka engine. I'm wondering if you can just talk through the overall life cycle of maybe a new table being created or a new data set being published, and then how that would flow through Amundson to be discoverable. And then the workflow on the other side of somebody using Amundson to be able to find it and then any assistance that Amundson provides in terms of being able to specify the connection information for somebody to be able to then just start querying that information.
18:23  Tao Feng
So, the workflow and Lyft is typically somewhere will analysis for example, they try to create data set or data engineer to try to create data set, they will start with certain prototyping on their personal schema right there on sequel using certain BI tool mo Tableau or superset, to get some get a sequel query or running, get the expected data format mode data model going once they satisfy the data model, they will create effort and normally they will create Afro tech, which our email workflow management system you and live to populating this table in our daily or cron job crunch time or batch job fashions. Once this table has been populated, if this table is created under certain schema for example, you could be called schema or input schema us live on Wilson has another index job Core Data builder, which is run inside fo tak which pull this metadata from high metal storing price per days fashions. For example, once this table has been created as short as our record inside high metal store, once we are data builder is running, you will prove this record from high metal store and index persist these information into new project. So near for JE is a graph database, it will create certain graph know for example, you have four table name that you create a table no four column you create column, no bill sudden relationships, so on and so forth. And once this table has been created, you will do a search index to make sure this table could be searchable from from the front. So this was this index job has been finished. Meaning like this table is available for user for user to consume. When user go to Amazon UI, they searched a table based on that relevancy and popularity, you will show up in the search result. And
20:30  Tobias Macey
now that Amundson is in general use a Lyft. I'm curious what types of feedback you've received from your teammates and people in other data oriented teams as far as how it has impacted their overall workflow and productivity versus what the state of affairs was before it became generally available and accessible?
20:50  Mark Grover
Yeah, for sure. So we've noticed, and this is two qualitative surveys that the productivity of the analysis workflow has increased by about 30, primarily, because we've reduced a large chunk of time that used to be spent on data discovery, down to almost nothing compared to what it was before, to back it up with some data. And this actually ties into your previous question around assumptions that have been challenged, we built this tool mainly for the data scientists and a company Lyft size at the time, there were about 200 or so people in the science team. And that's what we were targeting for like really savvy, heavy data users who would want to use tables and views as a first pass movie have learned as Evans now has weekly after users, over 700 people, right? So these are people who have because of us democratizing data, wanted to use more and more data. And that's been a huge assumption, challenge for us in a good way. And we were seeing that, that adoption number as a good proxy for people loving this tool, we also measure see sat both through within the tool as well as to quarterly survey and that sees that has been high over nine out of 10, people have just been loving this tool. And that's really shown in their increased productivity as well.
22:13  Tobias Macey
As I mentioned at the beginning, you recently open source the Amundson project and have made it publicly available for other people to be able to run their own instances. And I'm wondering what the motivation was for releasing that publicly and how much effort was involved in cleaning up the code base to make it accessible to the public without having too many internal assumptions about how it was going to be deployed and the systems that I was going to be interacting with?
22:41  Mark Grover
Yeah, for sure. Actually, this ties back to how we made the decision to build Amazon from scratch. And I think that's a good topic to cover as well. When we were looking, we were really looking for an experience where you can discover and trust data really quickly. And we looked at vendor products, things that had been built in the past. And that were available live markets are examples where elation, clever navigator, and so on. We looked at closed source tools that companies had built for solving similar problems in that category was Airbnb Data Portal. Facebook had a tool called high data, we looked at open source tools in the same space as well. So that was a patchy Atlas. As well as Marquez pretty early on. Actually, Marcus didn't exist back then they were only started much later, but Apache Atlas, as well as Lincoln's warehouse. So we looked at all these tools. And we were looking for that experience where you could discover and have very little curation, human curation of data required, and figure out really quickly and nimbly what it's going to touch trust, and then have a vision for including people in the graph, as well as dashboards and, and Kafka topics and schemas and so on. And we found that that experience was there was much to be desired in that experience, and in all those options, and that's how we decided to build something. We also have learned and gotten a lot from open source, both in our careers as well as in our organizations. And we wanted this to be an open source standard for doing discovery. So we knew from the very beginning, we wanted to build Amundson not just for Lyft, but for everybody else who's solving the same problem. So we wanted to do do it an open source. And that's why when Tao talks about microservice architecture and the repositories, we do all our development in the open, we don't have a fork within Lyft. And outside of left that we try to cap we have an overlay of repositories that we do just for our custom configuration. The important point being that we really reduced the amount of cleanup that was required later on when we wanted to go open source with it.
24:47  Tobias Macey
And do you have any sense of the level of coverage that you have for the overall data that's available within Lyft? And what's represented within Amundson, and I'm wondering what the process is for being able to find any other remaining data sets that aren't present, or if there have been any issues as far as the data, not being able to be easily represented or accessible to Amundson for cataloguing and surfacing for other people to discover.
25:17  Mark Grover
Yeah, so broadly speaking, we're going to find the data left in two categories. One is the online stores which power the left apps that you see, these are powered by databases, like Dynamo, and Mongo, and so on. Then there's the offline store, the world that we all hang out more. And so these are your analytical systems. And off those we have hive, and presto, we have an old redshift cluster. We have Druid, we have Bitcoin installation. And so those are the systems that are the in the offline store. So, Amundson has historically focused more on the offline world. So it doesn't cover databases, the NoSQL style data in the online world. In the offline world, we have built integrations as, as we were talking about earlier, we now have integrations for redshift, hive, and presto, the s3 world as well as for Big Query, and we have indexed all those data. We are sometimes selective, if we know some schemas are just temporary schemas or personal schemas, we are very selective about them, and we never bring them to Emerson. But in general, we have a pretty large footprint. I don't know if I have a percentage number for you to quote. But that's that's the way we think about the coverage of Hamilton, do you want to add anything? Sure.
26:37  Tao Feng
Yeah. So almost as smart mention is I live has a lot of heterogeneous data sources and epic query to it Postgres hive redshift. So we try to build up our our vision is I try to build a comprehensive data map for all the data sources also, like bring relevancy to the user. Like, for us, we try to index all the way when tables for users, so also, for compliance purposes, slightly kindly, we already index all the managed schema within hive, so that you could be an only use by the user also using by compliance auditing purposes.
27:16  Tobias Macey
And that brings up another interesting question, as far as how you determine whether or not a given data set should be surfaced to somebody based on compliance or regulatory reasons, or just read what the general access control is for that data set. Because if somebody's searching for something, and their role is not going to grant them access to it, but then they see it listed and Amundson, I'm wondering what the just overall processes for being able to integrate and surface that information at the appropriate time?
27:49  Mark Grover
That's a great question. And we have an opinion about that, we feel that discovery of data sets at a high level should be available to all accessing some higher level of granularity metadata around data set. So this will be example of distinct values, or seeing a preview of the data or accessing the data, of course, depends on the data set, and should be limited to some those who only have access. So the preview, the approach that Amundson has taken is that it will get metadata and allow you to search for whatever you want to search for, regardless of whether that's privileged or not. And then once you get to a table that's privileged that you don't have access, it integrates with a delegates that that access control to another component, superset for us, and only then are you able to see based on her access controls, the preview of the data or be able to query that data. And the nice benefit we've seen from this as historically, what used to happen is, people would want to see if this is the table, they should ask for access for. But they can't know that until they request access. So their request access, find that that's not the right people for them, us. And then they start discovering more tables and then getting access and trying to figure out if now this new table is the right table for them to use. So that became a chicken and egg problem. And we try to solve that with Amundson by saying discover, you can figure out that there is a table out there, which is described this way. But if you do want to go further and actually start using this table, this is now when you start requesting access,
29:24  Tobias Macey
and what have been some of the other types of feature requests, either internally or externally that you have been most intrigued by and some of the ones that you have consciously decided not to implement and that are out of scope for Amundson.
29:39  Mark Grover
Yeah, so the core feature set was around discovery and trust of data. And that has gone really well, both out left and the open source community, we see a few feature requests in that application. And those are around additions to the graph. So as I was talking earlier, we've added that note, previously, there were notes for tables, we've added notes for people. And that was a very big request within left. Because when you join a team, if I joined towels team, I gotta look up how and I can see what tables this towel access every day. Which ones has a bookmark, and which ones does he frequently use, right. And so that information is all again populated automatically. And it helps me get up to speed really fast. So that was a pretty popularly requested feature outlet few features that have been requested in the open source community and are also relevant Lyft our lineage so you can see various kinds of use cases come through for this example would be you can figure out if two tables are exactly the same thing. Or if you were going to change one table, which tables This is going to impact downstream. So a lot of data engineering use cases showing up, then the next one in the line is dashboards. So you can currently search for tables and people, people want to see what previous analysis has been done. So maybe because they don't, they can learn from work. Or maybe they don't even have to do the work because someone else has already done that. And that's the next feature. that's currently in scoping phase right now. And we want to add, but I want to stress that that's just one application that application of data discovery and trust. And our vision for this project is that you've actually gathered a whole bunch of meta data which has become holy grail for a lot of applications. And Tom was referring to another application we are building on Amundson, which is compliance. So what we we've tried to build the application gathered a bunch of meta data. And now we're seeing that we can use this for a lot of other interesting use cases. So we're built we want to build a compliance application on top of it, and then later on elective quality and EPL style applications on top of it. So we actually improve the quality of the data and not just help with discovery for somebody who is interested in getting Amundson running on their own infrastructure and wondering what the overall process looks like of getting it deployed and integrated into their systems to start being able to gain value from it.
32:02  Tao Feng
Sure. So, first of all, Amazon we want to be easily accessible for everyone. So when we open source we require a lot of document about how to install on Amazon and using our Quick Start guideline light and we provide a simple simple script for user to ingest certain dummy data and ingest and persist into the new Phaedra graph is showing into the front end to keep people are feeling once user once user is a understand like this whole system like have three microservices and data ingestion library they quickly change this setup left with some of the sample loader script to use things based on the metadata you want based on the data environment let's say if they using Russia mostly they could you change to use the restroom extractor to get the metadata for Russia a purses into the new for Jake rough and for quick start with while a Docker, Docker compose which allow you to bootstrap using Docker container. But you could easily cannot deploy and install using others like AWS, UCS or directly deployed in native, easy to instance, that's also doable.
33:17  Tobias Macey
And what have you found to be some of the most challenging aspects of being able to build and maintain Amundson and support it as you integrate it into more systems and start feeding more data and workflows through it? Sure.
33:31  Tao Feng
So when building Armisen so there's a lot of like challenging our design discussion. So, first of all, as Mark mentioned, we view on design on Muslims with open sourcing mind from the day once. So, we are thinking about how to make every interface to be generic, which work we live also could work the artist system for example, like the data builder, I agree, just talk about. So he has four phases, extractor and transformer loader and publishers. So extractor mostly sees us I extract the metadata from different source and we build like high metadata extractor. But the interface allow us or community for example, late on contribute a Big Query extract as well post grad pressure extracted. And secondly is I we, we think we're hard to say how to make this project work we live internal infrastructure as well working externally. So, when we start this project, we for every service for all these three microservice, we have one report mostly for open source, we have another repo which ingest or include all the configs for this open source repo as well as those departmental so that once we open source is easily to kind of split the Lyft specific section only in a student living side pro report, while we continue to open source make the proper repo open source. And another challenge we we have seen is that when we decide amongst end user how to make the search result more Weller's, for example, we say there are there could be many different tables, like for free somebody at a table or rice table, different forms could have different have different copyright, how to make this search with some more relevance to us, we've been very hard to, to see how to keep the result relevant where we are, for example, initially, when we designed that search algorithm, we only take into account say, we only take into account say, I search based on a column name, table name, table, name, and tack, so on so forth, and we we found that actually, when user child, this table hasn't been used a lot and doesn't have much usage, when they type of rule table name to search, it didn't show up in the search, we saw in the first few pages, because like, we initially rank the order result based on initially we read all the result based on their usage, then we improve the search ranking saying, if the user try to search with the actual real table name, the search ranking will be put up so that like, even this table has been seldomly use English to show up in the first page. So that user could find it sound like a later pitching in the initial versions, another life challenge away where we build a monster new slide, because we use new 4g graph database to us the persistent layer, how do we design a data model to to feed into this data discovery? For example, like we we have light table know, as Mark mentioned, before, we have Table No, we have column no later we add Muslim people which we have a user note. How do we design I say, Make column though, as a separate note is that we're putting inside as a attribute for this table? No, we made a lot of this kind of a design decision. Now we are. The reason we have different separate note in this case is because we want to make the graph travel much faster. For example, we want to say, once we go to a table, no, we want to figure out why the columnist is just one hop to see all the columns. And we if we want to say like which user is using this table, we go to also one hop to base on the graph relationship to find out all the user
37:43  Tobias Macey
and what have been some of the most interesting or unexpected lessons that you've learned in the process of building and using Amundson and seeing others use it and some of the unexpected ways that users have put it to use
37:57  Mark Grover
one of the goals friends, and was to take this tribal knowledge off from people's minds and put that in a centralized place. And we thought just because we've removed the friction from adding this information, that people will start doing it. And all of a sudden, overnight, we would have all these comments and rich metadata in the turns out, that was overly optimistic of us that just because there is no friction doesn't mean that people are going to do that. In fact, like, there's a whole bunch of sort of cultural and social cues that need to be used in order to either get ownership of the data or just encourage people to put the descriptions in there, which is one of the very few pieces that are human curated. So that's been something that's been an insight for us. And from a product perspective, we are looking at what kind of incentives or social behavior we want to inculcate in the tool. So people are actually putting in, in descriptions and so on in the table. The second thing we saw is that they use compliance and other applications was a surprise to us, we totally stopped wanting to solve the data discovery and trust problem, it was only along the way that we figured it out with the heat of CPA here in California, which is GDPR equivalent here in California, and the having this comprehensive data map that we could start using this for governance and compliance. And that was a really pleasant surprise for us. And that's something that we've been using members of the open source community have also been using, for example, square is a member of the AMS and community and they're using them for compliance. The rest of the other folks, they're about 150 people in the open source community, with companies like of ING Square, we were sure for forgetting a whole bunch of work, they work day. Yes. And all most of the companies are using it for data discovery and trust, while square for example, is using compliance. And that's
40:00  Tobias Macey
another thing that I was thinking of, as you were discussing the use cases is the overall process of user experience design that has gone into how you build and use Amundson and what the interface looks like for people to be able to find it approachable, because you can build a fabulous tool that does everything that you want. But if it's got a terrible interface that nobody's going to actually take use of it. And so I'm just curious what the feedback has been on that front, and any modifications or updates that you had to make after you first launched it?
40:36  Mark Grover
Yeah, that's a good question. I, I think that's actually a really good to discuss something broader as well, where a lot of these data tools are two ways. One, they don't pay enough emphasis to that experience. So many of them end up having a clunky UI or really terrible experience. And to and, and they are not opinionated enough sometimes in how they want to structure things or represent that experience or what the experience should be. And that's something we are lucky in the MZ team at Lyft that we got help with from the very beginning. So we had designers work on it, who worked on it as a full time job. And they did a bunch of user interviews to actually make sure their design was in line with what users were expecting. And then we are also very opinionated about search ranking, for example, that the idea that we will use querying activity on our table as a proxy for trust is important. And what weights would be choose for you to use ad hoc way versus an ATL Corey, that is something I think we don't do enough in the data space where we need to build opinionated tools. So folks can have an experience that's easier to maneuver and get gets them more productive. That's one in terms of answering your more direct question around surprises, I think, Well, the main design has remained same, there are a few smaller things that we have changed along the way. For example, when you click on a column name and Amundson is shows you a profile of the column. So this is mean and standard deviation of that column from a recent partition, it shows men's and Max for integer columns over it applies for for a show like averages where it applies Michelle string length. And if it has less than a certain number of distinct values, and you have the access to it'll show you distinct values. However, it didn't show what date that metadata was based off often, that was something that users wanted to see. So we actually ended up adding like a small tagline underneath this was last calculated on this date. So small changes like that, in terms of big changes that are evolving in the experience, we're finding that on Currently, we have a lot of data around schema, and then a lot of data around behavior of the table. So we have frequent users and owners in which airflow data generated this table. And we have a link to the lineage metadata. So you can see like, what are the downstream jobs, we have a link to the Preview button and a skeleton Corey, and we're finding that this data is like overloading that page and making it very hard to digest. So we are doing a lot of sort of, quote unquote, scaling of the experience exercises where we're moving this metadata around. So it's actually like the most important meta data first, and then the less relevant later on mostly evolutionary stuff for the project,
43:31  Tobias Macey
having the sample query, I think it's definitely a good way to make the overall tool accessible as well, because somebody might be able to see the table and understand what the different columns might be. But then having that snippet already ready to go, where they can just run it and then start experimenting with it. And adding to it, I think definitely adds a lot more value for somebody who may not necessarily be as comfortable at the outset with a new data set.
44:01  Mark Grover
Yeah, absolutely. And I actually reminds me one more thing, that these, it's amazing how small little things help so much we put on the bottom right corner families, and we have this message box, essentially, you click on it, it's a bright pink button, of course and all Lyft colors, and you click on it, you'll read easily able to send some feedback in the tool and that skeleton query thing that you mentioned, was not a part of Amundson we didn't know by we didn't think about it. But it was like one of those requests that you get from a user, you know, like clicking on that button, they're just saying like, hey, it'd be really cool to do this, right. And I think sometimes having those channels open, so you can have a very low friction way of passing that feedback to the people who built this tool plays a huge role in that that one is a prime example of that.
44:48  Tobias Macey
It's definitely a great example of the fact that while we're building tools that aren't necessarily meant for external consumption to the people who are paying the business money, it's still customer interaction, where we're building something that is providing value to somebody else. And being able to have those user feedback cycles to improve the overall utility of the tool. And its effectiveness for the people who, who its intended to serve is valuable, no matter what segment of the business you're in, and whether the tools are internal or external.
45:22  Mark Grover
Yeah. 100%.
45:23  Tobias Macey
So what do you have planned for the future of Amundson?
45:27  Mark Grover
Yeah, so I, I classify this in various applications that are built on top of meta data. The first application is data discovery and trust. And we have lots to do there we have, we currently have users as well as tables in that graph, we are working on adding dashboards in that graph. And as time goes on, we want to add more streaming features to so the same problem of data discovery that exists in the analytics world of tables and views and dashboards. It exists in the streaming world too. And it would only get worse for time. So you want to be able to discover trustworthy streams, and topics and schemas and so on. And we want to go in that territory and solve that discovery and trust problem for all data components. But it doesn't stop there we are building and want to build other applications. On top of this. An example of an application that our leader mentioned was compliance. So how can you use all the metadata we have, since we already know what are all the tables at Lyft. And who's using them, we can tag them as PIIRNPI, and be able to figure out if there is anomalous activity happening based on this and alert the right people. So that's like an application that we want to build. And then down down the road, we have applications around downstream impacts of your data engineer wanting to change your column type or add a new column column or trying to make a backwards compatible change to a table, you can figure out who to notify and who to keep posted up those changes. But overall, if we were to step back, I think what's really missing from this face today, is like a Data Portal, like one place where you where you go as a data user, and it is that place where you can get all your information about what changes are happening to tables. So think of Facebook feed for your data, right? So if I commonly use and a table in town just made a change to it yesterday, I'll get a notification saying this change was made. And then I also have a Google like experience in the same thing, where if I want to discover a new area of work, I can just type in eta is and find a trustworthy source for me to use that information and learn from what work has been done in the past and so on.
47:40  Tobias Macey
Are there any other aspects of the Amundson project itself or the ways that it's being used at Lyft, and in the open source community, or the engineering work that has gone into it that we didn't discuss yet that you'd like to cover before we close out the show?
47:54  Tao Feng
So for example, like, initial initial, one thing I'm not sure you've been mentioned before, so initially, we the metadata service is only serving for the only serving for the front end service for the metadata request. Now have we have been involved in a place like the metadata also serve as a standalone service normally serving the front end services, as well as serving some of the other external service reading live for their metadata request, or either get output, one use case will be compliant. The other use cases like that some of the other teams, they want to ingest certain relevance, metadata, for example, one example could be like the features services being Lyft machine and a platform, they want to, they want to allow the feature table to be more easily discoverable and grouped into certain category. They have their feature service directory in just the relevant metadata, like the tagging the team name, which inform, which that created this table directly in core, our metadata API and just this metadata information into our graph, so that later on you for the machine learning user, they could they could, once they search their future table they could already see, actually this is come from which that which team is created, and is creative. for what purposes?
49:18  Tobias Macey
Well, for anybody who wants to follow along with the work that you're doing, or get in touch, or provide any feedback on the tool that you've built in the form of Amundson, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get the perspective of each of you on what you see as being the biggest gap in the tooling or technology that's available for data management today. So Mark, if you want to go first and answer that,
49:42  Mark Grover
sure, yeah, I think the big gap is having a standard for how we discover and trust data. This is crucial to the experience of data users, that they are always up to date on what's happening in the organization, they have a trustworthy way to figure out if this is the right source for them to rely on for making their decisions. And then expanding on this metadata engine, in building applications and making metadata this holy grail of whether it's discovery or compliance or downstream impact analysis, like making meditate on the holy grail for this experience.
50:22  Tao Feng
Yeah, so for me, I want one big gap I could see once Well, for example, once we have all the different heterogeneous metadata index in, in our data manage system like almost and how to make the search relevant for across all these heterogeneous system, for example, initially with we start with a hive table, and we caught some tape user, assess lock and get the user assessor where once he put his high table later on for once, for example, if the other heterogeneous table at Foursquare, they don't have usage informations. How do we still make sure the search where we see is apply to those posts where people or other jewelry table Big Query tables. So this is something we need to think about as see how to adjust that.
51:13  Tobias Macey
Thank you both very much for taking the time today to join me and discuss the work that you've put into the Amundson project. It's definitely an interesting problem space, and one that is absolutely necessary for the continued success of data platforms and data teams as the overall complexity of our systems continues to grow and evolve. So I appreciate the work that you've both put into that and I hope you enjoy the rest of your day.
51:37  Mark Grover
Thank you very much.
51:38  Tao Feng
Thanks for having us Tobias. Thank you.