Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

https://www.dataengineeringpodcast.com

subscribe
share





episode 49: A Primer On Enterprise Data Curation with Todd Walter [transcript]


Summary

As your data needs scale across an organization the need for a carefully considered approach to collection, storage, organization, and access becomes increasingly critical. In this episode Todd Walter shares his considerable experience in data curation to clarify the many aspects that are necessary for a successful platform for your business. Using the metaphor of a museum curator carefully managing the precious resources on display and in the vaults, he discusses the various layers of an enterprise data strategy. This includes modeling the lifecycle of your information as a pipeline from the raw, messy, loosely structured records in your data lake, through a series of transformations and ultimately to your data warehouse. He also explains which layers are useful for the different members of the business, and which pitfalls to look out for along the path to a mature and flexible data platform.

Preamble
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Todd Walter about data curation and how to architect your data systems to support high quality, maintainable intelligence
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • How do you define data curation?
    • What are some of the high level concerns that are encapsulated in that effort?
  • How does the size and maturity of a company affect the ways that they architect and interact with their data systems?
  • Can you walk through the stages of an ideal lifecycle for data within the context of an organizations uses for it?
  • What are some of the common mistakes that are made when designing a data architecture and how do they lead to failure?
  • What has changed in terms of complexity and scope for data architecture and curation since you first started working in this space?
  • As “big data” became more widely discussed the common mantra was to store everything because you never know when you’ll need the data that might get thrown away. As the industry is reaching a greater degree of maturity and more regulations are implemented there has been a shift to being more considerate as to what information gets stored and for how long. What are your views on that evolution and what is your litmus test for determining which data to keep?
  • In terms of infrastructure, what are the components of a modern data architecture and how has that changed over the years?
    • What is your opinion on the relative merits of a data warehouse vs a data lake and are they mutually exclusive?
  • Once an architecture has been established, how do you allow for continued evolution to prevent stagnation and eventual failure?
  • ETL has long been the default approach for building and enforcing data architecture, but there have been significant shifts in recent years due to the emergence of streaming systems and ELT approaches in new data warehouses. What are your thoughts on the landscape for managing data flows and migration and when to use which approach?
  • What are some of the areas of data architecture and curation that are most often forgotten or ignored?
  • What resources do you recommend for anyone who is interested in learning more about the landscape of data architecture and curation?
Contact Info
  • LinkedIn
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
  • Teradata
  • Data Architecture
  • Data Curation
  • Data Warehouse
  • Chief Data Officer
  • ETL (Extract, Transform, Load)
  • Data Lake
  • Metadata
  • Data Lineage
    • Data Provenance
  • Strata Conference
  • ELT (Extract, Load, Transform)
  • Map-Reduce
  • Hive
  • Pig
  • Spark
  • Data Governance

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA


share







 2018-09-24  49m
 
 
00:13  Tobias Macey
Hello, and welcome to the data engineering podcast the show about modern data management. When you ready to build your next pipeline, you'll need somewhere to deploy. So check out the node. with private networking, shared block storage, node balancers, and a 40 gigabit network all controlled by a brand new API, you'll get everything you need to run a bulletproof data platform. Go to data engineering podcast.com slash Linux today to get a $20 credit and launch a new server and under a minute, and you work hard to make sure that your data is reliable and accurate. But can you say the same about the deployment of your machine learning models this case was platform for medicine machine was built to give your data scientists the end to end support that they need throughout the machine learning lifecycle. Escape host maximizes interoperability with the existing tools and platforms and offers real time insights and the ability to be up and running with cloud based product scale infrastructure instantaneously. request a demo today at data engineering podcast.com slash meta stash machine to learn more about how this machine is operationalize and data science. And if you're attending the strata data conference in New York in September, then come say hi to miss machine at booth p 16. And go to data engineering podcast. com to subscribe to the show. Sign up for the mailing list, read the show notes and get in touch and join the discussion at data engineering podcast.com slash chat. Your host is Tobias Macey and today I'm interviewing Todd Walter about data curation and how to architect your data systems to support high quality maintainable intelligence. So Todd, could you start by introducing yourself?
01:40  Todd Walter
Hi, Tobias. I'm Todd Walter. I'm Chief Technologist for Teradata been with Teradata for as crazy as it sounds 31 years and joined joined the organization when Teradata was still a startup company and have walked through all of the, all of the cycles and lifetimes of of Teradata, through its history.
02:06  Tobias Macey
And do you remember how you first got involved in the area of data management? You
02:10  Todd Walter
know, it's funny I, I didn't know when I was a kid. But I've always been a data geek. I, I've always been fascinated by by data, I got involved in a project in high school through my my government teacher, my American government teacher and, and digitized a bunch of maps and identified all of the publicly own plans within the city to better understand the tax consequences of public owned lands and, and just gotten gotten hooked by data all of my life, I like to know I like to know the details. And it's
02:54  Tobias Macey
definitely a very valuable skill to have. And it's become even more so as it has become the sort of default currency of businesses.
03:03  Todd Walter
It's so crucial these days. You know, there's there's plenty of people out there saying that, you know, data or die analytics or die is the new model for for companies. But I think that is more true than ever, I don't think it's the only thing you still have to have a strategy, you still have to have something compelling to, to offer to your customers. But if you don't understand in detail how your business operates, and, and how you make money and how you deal with customers, and you you just you can't operate in the world anymore. You can't we user user and customer expectations.
03:49  Tobias Macey
And it seems that in the time leading up to the early 2000s, where we were starting to digitize our data sources, but we had a very small variety and small volume of the the business analytics was in a sense, easy, because you didn't have so many different systems to consider. And so you know, the answers you got were fairly well scoped. Whereas now you have mountains of data to turn through and massive variety of sources. So it's kind of hard to understand the signal from the noise. And for a while in the sort of mid 2000s there was the onset of the Big Data mantra of just save everything. And magically things good things will happen. So in that I'm sure that your skills with data curation and data architecture have proved invaluable to yourself and all of your clients. So can you start by just defining what those terms mean, and what their differences are in terms of curation versus architecture?
04:48  Todd Walter
Sure. The I'm going to agree and disagree with you though. On on your on your point there is certainly the signal separating the signal from the noise today is gotten significantly more difficult, partly just because of the enormous volumes of data. However, the the number of sources has always been monumental. I've been doing data warehousing, I've been part of the data warehousing industry for for much of my career at Teradata and worked with a lot of different companies and and when I talk to these big customers, they're talking about thousands, maybe 10s of thousands of source applications, each producing data independently. None of them architected to talk to each other interact with each other, none of them with similar with similar keys or similar representations of the data, just because it was sort of structured data just because it was more Rosen and columns doesn't make the problem any easier. So data curation has been a problem for the entirety of the data warehouse industry. And the problem just keeps getting bigger and harder as as we move into this era of, quote, big data, unquote. So what do I mean by data curation, I mean that, that the data needs to be managed, cleaned, reassembled and made into a form that's actually usable by a by an analyst, or a data scientist or a system a modeling system of some form. I like the word curation, because it makes me think of the curator of a museum. In a museum, there's a lot of artifacts, many of which are in the basement, in dusty shelves, that aren't on display at all. And, and they are kept there because they have some amount of value or some amount of research value. But they're not on display for all of the all of the guests of the museum to see. And then of course, there are the ones that are on display and in in the in the museum. And those of course, are nicely arranged, carefully, carefully organized, nicely labeled easily understood by anyone from children to grandparents that come to the museum, the skeletons are articulated so that you can understand the creature that they came from. And the displays are all very carefully curated to make that user experience of a very easy and very understandable. And I think that data of people these days need to think the same way. They need to think about how they curate the data and figure out which things are the most important to be on display, if you will, to the whole organization and which things are fine to keep in the basement for a few, a few PhD researchers to look at some days,
08:21  Tobias Macey
I think that's a very valuable metaphor. And also in the case of the skeletons, for instance, with dinosaurs, they're often found with pieces missing. And so the museum curators will actually fill in the gaps of the skeleton to make it easier to understand. And that's a very good parallel to draw with the space of data and business intelligence and representation. And so in the space of data curation, what are some of the high level concerns that are encapsulated in that effort? And is it something that would generally be led by a single person or are there multiple individuals or business roles that are necessary for a proper curation of data within an organization?
09:10  Todd Walter
Oh, it's a it's definitely not one person unless you unless you define as being led. Define that is the role of the chief data officer. It starts with, it starts with that kind of a role the chief data officer, but the curation has to be a combined effort of the business owners and business users, along with the the IT people and the people who are the ones who actually do the curation. And curation in a large organization is a monumental task, it's a it is a never ending monumental task, because there's always new sources of data, there's mergers and acquisition, there's always new applications, there's applications that change. And, and, and so it's, it is this continuous process that needs to be prioritized, and, and managed by by multiple teams of people working closely together, the really highest level of concern is time to availability of the of the data, the traditional data warehouse model, it was to highly curate all the data and make it make available only highly curated data data that was, let's say, perfect, no data is ever perfect. But But let's say the data is perfect. And and and give it to the users only when the data is perfect for their bi applications and their dashboards and stuff. But the trade off for that is that it takes a lot of time to do that. A an ATL project for a for a new data source can take months and contain it can be a costly project that has to get in the budget and give planned for next year. And the elapsed time can be months or, or even years before the data. And the team gets to curate the data and make it available to people. So there's huge push back. And more and more these days, in these days of web time, instant gratification, the users are very impatient to get at their data. As a result, there's a huge amount of pushback, say, Well, you know, don't curate the data, just give it to me, and it's in its raw form. And I'll I'll do it. Now I'll play with it as the as the end user. And that's, that's okay, that's actually can be a very good thing. But it also can introduce all sorts of problems, when the person using the data doesn't understand the characteristics of the raw data, and the data quality issues that might be in it. Let's say they compute some report that that computes revenue out of the raw data, but hasn't applied all of the business rules that you would normally apply in a curation process. And they come up with a completely different revenue number, which they report up through the management chain, and it gets used, then, now the company's really making bad decisions, or even getting executives in trouble because they are reporting incorrect data to the street or, or to investors or whatever. And so, so the, the the whole process has to be well understood and well governed, to make sure that the the data that people use that's not curated is really understood to be not curated, and, and are, they're not using it for business critical decisions. Whereas you're expending the right amount of energy on curating the right stuff, to make sure that the business critical decisions are made on really good data. And that's really hard, because there are 10s of thousands, many 10s of thousands of attributes floating around in thousands of data sets in an organization. And so
13:42  Tobias Macey
how does the size and maturity of a company affect the different ways that they will approach architect doing and curating the data that they're capturing and the ways that they are building tearing their data systems to enable the curation strategies that are being architected,
14:08  Todd Walter
the, the size of the companies, among the large and medium companies as least doesn't seem to really be the predictive factor. The really, it's about the data maturity of an organization. And frankly, the pendulum swings back and forth. As I was saying, in my previous, in my previous comments, the the data warehouse people really frustrated the users because it took so long to get data ready for the data warehouse, that the users got really frustrated waiting for that, that data and wanted raw data to work to work with. And so, so that happens, and then the users get frustrated with that data, not matching the general ledger and not being able to be joined with customer and not having common units. And they come back to the IT organization and want their data curated more. So the pendulum kind of swings back and forth over over time, as well. And so, I think and what I try to advocate with the customers that I work with, and what Teradata has has always tried to advocate is that people should have a, a pipeline kind of mentality about curation. These days, we would call it an agile methodology around curation, where they start with data that is more in its raw state and let a few exploratory users work with that. And then as that data becomes, to the point where it supports, production applications are supports more users or supports more sharing between departments, then that data gets more and more curated over time, all managed by a governance process and prioritization process. There are not very many organizations that have that discipline, however. And so what really, really happens is that the, the organizations start with, with raw data, especially in the Big Data space, and, and they they hack at it and create data pipelines, and actually are creating a fair amount of future issues for themselves in terms of scaling up to more analytic projects and scaling up to to production use. But the people who are really data centric organizations, they get this and and they go to through this pipeline, sort of Agile process of continuous curation, and, and making continuously making the data better for their users.
17:14  Tobias Macey
And so in this pipeline oriented workflow, what does the lifecycle of the data look like in terms of the systems that it flows through, and the operations being performed on it and the sort of general availability within the organization to these different layers of data,
17:37  Todd Walter
they do data when it first arrives as a new data set that the organization has never seen before. Especially in when it is some one of the forms of big data it might be. It might be text, it might be web logs, it might be IoT data, and the any of the not traditional column, bro, kind of kind of data from traditional applications, that data generally needs to land in some form of a data lake some form of a file system. And a system that doesn't require significant rigor on in the format of the data and the and the structure of the data. But that data is really only usable by a small number of heroes in the organization that really can understand all the details of the source systems, and the data quality and the and the challenges of the structures of the data. And, and so then, that data needs to be progressively curated and and, and progressively add metadata to it so that other people can understand the structure of the market rated form of the data. At the other end of the pipeline, when the data is being used by the enterprise as a whole, let's say it's the the common view of customer data or the core financial data or that kind of information for the company, that data needs to be very highly curated. And, and put in a form that everybody can use, which is usually not always but usually in a more relational form, and in a more traditional relational form. But by the time it gets there, that that a lot of the volume of the raw data has been has been consumed or left behind it stages along the curation pipeline, maybe there are aggregates that have made it there, maybe there are, maybe there are computed scores or value that have made it there something like a, a customer sentiment score has been computed out of a large volume of text, the only thing that actually makes it into the highly curated data is the customer value score. And the text is is left in the data lake for further new analyses.
20:20  Tobias Macey
And in the data warehouse, where you're storing these aggregates, or these condensed analyses of the raw data, would you generally also have some record of the original source of the data and the provenance so that somebody who is interested in doing a different type of analysis or trying to use a different algorithm for generating these aggregates can then go back to those records easily to be able to try and either replicate or discover new reflections of that information?
20:53  Todd Walter
lineage. And provenance is a huge deal these days, absolutely enormous. And there's a a bunch of new companies springing up some of the been around for a little while, some of them, some newer ones, a lot of energy is being expended in the overall metadata, and lineage and provenance areas. And the reasons for that are not only the ones you described, about the analyst, just needing to know what's going on and where the data came from, and how it got there, what what curation rules have been applied, but, but it goes much further than that, because it also goes to being able to prove to an auditor, or a legal challenge on a privacy case or legal challenge on a security case, being able to prove how data got to where it is, and being able to prove that bias wasn't introduced, or that that the data was handled, and maintained and secured. properly. And, and it ties in with access controls, and and everything else. So. So yeah, that that whole lineage spaces is a huge deal along with the metadata on the data.
22:24  Tobias Macey
And when you're dealing with data lakes, there has been a lot of discussion about that being sort of the canonical source of data and maybe the only source of data within an organization and people trying to espouse the schema on read. And you don't necessarily you don't need to define all of your schema up front, because it slows you down. Whereas with data warehouses, you have a very strong structure to the data. And you very much need to define that schema up front end. So it schema on right. And then also in those with how you actually enforce the schemas, there's the question of doing extract, transform, and then load for being able to put the data into the data warehouse within that structure, or with the data lakes doing extract, load, and then transform, where you determine the schema at read time or when you're doing various analyses. So I'm curious what your thoughts are on all of those as somebody who does have a lot of history and contacts within the industry?
23:27  Todd Walter
Well, that's a big question. There are people in the industry who say that the data lake is a replacement for the data warehouse. I do not believe that Teradata does not believe that. However, we strongly believe that the data lake has a really important role to play in the overall analytic data platform architecture. There are there are people that make this an either or conversation. And, and we just don't believe that at all. We believe that the the data lake and the data warehouse should work together symbiotically to do to deliver the the data and deliver their in their separate capabilities to the organization. So the data lake is really good at collecting this really high volume data and doing big grind them up app operations over that large volume of data. So the big curation steps, the big processing of sensor data, for instance, to to put it into a format that analysts can use and, and and normalize units and normalize time and all these big, these are big, heavy lifting operations on this data. And, and those are really great things to do in the database. while delivering
25:03
access to the highly curated data of the organization is a is something that data warehouses do very well. And each of them are bad at doing the other thing. So data warehouses are bad are bad at doing the really heavy lifting on the semi structured or weekly structured data that raw data that's coming in. And and the data lake technologies are are weak at providing SLA is on high concurrency workloads to support a whole organization. And so we really think that the two should work together. And that we think there's a natural flow of as the data is more and more curated, it is more and more likely to belong in the data warehouse to be delivered to a much wider group of people sharing it across the organization, rather than the rather than the exploratory users who are doing the initial analysis and initial understanding, and the end the big heavy lifting curation processes. We also think that the data lake is a virtual concept, in that a file system is a great place to land data that is weekly structured, the text, the IoT data, the web log data, that all of those all of those kinds of forms of data that are that are the new big data sources of the of the world. But the when the data is coming in a more rolling column form, landing that and formatting it as as unstructured files and then restructuring it back into a forum for the widespread use in the organization is an extra hop and an extra extra energy, extra resources used that don't really need to be used. The your comments or questions about ETLLTETLTTLTXX there's an end at at strata last week, I heard a new one, which I really liked. ELE, it's, it's around the concept of a data hub where you extract load and then extract again to feed out to a very large number of sources. And one of the one of the presenters was bemoaning the fact that all he did was his entire life was Ellie, and just land data and then ship it back out again. And nobody actually ever used it in his in his data platform. But Teradata has always advocated the use of an LT kind of model. Just because it is the scalable model for the larger data sets. It's easier to do the transform processes with with a parallelized scalable set of operations. Rather than trying to push them through a server, you know, single threaded server somewhere and process them record by record that's fine for small data sets, but doesn't work for large data sets. And the LT model of course has been highly adopted by the by the data lake folks where a lot of the the curation is done on platform you leveraging the tools of the data lake environment, like you know, anything from MapReduce, pig hive, all the way up to, to spark and, and I'll Python scripts and everything else. But But the goal is the same this, the goal is to push the work into the scalable platform so that you can operate on the very large data sets and do the heavy lifting on the large datasets in a in a reasonable amount of time, schema and redone schema and right come back right back to the conversation from before about the lifecycle of of data in an organization and the lifecycle of data curation, schema. And read is really great when it's a small number of users who are exploring the data and trying to and trying to understand the structure and the value in the content. And trying to derive what they the interesting things or the interesting new insights out of that data is, that's a great thing to do. And every organization should provide in their processes, a way for people to land the data in a raw or the lightly curated way, and make it available in a schema and read kind of model to that small set of super users who can deal with that data in that form.
30:12
But when you need to get the data out to 10,000 users in 50 organizations, schema read no longer makes any sense at all. It is a huge resource utilization, because you're doing it over and over and over again, for every every use of the data. It it introduces all sorts of opportunities for each person to curate the data, or each application to curate the data in a different way and does get different answers. It introduces a whole bunch of, of problems that we're all the reasons why we did ATL in the data warehouse world in the in the first place. And so the more the data is used across the organization, the more production or data as a product the data becomes, the more curated it needs to be, the more it needs to be modeled. And the work done the curation work done once and then the curated data used many times by the by the people don't stream.
31:20  Tobias Macey
And going back to your metaphor of the museum with the data lake versus the data warehouse, the data warehouse ends up being the display room where all of the exhibits are put on display for everybody to be able to access and they're easy to consume and understand. And then the data lake is the basement of the museum where all of the raw unprocessed resources are for people to be able to do their research and analyses and prepare them for moving up to the display room.
31:50  Todd Walter
Absolutely. And in the display room, you have lots of metadata, they're linked together bye bye time or, you know, timelines or, or geography or all of that. They're easily understood, they're all articulated so that you can, you know, the thought it like joining together, you don't you you can you can see it all in in its relationships, in addition to just the data points in individual form.
32:23  Tobias Macey
And for organizations or individuals who are first starting to plan out the overall data architecture and the associated infrastructure and systems that they're going to be building and their curation processes. What have you found to be some of the common mistakes that ultimately result in failure of either a lesser or greater degree,
32:49  Todd Walter
the failures come from swinging the pendulum to one side or the other rather than thinking about it as a continuum. The failures come from both ends of the spectrum. In particular, the people who believe that everything needs to be perfectly curated before it gets in the hands of the users are blocking a whole groups of users and especially data science users from being able to, to explore new sets of data and work on them without a huge time lag in between. They're also wasting a lot of resources on the curation projects themselves, when that data doesn't turn out to have the high utility that justifies that level of curation. And on the other end, one of my favorite conversations ever was that I sat with a CTO at a very large organization. And he was very proud of the fact that he had gotten it completely out of the curation business and, and instead, all that they were doing was gathering all the data in the raw form, dumping it in the data lake, and then giving access to the business organizations and saying it's all your problem. Because that is that is going to result in failure. Well, it is resulting in failure. Because the user organizations don't have the skills, they are each curating their data sets independently. They're all coming up with different answers to the same question. They, they all have. Its back to the the worst of the data mart world, in in the in the 90s and early 2000s.
34:47  Tobias Macey
And so it sounds that for somebody who is first beginning a new project, or starting up or starting with a new organization that their best path forward would likely be to start land data in raw form in a data lake. And then either doing it themselves or having someone help them or an analyst to actually start exploring the data, building reports off of that from the data lake to determine what's actually useful, what's being used by the broader organization, and then starting to encapsulate that and capture it in the form of a data warehouse. And then building out that data warehouse based on the data sets and reports that are most valuable to the organization while continuing to launch new sources in the data lake is sort of a staging and testing ground.
35:37  Todd Walter
Exactly. And and as you start start from the beginning, building out a governance process that that works with the users to understand what level of curation is actually required. And do the absolute minimum necessary curation to meet the business requirements, as I call it many Minimum Viable curation, stealing the term from the Agile world. And the idea is that if you have a good conversation between the people who are doing curation and the people who are using the data, and and you have a constructive conversation there continuously, then you can, then you can spend the right amount of time and dollars on, on doing the curation and be very selective, you might have a data set with 1000 attributes in it. But the the users who are producing the Business Report, say, well, we only care about these five, then you don't need to curate the other 995, don't waste your time on it, curate out the five that you care about, that they care about. And leave the rest for another day, when another application comes along in it. And those need to be curated to support another application or, or new business us that governance team is a key thing. You know, nobody gets to start from scratch, it would be nice to start from a green, you know, blank sheet of paper Greenfield, nobody gets to start from scratch. But if you did, you know, if I got to start from scratch, I would I would start the governance process very early, it might be very light to start with. But I would be growing it as the datasets grew. And as the as the as the usage grew. And as more and more people using more and more of the data more widely in the in the organization, and I build in the metadata from day one, you got to start capturing the metadata is very hard to go back and capture the metadata and the lineage after the fact. So it build in the metadata capture and, and lineage capture as automated as possible, right from the very beginning to to make sure that every everything, track and trace on everything.
38:02  Tobias Macey
And once the architecture has been established and put into production, and people are starting to use it, what are some of the techniques or strategies that you use to allow for continued evolution of those systems to prevent stagnation and eventual failure of either the data platform and data project or the entire organization?
38:26  Todd Walter
I think we've touched on a lot of those points already, you have to have the governance, you have to have the metadata, you have to have the continuous conversation with the with the users, because just dumping data in a pile doesn't do anybody any good. There, there were there are lots of of published reports these days about, you know, small percentages of data lakes are actually successful, and actually delivering business value. Rewind two decades in the same publish numbers were written about data warehouses, and the failure modes are the same. The the failure modes are because it people do build it, and they will come edifices. And those never work, they never succeed. If they're not tightly linked with the business users from the beginning and have a good governance process, then they will never be able to have a living, living data organism that the data of an organization is a very living thing. And it needs to be maintained and managed and, and, and fed. And, and, and, and manage that way.
39:51  Tobias Macey
And for somebody who is interested in learning more about the overall landscape of data architecture and curation, some of the concrete strategies and systems that they can implement to help them in that journey, what are some of the resources that you recommend that you found to be the most useful?
40:10  Todd Walter
Well, there's a lot of stuff out there, but it's really difficult to sort the wheat from the chaff, it's best to look for materials that are published from somebody who is more neutral, rather than a vendor. That because too many vendors are, are just pushing a strategy that is one dimensional, you know, a data lake vendor might be pushing the strategy of everything is schema and read and everything's landed draw, and you just, and you just, you know, provide the data out to the, to the, to the users, while a, you know, a specialized data mart vendor might be pushing, while everything has to be in a star schema and carefully curated in and made available to, to BI engines. So so there's there, some of the vendor stuff can be quite one dimensional. So you know, encourage people to find the stuff that's written by the people who are more more independent to within independent analyst types and, and such in the in the marketplace.
41:29  Tobias Macey
And are there any other aspects of data curation and the associated concerns that we didn't cover yet, which you think we should discuss further, before we close out the show,
41:39  Todd Walter
I think there's two things that we didn't talk about. And they're they're kind of related. One is, if you collect data, and nobody uses it, you have wasted resources, you've wasted your time and energy, you've wasted the physical resources for storing the data and managing the data. I am really irritated by people who tell me they have a successful data lake have two petabytes of data in their data lake, I asked them how many users they have. And they say, huh, the data lake is successful, because we put two petabytes of data in it. No, it's not, it was a giant waste of resources, and you should be fired instead of getting your bonus. So so that is, you know, the whole, the whole idea of gathering data should be because there's something that you're going to do with it, there's some, there's somebody who is going to analyze it, somebody who's going to take the results of that analyst, and execute a business process using the results of that, if it doesn't result in a business changing decision, then it is all worthless. Right. So doing data science and putting up posters, you know, on the walls about a cool data science project. Also a waste of time and energy, unless it is results in turning into a production business process that is causing some value to the business at the end of the day. And there's way too much of that going on. The flip side of that is you got to decide when to delete data or not keep it in the first place. Again, people are counting petabytes. And that's interesting, but not valuable in the the data that the data that people need to keep is the data that is actually useful for the business. Now, of course, you have to have a retention policy. And retention policy says that data must be deleted after seven years in order to meet some compliance requirements or something, or it needs to be kept for 10 years to meet other complaints requirements. But this is a place where the data people need to get with all of the legal people to define the the right rules. But then you also have to be smart about which raw data sets that you keep, and which ones you don't. Some people say keep everything. And I have a, you know, kind of a personal feeling about that. And I'm kind of a keep everything kind of person just asked my wife. But yeah, you got that. Yeah, so, so yeah, but but, and, and one of the cool things about the data lake technologies at the lower cost per terabytes, is that you can keep more of it. And that allows you to go back and curate out more attributes from the history of the raw data. But there's, there's got to be an end to that, you know, all that data doesn't have value forever. And at some point, you know, you not even selling the cars anymore that you had, that you have the sensor data from, you're doing, you know, the lifetime of the, of the, of the thing that has the sensor in it is has passed or whatever. And then it's time to, then it's time to delete it. And of course, some of the new, of course, some of the new rules around privacy are making for some new deletion of requirements that are actually very challenging, you have to be able to forget someone, so someone with in GDPR, in a in a European country can can call up your company and say, You must forget me. And you need to go through all of the data sets everywhere in your organization, and find every record that pertains to that to that customer and erase them from, from your data sets. That's really, really hard when there's many copies of the data laying around, and it's in a lot of different platforms. And, and replicated in a lot of different ways. That's a very, very hard problem. And so, so deleting data is as as difficult a problem as as getting it, storing it and curating it in the first place.
46:23  Tobias Macey
So for anybody who wants to follow you and keep up to date with the work that you're up to, I'll have you add to your preferred contact information to the show notes. And as a final question, I would be interested to get your perspective on what you view as being the biggest gap in the tooling or technology that's available for data management today.
46:43  Todd Walter
Wow, the biggest gap, there's lots of them. And, and that's a that's a good thing and a bad thing. It's a bad thing for for users and IT organizations. But it's a good thing for innovation and all the creative people firing up startups and making investments in the space. So I think that a couple of key areas are important. One is the the whole data pipeline linked with lineage and metadata, that whole area, most people in the data lake world are writing code for that. And that cannot scale over the long run. There are some there are a number of companies in that space. But they're all small and fairly early. And and nobody really has a comprehensive end to end answer. Similar to the answer that we had with the ATL tools of the data warehouse era. And and, and we really need that tooling. Because because we really need to scale the processes and we can't afford to be writing code for every every data set maintaining code for every data set, that doesn't make sense. And on the on the usage side, it is crucial to much more tightly linked all of the different analytics together, and much more tightly linked them to the data source. There's there's a lot of people doing tools and cool algorithms, but they are completely unlinked from from the data stores and you have to extract data and reformat the data and get it in the right form and put it through a tool. And if you need to use three algorithms, you need to use three tools. So this this, again, can't scale because again, people are writing code, lots and lots of of nasty code in order to in order to solve these problems. And this is an area where Teradata is spending a lot of energy right now. And we'll be making some some announcement in the near future at our big user conference coming
49:04  Tobias Macey
up in October. All right. Well, thank you very much for taking the time today to join me and discuss your experience and perspective on data curation and data architecture. That's been very useful for me and I'm sure that the listeners will appreciate that as well. So thank you for that and I hope you enjoy the rest of your day.
49:23  Todd Walter
Thank you very much, Tobias. It's been great talking with you.