Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

https://www.dataengineeringpodcast.com

subscribe
share





episode 74: Building An Enterprise Data Fabric At CluedIn [transcript]


Summary

Data integration is one of the most challenging aspects of any data platform, especially as the variety of data sources and formats grow. Enterprise organizations feel this acutely due to the silos that occur naturally across business units. The CluedIn team experienced this issue first-hand in their previous roles, leading them to build a business aimed at building a managed data fabric for the enterprise. In this episode Tim Ward, CEO of CluedIn, joins me to explain how their platform is architected, how they manage the task of integrating with third-party platforms, automating entity extraction and master data management, and the work of providing multiple views of the same data for different use cases. I highly recommend listening closely to his explanation of how they manage consistency of the data that they process across different storage backends.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems.
  • Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support.
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Tim Ward about CluedIn, an integration platform for implementing your companies data fabric
Interview
  • Introduction

  • How did you get involved in the area of data management?

  • Before we get started, can you share your definition of what a data fabric is?

  • Can you explain what CluedIn is and share the story of how it started?

    • Can you describe your ideal customer?
    • What are some of the primary ways that organizations are using CluedIn?
  • Can you give an overview of the system architecture that you have built and how it has evolved since you first began building it?

  • For a new customer of CluedIn, what is involved in the onboarding process?

  • What are some of the most challenging aspects of data integration?

    • What is your approach to managing the process of cleaning the data that you are ingesting?
      • How much domain knowledge from a business or industry perspective do you incorporate during onboarding and ongoing execution?
    • How do you preserve and expose data lineage/provenance to your customers?
  • How do you manage changes or breakage in the interfaces that you use for source or destination systems?

  • What are some of the signals that you monitor to ensure the continued healthy operation of your platform?

  • What are some of the most notable customer success stories that you have experienced?

    • Are there any notable failures that you have experienced, and if so, what were the lessons learned?
  • What are some cases where CluedIn is not the right choice?

  • What do you have planned for the future of CluedIn?

Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
  • CluedIn
  • Copenhagen, Denmark
  • A/B Testing
  • Data Fabric
  • Dataiku
  • RapidMiner
  • Azure Machine Learning Studio
  • CRM (Customer Relationship Management)
  • Graph Database
  • Data Lake
  • GraphQL
  • DGraph
    • Podcast Episode
  • RabbitMQ
  • GDPR (General Data Protection Regulation)
  • Master Data Management
    • Podcast Interview
  • OAuth
  • Docker
  • Kubernetes
  • Helm
  • DevOps
  • DataOps
  • DevOps vs DataOps Podcast Interview
  • Kafka

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA


share







 2019-03-25  57m
 
 
00:11  Tobias Macey
Hello, and welcome to the data engineering podcast the show about modern data management. When you're ready to build your next pipeline or want to test out the project to hear about on the show, you'll need somewhere to deploy it. So check out our friends at Lynn ODE. With 200 gigabit private networking, scalable shared block storage and 40 gigabit public network, you'll get everything you need to run a fast, reliable and bulletproof data platform. And if you need global distribution, they've got that coverage to the worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances to ensure that you get the performance that you need, go to data engineering podcast.com slash lindo, that's LI and OD today to get a $20 credit and launch a new server and under a minute, and don't forget to thank them for their continued support of the show. Managing and auditing access to all those servers and databases that you're running is a problem that grows and difficulty alongside the growth of your teams. If you're tired of wasting your time, cobbling together scripts, and workarounds to give your developers data scientists and managers the permissions that they need, then it's time to talk to our friends at strong VM. They've built an easy to use platform that lets you leverage your company's single sign on for your data. Go to data engineering podcast.com slash strong DM today to find out how you can simplify your systems. And Alexia was an open source distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos. Alexia unlocks the value of your data and allows for modern computation intensive workloads to become truly elastic and flexible for the cloud. With Alexia companies like Barclays, JD calm Tencent, and two sigma can manage data efficiently accelerate business analytics and ease the adoption of any cloud. Go to data engineering podcast.com slash Alexia today to learn more and thank them for their support. New listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen and learn from your peers you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media Day diversity and the Open Data Science Conference. Go to data engineering podcast.com slash conferences to learn more and take advantage of our partner discounts when you register and help other people find the show, please leave a review on iTunes and tell your friends and coworkers your host is Tobias Macey. And today I'm interviewing Tim Ward about clued in an integration platform for implementing your company's Data Fabric. So Tim, could you start by introducing yourself?
02:48  Tim Ward
Yeah, sure. I'm first of all, thanks for having me on Tobias. Yeah, my name is Tim ward. I've been working in the software engineering space for around the last 1213 years and mainly focusing on the kind of enterprise space myself, I'm based out of Copenhagen, Denmark, and I have a little boy and a little dog that looks like and he walk and and my wife all live in, in small, little Copenhagen.
03:11  Tobias Macey
And do you remember how you first got involved in the area of data management?
03:14  Tim Ward
Yeah, so when based on the fact I've been working in software engineering for for that long, it was around six years ago that I got my first glimpse into a new world, I'd been kind of focusing mainly in the web space. So you know, building big websites for kind of big brands. And I got given a project, which was around how do we optimize the content on the website to the right audiences. And you could probably imagine, this comes with a lot of data as well. And you know, in this industry of, of testing content, you would usually have this process called a split a B test, you know, give two variations of a website. And you know, which one is the is the more popular and we had come up with this kind of interesting idea, dear have well, should they really be one winner? Or is there actually that both are relevant to different audiences? So this took me down this rabbit hole of data mining clustering techniques with machine learning. And what got me into the more data management pieces that I realized, Oh, God, I have to blend data from different systems. And it doesn't really blend well because it's not clean. And so I need to normalize the data. And well, fast forward to today, and I've been in the industry for the last six years now.
04:31  Tobias Macey
And so before we get digging too deep into clued in, in particular, can you start by sharing your definition of what a Data Fabric is? Because there are a lot of different resources that people might look to that might have conflicting definitions. And that way we can have a sort of consistent view for the purpose of this conversation.
04:48  Tim Ward
Yeah, well, I mean, first of all, I will, I will apologize for introducing yet another data acronym in the data space, I think we don't need them anymore. We've got data lake data warehouse, and data Mart's and now we've got Data Fabric. So for that, I apologize. But you know, if you, if you look up Data Fabric on the web, and you know, you find us, you'll see that we're kind of talking about this idea of a foundation or a fabric across our content. And really, it comes down to when I first got into this industry, I realized how overwhelming the data spaces. And I mean, I guess just some of the key words or buzzwords I mentioned before, like data integration and data governance and data warehouse, it just was overwhelming from a technology stack. And, you know, when we started the platform, we saw this common theme of people were buying these kind of individual tools like a data warehouse and business intelligence tools, but really where most of these projects failed, was the stitching of all of these different pieces together. So suddenly, if we had bought, you know, a purpose fit integration platform, and now we had purchased the best in breed data preparation tool, we had this huge challenge of how do we make that a coherent and cohesive story. And so when we described the Data Fabric, you can think of it more as this kind of plumbing or this, this stitching where, you know, we know that at the end of the day, you might already have bought a data warehouse, it's probably been around for a very long time. But what you miss is this kind of core stitching and plumbing of, you know, an end to end story. So when we talk about fabric, it's it's really about how do I make sure that when I've got data coming in, that it's given value to me at the end. And those and use cases will typically be things like visualizations in in charts and BI tools, it could be, you know, figuring out patterns in your data using machine learning techniques and tools like data EQ, or as your ml or, or rapid minor. So really wanting to kind of make decisions ditching solved, and you can plug in your parts where you see fit. And so
07:04  Tobias Macey
that brings us to the work that you've been doing with clued in to make this whole process of stitching together different data sources easier. So can you give an explanation about what it is that your mission is included? And some of the story of how it got started?
07:19  Tim Ward
Yeah, show I mean, really, we're interested in helping companies become data driven. And this means a lot of things to me as well. There's another data acronym for you. So for me, the idea of data driven is that I as a company, and using data to make very decisive decisions. So something that I could also not just use as a gut feeling or hypothesis, but something that I can point at a screen and say, Hey, here's the data, it's actually allow me to make my opinion or give this input. And, you know, we think it's so important to also be able to trace that data back to where did it come from? How was it cleaned over time? You know, how, how stale is it? What's the quality? What's the accuracy, and I also saw all these pieces, as you know, missing in the market, and something that I would like in the core fabric or the core plumbing. So I mean, like any good story, I guess it starts with a necessity we, a couple of the engineers, including we're working at a larger enterprise, and we were at the time a small core engineering group. And, you know, you can, you know, a not a lot about what's happening in the business, when you're that kind of small team, you know, we knew who our customers were, we knew, you know, what was wrong with our platform, we knew what our customers were asking for that kind of communication that was there. And, you know, as most businesses do, they grow. And, you know, within a year, we had grown from five kind of core engineers to 125. And, you know, we we lost this connection to why we were building this product, why were we building what we what we were, and, you know, like any typical engineer would try to solve things, our CTO now at the time, he said, Well, why don't I just build this kind of hub. And I can, I can get everybody to kind of plug in all the tools that they use, like the CRM and, and the marketing tools, and I'll just go there to figure things out, instead of God forbid, talk to people or take meetings. So it's a very kind of has a very typical engineering start. But really, what happened from there was this natural progression of, well, if I look up a particular customer, well, I get 10 different results, but they're all the same customer. So okay, I have to introduce maybe some duplication or some emerging engine. And then we realized, well, things don't always merge perfectly. So we need a fuzzy merging engine. And, you know, well, this would probably be represented Well, in a graph database, instead of just one type of data format. And you know, then you come to this natural step of, well, I can't form a graph properly if the data is not clean, and normalize. So this really snowballed into this kind of self fulfilling prophecy, where it's landed today as this kind of like core stitching of the data foundation of a business.
10:14  Tobias Macey
And as far as the types of customers that you work with, I guess, if you can just describe what the ideal customer is, as far as whether they're a technically oriented organization, or if your aim is to be just sort of plug and play where somebody just sends you data, and then you automatically do the integration and send it to their destinations, or integrate it with the various data streams, and they're working with
10:36  Tim Ward
Yeah, so I mean, there's lots of good pieces, then this question, I mean, the piece automatic, right, you could probably imagine, you know, there's no magic in our, in our field, Tobias, you know, there's rules, there's techniques, and there's things that fall through the cracks. So, I mean, now, funnily enough, our ideal customer is large enterprises that have a huge problem, right, thousands of tools, God forbid, you have to integrate these 1000 tools into one central hub, and then clean it and, and kind of invent all these processes themselves. And there is, there's a lot of different things and techniques we can use in our field right now to automate some of the cleaning. So things like, you know, standardizing on a date format, things like normalizing phone numbers into ISO formats, you know, trying to do a relatively good job at, you know, normalizing addresses and things like this. And there's really no science to some of these things. But there are things that fall through the cracks. So really, our goal with, like, our ideal customer would be, hey, you've got a mess with the data, it's not blending well together. And therefore, when you see that data in your, you know, BI tools, or your machine learning tools, when you're actually going to get value, you will you're not seeing that value. And really, our goal is to put in the processes and the pipeline to say, Hey, we're, we're facilitating a process for you to improve this quality over time. So it's really large enterprise customers. And you know, funnily enough out, the majority of our customers are in the finance and the end the insurance industry. And a lot of this also comes down to regulation. You know, a lot of these industries are put under more strenuous regulations, then, for example, maybe technology companies today, and you know, being being living in Europe, this strong policies are now around data privacy, and you know, what customers can data they can have on individuals. And all this plays a role in, in really kind of focusing in on enterprise businesses,
12:34  Tobias Macey
and your point that you're making about regulation and compliance requirements that these companies are dealing with. I'm wondering, in particular, how you manage that at the architectural and system level in terms of managing data privacy and security as it transits your system, and what the sort of deployment environment looks like whether you're running a hosted system that people send the data through, or if you're co located with the data to reduce some of the latency effects of having to cross network boundaries, and just some of the overall ways that you have built your company and the ways that the data flows through.
13:10  Tim Ward
Yeah, I mean, it's, it's, you could probably gather that a lot of our customers are really interested in not only hosting this environment, in their own premises in their own kind of virtual machine clusters, but a lot of them have kind of got their moving into the cloud. And some of them even have this kind of hybrid approach. So really, the the architectures that we're seeing is, you know, first of all, we have to start with a good base, right? So if someone's going into the cloud, it's so important that we're adhering to the kind of leading security network and at rest encryption, so things like TLS, things like SSL, things like making sure that, you know, we're running all of this behind VPN infrastructure. Now, it's so important, of course, with the customer data that you know, as it's flowing through, and if we're encrypting it about both network and at rest, we also need to make sure that we held the highest of encryption, at least as the same or higher than any of the sources. So for example, if we plugged in something like the office 365 stack, and we were analyzing something like mail or calendar events, well, we would need to make sure that we're running at 256 AS encryption at the at the data level, you know, that we're using industry standard protocols for for network transfer. So that's the security side of things. And then there's really the privacy side of things. So one of the ways that we help with this is within our platform, you can manage what's called the consent, you know, what do you have consent for, for, you know, personally identifying data. And in fact, one of the ways that we help with this is because we're cataloging the data that comes through this pipe, you know, we're looking at what are the fields coming from Salesforce? What are the fields coming from the RP system, this allows you to have a direct mapping between how you, this individual, we have the consent to use their email address, and their phone number and their first name and last name in the specific situations. But government is also so much bigger than than just compliance. There's also the policies around what do we allow to flow through this stack? It What if I see something like a credit card come through this pipe? How should I react to it, and kind of the ways that we support that is, you know, there's two different types of ways to react to that one of those is just to be notified and alerted that, hey, you've got this data, and you know, it's high risk. The other is, hey, I need to, you know, stop, right, don't let this data go any further to upstream consumers, like the data warehouse, and, you know, machine learning platforms. And we often see that this is something that it's not really covered in, in data governance, you know, the typical data governance is, Hey, can we set up some rules around ownership of data, ownership of products? And I think one of the things that we missing out industry is, yeah, how we actually tracing that against the actual data. And I think this is why, you know, this is a problem for most companies, and to some think it's kind of unsolvable to be able to really meet all these regulations. So we're making sure that we cover off these kind of security blankets off, hey, if you've got data flowing through us, at least we've got you covered on these kind of main pieces and concepts that you'll need.
16:26  Tobias Macey
And as far as the overall lifecycle of data, as it reverses your system, can you just talk through the sort of different components of how you're processing it and the systems that you're running? And some of the evolution that has gone on since you first began working on clued in?
16:42  Tim Ward
Yeah, definitely. I mean, you could probably, you'll probably agree that everything in our world starts with getting the data. And so you know, the integration layer, that's really where it starts. And I often like to say that platforms like clued in are useless if you can't put data through it. So you know, the way that we help there is, you know, first we've got, you know, over 200 integrations to popular platforms like Salesforce and sequel server and Oracle databases, Hadoop, you know, the common kind of big data type of environments, as well. And one of the things that we do First off, as the data is flowing through, the absolute first thing is we start to score the data roar. So we start to look at the data on about 12 different metrics. And these are classic kind of, you know, data quality metrics, like accuracy and completeness. And now, because we're using certain technology, like a, one of the databases, I think I'll probably touch on a few times throughout this is the graph database, you know, this allows us to also measure things like connectivity of these records. And the reason why we bring it in raw is one of the kind of common architectures we see is that a lot of people are introducing this idea of a data lake, I, you know, a place to dump all the data from all the different systems, you get an yubikey, this language typically via something like SQL to be able to, you know, query this data in a in a common format. And that's what allows us to say, hey, that's great. If if you've got that, instead of the data going directly from the sources to clued in, why don't we go through the data lake first, and that clued in will then integrate with with the data lake itself. And so that's that pivotal step of allowing flexibility is also why we often refer to ourselves as this data fabric. But I think we would both agree, Tobias, that data by itself is really not that valuable. I mean, in some cases, especially with privacy, it's kind of sometimes more of a liability. And so you know, the ability to score this data on the different metrics with setting it up for a whole bunch of processing and tests that we're going to do on it. And the next kind of natural step in the pipeline is, hey, you will I need to prepare this data to do some analysis. So one of the ways that we've tackled this is we actually use five different data families to store the same data. So you can imagine that, you know, a relational database is very good at doing certain things. It's it's what runs a data warehouse, typically, it's very good at aggregating data it's been around for for a very long time. So it's solid and robust that, you know, it's just not good at doing things like modeling data that is more in kind of a graph or network model, which is to be honest, how most businesses actually are modeled. And so we're saving the same records and persisting them into the same database in these different database families. There's a lot of value we get out of this. First of all, you know, we can ask questions about data, but we couldn't just ask to a relational database, or God forbid, shouldn't ask to a relational database. But the other great thing is when we're wanting to get data out of our system, okay, so I have some pretty interesting querying techniques, I can do queries that a pot graph and part search index and part column the store and kind of manage it all together into this Frankenstein type of query. But we can also optimize that query to run against the right databases. So once we've got it in these databases, of course, we're going to start applying these kind of classic cleaning techniques like normalization of dates, and things like phone numbers, and maybe even some simple things like genders. But of course, things like normalizing types, like, Hey, this is an integer, this is a float, hey, let's normalize this. So people don't need to do it upstream themselves, but things fall through the cracks, right? There's, there's no magic to this industry, there are just some things that need a manual application. So for that, we you know, that's when we kind of bring including clean into the into the whole pipeline, it's the ability for, you know, a data engineer to to really solve the the normalization and standardization, things that really can't be automated very easily, or at least not very statistically confident. So once we've gone through these cleaning techniques, and really the next thing is the governance piece is applied, you know, are you even allowed to send this data upstream? Do you have any personal data coming through or high risk data that we should be alerting the system of,
21:13  Tobias Macey
and really, the endpoint of clued in is kind of kind of sometimes unfortunate for us. But we stopped really where the fun happens. At that point, we just say, hey, here is the data, it's been blended. It's, it's cleaner, it's more accurate. I've traced where it came from. So you can always orders now go do something with it. And that's why we've brought graph QL into the situation, have you had a chance to play with graph QL? Tobias, I haven't done much work with it. But I have had a number of conversations about it. I know that the D graph project uses a modified version of graph QL is their query language. And my other podcast, I had an interview with a gentleman who has been building tools for simplifying the creation of graph QL API's and Python JavaScript got us, it's definitely an interesting approach to the sort of overall problem of saying I just need this data, I don't care how it gets to me, and then pushing the sort of responsibility further down the stack. So it makes it easier at some layers and more complicated than others. But I think that in terms of just overall system design, it sort of simplifies things and just adds a fairly useful abstraction layer to it.
22:23  Tim Ward
And that's the word, it's the abstraction. It's, you know, in engineering, when you dial up simplicity, you lose more functionality sometimes, and I like to describe the graph QL more as, like a, well, it's kind of like a schema for a query language. And, you know, you could do what you want with it. And, you know, if you want it just to talk to a relational database, go for it. But I mean, in our case, we're in this good position, where we say, hey, I've actually got the same data in five different database family. So if I want to part of the graph, QL query to run against the, you know, the search index, because that, you know, maybe we're doing fuzzy searching, or maybe we're searching in different languages. And, you know, there's traditionally they're harder to actually achieve in the other types of database families, we can say, Okay, why don't you do that part of the search index. And when you get to the part where you're getting the results from graph QL, and you need edges, well, don't get the relational database all the search index to do that, why don't you get the graph to do that piece. So that's why I kind of describe it sometimes at this Frankenstein of languages, but like you said, that's kind of that complexity is a abstracted away from you. And in the end, all you kind of get is the power of sequel, but kind of just the the Select part and the weird part, because by that point, the date has already been joined. So this whole complexity around, well, I need data from Salesforce, and I need it from dynamics. But now I need to figure out the Inner Joins or the outer joins to actually blend that data that's already been done in before this. So that's why we also like to say that it makes this data much more accessible to the business because we don't have to be domain experts, I've seen one system or many systems to actually get the data that we want out.
24:05  Tobias Macey
One of the things that I was curious about, as you were discussing using these different storage engines, and being able to leverage their different sort of strengths and capabilities is in terms of ensuring that you're able to reconcile the records as they get split out across these different systems and make sure that you are able to keep track of them individually and in aggregate, and in the event that there's something like a GDPR right to be forgotten request, you're able to then go back through and delete them from all the different systems where they need to be removed from and just some of the overall sort of record keeping that you use to be able to manage all these different databases and ensure that they are that there is some sort of consensus across them.
24:45  Tim Ward
Yeah, exactly. And you could probably, I mean, just by you asking that question. I mean, you've probably already done you know, your your research and thought maybe this is something where it's kind of more eventual consistency base. Because when you've got, I mean, the kind of the backbone of clued in is a is a message queuing system called rabid MQ. And you know, all of the databases are essentially saying, Hey, I'm just going to consume the queue called create record. But even though the individual databases could be transactional, in no way, do we wrap an entire transaction over five databases, it's just that's just fraught with issues, you know, you would, of course, we do have these mechanisms. But in theory, it just won't even I don't think even work in reality that what happens if we have a lot of retry mechanisms where we try to enter in first one second, third, fourth, failed, do we go back and clean up this, there's nothing transactional about that. And so one of the ways that we've addressed this is actually that now there is what you would call like a right first journal. So like a log, so all of the the the messages that are consumed off the queue that's coming from these different sources, it's actually just written to one log file, like an append only super fast, right speed. And then what the databases do is they're basically reading off those logs. And the great thing about that is that instead of having this kind of, hey, insert this and then insert this next thing and insert this next thing, these systems can say, hey, go get me 1000 lines off the off the log and process that all in one big coal to the to the database. And then of course, that piece is transactional. So in the end, what you get is clued in isn't a system that is has eventual consistency across the stack and answer the question around the kind of GDPR subject request, right for portability, part of the question. First of all, let's start with it's hard, it's a really hard problem to solve. And, you know, sometimes people will talk about, one of the things they want to do with their data is, you know, the single view of the customer. And, you know, I've never seen a system do this very well, without having to do a lot of the work yourself, when you kind of think about it, the single view of the customer. And also this, right for portability, it's a kind of a similar problem, you want everything connected to an individual, but you really want to just highlight the things that are personally identifying, that's more for just a portability, you've got to have the right to remove that data if necessary. So and you could probably start to to see that now a graph database seems to be a really good structure, to host data connected to things. And in this case, it's a person, what we need to make sure we're doing is as the data comes in, we cataloging it. So we know hey, this job title of CEO, it came from the Salesforce contact record. And what this allows us to do is with this lineage, if we're wanting to kind of change that record in our system, you can kind of think of it like a Master Data Management type of component of pay to biases run up our company, he's told us actually, we've got some of his data wrong, you know, like, we've got the wrong job title, and we've got the job, Rob phone number, we can actually change that in the clued in platform. And then it will basically say, hey, got it. So and this is what we call our mesh API, it basically can unravel and say, Hey, here are the queries that you need to run up against the source systems to be able to change those values in in the source systems. Now, there is a lot of complexity around that, right. So you know, what happens if we want to delete a record, and then when we look back at the database, you know, we realized that it's got the need for cascading deletes. So it's going to throw us a kind of an exception, say, you can't just delete the contact, because he's got referential, or she's got referential integrity to these other tables. So there's all these complexities. And that's why a lot of the time these are, you know, we give this kind of skeleton and framework for our customers to be able to implement it. But I guess, Tobias, also, this is this whole idea of, you know, unstructured data and even structured data, that's, that's not clean, we often say to our customers, really, if you're wanting this type of private solutions, just know that it's a thing about, it's more about confidence that you have the right data, rather than something that can just be 100% solved.
29:09  Tobias Macey
And particularly in the area of data cleaning, and being able to reconcile records across these different systems. I'm wondering what you found to be some of the most challenging aspects of that integration step and any specific requirements for domain knowledge or manual intervention, across industries or across sort of business verticals, and any steps that you incorporate as part of the onboarding process and ongoing execution of these integration routines to make sure that your customers are able to ensure that these integrations and reconciliations are happening appropriately?
29:46  Tim Ward
Yeah, well, I think it starts with this idea of, of what we call that Cole vocabulary. You can think of it as like a schemer as well, basically, you know, what we've done is we've said, Okay, listen, a person is a person, whether they exist inside Salesforce or dynamics or HubSpot is still a person, they have these generic properties that exists no matter where the source system, now, they might call them different things, we might call them, you know, last name and might be surname in another, but effectively, that the same thing. So this idea of this kind of core vocab is is kind of what gives us this pillar of structure in what could be very unstructured. And, I mean, the complexity with with integration, really start with the conceptual complexities. I am I know you and I, Tobias could probably sit in a room with a whiteboard, we have three systems to connect, we could probably drop some boxes and say, Oh, you have to, you have to join under this table. But that'll give you the ID to join on to tool three and things like that, you know, but if you look at the enterprise, you know, as you can imagine, it's not three tools, it's 300. It's 3000. You know, in some of our customers case, it's over 10,000 different systems, as a lot, sometimes due to things like acquisitions, right, that you suddenly inherit this whole new technology stack. And, you know, what do you do about it. And so one of the complexities is there. And you know, one of the ways that we help address this is that, well, if I knew the end goal of this blended data was a relational database or a data warehouse, I have to be really careful about how I model this data, I wouldn't want to model it into a place where querying, it would take too long, or, you know, just joining between too many tables is just Well, that's never going to scale at all. But the graph database is really good at flexible modeling. There are things it's not good at. But that's one of the things that it's really good at. So what this allows us to do at the integration level is say, hey, let's not only take a system at a time, let's take just an object at a time, and I'm not interested, I'm not interested at all in how the product table and the customer table join, I'm not interested about that integrity. I'm more interested about what's a unique reference to this person, or customer, is it their email, okay, let's flag that. Is it their phone number? Well, not unique, but it's a potential alias for it. And what that allows us to do is in what is typically an ATL world where you might use something like SSIS, or or other ATL tools were much more of a LT where, hey, I'm actually going to load in all the data first. And I'm going to reverse engineer how the systems I joined. So you could probably understand that, you know, when we plug in one system, we've got this customer in the graph, and you know, he or she is just floating in the graph saying, Hey, I don't connect to anything. It might be, you know, one month later, when we've plugged in system 15, that this other node says, hey, we've got the same ID, we should merge. So those kind of design principles have allowed us to kind of scale to these more larger installations. I think the final point around cleaning is that places like gluten, and as I described before, as a data hub, they are the place to really standardize on how we're going to represent gender, right? how we're going to represent something like these categories, or these labels. And I'm not so worried about changing them in the solar systems, because, you know, they're probably like that in the solar systems for a very good reason. But I do want to make sure that any upstream consumers office are getting some consistency, they're getting standardization, so I don't want them to take care of this. Yeah, but in Salesforce, it's called gender and in Dynamics, it's called sex. And, you know, in our custom built system, it's in Danish or another language, I don't want them to think about that at all, I want to standardize on that at the the clued in kind of level. And so the classic cleaning challenges that we see there's some really nice edge cases, you know, just simple, something as simple as, hey, let's go get the agenda field, let's grow the values are cluster them. And let's fix things like oh, someone spelt male MIIL. And of someone's done it in danger. So Amanda, Covina for a woman, oh, let's, let's clean those types of things. But sometimes we run into these situations where it's not so easy. So for example, you know, you might have a system that's trying to optimize for performance. And so its doors agenda is a zero or one. And let's just say that zero represents a woman and one represents a man, they might do that in one system. But in another system, the business model, flip those. So it's not always easy as just saying, hey, just turn all the zeros into female and doing all the ones into demand. It, there's these business cleaning rules that come in. So there's some of the complexities that we run into.
34:49  Tobias Macey
And as far as being able to manage the data lineage or data provenance, and then be able to expose it to the customer. So that if there are any issues with the clean or reconciliation, or if there's just some incorrect data and the record that they're then able to trace it back to the source system and maybe do some fixes for how the data is being transferred to you or fixes in the source system itself, or how we're they're capturing the data originally, just wondering what you're using to manage that and some of the challenges and benefits that you found in the process of building that system?
35:23  Tim Ward
Yeah, I think, first, it's probably a good idea to start with the object structure that we use, you kind of think the object structure that we use is kind of like a good object, right? So I would refer to as like a version, object graph, where you've got all the history, all the permutations all in this kind of binary. And we've leveraged that idea. And we call this model our clue. And you know, the interesting thing behind the clue model is that, hey, just because you have the data doesn't mean it's true, right? We need to be statistically confident. And making it statistically confident is something you get by throwing clues through our pipeline line. And so what this allows us to do is then say, Okay, well, I generically taken clues. And all of those clues have an origin, all of those origins also need a little bit more detail they need what was the account, it was added under as well, we want complete uniqueness of where the data came came in. This allows us to kind of, I guess, you could say unravel this giant object, in some cases, absolutely huge object, you can imagine, you know, maybe an Excel sheet or the the version history and the have a very old excel sheet, that's going to lead to potentially a larger clue object. So we maintain all of that history. And what's interesting is that our guest, the GDPR, example is a good one. If I needed to ever purge the data, what I can actually do is I can say, hey, go get this record for Tobias, unravel all the history and then remove the 13th clue that we got from Salesforce, and then reprocess that as if we never saw it in the in the past. So that's, that's one way that we help with lineage piece. But the other complexity with lineages, where's the data going? And one of the ways that we help with that is, you know, we have our graph QL endpoint and every time that you run a query, it will actually generate a new API token. So it's like, Okay, well, we can use that to trace where that that particular API token was used. The other way that we expose our data is still through graph QL. But instead of this classic kind of paging of data, it will stream the data out of us. So it will use this kind of graph QL streaming technique to say, hey, I've seen a new record, it's Tobias. And you have a graph QL query that is looking for people called Tobias, hey, I would match that, where should I send you. And that stream can either be a classic stream, like, maybe something like a Kafka stream or a spark stream, or it could just be something as simple as like, hey, post, do HTTP POST over to this endpoint, that some of the ways that we're, you know, able to trace that, Hey, Tobias, as data has come in from Salesforce and dynamics, it's gone through the lineage and tracking in our system, and then we pushed it over to Power BI and to Azure ML. So you have that kind of end to end lineage.
38:10  Tobias Macey
And this problem of managing the destination and being able to push out to these systems that other people are using for analysis. But also at the other side of being able to consume the data. And the number of integrations that you mentioned that you have, I'm wondering how you manage any changes in the API's or the data formats, or any breakage or failures that might occur because of, you know, network outages or system failures or maintenance, the third parties and just your overall strategy for ensuring that your system remains operable in the face of all these potential edge cases and failure modes?
38:48  Tim Ward
Yeah, I mean, I will stop by saying this. It's the hardest part of clued in is the maintenance of third party integrations. I think one of the sanity points for us is that because our in customers is enterprise that they are well aware that this happens, right that, you know, if we're going to plug dynamics 2017 into our system, I know in you know, in the world we live in that people are going to introduce new objects and dynamics, they're going to introduce new properties. And how do we handle these? Well, in a lot of cases, I'll take the kind of dynamics or Salesforce example, you could probably already realize you could do absolutely anything you want in those systems, you could branch out and not only use them for leads and contacts, but you could just put animals or hotdogs or any any object you can think of now a lot of these more enterprise tools they expose. It was a silly example, I'm aware. But a lot of these tools expose discovery endpoints where you can actually say, Okay, well, instead of just guessing what your endpoints are, why don't you tell me what objects are available in your system. Now, this really only happens in the economy, enterprise types of tools. Let's take something like HubSpot, which is more, you know, CRM targeted towards small and medium businesses. It happens actually all the time that and you could probably imagine this as well is that, you know, they update their API, and the documentation is not reflecting that. So one of the interesting ideas and processes we had in the past is that or why don't we just watch API pages, and when there's a content change, it'll alert our system and alert our to our team that, hey, okay, so something's changed on the page. And we will just use a standard diff tool to tell us what's changed. And this this kind of, is just floored to start with you. And it's because of this idea of Usually, the smaller companies are also innovating so fast. So they want to get these things out. And what happens in the end is that there are times when this just breaks, it just breaks. A good example would be HubSpot have stopped giving you back a list of people and now gives you back a dictionary of people and you know, your serialized in your crawler didn't cater for that. So I mean, the good thing is that you get signals from the system immediately that you have this issue. And in a lot of other cases, you know, you know, platforms are getting a little bit better and better and better at version being, you know, semantic version, and, you know, making sure that they release version in their API. So I guarantee that if you're using version two, I'm never going to change that. But if you want to move to version three with the new stuff, yeah, I'm sorry, gonna have to do some rework to do DC realize that and move to the new models. So it does happen. It's a big piece of the work that we work on that with proper alerting, we're pretty quick to act, it's in these cases where there's been such fundamental changes, it doesn't happen often. One of us, like I said before, one of the sanity checks for us is that our customers host this infrastructure themselves. And so typically, they've got people dedicated to, hey, we've plugged in, you know, Oracle databases and work day and enterprise systems, and they're usually a little bit more mature around their release policy. So I guess what I'm saying is, it happens, it's a very hard problem to solve. And I think the way to solve it is that businesses really need to fit more standards, right, we need to move towards much more standards of, hey, if you want to be a good business and exposure data, you need to adhere to at least these rules of the of the game. And actually, one of the ways that we're seeing this in an odd kind of space is actually in the authentication space. There's some proposals for standards out for the new off. So off to of course, is one of the kind of industry standards for authentication between systems in the new version, they're talking about, hey, how can I just have applications talk directly to each other? How can I automatically spin up new applications without having to do them through some type of developer portal, so we are seeing the shifts to more standardization. But I believe it's the only way that will actually get sanity and, and any control over the complexities of integration.
43:09  Tobias Macey
And to your point about being able to alert on different failure cases, or being able to keep track of the overall health of the system. I'm wondering what types of metrics you're using to keep an eye on that. And in the environments where you are actually deploying the system to customers infrastructure, I'm wondering how you manage the overall lifecycle and deployment model to them to ensure that they're able to stay up to date with the systems and that they don't have too much drift where, you know, they may decide five versions from now that they're actually going to upgrade versus actually staying with the current way that you're architected the system?
43:47  Tim Ward
Yeah, there's a couple of I think there's a couple of technology choices that we bet on early that have just turned out really good for us in this space. So I guess the first thing is that coming from enterprise background upgrading was one of the ways the the big issues that how, how do you get your customers to upgrade and, you know, in many cases, it will cost that customer quite a lot of money to put in the effort to upgrade your product, em, and so from day one, we we bet on Docker, we bet on containerization, then you know, these orchestration frameworks we took a bet on and you know, what our deployment to our customers looks like now is, hey, here's some, here's some Docker containers in a hub here is Kubernetes, his Helm charts, you know, feel free to use the services in the cloud to deploy those and or deploy this into your kind of VM where environment. And when I'm thinking about signals and metrics, I kind of, I've actually moved in kind of, into to kind of fields, I think, for sometimes it was always around like DevOps types of metrics. But I actually think more our signals are more moving into the kind of data ops side of things. So for DevOps, of course, there's lots of metrics that, of course, third party tools, and a lot of them are actually open source and fantastic that, you know, just have native support for Kubernetes. So spin up your Docker containers, spin up your, your your databases, put it into a Kubernetes cluster orchestrate that with the with Helm charts, as well. Oh, and let's bring in Gruffalo, because that comes free, let's put in a net stat D, because that's got native support for that. So we would really, for the DevOps side, just say, Hey, we're going to stay out of it, right, we're going to give you kind of industry standard ways for deployment, and then pick your tool of choice. Now, the signals with the data ops part becomes interesting, because that's kind of a new field. And when I talk about data Ops, it's it's more you know, if you want to be data driven, it's about how do you maintain a consistent flow of data throughout your business. And the thought of something as simple as a password change meant that data wasn't flowing from Salesforce for, you know, 24 hours or longer? I refer to that data up. So some of the signals that were and metrics were sending from our system is, you know, you've got an expiry on a token soon, would you like to generate a new one and schedule for me to switch it in and out at a certain time? A lot of the other metrics, of course, I've already touched on would be things like data quality, and data and data accuracy. And that's kind of more, I guess, that's not as much on the data ops side, but they're still important things. Because what I see in fields like us is what's becoming more important is, how do I have people that are monitoring that I'm sending good quality data upstream? Because, you know, everyone's relying on it. And to be honest, if I fix a few things, there's multiple people that went off it, right, we're all listening off the same pipe. And whether I'm a bi system or an ML system, they all want clean, good quality, enriched, you know, complete data, and blended data is the other big thing as well. So I think the signals for us are really broken up into into the data as part of the DevOps, and what have
46:59  Tobias Macey
been some of the most notable customer success stories that you've experienced, or interesting or unexpected ways that people have used the gluten platform.
47:08  Tim Ward
I mean, the one that always stands out for me is it's one of our cases around data privacy. And as a lot of businesses were doing it was, you know, getting to that time in May 2018. And, you know, a lot of companies were kind of scrambling, what do we do? We had a larger company that came to us, and they're just over 100,000 employees, and, and they said, hey, it's, it's March, we have 124 applications were over 100,000 employees. So we have, you know, we have 3025 years of historical data, and we want to hit may 25, what can we do about that, and once again, I think just down to some core technology choices, like you know, containerization have, and clear separation of concern, and, you know, backing the system with a service bus, hey, to scale this thing, you just introduced more machines, subscribe to the buffers you want and you're processing faster. Of course, the trick with that is, every time you scale, you move the bottleneck somewhere else. So it was an interesting challenge and getting, you know, the system to basically say, hey, the bottleneck is the sources, we can't pull any faster because they're throttled, or this quality of service that we need on the line, because these are production systems. I think that's one of the great stories of just being able to integrate that amount of data in within a few months, and have them hit that date with a kind of high confidence that they could, you know, be compliant and fulfill the regulations. That was one, I think, you know, some of the banks that we've been working with, you know, just a lot of people are suffering from this kind of history of siloed data. And, you know, one of our customers have over 10,000 systems, and we don't integrate into all of them. In fact, it's a smaller subset. But I think for the the part I like about it is that we gave them a view of this could do it like that, it's daunting to think how would you integrate 10,000 systems, and now, with this kind of approach, where, you know, you take one system at a time and you take even one object at a time, it's, I liked the fact it's giving them this, this visibility of this, we could actually do this,
49:26  Tobias Macey
and what are some cases where a customer was interested in using clued in and it either wasn't the right fit? Or they tried to get started with it, and they weren't able to meet the sort of desired and goals that they set out with when they first started engaging with you?
49:41  Tim Ward
Yeah, definitely. I mean,
49:44
when you enter into the land of machine learning, you enter into the land of magic. So we've had some pretty fantastic examples of failures. And I think one of my favorite is, we brought on our first customer in the food in history. And, you know, as we're integrating their systems, there were a lot of PDF documents in file storage, and they're actually all the menus of all the different restaurants that they support it kind of like, they're an aggregator of restaurant data. And well, due to the kind of kind of way that statistics work. And that's how a lot of these kind of natural language processing techniques are working, as well as just based off statistics, it detected, you know, all these menu items, like spicy chicken as a person and lamb corner as a person. And our sister would kind of pop up and say things like, Hey, here's the phone, number of spicy chicken, would you like to give them a call? And, you know, it's, it's really hard. It's, it's really hard in those situations to describe. This is done off, not off rules. It's not done of business logic, it's done off statistics. And we all have that laugh about it. But it also, for me, it showed how the is show me the reality of machine learning and where it's useful and where it's not. So I think that's one of my favorite failure stories, I have to tell one more, because I always do this. And I always do this to embarrass our our CTO, but I think you'll get a laugh out of it. our CTO, Martin, he, he kind of looks like your classic Nord, he's got very long hair and has death metal shirts, and, and like a beard, in kind of, He kind of looks like Jesus a little bit like the, the pictures and of and the drawings of Jesus. So he, we were using an image object recognition, to basically scan through images and say, hey, there's some objects in this image. And I place to test out the system, I placed a picture of myself, and it was me at my brother's wedding, and it picked up glasses and suit and bow tie. And we thought, Wow, this is amazing. Like, how does this work, and we put Martin's picture through the same engine about 30 seconds later. And with 90%, confidence had classified him as a fur coat. And I really have to just look up a picture of of mine, his name is Martin hold L. And you really just need to look up a picture of it to get the true emphasis of the laugh. But it's the other kind of failures we run into. Yeah, I
52:24  Tobias Macey
think I might have actually seen the example photo in when I was looking through your blog, prepare for the show and get a good laugh out of it. Yeah.
52:35  Tim Ward
It's one of those things that never fails to make me laugh as well.
52:39  Tobias Macey
And so are there any other cases where you found that gluten is not the right choice and a company or organization is better off using a different system, whether it's because of the size of the organization and complexity that they're dealing with? Or because of just the overall sort of goals that they have for managing their data integration? Or any sorts of issues with control or visibility into the system?
53:04  Tim Ward
Yeah, I think one of the big things is that a lot of, you know, people are throwing this word around of like real time data, right? So IoT data and signal data, I would say to a person that we play no role in that world, right? Because it's kind of like this, this classic saying, we have an engineering like you've got all these dials, right speed, you know, cost, right? If you move one, the others move accordingly as well. And for me, IoT data is about hey, I need to get it is streaming in as fast as I can from the plane or the jet or the wind turbine. And I need to put it into like a logging system that show me real time sensor data. And if you've got cleaning and accuracy and quality of data, don't fall into it as much, because you will if the data is incorrect, maybe that's just a wrong reader that you've bought right? Now, easier said than done, because people buy different readers. And they all send data in different types of formats. But I would say that's one case where it's like, let's just stay out of that. As soon as you don't want anything tracked. You don't want to clean, you don't want to blended, there's no need for you to put your data throughout pipeline. But I think overall, it's still nice to be able to set up an overall foundation for a company where you say, okay, IoT data, signal data, maybe it streams into something like, I don't know, like a Kafka stream or, or maybe you're using like IoT hub or event hub and in Azure, and you just want to push it directly to the logging application clued in, this really doesn't play a role in that
54:43
story.
54:44  Tobias Macey
And what are your plans for the future of clued in both from the technical and business perspective,
54:51  Tim Ward
I think from the technology side, it's really about just, you know, continuing with the, you know, the robustness. So I think from kind of a feature set and functionality perspective, we're really happy with where we are. And really more it's about also making sure that we can adapt to new things that will come in our industry, I mean, it's going so fast, I didn't know what a data lake was a few years ago. And now it's the kind of thing that everybody wants. And, you know, in two years, there'll be new things. And it's really about how we make sure that we're building robustness and flexibility to be able to adapt to those new requirements. So you know, from a, from a technical side, it's, you know, also making sure that we're always keeping up with industry standards, that we're always moving to frameworks that are better, for example, you know, we're mostly built in.net. But, you know, we've moved to a.net core to get the the winds from that, from a business side, it's, it's just more, you know, moving into other countries where, you know, they have the same issues, and, you know, recently have moved into customers in Australia and UK and the US. And, you know, you could probably imagine these problems exist everywhere with businesses. So I think that's the plan from the business side.
56:07  Tobias Macey
And are there any other aspects of clued in or a data integration or data engineering that we didn't discuss yet you'd like to cover before we close out the show?
56:16  Tim Ward
I think I think I use more data acronyms that anyone could take any more of? So I think, I think we've had a good discussion.
56:25  Tobias Macey
All right. Well, for anybody who wants to get in touch with you, or follow along with the work that you and your company are doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap and the tooling or technology that's available for data management today.
56:42  Tim Ward
Yeah, good question. Um, I mean, I think I'm naturally biased when I say this, but I think it's this stitching, right? It's this, yeah, I can go out, I have a plethora of choices of technology, but how these things going to work together. I mean, you know, we can't only forget that integration is not just about from data sources into a pipe, it's from the pipe into the governance and governance into the lineage and, and all those different components. So, you know, I'd like to see us, you know, in a way, kind of, like standardize on what is the plumbing the businesses need? And and I think that's the biggest piece that is missing from the overall story and why companies are finding it so complex to get value out of their data.
57:24  Tobias Macey
All right. Well, thank you very much for taking the time today to join me and discuss the work that you're doing included. It's definitely very interesting platform and challenging problem domain. So it's always great to get an view of how different people are solving it. So I appreciate that and I hope you enjoy the rest of your day.
57:39  Tim Ward
Thanks, Tobias. Pleasure.