Data in Construction

Data Lineage and Graph Databases

Episode Summary

Data lineage is the key to trust in our data, and ultimately how a data driven culture can thrive in today's construction industry. By tracking where data came from, what's been done to it, and where it's going, teams are able to ensure they know exactly what's going into everything from dashboards to analytics to AI models. Data lineage is about relationships, often quite complex ones, and that is where graph databases make the most sense. In this episode, data expert Wouter Trappers outlines the importance of data lineage, and how graph databases can help.

Episode Notes

Follow Wouter here

Check out Xudo.be

Check out the Medium article that started it all: https://medium.com/p/8cbf0497d5a6

From The_Link:

Subscribe on Apple: https://podcasts.apple.com/us/podcast/data-in-construction/id1604092908
Subscribe on Spotify: https://open.spotify.com/show/2AUUpaT0yYueyah826JOOQ?si=a34ca4e3acf24835
Sign up for the Data in Construction Book: http://eepurl.com/hTtFPH
Sign up for Data in Construction skills webinars
Buy The Construction Technology Handbook here: https://www.amazon.com/gp/product/B08PNHBB1M/ref=dbs_a_def_rwt_bibl_vppi_i0

Episode Transcription

Hugh Seaton: Welcome to Data In Construction, I'm Hugh Seaton. Today I'm here with Wouter Trappers, founder and CEO of Xudo. Wouter, welcome to the podcast.

Wouter Trappers: Hi Hugh, great to be here.

Hugh Seaton: I was really intrigued by a medium post that you did about data lineage and graph databases. And I wanted to ask you about that and I'd love to start with what data lineage even is.

Wouter Trappers: Yeah, that's a great question, but let's first take a step back to describe the context in which data lineage makes sense before jumping right in.

Serving your customers, you get certain inputs, like for instance, in your context, construction plans, then you deliver certain outputs like a building, but to get from those inputs to the outputs, you go through a set of steps to get there. So these are your business processes. Now to deliver these outputs in an efficient way, you use software systems to support them.

These systems can be ERP systems to keep track of orders and stock or financial systems to do your accounting or CRM systems to keep track of customer contacts and sales opportunities. These systems contain what I call small data. It can be large datasets, but they are stored in a very structured way in the underlying databases on the transactional systems.

They are very valuable use cases to be developed with big data as well. But let's put that aside for now Xudo's motto is, first put your small data in order before starting big data analytics.

Hugh Seaton: I love that, that you're saying get, you know, get the small bits in order before you worry about big structures.

Wouter Trappers: Yeah, that's another discussion. So let's first talk a bit more about the databases supporting the transactional systems.

They are usually not very well suited to get insights out of them because of the way they are structured. Sometimes they have a reporting module can be useful to track what's going on in the system. But they are confined to the data in this one system and they don't allow it to get transversal of insights over different business functions or silos, if you will.

And it's in the data visualization dashboards that are delivering transversal insights, where you can look for trends and follow up on KPIs. And here's a lot of value can be generated from your data. So generating value from small data can be done in three forms: more revenue, less cost and peace of mind.

Now, this sounds very easy, but let me assure you, it's not that easy, right? But that, for example, if you think about generating more revenue through better insights, that can be the case. For instance, By generating more insights in your customer's behavior, leading to better pricing or doing business with more profitable customers.

So that's more revenue, less costs can be realized by optimizing business processes, cleaning up inefficiencies, peace of minds can happen through transparency. What's going on in your business. Having the feeling of being in control and being compliant with regulations that might exist in your industry.

Now here, you have to be careful.

There is a story searching where the light is. It goes as follows. The policeman sees a drunk looking for his keys under a street light. He asks, where did you lose your keys? In the park the trunk answers. But why are you looking here then? And this, this is where the light the drunk answers. So before starting a project to build an analytics dashboard, ask yourself what is my business strategy?

Don't just start using the data you can find, or you can get to. Don't step into the pitfall of looking where the light is. So you really have to start from your business strategy. What key performance indicators will help me follow up? Whether I realize in this strategy and what data do I need to calculate the KPIs.

So do I have this data and if not, what do I have to do to get this data? And only then you can build dashboards to follow up on the KPIs.

Hugh Seaton: Let me stop there a little bit. Cause we we've talked about this in some other episodes and I it's really worth underscoring. That it's so easy to start with the data you have or with what you think you trust already and not really anchor your process and what you're doing on questions and, you're talking about that through strategy, I mean the same thing. What is our business strategy and what do we need to know to, to execute that? So I love that you're anchoring this and that the idea of, I'm looking here more because I have better light. is such a nice analogy for what often happens, right.

Is let's do a lot of analysis on things that maybe aren't so important, but my gosh, the data is good.

Wouter Trappers: Yeah, indeed. That's exactly what I mean by a avoid to search where the light is, but really start from the questions you're trying to answer. So it can be that you don't have all the data points available yet to answer the questions you, you need to answer to know if you're realizing your business strategy.

So let's try to come up with an example. Like, it's very easy for people to, if they are filling in their systems to just pick the top option in the dropdown box, you know, it's easy. I don't care what this data is used for because I want to do my job as quickly as possible. But then of course, Your data is of lower quality.

And then you have to ask yourself, okay, how can we motivate our people to, to pick the right option from the drop-down box and not just go with the easy one, the top one. So then you have to explain, why are you using this data? What is it contributing to the business strategy that you're trying to execute?

So. It can be the case. That's in a certain processes. It is important to indicate what type of customer is The service is delivered to, or the product. So you can have different types of customers, like for instance, government, small businesses, large businesses instances in different sectors. So if you want to have a better customer segmentation, it's important that the people who go through the process and will fill out the data in the systems, in the transactional systems.

Know why they are doing what they are doing. They know what the importance is of picking the right segments, customer segments in the dropdown box. And then it can help you to follow up.

Hugh Seaton: That makes a lot of sense. Okay. So how is it that you think about, the steps of an analytical project and the, the kind of the role of a dashboard?

Wouter Trappers: The dashboard is actually used to provide feedback to the business processes. So the dashboard displays the data that is coming from the business processes, and it allows you also to give feedback to the processes where this data was generated and this kind of feedback loop from the analytics dashboard to the transactional processes needs a certain culture to be successful. For instance, a culture of trust of environment and of accountability. So be sure to build a change management project around this, to communicate within the organization, how this will be done, so how the data will be used and why.

Hugh Seaton: it's interesting that one of the things I really liked about the idea of data lineage. Is that's how you do some of that, right? Is that by keeping track of where data comes from and what's been done to it, people can start to trust it because they know where it came from and they, again, they know what's been done to it. as you think about how change management projects have worked, things you've seen or heard about, what are some tactics? What are some things people do to make people aware of data lineage or, uh, make it part of a practice to help with that trust? Is, is there things that you've come across that, that help out?

Wouter Trappers: Yeah, it's not just confined to the data lineage type of projects, but involve the specialists you have in your company and who are the specialists in change management, and in communication?

Often it is the HR people and the marketing people. So make sure they are involved and make sure you pick their brains on how to talk to people about the change that's coming. So HR can do this from an organizational design standpoint.

Okay. What roles do we have in the organization? What roles might change? How might they change? What is the impact on the people that are performing these roles and marketing might come in from a communications specialist standpoint. How are we going to communicate about these changes?

And some very practical tips are, give the project a name because then people can use the name of the project to refer to it. And it can be something pretty basic like a data lineage 1.0, I don't know the lineage project. I'm just making things up now. Yeah. So that's why the marketing people have to be involved to come up with a good name, to be able to talk about it.

And also don't be afraid to run this really as a communication campaign, but then internally put up posters, building landing page in your intranets and things like that. Very practical and simple things. But if you involve the right people in your organization also will not be that complex. But it will make a big difference in landing the project and then communicating the why of the project.

Hugh Seaton: You know, what's interesting also though, is we find it because construction can be so decentralized. And so non-office centric certain parts of it for obviously, for sure, since there's, the main economic activity happens on a job site, what often happens is people adapt some of the ideas you've just described and they'll do things like, well, let's work with project teams.

And do small little data pilots. And while we're doing that, we'll often say, and you can trust this data because here's a little indicator of where the data came from, whether it's a graph or it's just a, kind of a source, but, but getting project teams, whether it's a PM or a PX, To believe that this is necessary and a critical part of how they do their jobs.

And part of that being critical, is we trust where the data came from. So this idea of starting with HR and helping to craft the message either with the marketing team or at least the marketing sensibility, depending on what sorts of resources they have. I think that's a really important point is giving it a name, giving it, the ability to communicate it, as opposed to it being kind of a nebulous thing that we're doing just another one of a list, I really liked that.

Cool. Yeah. So do you want to kick into how to start building data pipelines?

Wouter Trappers: Yeah. In the story earlier I left one of the most important steps out it's because, uh, there's one step between the transactional systems and the visualization dashboards. And that's the step where the actual data pipeline is being built.

So the way that data stored in databases behind transactional applications are usually not meant to support analytics. The data model was designed to process transactions, but not serve as source for analytics. So this means we need to transform the data to another data model before we can link data visualization dashboards to it. And this is done in the data pipeline.

So that's before you can start building this data pipeline, the database with the transactional systems need to be of good quality. It needs to be accessible. And in our current world where a lot of SaaS applications in the cloud are supporting the business processes, that means that the APIs have to serve the data that you need to... remember the story about searching where your light is?

If you don't have the data, if it's not serving from the systems through the API, into your data pipeline, you cannot use it. So we have to be aware and have to discuss with the vendors of the SaaS applications, if there's some data missing, how can we get to this data?

And also keep this in mind. If you make a tool selection, what kind of API are they offering and what kind of flexibility do they have around these APIs and the larger the vendor, the less flexibility usually. The larger the APIs that they are offering to begin with.

In general, you could say that the data model behind the transactional application is a relational data model, but a data model that can serve as a basis for data viz dashboards is a dimensional model. And I'm describing this probably a bit too simple because you also have streaming data coming in and other types of data models for analytics like data vault and data mesh that are being developed in recent times, but for now let's keep it simple. And just also for the importance of data lineage, the principles are very much the same.

Hugh Seaton: And, and the reality is most of what we're dealing with is, is things that are locked up in a SaaS application. So I think that's, that's a really good place to stick.

And your idea that, start with what is coming in from APIs, from your project management software, which is going to be one of the places where a lot of this happens. I think is a really good one. And some of them have really good APIs and some of them have less good APIs.

Really good point.

Wouter Trappers: Yeah.

Hugh Seaton: So as you think about this, how do people think about, I don't know if instrumenting is the right word, but we've got this pipeline of data. How are you keeping track of where... how are you representing what's going on?

So we recognize that data gets produced one way. It gets stored, often transactional data gets stored another way that's really meant to support transactions. But, but at some point it has to go from that state to somewhere where it is grouped together and tagged and all that, so that you can run analytics against it.

What's what is the way that people are typically keeping track of those steps? Which really is what we mean by data lineage.

Wouter Trappers: Yeah, indeed. In the article you're referring to in the beginning of the podcast, I make a distinction between two types of data lineage. You could argue about the terminology, but let's call the, the one, the upfront definition of the data lineage.

So before you start building the data pipelines, you need to have an idea what data is going, where, what data do I need? How do I need to transform it? And this will be collected in some kinds of documents that you use as a development guideline. And there you have all the definitions, all the transformation steps, and sometimes depending on one software or the other, you really need this type of documentation or you have more freedom to make it up on the fly.

But, and then that's what I call the upfront definition of the data lineage. And then you have, also what I've called data lineage discovery. And this is some kind of automated tool that goes into the different transformation steps, and then extracts the logic that is applied in these transformation steps so that you can follow the data from the source to the visualization dashboards and, you know every step in between what transformations or groupings or mappings were applied to the source so that you can follow the data through all the systems and through all the different steps in your ETL or in your, uh, data pipeline.

Hugh Seaton: So I'm seeing that as in the case of upfront, it's really maybe even a white boarding session, but you're mapping out and designing where, whether you're taking, what's already been given to you, but you're designing where it's going to go and how it's going to, how it's going to flow in discovery.

It's you know, it sounds a little bit like you're, you're using tooling to understand what's going where, so it's a little bit of a bottom-up top-down approach is that accurate?

Wouter Trappers: Yeah, that's accurate. And to the reason you need both, it's, it's usually very expensive to keep the, the top-down approach up to date because you have a lot of people working on the data pipeline. And sometimes under time pressure, documentation is not completed or different types of software don't allow for the same type of documentation. So then afterwards you need a mechanism to check if the actual data pipeline was implemented as it is designed. And usually this will not be the case because of different reasons.

So that's why you need both upfront to plan your development and then the data lineage discovery to check how it was actually implemented in reality.

Hugh Seaton: And well, there's actually a great analogy to how construction works because an architect designs, a building and then the contractor executes, and those are not always, those are not the same thing, because the reality is the world's a complicated place and things don't work as perfectly as designed.

So the idea of having a policy process that's this high level top-down that says here's how we intend to do this. But then a, I don't want to say more junior, but it's a little bit less heavy to have a team of people who are much more routinely looking at what's really going on and running analysis and using some of the tools you talked about.

That's a really practical way to think about this. So you don't need to have the CIO involved every time there's a little bit of work being done on data lineage. You can actually design a policy and design the, how you'd like it to be at the CIO or sort of that level. And then have teams that are actually near the data near the processes, keeping an eye on how it's really happening, because this probably the scope for variance is lower when you're, when you're actually executing than when you're designing something from the beginning.

Wouter Trappers: Yeah, that's right on. And I love the analogy between the software engineering and the actual architecture in building of buildings, actual buildings.

Hugh Seaton: Yeah. We don't get to do that very often. So this is nice. Yeah. So, so what's a good example of how this works.

Wouter Trappers: Yeah, let's make it very concrete with an example. The initial analysis, for instance, could be incomplete. For example, if you're trying to show the number of OPEX spending, so operational expenditure. And during, when you're talking to the subject matters experts, they're telling you, yeah, this is wrong.

Something is not right. So for example, it can be the case that you use procurement data, because you thought you had more details there. So, but during testing phase, it turns out that there cannot be found alignment between the procurement number and the OPEX postings in the accounting system. So, and that can be a reason.

The reason for this discrepancy can be that there is a discrepancy between where the items on a purchase request are shown on the one hand or the postings of the items on the vendor invoices by finance on the other. So it's maybe a very detailed example, but long story short. You need to replace the one source system with the other to have the right numbers in the dashboards.

But now before you can start this complex task, you want to be sure what the impact will be of this change. So what metrics use the procurement data instead of finance data?

And so that's where the data lineage comes in. It provides an overview of what the data is used, where, and in my piece, on medium, on data lineage with the graph database, I described two different takes. So the upfront definition of your data pipelines and the data lineage discovery from existing data pipelines and before starting development of new data pipeline, as we said earlier, you should have a plan on how to approach the integration of the different data sources and know what transformations they will undergo between the transactional database and the data viz dashboard.

And now then this plan is developed by technical teams, building different steps in the data pipeline. However, for multiple reasons, these upfront definitions will not be implemented as prescribed as you also made the analogy between the planning of the architect and the actual building by the contractors.

So that's why you have to double-check where the differences are between the initial planned reality. That's what data lineage discovery does. It goes into the technical transformation steps and extracts what's happening to the data between the source and the front ends. And what I didn't mention in the medium piece is that data lineage works in two directions, upstream and downstream.

Let me clarify what I mean. So upstream means that you're starting from the metric, looking toward the source of the data that is used in the metric. So you're looking upstream.

And then downstream, it means starting from the data source and then looking which metrics use the data from this source. So looking the other way.

So in the case of the example, Both upstream and downstream occur. So the subject matter experts find an error in a metric, and then they look upstream. What table is at the source of this error. And once this table is identified, then the decision is taken to replace it. Then we need to assess the impact of this change.

And so then we need to look downstream where the data of this table is used and not only in the dashboard where the discrepancy was discovered, but also in other dashboards, where this data may be in use.

Hugh Seaton: I love this. And it speaks to what you're finding more and more in construction contractors, certainly larger ones where they're, they're growing internally and hiring in internally Data capability.

One of the questions that comes up a lot is should we train people from construction into data, or should we take data people from outside of the industry and train them up on construction? And very often the consensus is up to a certain point of sophistication, you really should take people that are subject matter experts in construction and teach them the data processes that the business needs, but after a certain point, you should hire somebody like, for example, your company, to help with architecture and higher level. And that really lines up with both your kind of dichotomy that you've used a number of times now, this idea of policy and top-down and, and sort of bottom up, relying on execution that isn't quite what you meant here with upstream, downstream, but there's still a nice parallel to it, this idea that you're looking at the highest level of what the metrics don't seem to be working, or they're throwing off numbers that don't seem to make some sense. So let's trace back and then the other side is, okay. Let's look at the data source, really think about what we're doing to it.

I mean, it just makes a ton of sense, right? As you're looking at outcomes, as well as inputs and tracing both of them to make sure it's working in both directions. Yeah, I really liked that. And you know, in other industries, are you finding that this has happening to that, that they're taking people that really know the business well and training them to, to a certain degree.

And then they say, okay, look, it doesn't make sense for us to, to get more sophisticated. Let's pull in someone like Wouter?

Wouter Trappers: Yeah, it's always a balancing act. You can train the subject matters experts to a certain level, to be familiar with data and the way data can be used. But of course it's not their core competency and maybe core interest. But it's definitely needed. You need those people in the business to, to double check if what's being built makes actual sense to follow the business strategy, because otherwise you're building things that actually don't help follow up your business strategy. And on the other hand, once it becomes too technical, you have to pull in indeed the technical experts who are building the data pipelines, according to IT best practices, but by people who don't necessarily have all the subject matter expertise of the different companies they are doing this for.

So it's always a balancing act. Of course, as a more technical person, I have to have some business acumen to understand the goals of the business, enough to be able to communicate with them and to help them in their thought process and vice versa, the people in the business have to be data savvy enough to explain to me what they are trying to measure and why that's important for them.

Hugh Seaton: Yeah, that makes a lot of sense. And it is. I like that you call it a balancing act. And where that line sits, I think is a little bit of an individual choice of a, of a given contractor. But it's, it's one that people are struggling with. So Wouter, I'd like to shift gears to the other half of your medium post that I was so excited about, and that is using a graph database in the service of data lineage. So using that as a tool to understand, and it sounds like communicate what's going on with data lineage. Can you start with what you mean by graph databases?

Wouter Trappers: Yeah. I'm not that technical Hugh, I am a philosopher by training. So, let me try my take on this.

In my, understanding as a not super technical guy, graphs are actually quite simple. There are only two parts. You have documents and you have relationships between documents. Sometimes you also have a direction of a relationship and that's then the third part. So that's then called the documents.

The relationships are called the edges and the directions. Attributes of the relationships are vertexes. So that's all you need. And actually, why is this so well suited for data lineage is because actually data lineage is quite complex, so it's difficult to model it in other types of databases, like relational databases, because the way the different documents relate to each other are simply too complex to be represented in a relational database. So that's why graph database is very well suited for this.

Hugh Seaton: That makes a lot of sense. And one of the things that we know about the difference between a graph database and a standard SQL relational database is connecting things, the relationship, is is called a join usually for a relational database and it itself has its own table and it's outside, it's separate from the data itself. So like relationships are represented totally separately and that makes it really computationally expensive to understand how things are related to each other.

This is one of the reasons I'm so excited about graph databases generally, is that so much of what goes on in construction is about how one thing is related to another. And I love that you said, well, that's what data lineage is right. Is, is how are things related to each other? What are we, what processes are being produced to make it?

So that really was, I thought, a pretty interesting application. How do you actually, to that point, so how do you apply this? How do you... what's the process of using a graph database? At the highest level or as detailed as you want to get. For a data lineage project.

Wouter Trappers: Yeah. Let's start with a very simple example. So let's say you have two documents, let's call it now a father and a son, so that if one edge and it's that relationship is, is a parent of, so now the visual representation of this graph will be two points, the documents. A line between these points, the edge, and an arrow going from the father to the son.

The vertex is the father is the parent of the son. Now, this sounds very simple and it is that the concept of a graph database is simple, but once you start filling in your data lineage data, it quickly becomes more difficult because a father cannot only have a son.

He can have two sons, but also the son can be a father of the father again. So it's not always linear. And that's where, graph databases is very well suited is to visualize these complex relations between the different documents. And of course it's another discussion whether it's good to design, if you have a very complex relationships between your different data sources and steps in the data pipeline, that's another discussion.

But as we discussed earlier, we are talking now about data lineage discovery. So we want to know how, what is actually going on. And then it can be that you see, okay, Here we have a really spaghetti between the different documents what's going on here. And then it can be that we use this data lineage discovery to go back to the design and see how was this actually designed, because it seems it is implemented differently and maybe we have to redesign it a bit or make the decision to live with it for now.

That's then up to the, to the management to take this conclusion.

Hugh Seaton: I really like that. And again, this is the fact that the point of a graph database is it makes relationships... people like to say a first class citizen. And I don't think anyone ever knows what they mean when they say that.

But what it means is that relationships are something you can use to search with. That means more than that, but that's one good kind of outcome of that. So if you think about what you're saying is, as we're representing data lineage based on relationships in this graph database, you can also view it a lot of ways, you can search for different relationships, so that you understand, how things are going. You can simplify it with a search or look at the whole thing at once, which I think, you know, imagine you're a multi-billion dollar contractor and you've got lots of different SaaS applications. You've got lots of different projects and you've got different sites even on sometimes the same project.

So you've got all these different sources and being able to map them in a relatively kind of semantically, not that complicated way, but in a reality, there's still pretty, a lot of complexity, in there. Means that you can then say, all right, well, I just want to know what's my RFI flow. And I just want to know where are the drawings going and how are they moving back and forth?

There's something really special about being able to represent and then manage based on the relationships. So I really liked that, that you came to this idea of graph databases and data lineage. Really exciting.

Wouter Trappers: Yeah, thanks.

Hugh Seaton: So where should people learn more about this?

In the show notes, I'm going to include, the original medium article that got me so excited, but where else can people learn more about this and learn more about you?

Wouter Trappers: Now your listeners can find me on the website, Xudo.be the .be is extension for Belgium, and then the on LinkedIn I'm under Wouter Trappers.

And that's where your listeners can find me.

Hugh Seaton: Excellent. And I'll of course I'll have links to those in the show notes. And Wouter, thank you for being on this. Thank you for taking us through with data lineage is and how to represent it in graph databases. Really exciting stuff.

Wouter Trappers: Thanks. It's my pleasure.