What is Data Lineage?
Data lineage is the record of the path that data goes through from the original source to final consumption or storage. It can also refer to the description of how and where data changes as it moves through its life cycle. The value of data lineage is that it's a track record of everything that piece of data has gone through, and therefore creates accountability for those interacting with data. Additionally, understanding information on how and where the data is stored becomes important when an organization needs to make a change to this "data lifecycle".
A typical data lineage diagram documents the location of data, either in a database or data warehouse, and shows how it travels from one place to another. It also shows how that data changes as it moves through the system.
The more granular you make your documentation, the easier it is for people to understand what is happening with your data and why certain processes work in certain ways. This can help to build trust and accountability in your analytics as well as make it easier to identify issues when they arise. The problem with most manual approaches is that they are costly.
How to use Data Lineage
The data lineage is the data’s origin, each step it took to get from point A to point B, and where it ended up. Data lineage tracks the history of data through a series of steps or transformations from its original source to its final destination. This can include knowing where data originated and how it moved between systems.
This information is extremely valuable in understanding how an organization's data is sorted, where it lives, how it's stored, and ultimately, how it impacts the organization. This empowers all involved with data to make better informed decisions while understanding the data- it cuts out the time needed to explain a piece of information on both the parts of data stewards and data consumers within an organization.
Data lineage is especially important in compliance-driven industries such as healthcare and finance, where organizations must demonstrate compliance by providing information on how they handle customer or patient data. Data governance initiatives often include a requirement to create and store data lineages.
What makes up Data Lineage?
Data lineage can be used to support many different use cases including:
Understanding your data – Data lineage enables you to trace the path of a specific piece of data as it moves throughout your data ecosystem. It allows you to see how and where a specific piece of data was manipulated over time.
Tracing bugs – Using data lineage, you can investigate and trace errors in your system back to the source, helping you identify and fix bugs faster.
Creating policies – Data lineage helps you discover who has access to certain pieces of information, which can help you create policies around who can access what information, when and why.
Auditing – You may need to undergo an audit for compliance purposes (e.g., PCI DSS). Data lineage can help you show auditors exactly where every piece of your sensitive information came
Why is Data Lineage important?
- Data governance. The data stakeholders and data stewards all align on how data is managed, tracked, stored, and used- and since data lineage is a method of accountability within a data warehouse or database, the decisions that are made in the data governance practice impact what data lineage will look like.
- Migrations. When there needs to be a migration or major change to the data stack, like integrating a new tool, for example, understanding the location and lifecycle of the data in its current form directs the decision making process.
- Data analysis. Data lineage makes understanding data easier, both within the data team itself, and amongst external team members. It also improves the reliability of the data, since users and viewers can see the "receipt" of where and when the data has been interacted with.