Understanding Column-level Lineage in dbt Explorer

Published
May 22, 2024
Author

What is Column-level Lineage in dbt Explorer?

Column-level Lineage (CLL) is a feature in dbt Explorer that provides a detailed view of how data flows and transforms across tables and databases. It's a crucial tool for dbt data teams to understand where errors occur in data pipelines and diagnose issues in data workflows. For instance, CLL can help identify if a failing data test on a column is due to an untested column upstream.

SELECT column_name, table_name
FROM information_schema.columns
WHERE table_schema = 'your_schema'

This SQL query can be used to retrieve the column names and table names in a specific schema. This information can be used to understand the data flow and transformations in your dbt project.

  • SELECT column_name, table_name: This part of the query selects the column names and table names from the information schema.
  • WHERE table_schema = 'your_schema': This condition filters the results to only include tables from a specific schema.

How can CLL help in identifying problematic nodes in data transformation jobs?

CLL can also help identify problematic nodes in data transformation jobs that could cause cascading failures. By providing a clear view of how data is flowing and transforming, it can help pinpoint where issues are occurring and prevent potential downstream failures.

SELECT *
FROM dbt_meta
WHERE node_type = 'model' AND status = 'error'

This SQL query can be used to identify problematic nodes in your dbt data transformation jobs. It selects all records from the dbt_meta table where the node type is 'model' and the status is 'error'.

  • SELECT * FROM dbt_meta: This part of the query selects all records from the dbt_meta table.
  • WHERE node_type = 'model' AND status = 'error': This condition filters the results to only include records where the node type is 'model' and the status is 'error'.

How does CLL simplify debugging data issues?

CLL helps data teams understand how data is used in models, which can simplify debugging data issues. For example, CLL can answer questions like which input columns are used to produce output columns. This insight can greatly simplify the debugging process and save valuable time.

SELECT *
FROM dbt_meta
WHERE node_type = 'model' AND status = 'error'
ORDER BY updated_at DESC

This SQL query can be used to identify the most recent errors in your dbt data models. It selects all records from the dbt_meta table where the node type is 'model' and the status is 'error', and orders the results by the updated_at timestamp in descending order.

  • SELECT * FROM dbt_meta: This part of the query selects all records from the dbt_meta table.
  • WHERE node_type = 'model' AND status = 'error': This condition filters the results to only include records where the node type is 'model' and the status is 'error'.
  • ORDER BY updated_at DESC: This part of the query orders the results by the updated_at timestamp in descending order, so the most recent errors are shown first.

What is the importance of data lineage in analytics engineering?

Data lineage is a holistic overview of how data moves through a system or organization, and is usually represented by a Directed Acyclic Graph (DAG). Analytics engineering practitioners can use their DAG and data lineage to unpack root causes in broken pipelines, audit models for inefficiencies, and promote greater transparency in data work to business users.

SELECT *
FROM dbt_meta
WHERE node_type = 'model'
ORDER BY depth DESC

This SQL query can be used to understand the depth of your dbt data models. It selects all records from the dbt_meta table where the node type is 'model', and orders the results by the depth in descending order.

  • SELECT * FROM dbt_meta: This part of the query selects all records from the dbt_meta table.
  • WHERE node_type = 'model': This condition filters the results to only include records where the node type is 'model'.
  • ORDER BY depth DESC: This part of the query orders the results by the depth in descending order, so the models with the most dependencies are shown first.

How to access a project's full lineage graph in dbt Explorer?

To access a project's full lineage graph in dbt Explorer, select Overview in the left sidebar and click the Explore Lineage button on the main page. This will provide a visual representation of how data is flowing and transforming in your dbt project.

As this step involves interacting with the dbt Explorer graphical user interface (GUI), there is no code example. Simply navigate to the Overview section in the left sidebar and click the Explore Lineage button to view your project's full lineage graph.

  • Overview: This is the section in the left sidebar of dbt Explorer where you can access the lineage graph.
  • Explore Lineage button: This is the button you need to click on the main page to view your project's full lineage graph.

Keep reading

See all