Skip to main content

SQL Lineage

Updated over a week ago

Introduction

SQL lineage tracks and visualizes how data flows through SQL queries and how it transforms from its source to its final destination. This level of visibility is essential for data governance, compliance, and auditing because it ensures complete traceability and accuracy across the data lifecycle.

By revealing data dependencies and transformations, SQL lineage helps teams identify data-quality issues, streamline debugging, and optimize data workflows. It also strengthens impact analysis by showing how changes in one part of the pipeline affect downstream assets.


Why this framework matters in your data stack

In today’s data ecosystem, many components of the modern data stack, such as data warehouses, transformation frameworks like dbt, data catalogs, and observability tools, provide some form of SQL lineage. Accurate column-level lineage is essential for identifying issues such as references to non-existent columns. It also gives data engineers a clear understanding of dependencies.

This clarity helps prevent unintended disruptions, including situations where a column is removed even though downstream processes still depend on it.


How Foundational analyzes this framework

Foundational’s SQL lineage engine is composed of three core components:

  1. SQL parser

  2. SQL lineage engine

  3. Full lineage graph builder

You can also experiment with Foundational SQL Lineage and visualize the lineage for any SQL query in the Lineage Explore page.


Foundational’s process to extract schema and lineage

SQL Parser

The SQL parser breaks an SQL query string into an Abstract Syntax Tree (AST). Because SQL dialects differ across systems such as Snowflake and BigQuery, the parser must handle these variations and convert dialect-specific syntax into a uniform AST. This removes the need for dialect-specific logic inside the SQL lineage engine.

Foundational uses SQLGlot, an open-source SQL parser that supports multiple SQL dialects and has an active community.

SQL Lineage Engine

The SQL lineage engine takes the AST generated by the SQL parser and derives lineage relationships from it. This step can be complex. Below are two examples of challenges the lineage engine must address.

Challenge #1: Ignore ephemeral CTEs

In the following example:

CREATE VIEW db.schema.my_view AS (
with temp_table as ( SELECT Col1 FROM db.schema.Foo )
SELECT Col1 FROM temp_table
)

Foundational aims to capture only the meaningful lineage:

db.schema.Foo.Col1 → db.schema.my_view.Col1

The transient CTE should not appear in the lineage, so the temporary link:

temp_table → my_view.Col1

is ignored.

Challenge #2: Resolve ambiguities

Foundational analyzes lineage directly from SQL code, which often means there is no schema information available. This can create ambiguities.


For example:

SELECT a, b FROM table1, table2

With no schema, the query analyzer cannot determine which column comes from which table.

Any of the following combinations could be correct:

  • Table1 → a, b ; Table2 → []

  • Table1 → a ; Table2 → b

  • Table1 → b ; Table2 → a

  • Table1 → [] ; Table2 → a, b

To resolve this, Foundational analyzes the larger query context. For example:

SELECT a,b FROM table1, table2
(
UNION ALL
SELECT c FROM table1 WHERE a == '1' and b == '2'
)

The WHERE clause ties a and b to table1, which allows the lineage engine to conclude that table1 is the source table for both columns.

You can read more in our blog post: Overcoming Challenges in Source-Code Based Data Lineage.

Full lineage graph builder

Once all lineage relationships are extracted from the SQL query, Foundational runs a post-processing phase that assembles the full lineage graph from the individual lineage relations.

This step is necessary because some relations may not contain complete information and require global context to infer the final lineage.

For example:


CREATE VIEW ANALYTICS_DB.AGG.CUSTOMERS AS (
SELECT * FROM ANALYTICS_DB.AGG.STG_CUSTOMERS
)

This produces the lineage relation:

STG_CUSTOMERS.* → CUSTOMERS.*

To build the full lineage graph, Foundational must determine which columns exist in STG_CUSTOMERS. This is done during post-processing, where all relevant information is combined to complete the lineage.

Foundational may infer these columns from other areas in the code, such as queries that reference explicit columns, or from additional metadata sources such as non-code connectors.

Using this information, the system constructs the final lineage graph:


Advantages of Foundational’s approach

Foundational delivers the highest possible accuracy for column-level SQL lineage. The lineage engine is benchmarked against other leading tools and open-source libraries, and it consistently delivers the most precise results.

  • Over 10,000 real queries analyzed daily: More than 10,000 SQL queries from real clients are processed each day, contributing to a dataset of over 30,000 analyzed Pull Requests. These SQL queries often contain 300–800 lines of code and represent a wide range of real-world complexities and use cases. This diverse dataset helps identify and resolve bugs, including those related to new SQL operators or advanced language features.

  • Rigorous accuracy testing: Ongoing accuracy tests compare metadata inferred from SQL lineage with metadata retrieved directly from data warehouses. This practice quickly exposes discrepancies or missing lineage information. Full regression tests also run whenever updates are made to the lineage engine to prevent the introduction of inaccuracies or reduced coverage.

  • Heuristic detection of lineage inaccuracies: Heuristic-based testing is used to detect lineage inaccuracies. These tests look for anomalies such as references to non-existent columns or lineage connections between columns with incompatible data types, for example strings linked directly to integers. While heuristics may occasionally produce false positives, they are essential for identifying nuanced issues that might otherwise go unnoticed.


Set up SQL lineage in Foundational

SQL lineage is available automatically in Foundational.

To connect to your source control, check out the relevant How-to article from the Help Center Connectors and Integrations category.

No additional configuration is required.


Useful links

Did this answer your question?