Skip to main content
SQL Lineage
Updated over 5 months ago

Background

SQL lineage involves the systematic tracking and visualization of the flow of data through SQL queries, detailing how data is transformed from its source to its final destination. This process is essential for data governance, compliance, and auditing since it ensures complete data traceability and accuracy. By elucidating data dependencies and transformations, SQL lineage helps organizations identify data quality issues, streamline debugging, and optimize data workflows. Additionally, it enhances impact analysis by illustrating the downstream effects of changes within the data pipeline.

In today's data ecosystem, tools within the modern data stack—such as data warehouses, transformation frameworks like dbt, data catalogs, and observability tools—often include some form of SQL lineage capabilities. Achieving precise column-level lineage is crucial for identifying potential issues, such as attempting to read from non-existent columns, and for enabling data engineers to fully understand dependencies. This prevents unintended disruptions, like the accidental removal of a column that is still in use elsewhere.

In this article, we will delve into how Foundational constructs its highly accurate column-level lineage and why it stands out as the most precise solution available on the market.

How Foundational Extracts Lineage from SQL

You can easily experiment with Foundational SQL Lineage, and visualize the lineage for any SQL query in our Lineage Explore page.

Foundational's SQL lineage engine is composed of the following core components:

  1. SQL parser

  2. SQL Lineage engine

  3. Full lineage graph builder

SQL Parser

The SQL parser's role is to break down an SQL query string into an Abstract Syntax Tree (AST). Given the diversity of SQL dialects (e.g., Snowflake, BigQuery), the parser must accommodate these differences and convert the dialect-specific syntax into a uniform AST. This approach eliminates the need for dialect-specific handling within the SQL lineage engine. At Foundational, we leverage SQLGlot, an open-source SQL parser that supports multiple SQL dialects and boasts a robust community.

SQL Lineage Engine

The SQL Lineage engine takes the AST generated by the SQL parser and constructs the lineage relationships from it. However, this process can be complex. Here are two examples of challenges the SQL Lineage Engine has to face:

Challenge #1: Ignore ephemeral CTEs

For example, in the code:

CREATE VIEW db.schema.my_view AS (
with temp_table as ( SELECT Col1 FROM db.schema.Foo )
SELECT Col1 FROM temp_table
)

We aim to capture the lineage db.schema.Foo.Col1 → db.schema.my_view.Col1 without including the transient CTE temp_table → my_view.Col1.

Challenge #2: Resolve ambiguities

Since Foundational analyzes lineage directly from code, it frequently needs to resolve ambiguities caused by the absence of schema information in SQL code. For example:

SELECT a, b FROM table1, table2

Analyzing this query while having the schema information is generally easy, but in case of having only the source code, the query analyzer cannot determine which column belongs to which table. For this query, each of the following schemas could be valid:

  • Table1a, b ; Table2 → []

  • Table1a ; Table2b

  • Table1b ; Table2a

  • Table1 → [] ; Table2a, b

How can we then determine which one is it for an actual query? There are multiple different ways Foundation resolves these ambiguities. One of these ways is to look at the larger query, for example it could be that the rest of the query looks like this:

SELECT a,b FROM table1, table2
UNION ALL
(SELECT c FROM table1 WHERE a == '1' and b == '2')

In this case, we can identify the last part of the query that ties a and b to table1 through the WHERE clause, and determine that table1 is the one having these columns.

If you want to read more on this, visit our blog post: Overcoming Challenges in Source-Code Based Data Lineage

Full Lineage Graph Builder

Once all lineage relations are extracted from the SQL query, Foundational runs a post processing phase that constructs the full lineage graph, out of the individual lineage relations. This step is needed, as sometimes lineage relations do not have the full information, and require a global context in order to figure out the actual lineage.

For example, a SQL query may contain this code:

CREATE VIEW ANALYTICS_DB.AGG.CUSTOMERS AS (
SELECT * FROM ANALYTICS_DB.AGG.STG_CUSTOMERS
)

This code translates into the following lineage relation: STG_CUSTOMERS.*CUSTOMERS.*

However, in order to build the final lineage graph, we need to know which columns does the STG_CUSTOMERS table contains. This is done as part of the post processing phase, where Foundational combines all relevant information, in order to build the final lineage graph. For example, Foundational can infer the columns in the STG_CUSTOMERS table either from other places in the code (e.g. a code that reads explicit columns from STG_CUSTOMERS), or from other information (e.g. other non-code connectors).

Using this information, Foundational can construct the final lineage graph:

The Most Accurate SQL Lineage

At Foundational, we put significant effort into ensuring the highest accuracy for column-level SQL lineage. We continually benchmark our lineage engine against other leading tools and open-source libraries, and time and again, Foundational delivers the most precise results.

Analyzing Over 10,000 Real Queries Daily

Every day, Foundational processes over 10,000 SQL queries from real clients, contributing to a dataset of more than 30,000 Pull Requests analyzed. These queries often span 300-800 lines of code, encompassing a wide array of real-world complexities and use cases. This extensive and varied dataset allows us to quickly identify and resolve bugs or issues, including those arising from newly released SQL operators or advanced language features.

Rigorous Accuracy Testing

Foundational consistently conducts accuracy tests by comparing the metadata deduced from SQL lineage queries with the actual metadata retrieved directly from data warehouses. This practice quickly highlights any discrepancies or missing lineage information. Additionally, full regression tests are run whenever we make changes to the Foundational Lineage engine to ensure that updates do not introduce new inaccuracies or reduce lineage coverage.

Heuristic Detection of Lineage Inaccuracies

Foundational employs heuristic-based testing to catch lineage inaccuracies. These tests look for anomalies, such as attempts to access non-existent columns or lineage links between columns of drastically different data types (e.g., a string directly connected to an integer). Although these heuristics may occasionally produce false positives, they are invaluable in detecting and resolving more nuanced issues that might otherwise be overlooked.

Did this answer your question?