Introduction
Apache Spark is an open-source framework for fast data processing. It supports large-scale analytics and machine learning workloads. Spark works with multiple programming languages, including Python (through PySpark), Scala, and Spark SQL, so teams can use the skills they already have.
Organizations use Spark to process large datasets stored in data lakes such as AWS S3 and GCS. This approach costs less than moving the same data into warehouses like Snowflake or BigQuery. Today, platforms such as Databricks and AWS EMR run Spark pipelines at scale across many environments.
Why this framework matters in your data stack
Most lineage tools rely on SQL. They extract lineage by parsing queries or by reading query history in warehouses such as BigQuery and Redshift.
Spark works differently. It has no equivalent query history, and Spark code does not translate into lineage as easily as SQL. These differences prevent traditional SQL-based lineage tools from supporting Spark workloads.
How Foundational analyzes this framework
Foundational’s Code Engine extracts data lineage by analyzing Spark code directly. It supports PySpark, Scala Spark, and Spark SQL. The system simulates the code inside a specialized sandbox environment and observes how data moves through each step.
This approach gives teams visibility into code changes that are still in development or part of a pending Pull Request. It helps them understand the impact before they deploy the change.
Simulating pipeline run
Foundational does not run the full Spark workload. Instead, it simulates the pipeline by analyzing the code and preparing the required environment, such as files, folders, and environment variables. This approach allows the pipeline to exercise its flow without running at production scale.
Sandbox Spark environment
Foundational uses a modified Spark runtime inside its sandbox. This environment tracks each read and write operation and emits the information required to build lineage.
Extracting lineage
Foundational uses the simulated run to build precise lineage graphs for every column, table, and file. It traces data from read locations, such as S3 files, to write destinations and records the relationships between them.
Advantages of Foundational’s approach
Foundational builds the final lineage graph in a post-processing step. This step combines all lineage relations from the pipeline and fills in any missing information. Some relations do not contain full detail, so the engine uses global context to determine the complete lineage.
For example:
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder \
.appName("Read from S3 and Write to Spark Table") \
.getOrCreate()
# Define the S3 file path
s3_file_path = "s3://fd-sample-data/input_for_spark/events"
# Read the data from S3
df = spark.read.option("header", "true").csv(s3_file_path)
# Drop the specified columns
columns_to_drop = ["order_id", "method", "amount"]
df_filtered = df.drop(*columns_to_drop)
# Write the resulting DataFrame to a Spark table
df_filtered.write.mode("overwrite").saveAsTable("orders")
# Stop the Spark session
spark.stop()
This code creates the following lineage rule:
“Write all columns from the events S3 file to the orders table, except for order_id, method, and amount.”
To complete the lineage graph, Foundational must know which columns exist in the events S3 file. The engine resolves this during post-processing. It can infer these columns from other parts of the code, such as the code that writes the events file, or from other metadata sources, such as non-code connectors.
Foundational uses this information to create the final lineage graph:
Why Foundational’s approach is different
Traditional lineage tools rely on query history and SQL parsing. These methods do not work for Spark. Spark has no query history, and Spark code does not map cleanly to lineage the way SQL does.
Some Spark platforms, such as Databricks Unity Catalog, offer lineage features. However, these platforms:
Do not provide full end-to-end coverage.
Exclude upstream systems such as Postgres and downstream tools such as Power BI.
Offer little or no lineage support on platforms like AWS EMR, forcing teams to build manual in-house solutions.
Foundational closes this gap:
It extracts lineage directly from Spark code without requiring custom code changes.
It analyzes code at the git stage, so teams see issues in pending Pull Requests before they reach production.
It removes the need for in-house workarounds and manual lineage processes.
Set up Spark lineage in Foundational
Setup is simple. Connect the repositories that contain your Spark code. Foundational automatically detects Spark files, loads them safely, and extracts lineage from the code. It identifies changes in Pull Requests and evaluates downstream impact before the code reaches production.
To connect to your source control, check out the relevant How-to article from the Help Center Connectors and Integrations category.
No additional configuration is required.
Additional information
How Foundational differs from OpenLineage
Foundational supports OpenLineage and can ingest its data. However, there are important differences:
Foundational extracts Spark lineage directly from code. OpenLineage extracts lineage at runtime. Foundational shows lineage for pending changes and open Pull Requests, while OpenLineage only shows lineage for deployed jobs.
OpenLineage requires teams to modify code to emit events and to deploy an event collection server. Foundational Code Engine requires no code changes.
OpenLineage limits visibility for jobs that run infrequently, such as quarterly or yearly pipelines. It focuses on recent or active runs. In contrast, Foundational provides complete lineage regardless of execution frequency.
How Foundational differs from Unity Catalog
Unity Catalog extracts Spark lineage from executed jobs. It does not show lineage for pending Pull Requests.
Foundational extracts lineage directly from source code. It provides earlier visibility during development and supports full end-to-end lineage across Spark, upstream systems, and downstream consumers.


