What is Spark?
Apache Spark is a common open-source framework designed for fast data processing, ideal for large-scale data analytics and machine learning. Spark supports multiple programming languages, including Python (via PySpark), Scala, and Spark SQL, allowing businesses to leverage their existing skill sets.
Using Spark, organizations can efficiently handle vast datasets stored in data lakes like AWS S3 and GCS, offering a more cost-effective solution compared to traditional data warehouses such as Snowflake and BigQuery. This significant cost advantage makes Spark an attractive choice for managing extensive data volumes. Today, organizations use Databricks, AWS EMR and other solutions for running Spark processing pipelines on their data lakes.
Data Lineage for Spark
While data warehouses like BigQuery and Redshift rely solely on SQL where standard data lineage solutions exist, this isn't the case for Apache Spark. Traditional warehouse-focused lineage solutions use the query history as the source, and SQL parsing as the method, to extract lineage. These methods however do not apply to Spark. In Spark, there is no equivalent of query history, and Spark code can't easily be translated into lineage as SQL can.
Although some specific Spark platforms, like Databricks Unity Catalog, offer lineage solutions, they don't provide full end-to-end lineage that includes upstream sources (e.g., Postgres and other operational databases) and downstream BI tools and dashboards (e.g., Power BI). Additionally, businesses using other platforms, such as AWS EMR, for their Spark workloads severely lack lineage coverage, or are forced to extract lineage manually through in-house solutions.
Foundational solves for this gap leveraging code analysis that extracts lineage directly from Spark code, ‘which automates data lineage for Spark-based environments without the need for custom code development. Leveraging code analysis at the git stage also allows developers to analyze pending code changes and identify potential data issues before a bad code change is deployed to production where it impacts live data.
Extracting Spark Lineage Directly from Code
The Foundational Code Engine automates data lineage extraction by directly analyzing Spark code. It can ingest different flavors of Spark code—such as PySpark, Scala Spark, and Spark SQL—and simulate its execution in a specialized sandbox environment.
This unique approach enables customers to analyze Spark code changes that are still under development or are part of a pending Pull Request, allowing lineage to be extracted prior to deployment.
Simulating Pipeline Run
When Foundational simulates the execution of the pipeline run, it doesn’t run the full workload but mimics the pipeline run. This is done by analyzing the code and setting up the relevant environment, like: files, folders, environment variables, etc. This allows the pipeline to exercise its flow.
Sandbox Spark Environment
Foundational uses a specialized sandbox environment to simulate the pipeline runs. This sandbox environment contains a modified Spark runtime environment, which allows Foundational to track each read/write operation and emit relevant lineage information.
Extracting Lineage
With the simulated run in the specialized sandbox environment, Foundational constructs precise lineage graphs for every column, table, and file. It traces paths from read sinks (e.g., S3 files) to write sinks and creates a lineage relation between the read sink and the write sink.
Constructing The Lineage Graph
Once all lineage relations are extracted from the pipeline, Foundational runs a post processing phase that constructs the full lineage graph, out of the individual lineage relations. This step is needed, as sometimes lineage relations do not have the full information, and require a global context in order to figure out the actual lineage.
For example, a Spark pipeline may contain this code:
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder \
.appName("Read from S3 and Write to Spark Table") \
.getOrCreate()
# Define the S3 file path
s3_file_path = "s3://fd-sample-data/input_for_spark/events"
# Read the data from S3
df = spark.read.option("header", "true").csv(s3_file_path)
# Drop the specified columns
columns_to_drop = ["order_id", "method", "amount"]
df_filtered = df.drop(*columns_to_drop)
# Write the resulting DataFrame to a Spark table
df_filtered.write.mode("overwrite").saveAsTable("orders")
# Stop the Spark session
spark.stop()
This code translates into the following lineage relation:
"all columns from events
S3 file should be written to orders
table, except for the columns: order_id
, method
, amount
"
However, in order to build the final lineage graph, we need to know which columns does the events
S3 file contains. This is done as part of the post processing phase, where Foundational combines all relevant information, in order to build the final lineage graph. For example, Foundational can infer the columns in the events
S3 file either from other places in the code (e.g. a code that creates the events
S3 file), or from other information (e.g. other non-code connectors).
Using this information, Foundational can construct the final lineage graph:
How to set up Spark Lineage in Foundational?
To get started with Spark Lineage in Foundational, visit the Connectors page to set up a code connector to the Spark repositories. Code connectors are configured through the relevant developer platform, such as GitHub, Through the connector setup, you can select the appropriate git repositories, which contain the Spark code and configuration files, which would then get processed automatically by the Foundational Code Engine.
Once the git connector is set up, Foundational will process the source code and calculate data lineage for all the connected repositories, including pending pull requests. By default, Foundational is processing code history (e.g., Closed pull requests) up to 30 days back.
How is this different from OpenLineage?
Some may wonder, how does this approach differ from OpenLineage? It’s an excellent question! First, Foundational is a big supporter of OpenLineage, and can ingest OpenLineage as well. However, there are few key differences between OpenLineage and Foundational Code Engine:
Foundational supports OpenLineage and can ingest its data. However, Foundational Code Engine extracts Spark lineage directly from code, whereas OpenLineage extracts lineage from runtime. This allows Foundational to provide lineage visibility for pending changes and open Pull Requests, whereas with OpenLineage, you only see lineage for deployed instances.
Integrating with OpenLineage requires modifying your code to emit OpenLineage events and setting up a server for event collection. In contrast, Foundational Code Engine setup is straightforward—simply connect your repositories to Foundational for seamless integration.
OpenLineage focuses on runtime lineage, meaning it shows lineage for things that have recently run or are currently deployed. This limits visibility for less frequently executed processes like yearly or quarterly aggregation pipelines. Foundational, however, provides comprehensive lineage insights regardless of execution frequency.
How is this different from Unity Catalog?
Databricks offers a feature called Unity Catalog, allowing customers to view Spark lineage. How does Foundational Code Engine differ from Unity Catalog? Firstly, Unity Catalog, akin to OpenLineage, extracts lineage solely from executed jobs, omitting lineage visibility for pending Pull Requests, as previously discussed in our comparison with OpenLineage.
Moreover, Unity Catalog is available at a higher pricing tier exclusive to Databricks customers, making it non relevant to non-Databricks clients such as those using AWS EMR.