Skip to main content

Extracting Lineage from Scala

Introduction

Scala is the backbone of modern data engineering. Whether you run Spark jobs on Databricks, build streaming pipelines with Apache Pekko or Akka, or define transformation logic using functional data processing libraries, Scala is often the language that moves and shapes data at scale.

Scala's strong static typing and highly expressive functional patterns require deep static analysis to determine exactly which columns are read, written, and transformed across pipelines. When data moves from a Scala job through a data lake into a warehouse and then into BI dashboards, Scala data flows become a black box for traditional runtime observability tools.

Foundational connects to your existing repositories and reads your Scala code directly. It builds column-level lineage from the code itself, without requiring a running cluster or any changes to your application.


Why this framework matters in your data stack

Scala pipelines can sit anywhere in your data stack: as batch jobs that read from operational databases and write to data lakes, as Spark transformations that reshape warehouse tables, or as streaming jobs that fan data out to multiple downstream consumers.

Upstream changes can silently break downstream transformations, dashboards, and machine-learning pipelines. For example, a renamed column in a source table or a restructured case class can propagate failures across the entire stack. Data teams need a way to predict downstream impact before they deploy.

Foundational CI checks every pending pull request against end-to-end lineage to ensure code changes do not disrupt downstream transformations, dashboards, ML models, or other consumers.


How Foundational analyzes this framework

Foundational's code engine scans and extracts lineage directly from Scala source code. Data teams gain full visibility into any Scala pipeline, including legacy jobs and custom transformation logic that do not rely on a standard framework.

This shift-left approach lets teams review data flow changes in pending pull requests so that changes in data stack behavior do not lead to data incidents, pipeline disruptions, or data quality issues.

Foundational supports a wide range of Scala data patterns:

  • Scala Spark: DataFrame and Dataset transformations using the Spark Scala API, including select, join, groupBy, and withColumn.

  • Spark SQL in Scala: SQL strings executed via spark.sql(), analyzed for column-level read and write operations.

  • Case classes as schema definitions: Scala case classes used to define typed DataFrame schemas, extracted to determine column names and types.

  • Custom Scala pipelines: Bespoke transformation logic that reads from and writes to data stores such as S3, HDFS, BigQuery, or Snowflake.

  • Other Spark-based frameworks and formats: Pipelines built on Delta Lake, Apache Iceberg, and other Scala-native data processing libraries.

Multi-step extraction process


Foundational uses a multi-step process to track data.

Translate code to Abstract Syntax Tree (AST)

Foundational translates raw Scala code into an AST, a standardized hierarchy of the logic, transformations, and data access calls. For example:

val orders = spark.read.parquet("s3://data-lake/orders/")
val summary = orders
.filter(col("status") === "completed")
.groupBy(col("customer_id"))
.agg(sum(col("total_amount")).as("total_spend"))
summary.write.mode("overwrite").saveAsTable("analytics.customer_summary")

Trace data paths from source to sink

The engine scans the AST to map data movement. It identifies ingestion points such as spark.read calls or JDBC connections and traces data through intermediate transformations until it lands in a persistent sink such as a warehouse table or data lake path.

Resolve and remove ephemeral nodes

During path finding, the code engine detects and removes ephemeral nodes such as temporary views or intermediate DataFrames. Bypassing these temporary entities collapses the intermediary steps, producing a clean lineage graph that reflects the true architectural flow. For example: S3 → Spark transformation → Snowflake.

Construct and link the graph

A single Scala script rarely contains full schema context. For example, spark.read.parquet(...) does not declare column names inline. Foundational merges code analysis results with schema definitions found elsewhere in your repositories or connected data systems. You see the exact columns flowing through each transformation.


Advantages of Foundational's approach

Foundational provides data teams with:

  • Early visibility: Shows how Scala pipeline changes impact data flows during development. Integration covers GitHub, GitLab, Azure Repos, Bitbucket, and more.

  • Shift-left impact analysis: Detects breaking changes in open pull requests before they reach production, so downstream consumers can prepare in advance.

  • Intelligent noise reduction: Identifies and collapses ephemeral intermediate DataFrames and temporary views, producing a clean architectural map rather than a cluttered graph.

  • Reduced breakages: Prevents dashboards, ML features, transformation pipelines, and reverse ETL syncs from breaking due to upstream Scala schema changes.


Set up Scala lineage in Foundational

Setup is seamless.

  1. Connect the repositories that contain your Scala application code. There is no need to manually annotate code, add instrumentation, or modify your pipelines.

  2. The code engine automatically identifies Scala Spark jobs, SQL strings, and case class schema definitions, extracts lineage, detects changes in pull requests, and evaluates downstream impact.


Useful links

Did this answer your question?