Skip to main content

Extracting Lineage from Scala

Updated today

Introduction

Scala is a first-class citizen of the modern data stack. Whether you run Spark jobs on Databricks, build streaming pipelines with Apache Pekko or Akka, or define transformation logic using functional data processing libraries, Scala is often the language that moves and shapes data at scale.

Because Scala combines strong static typing with highly expressive functional patterns, tracing exactly which columns are read, written, and transformed across Scala pipelines requires deep static analysis. When data flows from a Scala job through a data lake into a warehouse and then into BI dashboards, traditional runtime observability tools lose visibility — turning your Scala pipelines into a black box.

Foundational connects to your existing repositories and reads your Scala code directly. It builds column-level lineage from the code itself, without requiring a running cluster or any changes to your application.


Why this framework matters in your data stack

Scala pipelines can sit anywhere in your data stack: as batch jobs that read from operational databases and write to data lakes, as Spark transformations that reshape warehouse tables, or as streaming jobs that fan data out to multiple downstream consumers.

When an upstream change occurs — such as a renamed column in a source table or a restructured case class — it can silently break downstream transformations, dashboards, and machine-learning pipelines. Data teams need a way to predict the downstream impact of these schema changes before they deploy.

Foundational CI, powered by comprehensive end-to-end lineage, thoroughly checks every pending Pull Request to ensure it doesn't disrupt downstream transformations, dashboards, ML models, and more.


How Foundational analyzes this framework

Foundational's code engine scans and extracts lineage straight from the Scala source code. Data teams gain full visibility into any Scala pipeline, including legacy jobs and custom transformation logic that does not rely on a standard framework.

This shift-left approach makes it possible to review data flow changes in pending Pull Requests, ensuring that any changes impacting your data stack do not lead to data incidents, pipeline disruptions, or compromises in data quality.

Foundational supports a wide range of Scala data patterns, including:

  • Scala Spark — DataFrame and Dataset transformations using the Spark Scala API, including select, join, groupBy, and withColumn

  • Spark SQL in Scala — SQL strings executed via spark.sql(), analyzed for column-level read and write operations

  • Case classes as schema definitions — Scala case classes used to define typed DataFrame schemas, extracted to determine column names and types

  • Custom Scala pipelines — bespoke transformation logic that reads from and writes to data stores such as S3, HDFS, BigQuery, or Snowflake

  • Other Spark-based frameworks and formats — including pipelines built on Delta Lake, Apache Iceberg, and other Scala-native data processing libraries

Multi-step extraction process

Foundational uses a multi-step process to track data:

  1. Translation to Abstract Syntax Tree (AST): Foundational translates the raw Scala code into an AST, a standardized hierarchy of the logic, transformations, and data access calls. For example:

    val orders = spark.read.parquet("s3://data-lake/orders/")val summary = orders
      .filter(col("status") === "completed")
      .groupBy(col("customer_id"))
      .agg(sum(col("total_amount")).as("total_spend"))summary.write.mode("overwrite").saveAsTable("analytics.customer_summary")
  2. Path finding (source to sink): The engine scans the AST to map data movement. It identifies ingestion points — such as spark.read calls or JDBC connections — and traces data through intermediate transformations until it lands in a persistent sink such as a warehouse table or data lake path.

  3. Ephemeral node resolution: During path finding, the code engine detects and removes ephemeral nodes such as temporary views or intermediate DataFrames. By bypassing these temporary entities, Foundational collapses the intermediary steps, resulting in a clean, accurate lineage graph that reflects the true architectural flow — for example, S3 → Spark transformation → Snowflake.

  4. Graph construction and linking: Because a single Scala script rarely contains full schema context (for example, spark.read.parquet(...) does not declare column names inline), Foundational merges code analysis results with schema definitions found elsewhere in your repositories or connected data systems. You see the exact columns flowing through each transformation.


Advantages of Foundational's approach

Foundational provides data teams with:

  • Early visibility: Shows how Scala pipeline changes impact data flows during development, seamlessly integrated into your source control — GitHub, GitLab, Azure Repos, Bitbucket, and more.

  • Shift-left impact analysis: Detects breaking changes in open Pull Requests before they reach production, so downstream consumers can prepare in advance.

  • Intelligent noise reduction: Identifies and collapses ephemeral intermediate DataFrames and temporary views. The result is a clean, actionable architectural map rather than a cluttered graph.

  • Reduced breakages: Prevents dashboards, ML features, transformation pipelines, and reverse ETL syncs from breaking due to upstream Scala schema changes.


Set up Scala lineage in Foundational

Setup is seamless.

  1. Connect the repositories that contain your Scala application code. There is no need to manually annotate code, add instrumentation, or modify your pipelines.

  2. From there, the code engine automatically identifies Scala Spark jobs, SQL strings, and case class schema definitions, extracts lineage, detects changes in Pull Requests, and evaluates downstream impact.


Useful links

Did this answer your question?