Skip to main content

Extracting Lineage from Java

Introduction

Java is the backbone of enterprise software. Whether you run Hibernate entities to manage a Postgres OLTP database, orchestrate complex pipelines with Spring Data JPA, or embed raw SQL in JDBC calls, Java is often the upstream source of truth for the data that powers your analytics stack.

Java's static typing and annotation-driven design require deep static analysis to determine exactly which columns are read, written, and transformed across services. When data moves from a Java-managed operational database through an ETL pipeline into a data warehouse and then into BI dashboards, Java data flows become a black box for traditional runtime observability tools.

Foundational connects to your existing repositories and reads your Java code directly. It builds column-level lineage from the code itself, without requiring a running database or any changes to your application.


Why this framework matters in your data stack

Java applications can sit anywhere in your data stack: as the operational source that feeds ETL tools, as microservices that write to data lakes, or as batch jobs that populate warehouse tables. Java-managed schemas are the starting point for a chain of downstream dependencies.

Upstream changes can silently break downstream transformations, dashboards, and machine-learning pipelines. For example, a renamed column in a Hibernate entity or a removed field in a Spring Data JPA model can propagate failures across the entire stack. Data teams need a way to predict downstream impact before they deploy.

Foundational CI checks every pending pull request against end-to-end lineage to ensure code changes do not disrupt downstream transformations, dashboards, ML models, or other consumers.


How Foundational analyzes this framework

Foundational's code engine scans and extracts lineage directly from Java source code. Data teams gain full visibility into any Java data flow, including legacy applications and custom JDBC code that do not rely on a common ORM framework.

This shift-left approach lets teams review data flow changes in pending pull requests so that changes in data stack behavior do not lead to data incidents, pipeline disruptions, or data quality issues.

Foundational supports a wide range of Java data frameworks:

  • Hibernate ORM: Entity classes annotated with @Entity, @Table, and @Column.

  • Spring Data JPA: Repository and entity definitions managed through the Spring ecosystem.

  • MyBatis: XML mapper files and annotated mapper interfaces that define SQL queries and column mappings.

  • JDBC: Raw SQL strings embedded in Java code, analyzed for column-level read and write operations.

  • jOOQ and QueryDSL: Typesafe query builder frameworks that define SQL programmatically in Java.

  • Other JPA providers and custom frameworks: Foundational's code engine handles a broad range of Java data access patterns, including in-house frameworks and less common ORMs.

Multi-step extraction process

Foundational uses a multi-step process to track data.

Identify relevant files

Foundational scans accessible repositories to locate Java files that define database schemas. It uses heuristics to detect entity classes, mapper files, and embedded SQL strings across supported frameworks. For example:

@Entity
@Table(name = "orders")
public class Order {
@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
private Long id;

@Column(name = "customer_id")
private Long customerId;

@Column(name = "total_amount")
private BigDecimal totalAmount;

@Column(name = "status")
private String status;
}

Parse annotations and extract schema

Foundational parses Java source files using static analysis. It reads class-level and field-level annotations to extract:

  • Table names

  • Column names and types

  • Primary and foreign key relationships (@Id, @JoinColumn, @ManyToOne, @OneToMany)

  • Inheritance structures (@MappedSuperclass)


When column names are not declared explicitly, Foundational applies the framework's default naming strategy. For example, Hibernate converts camelCase field names to snake_case by default.

Analyze SQL and data flow

For MyBatis, JDBC, and other SQL-based access patterns, Foundational extracts lineage by analyzing SQL operations in mapper files and embedded query strings. It identifies which columns are read (SELECT), written (INSERT, UPDATE), and used for filtering or ordering. For example, this JDBC snippet:

String sql = "INSERT INTO reporting.order_summary (customer_id, total_amount) "
           + "SELECT customer_id, SUM(total_amount) "
           + "FROM orders "
           + "GROUP BY customer_id";
statement.execute(sql);

Produces the following column-level lineage:

orders.customer_id → reporting.order_summary.customer_id

orders.total_amount → reporting.order_summary.total_amount

Construct and link the graph

A single Java service rarely contains full schema context. Foundational merges code analysis results with schema definitions found elsewhere in your repositories or data systems. It resolves cross-module dependencies, handles wildcard references like SELECT * using known schema context, and assembles the complete column-level lineage graph, tracing data from the Java-managed operational database through the warehouse, into transformations, and out to BI tools and downstream consumers.


Advantages of Foundational's approach

Foundational provides data teams with:

  • Early visibility: Shows how Java schema changes impact data flows during development. Integration covers GitHub, GitLab, Azure Repos, Bitbucket, and more.

  • Shift-left impact analysis: Detects breaking changes in open pull requests before they reach production, so downstream consumers can prepare in advance.

  • Broad Java framework coverage: Supports Hibernate, Spring Data JPA, MyBatis, JDBC, jOOQ, QueryDSL, and more, including embedded SQL and legacy code that runtime tools miss entirely.

  • Reduced breakages: Prevents dashboards, ML features, transformation pipelines, and reverse ETL syncs from breaking due to upstream Java schema changes.


Set up Java lineage in Foundational

Setup is seamless.

  1. Connect the repositories that contain your Java application code. There is no need to manually annotate code, add instrumentation, or modify your pipelines.

  2. The code engine automatically identifies Java data access patterns across supported frameworks, extracts schema and lineage, detects changes in pull requests, and evaluates downstream impact.


Useful links

Did this answer your question?