Skip to main content

Extracting Lineage from Python

Updated yesterday

Introduction

Python is the cornerstone of modern data engineering and AI. Whether you use Pandas for data wrangling or orchestrate complex machine learning models with APIs like OpenAI, Python is the "glue" that holds the data stack together.

Because Python is highly dynamic, tracing data through custom scripts is notoriously difficult. When data moves from a warehouse through a Python script to an AI service and into a data lake, traditional runtime observability tools lose visibility, turning your Python pipelines into a black box.

Foundational connects to your existing repositories and reads your Python code directly. It builds lineage from the code itself.


Why this framework matters in your data stack

Python scripts can sit anywhere in your data stack: between upstream sources, data warehouses, downstream consumers like S3 data lakes, or any two points where custom logic is needed.

When an upstream change occurs, such as a renamed column in a Snowflake table, it can silently break downstream Python scripts or feed faulty data into AI models. Data teams need a way to predict the downstream impact of these schema changes.

Foundational CI, powered by comprehensive end-to-end lineage, thoroughly checks every pending Pull Request to ensure it doesn't disrupt downstream transformations, dashboards, ML models, and more.


How Foundational analyzes this framework

Foundational's code engine scans and extracts lineage straight from the source code. Data teams gain full visibility into any Python pipeline, including legacy pipelines and custom code that do not rely on common frameworks like Airflow or Spark.

This shift-left approach makes it possible to review data flow changes in pending Pull Requests, ensuring that any changes impacting your data stack do not lead to data incidents, pipeline disruptions, or compromises in data quality.

Example of Python code analyzed by Foundational's code engine

Multi-step extraction process

Foundational uses a multi-step process to track data:

  1. Translation to Abstract Syntax Tree (AST): Foundational translates the raw Python code into an AST, a standardized hierarchy of the logic, variables, and API calls.

  2. Path Finding (Source to Sink): The engine scans the AST to map data movement. It identifies ingestion points and traces data through intermediate transformations until they land in a persistent sink.

  3. Ephemeral Node Resolution: During path finding, the code engine intelligently detects and removes ephemeral nodes, such as temporary local files. By bypassing these temporary entities, Foundational collapses the intermediary steps, resulting in a clean, accurate lineage graph that reflects the true architectural flow (e.g., Snowflake → OpenAI → S3).

  4. Graph Construction and Linking: Because a single script rarely contains full schema context (e.g., SELECT * FROM ORDERS), Foundational merges standalone code analysis results with schema definitions found elsewhere in your warehouse or repositories. You see the exact columns flowing into your models.

Lineage graph showing the Inspector panel and Python code

① Data lineage

② Inspector panel showing upstream / downstream count. When open, each section displays related entities.

③ Python code


Advantages of Foundational's approach

Foundational provides data teams with:

  • Early visibility: Shows how Python code impacts data flows during development, seamlessly integrated into your source control, e.g., GitHub, GitLab, Azure, Bitbucket etc.

  • Intelligent noise reduction: Identifies and collapses ephemeral endpoints (like temporary .csv or JSON files). The result is a clean, actionable architectural map rather than a cluttered graph.

  • Cross-system tracking: Connects governed data warehouses to external AI endpoints (like OpenAI) and unstructured data lakes.

  • Reduced breakages: Prevents dashboards, ML features, and Python pipelines from breaking due to upstream schema changes.


Set up Python lineage in Foundational

Setup is seamless.

  1. Connect the repositories that house your Python scripts. There is no need to manually annotate code or alter your runtime environment.

  2. From there, the code engine automatically identifies Python logic, tracks API connections, and extracts schema and lineage.


Useful links

Did this answer your question?