What is SQLAlchemy?

SQLAlchemy is an Object-Relational Mapping (ORM) library for Python, designed to facilitate database operations by allowing developers to interact with databases using Python objects instead of raw SQL queries. It provides a high-level ORM that maps Python classes to database tables, enabling developers to perform CRUD (Create, Read, Update, Delete) operations in an intuitive and Pythonic way. Additionally, SQLAlchemy includes a lower-level Core system that allows for more direct control over SQL statements and database interactions. This flexibility makes SQLAlchemy a powerful tool for building complex applications, as it can handle both simple and advanced database tasks while ensuring efficiency and scalability.

Why Is SQLAlchemy Lineage and Schema Management Crucial?

SQLAlchemy is often a key component within development teams for managing operational and production databases, particularly in OLTP environments. These tables are subsequently ingested into data warehouses, such as BigQuery or Snowflake, using popular ETL tools like Fivetran and AirByte. Once in the warehouse, the data is transformed using tools like dbt or Google Dataform and is then consumed by visualization platforms like Tableau and Looker, or by other downstream users, including data science teams and reverse ETL processes.

As a result, any change made by the development team to the upstream SQLAlchemy model has the potential to impact the entire data stack, potentially disrupting visualization dashboards, machine learning models, and more.

Ideally, teams relying on data from upstream tables managed by SQLAlchemy should be informed whenever changes that might affect them—such as renaming or removing columns—are planned. This proactive approach allows impacted teams to adapt their systems accordingly, ensuring that changes can be implemented without causing data incidents or compromising data quality.

Automated SQLAlchemy Schema Extraction from Code

With the Foundational Code Engine, data lineage and schema extraction are automated by directly examining SQLAlchemy code. This approach empowers customers to review SQLAlchemy schema changes during the development phase or while they are still in a pending Pull Request, making it possible to evaluate the potential downstream effects of these changes prior to deployment.

A Unique Approach Compared to Traditional Methods

Example of SQLAlchemy model from our own code at Foundational

Unlike most other vendors, who typically connect directly to the relevant database—such as a Postgres database hosted on AWS RDS—to retrieve the current schema, Foundational takes a different approach. While the traditional method is simpler to implement, it only allows customers to see schema changes after they have been deployed, missing the opportunity to be notified of changes in advance.

How Does Foundational Extract SQLAlchemy Schema from Code?

To accomplish this, the Foundational Code Engine follows a systematic process:

Identifying Relevant Source Files
Our Code Engine scans accessible repositories, searching for SQLAlchemy definition files using various heuristics, such as detecting Python classes that inherit from sqlalchemy.ext.declarative.declarative_base().
Creating a Sandbox Environment to Load These Classes
Utilizing dynamic mocks in Python, our Code Engine sets up a sandbox environment that enables the safe import of all relevant SQLAlchemy definitions. These mocks ensure that all necessary imports and non-built-in function calls are properly handled.
Loading SQLAlchemy Classes in the Sandbox Environment
In this sandbox, our Code Engine loads all SQLAlchemy classes and uses Python introspection mechanisms to examine each class’s attributes, constructing the schema from this data. This method allows us to extract even the most complex SQLAlchemy schemas, including those containing computations and complex logic.

Why Doesn’t Foundational Use Alembic?

Alembic is a popular tool often paired with SQLAlchemy for managing database schema migrations. It enables users to easily upgrade or downgrade their database schemas across various versions.

However, we chose SQLAlchemy over Alembic for implementing data lineage and data contracts for several reasons:

SQLAlchemy as the Source of Truth
SQLAlchemy, not Alembic, serves as the definitive source for database schemas. Companies might update SQLAlchemy models, thereby altering database schemas, but forget to update Alembic migrations. We prioritize catching every schema change, regardless of whether Alembic is updated.
Potential for Broken Alembic Migrations
Alembic migrations can become outdated or broken over time, particularly when no one expects to downgrade their database to a schema from years ago. Since Alembic relies on sequentially applying migrations, even one broken migration can compromise the entire schema analysis.
Scaling Considerations
SQLAlchemy scales with the database size, while Alembic scales with the number of schema changes. Over time, the number of changes often far exceeds the number of tables. By relying on SQLAlchemy, we ensure scalability aligned with the actual database size, which is more crucial than the sheer number of schema changes.

Setup Instructions

Setting up SQLAlchemy schema and lineage extraction is straightforward: simply connect the repositories that include your SQLAlchemy definitions. There’s nothing more to it!

From there, the Foundational Code Engine will automatically identify the SQLAlchemy definitions within the repository and proceed with the schema extraction process.

Foundational Spark Lineage

Extracting Lineage from Ruby Active Record

Extracting Lineage from Liquibase Migrations

Extracting Lineage from TypeORM (TypeScript ORM)

Extracting Lineage from Entity Framework Core (Built-in C# ORM)