Skip to main content

Data Contracts: How Foundational Simplifies Automation and Enforcement

Updated over a week ago

Overview

Data Contracts empower teams to define and enforce data quality expectations directly within their source code. A data contract is a YAML file (.fd.yml or .fd.yaml) that lives alongside your project’s code and declaratively specifies the rules, ownership, and quality standards for your data assets.

Distributed Ownership and Scalability

By enabling every team to define contracts within their own repository, Foundational eliminates the need for a single, central vendor-specific system for data governance. This distributed approach offers several key advantages:

  1. Proximity to Code: Defining contracts near the relevant code makes them easier to maintain and ensures they evolve alongside your data products.

  2. Scalable Quality: Organizations can scale data quality across many repositories and teams without the bottleneck of a centralized platform.

  3. Open Format & No Vendor Lock-in: Foundational uses an open-format YAML, ensuring your data definitions remain portable and preventing reliance on proprietary vendor systems.

Automated Monitoring and Alerts

When Foundational scans repositories configured with data contract files, it discovers and validates these contracts to enforce your defined monitors and policies. This ensures that any changes impacting your data stack—whether in data engineering or production code—do not compromise data quality.

If a contract rule is violated, the configured contacts are immediately notified through email, Slack, Microsoft Teams, etc.

Capabilities

Capability

Description

Ownership & accountability

Define a data owner, steward, support contact, and downstream consumers for every dataset.

Freshness / timeliness monitoring

Set maximum acceptable latency for a table and get alerted when data goes stale — tracked by write events, row-count changes, or both.

Schema-level field definitions

Document every column's name, type, nullability, description, sensitivity flag, and example values.

Field constraints

Enforce regex patterns, allowed-value lists, and min/max value ranges on individual columns.

Quality thresholds

Set target completeness, uniqueness, and validity percentages per column.

Advanced rate metrics

Monitor zero rate, true rate, null rate, and distinct-values rate with configurable min/max bounds (0–100%).

Advanced value metrics

Set bounds on min, max, count, sum, average, and row count for any table.

Custom SQL monitors

Run arbitrary SQL queries on a schedule and monitor the result columns for violations.

Incremental monitoring

Optionally scan only new or changed rows based on an incrementing column (timestamp, date, or integer).

Flexible scheduling

Choose an interval (minimum 1 hour) or a cron expression for monitoring frequency.

Automated alerting

Contacts marked with receive_alerts: true are automatically notified via email, Slack, or Microsoft Teams when violations occur.

Version control

Contracts are versioned and stored in git, giving you a full audit trail of every change.

Example use-cases

  • Guarantee SLA freshness for a critical table. Define a timeliness rule that alerts the data engineering team within 1 hour if the orders table hasn't been updated - but only when downstream consumers are actively reading the data (only_if_read: true).

  • Prevent bad data from reaching production. For example, if a corrupted ETL process results in an unexpected spike in NULL values for a critical field like customer_id, Foundational flags the issue immediately, allowing you to address the data integrity at the source.

  • Monitor referential integrity across tables. Write a custom SQL monitor that checks for orphaned child records (e.g., order_line_items referencing an order_id that no longer exists in orders) and get alerted when the count rises above zero.

  • Track data completeness over time. Set a completeness threshold of 98% on a customer_email column and monitor the null rate to catch ETL regressions before they impact downstream reports.

  • Establish data ownership. Assign a data owner, data steward, and support contact to every dataset so that when something breaks, the right person is notified on the right channel.

  • Monitor high-value numeric columns. Use value metrics to set bounds on a total_amount column's sum, average, min, and max so that anomalous financial data is surfaced early.


Data Contracts vs. Traditional Observability

While traditional data observability tools focus on monitoring data at rest via a UI, Foundational’s Data Contracts shift this logic into your development workflow.

Feature

Traditional Observability Tools

Foundational Data Contracts

Rule Definition

Defined in a non-scalable, vendor-specific UI.

Defined in code (YAML) within the relevant repository.

Vendor Lock-in

High; rules are trapped in a proprietary platform.

None; open-format YAML belongs to your codebase.

Management

Centralized; every team must use one platform.

Decentralized; teams manage rules in their own Git repos.

Transparency

Changes often lack clear audit trails.

Full transparency; every change is tracked and reviewed in Git.


Getting Started with Data Contracts

To begin enforcing data standards in Foundational, follow these steps to move from definition to automated enforcement:

Step 1: Define your contract

Outline your data validations, invariant rules, and governance policies in a YAML file.

  • Extension: Use the .fd.yml or .fd.yaml extension.

  • Location: Place the file anywhere in a repository Foundational scans for contracts —ideally near the code it describes.

  • Pro Tip: You can leverage AI tools to generate these YAML definitions at scale based on your existing table schemas.

Step 2: Create and Merge

Create a Pull Request in your Git repository containing the new contract file.

  • Discovery: Foundational automatically scans your repositories and recursively discovers these files.

  • Transparency: Because contracts are code, every change is tracked, reviewed, and versioned in Git, providing a full audit trail.

Step 3: Connect Notifications

To ensure the right people are alerted when a contract is violated, finalize your communication settings.

  • Alert Channels: Connect Slack or Microsoft Teams in your Foundational settings.

  • Receiver Logic: Ensure the data_owner or relevant contacts have receive_alerts: true set in the YAML file.

Data Contract YAML Format

Contract metadata (required)

# Unique ID across all contract files in the repo
contract_id: "order_events_v1"
# Version string (for your own tracking)
version: "1.0.0"
# Active | Inactive | Draft | Deprecated | Retired
status: "Active"
description: "Contract for the order events dataset."

Note: Only contracts with status Active are enforced. Use Draft while iterating, Deprecated or Retired to phase out old contracts, and Inactive to temporarily disable enforcement.

Ownership & contacts (data_owner required)

data_owner:                      # Required
name: "Jon Smith"
email: "jon@example.com"
team: "Data Engineering"
receive_alerts: true # Will be notified on violations

data_steward: # Optional
name: "Jane Doe"
email: "jane@example.com"
team: "Data Governance"
receive_alerts: true

support_contact: # Optional
team: "BI Support"
email: "bi-support@example.com"
slack_channel: "#data-support" # Slack channel for alerts
teams_channel: "Data Alerts" # Microsoft Teams channel for alerts
receive_alerts: true

consumers: # Optional — downstream teams
- team: "Sales Analytics"
email: "sales@example.com"
- team: "Operations"
email: "ops@example.com"
receive_alerts: true

Any contact with receive_alerts: true will be notified through the configured channels (email, Slack, and/or Teams) when a violation occurs.

Business context (optional)

business_domain: "Order Management"
business_description: |
This dataset contains one record per order event.
It is used by downstream analytics for dashboards and billing.
tags:
- "order"
- "high_priority"
- "PII"

Table definitions (required)

Each contract must define at least one table:

tables:
- description: "Master list of orders."
data_source:
# Snowflake | BigQuery | Postgres | MySQL | Oracle | MongoDB
# | GlueCatalog | S3
type: "Snowflake"
database: "prod_db"
schema: "public"
table_name: "orders"

Timeliness (optional)

Monitor data freshness and get alerted when data is stale:

    timeliness:
frequency: "hourly" # hourly | daily | weekly | monthly
max_latency: "1 hour" # Supported: "X hour(s)" or "X day(s)"
only_if_read: true # Only alert if stale data was actually read
timeliness_by_writes: true # Track freshness by write events
timeliness_by_row_count: false # Track freshness by row count changes
timeliness_alert:
description: "Orders table is stale"
severity: "HIGH" # LOW | MEDIUM | HIGH | CRITICAL

Partitioning (optional)

Document the table's partitioning strategy:

    partitioning:
strategy: "ByDate" # ByDate | ByHash
columns:
- name: "order_created_at"
type: "date"
description: "Partition on order date"
frequency: "daily" # hourly | daily | weekly | monthly
format: "yyyy-MM-dd"

Scheduling (optional)

Control how often monitors run. Defaults to every 1 hour if omitted.

    # Interval-based (minimum 1 hour):
schedule:
schedule_type: "interval"
frequency_hours: 1

# Or cron-based (must fire on rounded hours, i.e., minute = 0):
schedule:
schedule_type: "cron"
cron_expression: "0 */6 * * *" # Every 6 hours

Incremental monitoring (optional)

Monitor only new/changed data for efficiency on large tables:

    incremental_monitoring:
increment_by_column_name: "updated_at"
column_type: "timestamp" # timestamp | date | int | bigint
column_format: "%Y-%m-%d %H:%M:%S" # Optional format string

Field definitions (optional)

Define column-level schema, constraints, and quality thresholds:

    fields:
- name: "order_id"
type: "string"
nullable: false
description: "Globally unique order identifier."
example: "ORD-20250602-12345"
sensitive: false # Mark PII/sensitive columns as true
constraints:
pattern: "^ORD-[0-9]{8}-[0-9]+$" # Regex pattern
quality_thresholds:
completeness: 100.0 # % of non-null values (0–100)
uniqueness: 100.0 # % of distinct values (0–100)

- name: "order_status"
type: "string"
nullable: false
constraints:
allowed_values: # Enumerated allowed values
- "PENDING"
- "CONFIRMED"
- "SHIPPED"
- "DELIVERED"
- "CANCELLED"
quality_thresholds:
validity: 100.0 # % of valid values (0–100)

- name: "total_amount"
type: "decimal(10,2)"
nullable: false
constraints:
value_range:
min: 0.0
max: 100000.0

Advanced quality metrics (optional)

For finer-grained control, use rate_metrics and value_metrics inside quality_thresholds:

        quality_thresholds:
rate_metrics: # All rate values are percentages
zero_rate:
max: 5.0 # At most 5% zeros
null_rate:
max: 2.0 # At most 2% nulls
true_rate:
min: 0.0
max: 100.0
distinct_values_rate:
min: 95.0 # At least 95% distinct values

value_metrics: # Bounds on aggregate statistics
min:
min: 0.0 # Column minimum must be >= 0
max: 100.0 # Column minimum must be <= 100
max:
min: 0.0
max: 10000.0
count:
min: 1.0
sum:
min: 0.0
avg:
min: 0.0
max: 500.0

Note: For each field, use either the simple thresholds (completeness, uniqueness, value_range) or the advanced metrics (rate_metrics, value_metrics), not both.

Custom SQL monitors (optional)

Run arbitrary SQL queries and monitor the output columns:

custom_monitors:

- name: "referential_integrity_check"

description: "Verify every line item references a valid order."

data_source:

type: "Snowflake"

database: "prod_db"

sql_statement: >

SELECT count(*) AS bad_references

FROM order_line_items

LEFT JOIN orders ON order_line_items.order_id = orders.id

WHERE orders.id IS NULL;

fields:

- name: "bad_references"

type: "integer"

description: "Count of orphaned line items."

constraints:

value_range:

max: 0.0 # Should always be zero

Custom monitors also support schedule and incremental_monitoring with the same syntax as table definitions.


Full example

Below is a complete .fd.yml file demonstrating the major features:

contract_id: "order_events_v1.0"
version: "1.0.0"
status: "Active"
description: >
Data contract for the Order Events topic. Contains real-time events
emitted whenever an order is created, updated, or cancelled.

data_owner:
name: "Jon Smith"
email: "jon@example.com"
team: "Data Engineering"
receive_alerts: true

data_steward:
name: "Jane Doe"
email: "jane@example.com"
team: "Data Governance"
receive_alerts: true

support_contact:
team: "BI Support"
email: "bi-support@example.com"
slack_channel: "#order-events-support"
receive_alerts: true

consumers:
- team: "Sales Analytics"
email: "sales-analytics@example.com"
- team: "Operations Dashboard"
email: "ops-dashboard@example.com"
receive_alerts: true

business_domain: "Order Management"
business_description: |
This dataset contains one record per order event:
- Creation
- Status updates (e.g., PENDING → SHIPPED → DELIVERED)
- Cancellation

tags:
- "order"
- "event_stream"
- "high_priority"

tables:
- description: "Master list of orders and their current status."
data_source:
type: "Snowflake"
database: "prod_analytics_db"
schema: "public"
table_name: "orders"

timeliness:
frequency: "hourly"
max_latency: "1 hour"
only_if_read: true
timeliness_by_writes: true
timeliness_by_row_count: false
timeliness_alert:
description: "Alert if orders table is stale"
severity: "HIGH"

fields:
- name: "order_id"
type: "string"
nullable: false
description: "Globally unique identifier for the order."
example: "ORD-20250602-12345"
constraints:
pattern: "^ORD-[0-9]{8}-[0-9]+$"
quality_thresholds:
completeness: 100.0
uniqueness: 100.0

- name: "customer_id"
type: "string"
nullable: false
description: "Unique identifier for the customer."
sensitive: true
constraints:
pattern: "^CUST-[0-9]+$"
quality_thresholds:
completeness: 100.0

- name: "order_status"
type: "string"
nullable: false
description: "Current status of the order."
constraints:
allowed_values:
- "PENDING"
- "CONFIRMED"
- "SHIPPED"
- "DELIVERED"
- "CANCELLED"
quality_thresholds:
validity: 100.0

- name: "total_amount"
type: "decimal(10,2)"
nullable: false
description: "Total monetary amount for the order (USD)."
constraints:
value_range:
min: 0.0
max: 100000.0

custom_monitors:
- name: "referential_integrity_check"
description: >
Verify that every line item references a valid order.
data_source:
type: "Snowflake"
database: "prod_analytics_db"
sql_statement: >
SELECT count(*) AS bad_references
FROM order_line_items
LEFT JOIN orders ON order_line_items.order_id = orders.id
WHERE orders.id IS NULL;
fields:
- name: "bad_references"
type: "integer"
description: "Count of orphaned line items — should always be zero."
constraints:
value_range:
max: 0.0

Need help?

For any questions, feedback, or issues with data contracts, reach out to us at support@foundational.io.

Did this answer your question?