Skip to main content

How to set up Foundational On-Premise Agent in Your AWS

Introduction

The Foundational on-premise agent runs inside your AWS account and connects to your source code, data warehouses, and BI tools without exposing them to the public internet. This model is appropriate for organizations with strict network boundaries or data-residency requirements.

How to use this article

Networking teams:
See Part 1 > Network connectivity patterns

Security teams:
See Part 1 > Security


Part 1: Overview and key principles

Data residency

Code scanning and extraction happen inside the on-prem agent that you deploy in your own AWS account. Foundational's cloud receives lineage snapshots from the agent and stitches them into the full end-to-end lineage graph.

Split architecture

The deployment has two independently-secured halves:

  • On-prem agent: Runs in your AWS account. Owns access to your data sources, source control servers (e.g., GitHub), BI tools, and the credentials used to reach them.

  • Foundational's cloud: Receives lineage snapshots from the on-prem agent, stitches them into the full end-to-end lineage graph, persists the result, and serves the UI and API.

The boundary uses authenticated HTTPS. The agent pushes lineage snapshots to the cloud outbound. Your source control server sends signed, WAF-fronted webhooks to the cloud for code-change events.

Customer-controlled security

Although Foundational operates the deployment, all on-prem infrastructure runs in your AWS account. You own the IAM policies, VPC, network rules, KMS keys, and audit trail. Foundational scopes its access via a cross-account IAM role with an external ID and resource/path/tag restrictions (see Security below).

Built on standard AWS services

The agent uses VPC, EKS, S3, IAM, KMS, NAT Gateway, and optionally PrivateLink, Transit Gateway, or Site-to-Site VPN. It fits into existing AWS Organization structures, audit pipelines, and cost-allocation tagging without any special handling.

Foundational Architecture

The on-prem agent runs in a Customer VPC deployed via Terraform into your AWS account. Foundational's cloud runs in a Foundational VPC in a separate, Foundational-managed AWS account. The two VPCs communicate over authenticated HTTPS. All inbound traffic to the Foundational VPC is fronted by AWS WAF.

What runs in each VPC

Customer VPC: your AWS account

Runs in an EKS cluster you control. Every component that touches source code, table data, or data-source credentials lives here.

  • Source Control Agent: Pulls source code from your source control servers (e.g. GitHub, GitLab, Bitbucket, Azure DevOps), whether SaaS or self-hosted.

  • Code Changes Handler: Receives change events from the Foundational cloud webhook handler and dispatches scans to the right Code Scanners.

  • Code Scanners: Extract lineage directly from code (SQL, Spark, Python incl. SQLAlchemy/pandas, Java, dbt, Ruby ActiveRecord, Cobol, and others).

  • Metadata Extractors: Connect to your data warehouses and BI tools (Snowflake, BigQuery, PowerBI, Tableau, Looker, Salesforce, etc.) to pull schema and metadata for stitching.

  • Per-System Lineage Snapshot: An S3 bucket under the foundational-onprem-* prefix holding per-scan lineage output before it ships to the Foundational cloud.

Foundational VPC — Foundational-managed AWS account

Receives lineage snapshots from the on-prem agent.

  • AWS WAF + Webhooks Handler: Public ingestion endpoint for source-control webhooks. WAF enforces rate limits and known-bad-pattern blocking; the handler verifies the webhook signature and routes the event to the right tenant.

  • Scan Scheduler: Schedules incremental and full scans and signals the Customer VPC's Code Changes Handler when work is queued.

  • Lineage Graph Builder: Stitches per-system snapshots received from the Customer VPC into a full end-to-end lineage graph.

  • Lineage Storage: S3 bucket holding the lineage graph, plus a metadata store indexing it for fast retrieval.

  • Foundational UI / Webapp Backend / Foundational API: The web application end users interact with. Reads from Lineage Storage; does not reach into the Customer VPC.

  • (Optional) MCP server: Exposes lineage to MCP clients if enabled.

What gets deployed in your AWS account

The Terraform creates the following AWS resources in your account:

  • EKS cluster with managed node groups.

  • VPC with private and public subnets, NAT Gateway, and security groups (default CIDR 10.0.0.0/16, configurable).

  • S3 buckets under the foundational-onprem-* prefix.

  • KMS keys for at-rest encryption.

  • IAM roles and policies under the /foundational-onprem/* path.

  • AWS Secrets Manager entries for per-deployment credentials.

  • Datadog agent (in-cluster).


Security

Access permissions

Foundational's cross-account role in the deployment account grants:

  • Kubernetes (EKS): Create, manage, and deploy EKS clusters.

  • Networking (VPC/EC2): Create and manage VPCs and supporting networking.

  • IAM roles and policies: Create and manage service roles and policies.

  • Storage and encryption: Create and manage S3 buckets under the foundational-onprem-* prefix, and KMS keys for at-rest encryption.

  • Container Registry (ECR): Read-only pull of Docker images from Foundational's ECR.

  • Logging and monitoring: Create and manage CloudWatch log groups.

Security safeguards

The following restrictions apply to all permissions above:

  • Resource prefix: Resource names must begin with foundational-onprem-.

  • Tag conditions: Resources must carry ManagedBy=Foundational and ResourceGroup=foundational-onprem.

  • IAM path: All IAM roles and policies Foundational creates live under /foundational-onprem/*.

  • External ID: Required for cross-account role assumption, preventing confused-deputy attacks.

  • Encryption at rest: Foundational encrypts EKS secrets, S3 buckets, and EBS volumes with customer-account KMS keys.

  • Network isolation: Workloads run in private subnets; outbound traffic routes through a NAT Gateway.

Foundational cannot read or modify any pre-existing resource that does not match these constraints. The role can only create and manage new infrastructure dedicated to this deployment.


Network Connectivity

The Code Scanners and Metadata Extractors in the EKS cluster must reach your source control servers, data warehouses, and BI tools (Snowflake, GitHub Enterprise, Tableau, Postgres, and so on). Choose one connectivity pattern per service; different services in the same deployment can use different patterns. For example, Snowflake might use PrivateLink, Tableau Cloud might use a NAT EIP allowlist, and an internal Postgres might use VPC Peering.

Choose a pattern

Service type

Example

Recommended pattern

SaaS, no AWS network presence

Tableau Cloud, GitHub Enterprise Cloud, dbt Cloud, Looker (hosted)

Networking Option #1: Public + NAT EIP allowlist

SaaS, PrivateLink-capable on AWS

Snowflake (AWS), Databricks, MongoDB Atlas, Confluent Cloud

Networking Option #2: AWS PrivateLink (consumer side)

Customer-hosted in AWS, single VPC, non-overlapping CIDR

Self-hosted Postgres, GitHub Enterprise Server on EC2

Networking Option #3: VPC Peering

Customer-hosted in AWS, multiple VPCs / accounts / regions

Internal services across a hub-and-spoke network

Networking Option #4: AWS Transit Gateway

Customer-hosted in your own datacenter

Internal Postgres or Tableau Server on bare metal

Networking Option #5: Site-to-Site VPN

CIDR planning

The deployed VPC defaults to 10.0.0.0/16 (vpc_cidr in client-infrastructure/variables.tf). VPC Peering and Transit Gateway require non-overlapping CIDRs with every network the deployed VPC must reach.

Confirm the range with your networking team before applying Terraform, and override vpc_cidr if there is any chance of collision. PrivateLink and public-internet egress have no CIDR constraint.

Firewall rules

The allowlist source on your data services depends on how each service is reachable:

  • For services reachable over the public internet (SaaS: Tableau Cloud, GitHub Enterprise Cloud, dbt Cloud, etc.), allowlist the deployed VPC's NAT EIP and use the service's public hostname (Networking Option #1).

  • For services reachable privately (in an AWS VPC you own, via PrivateLink, or on-prem), allowlist the deployed VPC CIDR on the service (Networking Options #2–#5).

Service

Protocol / Port

Source

Snowflake (interface VPC endpoint)

TCP / 443

Deployed VPC CIDR

GitHub Enterprise Server

TCP / 22, 80, 443

Deployed VPC CIDR

Tableau Server

TCP / 80, 443

Deployed VPC CIDR

PostgreSQL

TCP / 5432

Deployed VPC CIDR

MySQL / MariaDB

TCP / 3306

Deployed VPC CIDR

Looker (private)

TCP / 443, 19999

Deployed VPC CIDR

Egress on the deployed VPC is already open (0.0.0.0/0 on the nodes' security group, in client-infrastructure/security-groups.tf); no outbound changes are needed.

Private DNS resolution

Pods must resolve your internal hostnames (postgres.internal.example.com, github.example.com, and so on). The standard pattern is a cross-account Route 53 Private Hosted Zone association: you own a PHZ for your internal domain and associate it with the deployed VPC.

# Your account (PHZ owner) — authorize the association
aws route53 create-vpc-association-authorization \
--hosted-zone-id Z1ABCDEFGHIJK \
--vpc VPCRegion=eu-west-1,VPCId=vpc-deployed-id

# Foundational deployer role (VPC owner) — perform the association
aws route53 associate-vpc-with-hosted-zone \
--hosted-zone-id Z1ABCDEFGHIJK \
--vpc VPCRegion=eu-west-1,VPCId=vpc-deployed-id

# Your account — delete the one-shot authorization
aws route53 delete-vpc-association-authorization \
--hosted-zone-id Z1ABCDEFGHIJK \
--vpc VPCRegion=eu-west-1,VPCId=vpc-deployed-id

PrivateLink-based services that don't use AWS's built-in private DNS (notably Snowflake) need an additional Private Hosted Zone created in the deployed VPC — see Networking Option #2.

IAM cross-account access

If the AWS resources you want to scan live in a different account from the one running this deployment (for example, your data lake's S3 buckets and Glue catalog are owned by a separate analytics team while the agent runs in a dedicated deployment sub-account), the Code Scanners or Metadata Extractors need a role in that account to call the relevant AWS APIs. Use IAM role chaining via EKS Pod Identity (preferred for new deployments) or IRSA:

  1. Create a role in the target account — for example Client-Resource-Access-Role — with read-only permissions on the required resources.

  2. Add a trust policy allowing the extractor pod role to assume it:

    {
    "Version": "2012-10-17",
    "Statement": [{
    "Effect": "Allow",
    "Principal": {
    "AWS": "arn:aws:iam::<deployed-account-id>:role/foundational-onprem-<client>-extractor"
    },
    "Action": ["sts:AssumeRole", "sts:TagSession"]
    }]
    }
  3. Pods call sts:AssumeRole at runtime to obtain temporary credentials in the target account.

This is independent of the deployer role created by client-bootstrap, which is used only at terraform apply time.


Connectivity patterns

Networking Option #1: Public internet + NAT EIP allowlisting

Use this for SaaS services that don't offer private connectivity, or when the service only supports source-IP allowlisting.

The EKS private subnets egress through a NAT Gateway with a single Elastic IP (single_nat_gateway = true in client-infrastructure/vpc.tf). You allowlist that EIP on each SaaS service; traffic is TLS over the public internet.

Steps:

  1. After applying client-infrastructure, fetch the NAT EIP:

    aws ec2 describe-nat-gateways \
    --filter "Name=tag:Name,Values=foundational-onprem-${CLIENT}-*" \
    --query 'NatGateways[].NatGatewayAddresses[].PublicIp' --output text

  2. Add the EIP to each service's IP allowlist (Snowflake Network Policy, Tableau Cloud trusted IPs, GitHub Enterprise Cloud allow list, dbt Cloud SSO, and so on).

For multi-AZ resilience, set single_nat_gateway = false and provide one EIP per AZ. NAT costs will be higher.

Networking Option #2: AWS PrivateLink (consumer side)

Use this for SaaS providers that publish a VPC Endpoint Service in the same AWS region (Snowflake on AWS is the common one).

What the provider supplies:

  • The VPC Endpoint Service name, e.g. com.amazonaws.vpce.eu-west-1.vpce-svc-0abcd1234.

  • For Snowflake: the PrivateLink account URL (<account>.<region>.privatelink.snowflakecomputing.com) and the OCSP URL, retrieved via SELECT SYSTEM$GET_PRIVATELINK_CONFIG();.

What is configured in the deployed VPC:

  • An Interface VPC Endpoint in the private subnets, targeting the provider's service name.

  • A security group on the endpoint allowing ingress on TCP 443 from the EKS node security group.

  • A Route 53 Private Hosted Zone associated with the deployed VPC, so pods resolve the PrivateLink DNS name to the interface endpoint. Snowflake does not use AWS's built-in private DNS for PrivateLink, so a manual PHZ is required.

Example Terraform fragment for Snowflake (add to client-infrastructure):

resource "aws_vpc_endpoint" "snowflake" {
vpc_id = module.vpc.vpc_id
service_name = "com.amazonaws.vpce.eu-west-1.vpce-svc-0abcd1234"
vpc_endpoint_type = "Interface"
subnet_ids = module.vpc.private_subnets
security_group_ids = [aws_security_group.snowflake_privatelink.id]
private_dns_enabled = false
}

Snowflake PrivateLink DNS specifics — create a PHZ in the deployed VPC for privatelink.snowflakecomputing.com and add the CNAMEs returned by SYSTEM$GET_PRIVATELINK_CONFIG, pointing at the interface VPC endpoint's regional DNS name.

Snowflake will not resolve PrivateLink URLs without this.

Networking Option #3: VPC Peering

Use this when your services live in a single AWS VPC, you own the account, CIDRs don't overlap, and there is no centralized network hub.

Steps (executed jointly):

  1. From the deployed VPC, via the deployer role, create a peering request targeting your service VPC ID and account ID.

  2. Accept the peering request in your account.

  3. Add routes in both directions:

    • Deployed VPC private and public route tables → service CIDR via the pcx-… peering connection.

    • Your VPC route tables → deployed VPC CIDR via the same pcx-….

  4. If you use Route 53 private hosted zones, enable "DNS resolution from accepter VPC to requester VPC" on the peering connection.

Peering is not transitive; each VPC pair needs its own connection. Use inter-region peering when the data lives in a different region from the deployed VPC.

Networking Option #4: AWS Transit Gateway

Use this when you already run a hub-and-spoke topology with a centralized TGW, or you need to reach services across multiple VPCs, accounts, or regions through a single attachment.

Steps:

  1. Share the TGW with the Foundational sub-account via AWS Resource Access Manager (aws ram create-resource-share).

  2. Foundational accepts the RAM share invitation in the deployer role.

  3. Attach the deployed VPC to the shared TGW (aws_ec2_transit_gateway_vpc_attachment).

  4. Update route tables on both sides:

    • Deployed VPC route tables → service CIDR via the TGW attachment.

    • TGW route tables → deployed VPC CIDR via the new attachment.

TGW charges hourly and per-GB data-processing fees, but it handles overlapping CIDRs via route domains and avoids an N² peering mesh once more than two or three VPCs are involved.

Networking Option #5: Site-to-Site VPN

Use this when data sources live in your on-prem datacenter (self-hosted Tableau Server, internal Postgres on bare metal, GitHub Enterprise Server on internal hardware).

You provide:

  • Customer VPN device public IP and BGP ASN (or static routes).

  • On-prem CIDRs to be reachable.

In the deployed VPC: a Customer Gateway, a Virtual Private Gateway (attached to the VPC) or TGW VPN attachment, and a Site-to-Site VPN connection with two IPsec tunnels. Routes are propagated from the VGW or TGW into the deployed VPC's route tables.

For high-throughput or low-jitter requirements, Direct Connect is the dedicated-link alternative. Provisioning takes weeks and is rarely justified for a single client, so VPN over the public internet is the default.


Part 2: Onboarding checklist

Share with your networking team

Item

Applies to

Foundational sub-account ID and target region

All options

Deployed VPC CIDR

All options

NAT Gateway EIP(s), once available

Option #1 only

Deployed VPC ID and interface-endpoint security group ID

Option #2 only

Extractor pod role ARN

Cross-account access only

Collect for Foundational

Item

Applies to

Per data source: hostname, port, auth method, current IP allowlist policy

All options

Endpoint Service name

Option #2

VPC ID, account ID, CIDR, route-table IDs of every service VPC

Options #3 and #4

TGW ID and RAM share ARN

Option #4

VPN device public IP, BGP ASN, on-prem CIDRs

Option #5

Private hosted zone IDs to associate

All options

Target role ARN for cross-account AWS API access

Cross-account access only


Part 3: Deployment

Deployment uses two Terraform modules, both shipped together in the ZIP linked under Deployment steps below. They are applied in sequence, from different accounts.

  • client-bootstrap — applied by you, the customer, in the AWS account that will host the agent. It creates a single cross-account IAM role (deployer_role_arn) that Foundational will later assume. Nothing else is provisioned at this step; the role is the handoff.

  • client-infrastructure — applied by Foundational, assuming the role created above. It provisions the rest of the agent: VPC, EKS cluster, S3 buckets, KMS keys, the Datadog agent, and the supporting IAM roles and policies described in Security.

Prerequisites

  1. An AWS account where you can create IAM roles and assign permissions.

  2. AWS CLI configured with credentials for that account.

  3. Terraform >= 1.0.

  4. From Foundational (request via the Support Team):

    • foundational_account_id — Foundational's AWS account ID.

    • external_id — a unique identifier used for secure role assumption (prevents confused-deputy attacks).

Recommended deployment topology

Deploy the agent into a dedicated AWS sub-account within your AWS Organization rather than directly into your main account. Benefits:

  • Isolation and security: Keeps the Foundational deployment out of your production resources.

  • Access control: Restricts Foundational's access to a single account and its explicitly granted cross-account paths.

  • Cost tracking: Separates Foundational's AWS spend from the rest of your bill.

  • Compliance and auditing: Clear resource boundaries simplify reporting.

  • Network segmentation: Dedicated network configuration controls exactly what the agent can reach.

Deployment steps

  1. Download the Terraform module for the on-premise agent (the file is attached to this article).

  2. Apply the client-bootstrap module following its README.md. This creates the cross-account IAM role.

  3. Send the resulting deployer_role_arn (the README shows how to retrieve it) to the Foundational support team. Foundational will then apply the client-infrastructure module against your account using the assumed role and connect to your on-premise deployment.


Part 4: FAQ

What traffic crosses the boundary between the Customer VPC and the Foundational VPC?

Direction

Purpose

Notes

Source control servers → Foundational VPC

Webhook notifications on code changes

WAF-fronted, signature-verified

Foundational VPC → Customer VPC

Scan orchestration (which repo, which commit)

Outbound from Foundational; no public-internet inbound to your VPC

Customer VPC → source control servers, warehouses, BI tools

Pull source code, schemas, metadata

Network paths you control — see Network Connectivity

Customer VPC → Foundational VPC

Upload lineage snapshots

Outbound HTTPS, authenticated

End users → Foundational VPC

Browse lineage via UI and API

Standard web traffic to Foundational

Did this answer your question?