Introduction
The Foundational on-premise agent runs inside your AWS account and connects to your source code, data warehouses, and BI tools without exposing them to the public internet. This model is appropriate for organizations with strict network boundaries or data-residency requirements.
How to use this article
Part 1: Overview and key principles covers the architecture, security model, and network connectivity options. Read this before deployment.
Part 2: Onboarding checklist lists everything you need to confirm before you start. Complete this before proceeding to Part 3.
Part 3: Deployment contains the step-by-step deployment instructions.
Networking teams:
See Part 1 > Network connectivity patterns
Security teams:
See Part 1 > Security
Part 1: Overview and key principles
Data residency
Code scanning and extraction happen inside the on-prem agent that you deploy in your own AWS account. Foundational's cloud receives lineage snapshots from the agent and stitches them into the full end-to-end lineage graph.
Split architecture
The deployment has two independently-secured halves:
On-prem agent: Runs in your AWS account. Owns access to your data sources, source control servers (e.g., GitHub), BI tools, and the credentials used to reach them.
Foundational's cloud: Receives lineage snapshots from the on-prem agent, stitches them into the full end-to-end lineage graph, persists the result, and serves the UI and API.
The boundary uses authenticated HTTPS. The agent pushes lineage snapshots to the cloud outbound. Your source control server sends signed, WAF-fronted webhooks to the cloud for code-change events.
Customer-controlled security
Although Foundational operates the deployment, all on-prem infrastructure runs in your AWS account. You own the IAM policies, VPC, network rules, KMS keys, and audit trail. Foundational scopes its access via a cross-account IAM role with an external ID and resource/path/tag restrictions (see Security below).
Built on standard AWS services
The agent uses VPC, EKS, S3, IAM, KMS, NAT Gateway, and optionally PrivateLink, Transit Gateway, or Site-to-Site VPN. It fits into existing AWS Organization structures, audit pipelines, and cost-allocation tagging without any special handling.
Foundational Architecture
The on-prem agent runs in a Customer VPC deployed via Terraform into your AWS account. Foundational's cloud runs in a Foundational VPC in a separate, Foundational-managed AWS account. The two VPCs communicate over authenticated HTTPS. All inbound traffic to the Foundational VPC is fronted by AWS WAF.
What runs in each VPC
Customer VPC: your AWS account
Runs in an EKS cluster you control. Every component that touches source code, table data, or data-source credentials lives here.
Source Control Agent: Pulls source code from your source control servers (e.g. GitHub, GitLab, Bitbucket, Azure DevOps), whether SaaS or self-hosted.
Code Changes Handler: Receives change events from the Foundational cloud webhook handler and dispatches scans to the right Code Scanners.
Code Scanners: Extract lineage directly from code (SQL, Spark, Python incl. SQLAlchemy/pandas, Java, dbt, Ruby ActiveRecord, Cobol, and others).
Metadata Extractors: Connect to your data warehouses and BI tools (Snowflake, BigQuery, PowerBI, Tableau, Looker, Salesforce, etc.) to pull schema and metadata for stitching.
Per-System Lineage Snapshot: An S3 bucket under the
foundational-onprem-*prefix holding per-scan lineage output before it ships to the Foundational cloud.
Foundational VPC — Foundational-managed AWS account
Receives lineage snapshots from the on-prem agent.
AWS WAF + Webhooks Handler: Public ingestion endpoint for source-control webhooks. WAF enforces rate limits and known-bad-pattern blocking; the handler verifies the webhook signature and routes the event to the right tenant.
Scan Scheduler: Schedules incremental and full scans and signals the Customer VPC's Code Changes Handler when work is queued.
Lineage Graph Builder: Stitches per-system snapshots received from the Customer VPC into a full end-to-end lineage graph.
Lineage Storage: S3 bucket holding the lineage graph, plus a metadata store indexing it for fast retrieval.
Foundational UI / Webapp Backend / Foundational API: The web application end users interact with. Reads from Lineage Storage; does not reach into the Customer VPC.
(Optional) MCP server: Exposes lineage to MCP clients if enabled.
What gets deployed in your AWS account
The Terraform creates the following AWS resources in your account:
EKS cluster with managed node groups.
VPC with private and public subnets, NAT Gateway, and security groups (default CIDR
10.0.0.0/16, configurable).
S3 buckets under the
foundational-onprem-*prefix.
KMS keys for at-rest encryption.
IAM roles and policies under the
/foundational-onprem/*path.
AWS Secrets Manager entries for per-deployment credentials.
Datadog agent (in-cluster).
Security
Access permissions
Foundational's cross-account role in the deployment account grants:
Kubernetes (EKS): Create, manage, and deploy EKS clusters.
Networking (VPC/EC2): Create and manage VPCs and supporting networking.
IAM roles and policies: Create and manage service roles and policies.
Storage and encryption: Create and manage S3 buckets under the
foundational-onprem-*prefix, and KMS keys for at-rest encryption.
Container Registry (ECR): Read-only pull of Docker images from Foundational's ECR.
Logging and monitoring: Create and manage CloudWatch log groups.
Security safeguards
The following restrictions apply to all permissions above:
Resource prefix: Resource names must begin with
foundational-onprem-.
Tag conditions: Resources must carry
ManagedBy=FoundationalandResourceGroup=foundational-onprem.
IAM path: All IAM roles and policies Foundational creates live under
/foundational-onprem/*.
External ID: Required for cross-account role assumption, preventing confused-deputy attacks.
Encryption at rest: Foundational encrypts EKS secrets, S3 buckets, and EBS volumes with customer-account KMS keys.
Network isolation: Workloads run in private subnets; outbound traffic routes through a NAT Gateway.
Foundational cannot read or modify any pre-existing resource that does not match these constraints. The role can only create and manage new infrastructure dedicated to this deployment.
Network Connectivity
The Code Scanners and Metadata Extractors in the EKS cluster must reach your source control servers, data warehouses, and BI tools (Snowflake, GitHub Enterprise, Tableau, Postgres, and so on). Choose one connectivity pattern per service; different services in the same deployment can use different patterns. For example, Snowflake might use PrivateLink, Tableau Cloud might use a NAT EIP allowlist, and an internal Postgres might use VPC Peering.
Choose a pattern
Service type | Example | Recommended pattern |
SaaS, no AWS network presence | Tableau Cloud, GitHub Enterprise Cloud, dbt Cloud, Looker (hosted) | Networking Option #1: Public + NAT EIP allowlist |
SaaS, PrivateLink-capable on AWS | Snowflake (AWS), Databricks, MongoDB Atlas, Confluent Cloud | Networking Option #2: AWS PrivateLink (consumer side) |
Customer-hosted in AWS, single VPC, non-overlapping CIDR | Self-hosted Postgres, GitHub Enterprise Server on EC2 | Networking Option #3: VPC Peering |
Customer-hosted in AWS, multiple VPCs / accounts / regions | Internal services across a hub-and-spoke network | Networking Option #4: AWS Transit Gateway |
Customer-hosted in your own datacenter | Internal Postgres or Tableau Server on bare metal | Networking Option #5: Site-to-Site VPN |
CIDR planning
The deployed VPC defaults to 10.0.0.0/16 (vpc_cidr in client-infrastructure/variables.tf). VPC Peering and Transit Gateway require non-overlapping CIDRs with every network the deployed VPC must reach.
Confirm the range with your networking team before applying Terraform, and override vpc_cidr if there is any chance of collision. PrivateLink and public-internet egress have no CIDR constraint.
Firewall rules
The allowlist source on your data services depends on how each service is reachable:
For services reachable over the public internet (SaaS: Tableau Cloud, GitHub Enterprise Cloud, dbt Cloud, etc.), allowlist the deployed VPC's NAT EIP and use the service's public hostname (Networking Option #1).
For services reachable privately (in an AWS VPC you own, via PrivateLink, or on-prem), allowlist the deployed VPC CIDR on the service (Networking Options #2–#5).
Service | Protocol / Port | Source |
Snowflake (interface VPC endpoint) | TCP / 443 | Deployed VPC CIDR |
GitHub Enterprise Server | TCP / 22, 80, 443 | Deployed VPC CIDR |
Tableau Server | TCP / 80, 443 | Deployed VPC CIDR |
PostgreSQL | TCP / 5432 | Deployed VPC CIDR |
MySQL / MariaDB | TCP / 3306 | Deployed VPC CIDR |
Looker (private) | TCP / 443, 19999 | Deployed VPC CIDR |
Egress on the deployed VPC is already open (0.0.0.0/0 on the nodes' security group, in client-infrastructure/security-groups.tf); no outbound changes are needed.
Private DNS resolution
Pods must resolve your internal hostnames (postgres.internal.example.com, github.example.com, and so on). The standard pattern is a cross-account Route 53 Private Hosted Zone association: you own a PHZ for your internal domain and associate it with the deployed VPC.
# Your account (PHZ owner) — authorize the association
aws route53 create-vpc-association-authorization \
--hosted-zone-id Z1ABCDEFGHIJK \
--vpc VPCRegion=eu-west-1,VPCId=vpc-deployed-id
# Foundational deployer role (VPC owner) — perform the association
aws route53 associate-vpc-with-hosted-zone \
--hosted-zone-id Z1ABCDEFGHIJK \
--vpc VPCRegion=eu-west-1,VPCId=vpc-deployed-id
# Your account — delete the one-shot authorization
aws route53 delete-vpc-association-authorization \
--hosted-zone-id Z1ABCDEFGHIJK \
--vpc VPCRegion=eu-west-1,VPCId=vpc-deployed-id
PrivateLink-based services that don't use AWS's built-in private DNS (notably Snowflake) need an additional Private Hosted Zone created in the deployed VPC — see Networking Option #2.
IAM cross-account access
If the AWS resources you want to scan live in a different account from the one running this deployment (for example, your data lake's S3 buckets and Glue catalog are owned by a separate analytics team while the agent runs in a dedicated deployment sub-account), the Code Scanners or Metadata Extractors need a role in that account to call the relevant AWS APIs. Use IAM role chaining via EKS Pod Identity (preferred for new deployments) or IRSA:
Create a role in the target account — for example
Client-Resource-Access-Role— with read-only permissions on the required resources.Add a trust policy allowing the extractor pod role to assume it:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<deployed-account-id>:role/foundational-onprem-<client>-extractor"
},
"Action": ["sts:AssumeRole", "sts:TagSession"]
}]
}Pods call
sts:AssumeRoleat runtime to obtain temporary credentials in the target account.
This is independent of the deployer role created by client-bootstrap, which is used only at terraform apply time.
Connectivity patterns
Networking Option #1: Public internet + NAT EIP allowlisting
Use this for SaaS services that don't offer private connectivity, or when the service only supports source-IP allowlisting.
The EKS private subnets egress through a NAT Gateway with a single Elastic IP (single_nat_gateway = true in client-infrastructure/vpc.tf). You allowlist that EIP on each SaaS service; traffic is TLS over the public internet.
Steps:
After applying
client-infrastructure, fetch the NAT EIP:
aws ec2 describe-nat-gateways \
--filter "Name=tag:Name,Values=foundational-onprem-${CLIENT}-*" \
--query 'NatGateways[].NatGatewayAddresses[].PublicIp' --output textAdd the EIP to each service's IP allowlist (Snowflake Network Policy, Tableau Cloud trusted IPs, GitHub Enterprise Cloud allow list, dbt Cloud SSO, and so on).
For multi-AZ resilience, set single_nat_gateway = false and provide one EIP per AZ. NAT costs will be higher.
Networking Option #2: AWS PrivateLink (consumer side)
Use this for SaaS providers that publish a VPC Endpoint Service in the same AWS region (Snowflake on AWS is the common one).
What the provider supplies:
The VPC Endpoint Service name, e.g.
com.amazonaws.vpce.eu-west-1.vpce-svc-0abcd1234.
For Snowflake: the PrivateLink account URL (
<account>.<region>.privatelink.snowflakecomputing.com) and the OCSP URL, retrieved viaSELECT SYSTEM$GET_PRIVATELINK_CONFIG();.
What is configured in the deployed VPC:
An Interface VPC Endpoint in the private subnets, targeting the provider's service name.
A security group on the endpoint allowing ingress on TCP 443 from the EKS node security group.
A Route 53 Private Hosted Zone associated with the deployed VPC, so pods resolve the PrivateLink DNS name to the interface endpoint. Snowflake does not use AWS's built-in private DNS for PrivateLink, so a manual PHZ is required.
Example Terraform fragment for Snowflake (add to client-infrastructure):
resource "aws_vpc_endpoint" "snowflake" {
vpc_id = module.vpc.vpc_id
service_name = "com.amazonaws.vpce.eu-west-1.vpce-svc-0abcd1234"
vpc_endpoint_type = "Interface"
subnet_ids = module.vpc.private_subnets
security_group_ids = [aws_security_group.snowflake_privatelink.id]
private_dns_enabled = false
}Snowflake PrivateLink DNS specifics — create a PHZ in the deployed VPC for privatelink.snowflakecomputing.com and add the CNAMEs returned by SYSTEM$GET_PRIVATELINK_CONFIG, pointing at the interface VPC endpoint's regional DNS name.
Snowflake will not resolve PrivateLink URLs without this.
Networking Option #3: VPC Peering
Use this when your services live in a single AWS VPC, you own the account, CIDRs don't overlap, and there is no centralized network hub.
Steps (executed jointly):
From the deployed VPC, via the deployer role, create a peering request targeting your service VPC ID and account ID.
Accept the peering request in your account.
Add routes in both directions:
Deployed VPC private and public route tables → service CIDR via the
pcx-…peering connection.
Your VPC route tables → deployed VPC CIDR via the same
pcx-….
If you use Route 53 private hosted zones, enable "DNS resolution from accepter VPC to requester VPC" on the peering connection.
Peering is not transitive; each VPC pair needs its own connection. Use inter-region peering when the data lives in a different region from the deployed VPC.
Networking Option #4: AWS Transit Gateway
Use this when you already run a hub-and-spoke topology with a centralized TGW, or you need to reach services across multiple VPCs, accounts, or regions through a single attachment.
Steps:
Share the TGW with the Foundational sub-account via AWS Resource Access Manager (
aws ram create-resource-share).
Foundational accepts the RAM share invitation in the deployer role.
Attach the deployed VPC to the shared TGW (
aws_ec2_transit_gateway_vpc_attachment).
Update route tables on both sides:
Deployed VPC route tables → service CIDR via the TGW attachment.
TGW route tables → deployed VPC CIDR via the new attachment.
TGW charges hourly and per-GB data-processing fees, but it handles overlapping CIDRs via route domains and avoids an N² peering mesh once more than two or three VPCs are involved.
Networking Option #5: Site-to-Site VPN
Use this when data sources live in your on-prem datacenter (self-hosted Tableau Server, internal Postgres on bare metal, GitHub Enterprise Server on internal hardware).
You provide:
Customer VPN device public IP and BGP ASN (or static routes).
On-prem CIDRs to be reachable.
In the deployed VPC: a Customer Gateway, a Virtual Private Gateway (attached to the VPC) or TGW VPN attachment, and a Site-to-Site VPN connection with two IPsec tunnels. Routes are propagated from the VGW or TGW into the deployed VPC's route tables.
For high-throughput or low-jitter requirements, Direct Connect is the dedicated-link alternative. Provisioning takes weeks and is rarely justified for a single client, so VPN over the public internet is the default.
Part 2: Onboarding checklist
Share with your networking team
Item | Applies to |
Foundational sub-account ID and target region | All options |
Deployed VPC CIDR | All options |
NAT Gateway EIP(s), once available | Option #1 only |
Deployed VPC ID and interface-endpoint security group ID | Option #2 only |
Extractor pod role ARN | Cross-account access only |
Collect for Foundational
Item | Applies to |
Per data source: hostname, port, auth method, current IP allowlist policy | All options |
Endpoint Service name | Option #2 |
VPC ID, account ID, CIDR, route-table IDs of every service VPC | Options #3 and #4 |
TGW ID and RAM share ARN | Option #4 |
VPN device public IP, BGP ASN, on-prem CIDRs | Option #5 |
Private hosted zone IDs to associate | All options |
Target role ARN for cross-account AWS API access | Cross-account access only |
Part 3: Deployment
Deployment uses two Terraform modules, both shipped together in the ZIP linked under Deployment steps below. They are applied in sequence, from different accounts.
client-bootstrap— applied by you, the customer, in the AWS account that will host the agent. It creates a single cross-account IAM role (deployer_role_arn) that Foundational will later assume. Nothing else is provisioned at this step; the role is the handoff.
client-infrastructure— applied by Foundational, assuming the role created above. It provisions the rest of the agent: VPC, EKS cluster, S3 buckets, KMS keys, the Datadog agent, and the supporting IAM roles and policies described in Security.
Prerequisites
An AWS account where you can create IAM roles and assign permissions.
AWS CLI configured with credentials for that account.
Terraform >= 1.0.
From Foundational (request via the Support Team):
foundational_account_id— Foundational's AWS account ID.
external_id— a unique identifier used for secure role assumption (prevents confused-deputy attacks).
Recommended deployment topology
Deploy the agent into a dedicated AWS sub-account within your AWS Organization rather than directly into your main account. Benefits:
Isolation and security: Keeps the Foundational deployment out of your production resources.
Access control: Restricts Foundational's access to a single account and its explicitly granted cross-account paths.
Cost tracking: Separates Foundational's AWS spend from the rest of your bill.
Compliance and auditing: Clear resource boundaries simplify reporting.
Network segmentation: Dedicated network configuration controls exactly what the agent can reach.
Deployment steps
Download the Terraform module for the on-premise agent (the file is attached to this article).
Apply the
client-bootstrapmodule following its README.md. This creates the cross-account IAM role.
Send the resulting
deployer_role_arn(the README shows how to retrieve it) to the Foundational support team. Foundational will then apply theclient-infrastructuremodule against your account using the assumed role and connect to your on-premise deployment.
Part 4: FAQ
What traffic crosses the boundary between the Customer VPC and the Foundational VPC?
Direction | Purpose | Notes |
Source control servers → Foundational VPC | Webhook notifications on code changes | WAF-fronted, signature-verified |
Foundational VPC → Customer VPC | Scan orchestration (which repo, which commit) | Outbound from Foundational; no public-internet inbound to your VPC |
Customer VPC → source control servers, warehouses, BI tools | Pull source code, schemas, metadata | Network paths you control — see Network Connectivity |
Customer VPC → Foundational VPC | Upload lineage snapshots | Outbound HTTPS, authenticated |
End users → Foundational VPC | Browse lineage via UI and API | Standard web traffic to Foundational |


