Skip to main content

Network Architecture

This document details the network topology, virtual network design, peering configuration, and traffic flows.

Network Overview

The platform uses a hub-and-spoke topology with the Control Plane as the hub and tenant Data Planes as spokes:

                    ┌────────────────────────────────────┐
│ Internet / Slack API │
└───────────────┬────────────────────┘


┌────────────────────────────────────┐
│ CloudFront / API Gateway │
│ • WAF Protection │
│ • TLS 1.3 Termination │
│ • Rate Limiting │
└───────────────┬────────────────────┘


┌───────────────────────────────────────────────────────────────────┐
│ CONTROL PLANE virtual network (10.10.0.0/16) │
│ Account: Management (123456789012) │
├───────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Private Subnet 1 (10.10.0.0/20) - us-east-1a │ │
│ │ • serverless functions Functions (in virtual network) │ │
│ │ • DP Dispatcher (initiates connections to DPs) │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Private Subnet 2 (10.10.16.0/20) - us-east-1b │ │
│ │ • serverless functions Functions (in virtual network) - HA │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Private Subnet 3 (10.10.32.0/20) - us-east-1c │ │
│ │ • serverless functions Functions (in virtual network) - HA │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Public Subnet 1 (10.10.48.0/20) - us-east-1a │ │
│ │ • NAT Instance (t3.nano - 88% cost savings) │ │
│ │ • Application Load Balancer (Admin Panel) │ │
│ │ • Bastion Host (emergency access) │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ virtual network Endpoints │ │
│ │ • NoSQL database (Gateway) │ │
│ │ • object storage (Gateway) │ │
│ │ • ECR API/DKR (Interface) │ │
│ │ • monitoring service Logs (Interface) │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
└─────────┬────────────────────────────────┬────────────────┬──────┘
│ Peering │ Peering │ Peering
│ (DP → CP, CP → DP) │ │
▼ ▼ ▼
┌──────────────────────┐ ┌──────────────────────┐ ┌──────────────────────┐
│ Tenant DP virtual network │ │ Tenant DP virtual network │ │ Tenant DP virtual network │
│ 10.100.0.0/16 │ │ 10.101.0.0/16 │ │ 10.102.0.0/16 │
│ Account: DP-001 │ │ Account: DP-002 │ │ Account: DP-003 │
├──────────────────────┤ ├──────────────────────┤ ├──────────────────────┤
│ • workflow-engine Executor │ │ • workflow-engine Executor │ │ • workflow-engine Executor │
│ • Go Microservices │ │ • Go Microservices │ │ • Go Microservices │
│ • NAT Instance │ │ • NAT Instance │ │ • NAT Instance │
│ • Private subnets │ │ • Private subnets │ │ • Private subnets │
└──────────────────────┘ └──────────────────────┘ └──────────────────────┘

Control Plane virtual network Design

virtual network Configuration

[Infrastructure code removed for vendor neutrality]

Configuration includes:
- Network isolation and segmentation
- Route tables and gateways
- Security groups and firewall rules
- Service endpoints for private connectivity

Subnet Design

SubnetCIDRAZPurposePublic
Private-110.10.0.0/20us-east-1aserverless functions, DP Dispatcher
Private-210.10.16.0/20us-east-1bserverless functions (HA), RDS
Private-310.10.32.0/20us-east-1cserverless functions (HA)
Public-110.10.48.0/20us-east-1aNAT Instance, load balancer, Bastion
Public-210.10.64.0/20us-east-1bNAT Instance (HA), load balancer
Public-310.10.80.0/20us-east-1cNAT Instance (HA)

Route Tables

Public Subnet Route Table

[Infrastructure code removed for vendor neutrality]

Configuration includes:
- Network isolation and segmentation
- Route tables and gateways
- Security groups and firewall rules
- Service endpoints for private connectivity

Private Subnet Route Table

[Infrastructure code removed for vendor neutrality]

Configuration includes:
- Network isolation and segmentation
- Route tables and gateways
- Security groups and firewall rules
- Service endpoints for private connectivity

virtual network Endpoints

Gateway Endpoints (no additional cost):

# object storage Gateway Endpoint
resource "cloud_vpc_endpoint" "s3" {
vpc_id = cloud_vpc.control_plane.id
service_name = "cloud-provider.region.s3"
vpc_endpoint_type = "Gateway"
route_table_ids = [cloud_route_table.private.id]
}

# NoSQL database Gateway Endpoint
resource "cloud_vpc_endpoint" "dynamodb" {
vpc_id = cloud_vpc.control_plane.id
service_name = "cloud-provider.region.dynamodb"
vpc_endpoint_type = "Gateway"
route_table_ids = [cloud_route_table.private.id]
}

Interface Endpoints (charged per hour + data):

# Secrets Manager Interface Endpoint
resource "cloud_vpc_endpoint" "secrets_manager" {
vpc_id = cloud_vpc.control_plane.id
service_name = "cloud-provider.region.secretsmanager"
vpc_endpoint_type = "Interface"
subnet_ids = [cloud_subnet.private_1.id, cloud_subnet.private_2.id]
security_group_ids = [cloud_security_group.vpc_endpoints.id]
private_dns_enabled = true
}

# monitoring service Logs Interface Endpoint
resource "cloud_vpc_endpoint" "logs" {
vpc_id = cloud_vpc.control_plane.id
service_name = "cloud-provider.region.logs"
vpc_endpoint_type = "Interface"
subnet_ids = [cloud_subnet.private_1.id, cloud_subnet.private_2.id]
security_group_ids = [cloud_security_group.vpc_endpoints.id]
private_dns_enabled = true
}

Data Plane virtual network Design

virtual network Configuration (Per Tenant)

# Tenant number determines virtual network CIDR to avoid conflicts
# Example: tenant-1 → 10.100.0.0/16, tenant-2 → 10.101.0.0/16

locals {
tenant_number = tonumber(regex("[0-9]+$", var.tenant_id))
vpc_cidr = "10.${100 + local.tenant_number - 1}.0.0/16"
}

resource "cloud_vpc" "tenant" {
cidr_block = local.vpc_cidr
enable_dns_hostnames = true
enable_dns_support = true

tags = {
Name = "dp-${var.tenant_id}"
TenantID = var.tenant_id
}
}

Subnet Design (Per Tenant)

SubnetCIDRAZPurposePublic
Public-110.(100+N-1).0.0/20us-east-1aNAT Instance
Public-210.(100+N-1).16.0/20us-east-1bNAT Instance (HA)
Public-310.(100+N-1).32.0/20us-east-1cNAT Instance (HA)
Private-110.(100+N-1).128.0/20us-east-1aworkflow-engine, Go services, Metering
Private-210.(100+N-1).144.0/20us-east-1bworkflow-engine (HA), Go services
Private-310.(100+N-1).160.0/20us-east-1cReserved for GPU instances

Route Tables (Data Plane)

[Infrastructure code removed for vendor neutrality]

Configuration includes:
- Network isolation and segmentation
- Route tables and gateways
- Security groups and firewall rules
- Service endpoints for private connectivity

virtual network Peering

Control Plane ↔ Data Plane Peering

Bidirectional peering with carefully controlled routes:

# Initiated from Data Plane account
resource "cloud_vpc_peering_connection" "to_cp" {
vpc_id = cloud_vpc.tenant.id
peer_vpc_id = var.control_plane_vpc_id
peer_owner_id = var.control_plane_account_id
peer_region = var.control_plane_region
auto_accept = false # CP must accept

tags = {
Name = "dp-${var.tenant_id}-to-cp"
TenantID = var.tenant_id
}
}

# Accepted in Control Plane account
resource "cloud_vpc_peering_connection_accepter" "from_dp" {
provider = aws.control_plane
vpc_peering_connection_id = cloud_vpc_peering_connection.to_cp.id
auto_accept = true

tags = {
Name = "cp-from-dp-${var.tenant_id}"
}
}

Traffic Flows

CP → DP (Execution Dispatch)

DP Dispatcher serverless functions (10.10.1.X)
↓ virtual network Peering
workflow-engine instance (10.142.1.10:5678)

Control Plane Route:

[Infrastructure code removed for vendor neutrality]

Configuration includes:
- Network isolation and segmentation
- Route tables and gateways
- Security groups and firewall rules
- Service endpoints for private connectivity

DP → CP (Outcome Reporting)

workflow-engine instance (10.142.1.10)
↓ virtual network Peering
API Gateway (10.10.1.50:443) → Metering Collector serverless functions

Data Plane Route:

[Infrastructure code removed for vendor neutrality]

Configuration includes:
- Network isolation and segmentation
- Route tables and gateways
- Security groups and firewall rules
- Service endpoints for private connectivity

Security Groups

Control Plane Security Groups

serverless functions Security Group

[Infrastructure code removed for vendor neutrality]

Configuration includes:
- Network isolation and segmentation
- Route tables and gateways
- Security groups and firewall rules
- Service endpoints for private connectivity

load balancer Security Group (Admin Panel)

[Infrastructure code removed for vendor neutrality]

Configuration includes:
- Network isolation and segmentation
- Route tables and gateways
- Security groups and firewall rules
- Service endpoints for private connectivity

Data Plane Security Groups

workflow-engine Instance Security Group

[Infrastructure code removed for vendor neutrality]

Configuration includes:
- Network isolation and segmentation
- Route tables and gateways
- Security groups and firewall rules
- Service endpoints for private connectivity

Network ACLs

Control Plane NACLs

[Infrastructure code removed for vendor neutrality]

Configuration includes:
- Network isolation and segmentation
- Route tables and gateways
- Security groups and firewall rules
- Service endpoints for private connectivity

Traffic Flow Examples

Example 1: Slack Command → Workflow Execution

┌──────────┐
│ Slack │ 1. POST /slack/commands
│ Workspace│────────────────────┐
└──────────┘ │

┌────────────────────────┐
│ CloudFront / WAF │
│ • Rate limit check │
└────────┬───────────────┘
│ 2. Forward to API Gateway

┌────────────────────────┐
│ API Gateway │
│ • Verify Slack sig │
└────────┬───────────────┘
│ 3. Invoke Slack Handler

┌────────────────────────────────────────────────────────────┐
│ Control Plane virtual network (10.10.0.0/16) │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Slack Handler serverless functions (10.10.1.X) │ │
│ │ 4. Check rate limits, invoke Execution Orchestrator │ │
│ └────────┬─────────────────────────────────────────────┘ │
│ │ 5. Invoke │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Execution Orchestrator serverless functions (10.10.1.Y) │ │
│ │ 6. Planner → Policy → Token Service → DP Dispatcher │ │
│ └────────┬─────────────────────────────────────────────┘ │
│ │ 7. Async invoke │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ DP Dispatcher serverless functions (10.10.1.Z) │ │
│ │ 8. POST http://10.142.1.10:5678/webhook/wf_sched │ │
│ └────────┬─────────────────────────────────────────────┘ │
│ │ │
└───────────┼────────────────────────────────────────────────┘
│ 9. virtual network Peering

┌────────────────────────────────────────────────────────────┐
│ Tenant Data Plane virtual network (10.142.0.0/16) │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ workflow-engine Instance (10.142.1.10:5678) │ │
│ │ 10. Validate JWT, execute workflow │ │
│ └────────┬─────────────────────────────────────────────┘ │
│ │ 11. POST outcome to CP API │
│ ▼ │
└───────────┼────────────────────────────────────────────────┘
│ 12. virtual network Peering

┌────────────────────────────────────────────────────────────┐
│ Control Plane virtual network (10.10.0.0/16) │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ API Gateway (10.10.1.50:443) │ │
│ │ 13. POST /outcomes/execution │ │
│ └────────┬─────────────────────────────────────────────┘ │
│ │ 14. Invoke │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Metering Collector serverless functions │ │
│ │ 15. Update execution status, increment quotas │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────┘

Example 2: Metering Heartbeat

┌────────────────────────────────────────────────────────────┐
│ Tenant Data Plane virtual network (10.142.0.0/16) │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Metering Sidecar (Go binary on workflow-engine instance) │ │
│ │ 1. Collect GPU/CPU metrics every 60s │ │
│ │ 2. Sign with Ed25519 private key │ │
│ └────────┬─────────────────────────────────────────────┘ │
│ │ 3. POST signed heartbeat │
│ ▼ │
└───────────┼────────────────────────────────────────────────┘
│ 4. virtual network Peering

┌────────────────────────────────────────────────────────────┐
│ Control Plane virtual network (10.10.0.0/16) │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ API Gateway (10.10.1.50:443) │ │
│ │ 5. POST /metering/heartbeat │ │
│ └────────┬─────────────────────────────────────────────┘ │
│ │ 6. Invoke │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Metering Collector serverless functions │ │
│ │ 7. Verify Ed25519 signature │ │
│ │ 8. Check monotonic counter (replay protection) │ │
│ │ 9. Store in NoSQL database (metering table, 30d TTL) │ │
│ │ 10. Update tenant heartbeat status │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────┘

DNS Configuration

Public DNS (Route 53)

# API subdomain
resource "cloud_route53_record" "api" {
zone_id = var.hosted_zone_id
name = "api.openai-platform.com"
type = "A"

alias {
name = aws_apigatewayv2_domain_name.api.domain_name_configuration[0].target_domain_name
zone_id = aws_apigatewayv2_domain_name.api.domain_name_configuration[0].hosted_zone_id
evaluate_target_health = true
}
}

# Admin panel subdomain
resource "cloud_route53_record" "admin" {
zone_id = var.hosted_zone_id
name = "admin.openai-platform.com"
type = "A"

alias {
name = aws_lb.admin.dns_name
zone_id = aws_lb.admin.zone_id
evaluate_target_health = true
}
}

Private DNS (virtual network)

# Internal API endpoint for Data Planes
resource "cloud_route53_zone" "internal" {
name = "openai-platform.internal"

vpc {
vpc_id = cloud_vpc.control_plane.id
}
}

resource "cloud_route53_record" "api_internal" {
zone_id = aws_route53_zone.internal.zone_id
name = "api.openai-platform.internal"
type = "A"
ttl = 300
records = [aws_apigatewayv2_vpc_link.main.network_interface_ids[0].private_ip_address]
}

Bandwidth & Data Transfer

Expected Traffic Patterns

Traffic TypeVolume/MonthDirectionCost
API Gateway Requests10M requestsInternet → CPIncluded (1M free)
Execution Dispatches100k dispatchesCP → DP$0 (virtual network peering)
Outcome Reports100k reportsDP → CP$0 (virtual network peering)
Metering Heartbeats4.3M/month (1/min)DP → CP$0 (virtual network peering)
OpenAI API CallsVariableDP → Internet~$0.09/GB out
monitoring service Logs100 GBAll → monitoring service~$0.50/GB ingested

virtual network Peering Data Transfer Costs

Within same region: $0.01/GB in each direction
Cross-region: $0.02/GB in each direction

Example monthly cost (100k executions):

  • Execution dispatches: 100k × 5 KB = 0.5 GB → $0.01
  • Outcome reports: 100k × 2 KB = 0.2 GB → $0.00
  • Heartbeats: 4.3M × 1 KB = 4.3 GB → $0.09

Total virtual network peering cost: ~$0.10/month


Network Monitoring

virtual network Flow Logs

[Infrastructure code removed for vendor neutrality]

Configuration includes:
- Network isolation and segmentation
- Route tables and gateways
- Security groups and firewall rules
- Service endpoints for private connectivity

monitoring service Metrics

# NAT Gateway bandwidth alarm
resource "cloud_cloudwatch_metric_alarm" "nat_bandwidth" {
alarm_name = "nat-gateway-bandwidth-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "BytesOutToDestination"
namespace = "cloud provider/NATGateway"
period = 300
statistic = "Sum"
threshold = 1000000000 # 1 GB in 5 min
alarm_description = "NAT Gateway bandwidth exceeded 1 GB/5min"

dimensions = {
NatGatewayId = aws_nat_gateway.main.id
}
}

# virtual network peering connection drops
resource "cloud_cloudwatch_metric_alarm" "peering_drops" {
alarm_name = "vpc-peering-packet-drops"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "PacketsDropped"
namespace = "cloud provider/virtual network"
period = 300
statistic = "Sum"
threshold = 1000

dimensions = {
VpcPeeringConnectionId = cloud_vpc_peering_connection.to_dp.id
}
}

Disaster Recovery

virtual network Backup Strategy

  1. infrastructure as code State Backup: object storage with versioning + replication to DR region
  2. Route Table Exports: Daily snapshot via cloud provider Config
  3. Security Group Exports: Automated backup to object storage (JSON format)

Multi-Region Failover

Primary: us-east-1
DR: us-west-2

# Route 53 Health Check
resource "cloud_route53_health_check" "api" {
fqdn = "api.openai-platform.com"
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = 3
request_interval = 30

tags = {
Name = "api-health-check"
}
}

# Failover routing policy
resource "cloud_route53_record" "api_failover_primary" {
zone_id = var.hosted_zone_id
name = "api.openai-platform.com"
type = "A"

set_identifier = "primary"
health_check_id = aws_route53_health_check.api.id

failover_routing_policy {
type = "PRIMARY"
}

alias {
name = aws_apigatewayv2_domain_name.api_us_east.domain_name_configuration[0].target_domain_name
zone_id = aws_apigatewayv2_domain_name.api_us_east.domain_name_configuration[0].hosted_zone_id
evaluate_target_health = true
}
}

resource "cloud_route53_record" "api_failover_secondary" {
zone_id = var.hosted_zone_id
name = "api.openai-platform.com"
type = "A"

set_identifier = "secondary"

failover_routing_policy {
type = "SECONDARY"
}

alias {
name = aws_apigatewayv2_domain_name.api_us_west.domain_name_configuration[0].target_domain_name
zone_id = aws_apigatewayv2_domain_name.api_us_west.domain_name_configuration[0].hosted_zone_id
evaluate_target_health = true
}
}