Skip to main content

AI Platform Architecture

Overview

This platform follows a Control Plane / Data Plane (CP/DP) architecture for multi-tenant AI agent deployment. This provides strong isolation, independent scaling, and precise metering for each tenant.

Core Security Guarantees

GuaranteeDescription
Never runs client codeCP only orchestrates, plans, and enforces policies
Never sees secretsClient API keys, OAuth tokens remain in DP only
Stateless policy enforcementAll decisions based on signed tokens and policies
Multi-tenant isolationPer-tenant data stores, network isolation, and policy configs

Trust Boundaries

BoundaryGuarantee
Control to DataSigned execution plans only (cryptographically verified)
Data to ControlOutcome metadata only (no PII, no prompts, no responses)
SecretsNever cross boundary (remain in DP always)
MeteringControl Plane only (usage accounting, quotas, approvals)
ExecutionData Plane only (client code, LLM calls, external APIs)

Terminology

TermDefinition
WorkflowA workflow runtime instance running in the Data Plane. The actual execution unit.
AgentA user-facing concept representing an AI automation capability. Implemented as workflow runtime instances.
TenantA customer organization with dedicated infrastructure and isolated resources.
ExecutionA single invocation of a workflow, triggered via chat command or API call.
OutcomeExecution result metadata (success/failure, duration, tokens used) sent from DP to CP. No PII.

Workflow Execution Modes

ModeDescriptionUser ExperienceDocumentation
StaticUser uploads pre-built workflow JSON (BYO)Full workflow runtime features, manual creationImplementation details vary by deployment
DynamicAI generates workflows from natural language intentNo workflow runtime knowledge needed, limited to approved toolsImplementation details vary by deployment

High-Level Architecture Diagram

The Control Plane performs planning, policy enforcement, and metering. The Data Plane performs execution and holds tenant secrets and data.

AI-Powered Planning

The Planner service uses an LLM with comprehensive context injection to generate data-driven execution plans. It gathers historical performance data, approval patterns, and tenant configurations to make informed predictions.

Context Gathering Flow

Example: Outcome Context Data

The Outcome Aggregator returns rich performance metrics:

Aggregated Context Includes:

  • Total executions, success/failure counts, success rate
  • Duration metrics: average, p50, p99
  • Token usage: average and trends
  • Error distribution by type
  • Success rate trends (improving/degrading)
  • Performance stability indicators
  • Last error timestamp and type

How AI Uses Context

  1. Time prediction: Uses historical duration samples to predict expected runtime and quantify uncertainty.

  2. Auto-approval decision: Combines approval history, recent outcomes, and policy configuration to decide whether human review is required.

  3. Risk assessment: Uses success rate, recent errors, and performance stability to inform the approval posture.

  4. Usage estimation: Produces token and duration estimates to support quota enforcement and operational planning.

Result: Decisions are based on observed patterns and explicit policy constraints, not static heuristics alone.

Component Details

Control Plane Components

ComponentTechnologyPurpose
API GatewayHTTP APIRequest routing, throttling
PlannerCompute Service + LLMGenerate execution plans, time prediction
Policy EngineCompute Service + DatabaseALLOW/DENY/APPROVE decisions
Token ServiceAuthentication Service + ES256Issue & validate JWTs (5min access, 4h refresh)
Metering CollectorCompute ServiceReceive heartbeats, detect violations
Prompt LibraryCompute Service + DatabaseVersioned prompt management
Workflow ManagerCompute Service + DatabaseManage workflow enable/disable per tenant
Admin PanelWeb Application + CDNTenant management, Storage Browser, analytics
DatabaseNoSQL Database8+ tables (executions, policies, prompts, etc.)
Shared ALBApplication Load BalancerHost-based routing for all tenants

Data Plane Components

ComponentTechnologyPurpose
Workflow runtimeCompute Instance (spot)Workflow execution engine (cost-optimized compute)
Metering SidecarGo + DCGMGPU/usage reporting, heartbeat
Secret VaultSecrets ManagerClient API keys, OAuth tokens (never leaves DP)
vLLMGPU Compute InstanceLLM inference (Enterprise/GPU Pro tiers)

Security Model

Token Types

TypeTTLPurpose
Access Token5 minExecute specific workflow
Refresh Token4 hoursRenew access tokens
Metering Token30 minReport usage metrics
Admin Token60 minTenant management

Token Claims

Token Properties:

  • Algorithm: ES256 (ECDSA with SHA-256)
  • Standard claims: Issuer, subject, audience, expiration, issued-at, JTI (unique ID)
  • Custom claims: Tenant ID, token type, agent/session identifiers
  • Expiration: 5-60 minutes depending on token type
  • Storage: Hashed (SHA-256) before database storage
  • Revocation: Instant via database flag

Token Security

  • Tokens hashed (SHA256) before storage
  • Instant revocation via database flag
  • 1-hour hard timeout (no extensions)
  • Never stored in plaintext

CIDR Allocation

ComponentCIDR Range
Control Plane10.10.0.0/16
Tenant 110.100.0.0/16
Tenant 210.101.0.0/16
Tenant N10.(100+N).0.0/16

Max tenants: 55 (10.100 - 10.154)

Network Peering

  • One-way: DP to CP only (security isolation)
  • Route tables configured in DP to reach CP
  • No routes from CP to DP

Workflow Management

Workflow Registry

All workflows stored in data-plane/workflows/ and registered in the database:

WorkflowDescriptionRequired Tier
marketing_content_agentSocial media content generationstarter
lead_intake_agentLead qualification and routingstarter
appointment_scheduler_agentCalendar managementprofessional
kpi_report_agentBusiness metrics reportingprofessional
rag_assistant_agentRAG-based document Q&Aenterprise

Workflow Lifecycle

Metering & Usage

Heartbeat Protocol

  1. Sidecar sends heartbeat every 60 seconds
  2. Contains: GPU %, memory, CPU, active workflows
  3. Control Plane stores in database
  4. Violation after 5 minutes of missed heartbeats
  5. Auto-suspend triggered via EventBridge

Outcome Feedback Loop

Execution outcomes collected asynchronously from DPs (no PII) for planner enhancement:

OUTCOME PAYLOAD (No PII):
- execution_id
- workflow_id, tenant_id
- success: boolean
- duration_ms, tokens_used
- gpu_seconds_used
- embedding_vectors
- error_code
- timestamp

EXCLUDED FIELDS:
- user prompts
- generated content
- PII / user data
- API keys / credentials

Technology Stack

Control Plane

  • Compute: Serverless functions
  • API: API Gateway HTTP API
  • Database: NoSQL database (8+ tables with indexes, streams, TTL)
  • Storage: Object storage (per-tenant buckets with lifecycle policies)
  • Auth: Identity provider, JWT tokens (ES256 signing)
  • Scheduling: EventBridge (timeout checker every 5min)
  • Queue: SQS with DLQ (completion callback retries)
  • AI: LLM (planning, time prediction)
  • IaC: Infrastructure as Code

Data Plane

  • Compute: Spot compute instances via workflow runtime
  • Orchestration: workflow runtime (containerized)
  • Networking: Network peering (DP to CP communication)
  • Load Balancing: Shared ALB with host-based routing
  • Metering: Go sidecar service (heartbeat, tamper detection)
  • Monitoring: Metrics and log aggregation

Admin Panel

  • Framework: Web application framework
  • Storage UI: Object storage browser interface
  • Auth: STS temporary credentials (1-hour)

Deployment Flow

Monitoring & Observability

  • Metrics: Function metrics, API Gateway latency
  • X-Ray: Distributed tracing across CP/DP
  • Logging: Centralized logging
  • EventBridge: Async event processing
  • SNS: Alerts for violations and errors

Future Enhancements

  1. ML Model for Predictions

    • Train custom model on prediction_error data
    • Replace initial LLM planner for efficiency/latency optimization
    • Track confidence calibration
  2. Multi-Region Deployment

    • Active-active across us-east-1, eu-west-1
    • Route53 geo-routing for low latency
    • Cross-region database replication
  3. Public API (Phase 2)

    • REST API for direct workflow invocation
    • API key management
    • Rate limiting per API key
    • Usage metering integration
  4. Advanced Analytics

    • Real-time execution dashboards
    • Resource attribution by workflow/tenant
    • Anomaly detection for unusual patterns
    • Predictive capacity planning