ZeroTwo/ LABSCase studies  ·  02 · Sovereign AITalk to a founder →
CASE STUDY · 02 / 06Infrastructure · 2025

Frontier AI,
inside your
own perimeter.

The first wave of enterprise AI shipped as API calls to someone else's cloud. The second wave doesn't. We deploy and operate open-source models, RAG systems, and agent runtimes inside the customer's VPC, on-prem, or air-gapped — and run them like a private OpenAI.

Engagement
Sovereign deploy + managed ops
Deployment targets
AWS · Azure · GCP · bare metal · air-gap
Models supported
Llama · Mistral · DeepSeek · Qwen · custom
Compliance footprint
GDPR · HIPAA · SOC 2 · PCI · FedRAMP

Nothing leaves the building.

The whole stack runs inside the customer's perimeter. No silent token exfil, no shadow-IT API calls, no third-party logs of prompts.

openai.comanthropic.combedrock.awsazure.openaigemini.api
perimeter

Models live where your data lives.

Open-weights models, gateway, RAG, agents and observability — deployed into your VPC, your colo, your air-gapped rack.

llama-3.1mistraldeepseekqwen-2phi-3+custom
01 · The problem

The default AI stack ships your data offshore.

Most enterprise AI today is a thin client over someone else's GPU. The convenience is real; so are the consequences. Four failure modes show up again and again in regulated and security-conscious orgs.

What "AI-as-an-API" actually costs

  • Customer data, source code, contracts, and patient records routinely transit third-party clouds.
  • Auditors flag every prompt and every retrieval as a data-egress event with no contractual recourse.
  • Models, prices, rate limits, and feature availability change unilaterally — you don't own the dependency.
  • Outages at a single vendor become outages at your product. Failover plans are mostly hopes.
  • The inference stack is a black box — no quantisation, no batching policy, no co-location with your data.
  • Per-token economics get punishing at scale. Most teams discover this only after launch.

What sovereign AI gives back

  • Open-weights models running in your VPC, your colo, or fully air-gapped.
  • Every prompt, retrieval, and trace stays inside your perimeter. Auditable end to end.
  • You pick the model, the version, the quantisation, the routing policy — and pin them.
  • Multi-model fleet with automatic failover. One model degraded, the gateway routes elsewhere.
  • Full inference stack tuning: vLLM, TensorRT, batching, KV cache, speculative decoding.
  • Per-GPU economics. At meaningful scale, 4–10× cheaper than equivalent API spend.
02 · The sovereign stack

Five layers. One perimeter. Zero external dependencies.

The stack is not novel; what's novel is operating it. We deploy and run layers 1–4 as a managed service, and your applications consume them through a stable internal gateway.

// STACK · zt-runtime/v3operated by ZeroTwo · L1–L4External providers · blockedopenai.comanthropic.combedrock · awsazure · openaigemini.api// Customer environment · VPC · on-prem · air-gapL5appsEnterprise applicationscustomer-facing & internal — consume the gateway onlycrmerpcopilotsanalyticsL4intelEnterprise intelligenceRAG · agents · memory · workflows · evals · observabilityragagentsmemoryworkflowsobserveL3modelsModel layeropen-weights + customer fine-tunes — pinned versions, no surprise upgradesllamamistraldeepseekqwen+custom-ftL2runtimeAI runtimegateway · routing · batching · cache · autoscale · failovervLLMtensorrtray servetritonk8s · autoscaleL1infraInfrastructurecustomer cloud · on-prem GPU · bare metal · air-gapped racksaws · vpcazure · gcpgpu clusterbare · air-gap
03 · The platform

Six things we run for you.

The platform is the operational surface. Each capability is a hard problem we've already solved — so your team doesn't have to grow into an internal LLMOps org.

/ 03.ADEPLOY

Sovereign deployment

Open-weight models stood up inside your environment — VPC, on-prem, or fully air-gapped. No outbound traffic, no third-party logging, no surprise dependencies.

air-gapvpc-onlyprivate-link
Targets
cloud · bare
Egress
0 bytes
Setup
2–4 wks
Audit
built-in
/ 03.BFLEET

Multi-model orchestration

Run a fleet — Llama, Mistral, DeepSeek, Qwen, Phi, plus your own fine-tunes — with pinned versions, controlled rollouts, and automatic failover between them.

multi-modelpinnedfailover
Catalog
20+
Quant.
fp8 / 4bit
Rollback
1-click
SLO
99.9%
/ 03.CGATEWAY

Intelligent gateway

One OpenAI-compatible endpoint your applications already know how to call. Behind it: routing by cost, latency, or quality; rate limits; per-team quotas; auth; full audit log.

openai-apismart-routeaudited
Routes
cost · fast · best
p95 add
~3ms
Auth
SSO · key
Tracing
otel
/ 03.DTUNING

Fine-tuning infra

LoRA, full fine-tunes, and preference optimisation pipelines wired into your data. Synthetic data generation for the tail cases. Evals to prove the new weights are actually better.

loradposynth-data
Recipes
opinionated
Eval gate
required
Cycle
days
Lineage
tracked
/ 03.ERAG

Enterprise RAG stack

Private retrieval across PDFs, databases, email archives, codebases, wikis — with reranking, citation, freshness, and per-document access controls inherited from your IdP.

hybrid-searchrerankidp-acl
Vector DB
your choice
Connectors
14
Refresh
streaming
ACL
per-doc
/ 03.FAGENTS + OBS

Agent runtime & observability

A runtime for long-running agents and workflows — tool use, retries, budgets, human-in-the-loop checkpoints — paired with full traces, token accounting, and behavioural evals.

tool-usebudgetedtraced
HITL
built-in
Traces
per-step
Eval
scheduled
Cost cap
per-run
Compliance footprint · in-scope frameworksaudit hooks · per-deployment
GDPRIn scopeData-residency controls; right-to-erasure across vector indexes & logs.
HIPAAIn scopePHI never leaves the perimeter; BAA-compatible deployment patterns.
SOC 2In scopeType II audit-ready logging across gateway, runtime, RAG, and agents.
PCI DSSIn scopeTokenisation gateway; cardholder data isolated from inference path.
FedRAMPIn scopeHardened deployment profile for GovCloud and air-gapped racks.
ISO 27001In scopeInformation-security controls aligned with platform observability.
04 · Operator console

What your platform team sees.

The same view our SRE team and your platform team look at every day. Real-time fleet health, gateway throughput, routing decisions, and what we'd call you about before you notice.

zt-runtime · operator
FleetGatewayRAGAgentsCost
LIVE · last refresh 11s ago
Requests · 24h▲ 12%
5.14M
internal apps · 23 consumers
p95 latency · gateway▼ 18ms
184ms
SLO 250ms · met
GPU utilisation▲ 4pt
67%
fleet avg · 12 nodes
Spend vs equivalent API▼ 78%
$11.4K
week-to-date · saved $40.8K
Inference fleet · live12 pods · 5 models
ModelStateGPURPSp95Pods
llama-3.1-70bhealthy78%4211874
mistral-7b-fin-fthealthy54%912622
deepseek-coder-33bhealthy41%1182401
qwen-2-72bhot92%2182142
bge-large-embedhealthy22%3,412122
phi-3-mini-canarycanary14%42881
Routing policy · live decisionspolicy v14

The gateway picks per request based on consumer SLO and complexity. Cost-optimised by default; the fast and best tiers are opt-in per app or per route.

// gateway.router · last 8 decisions
  • 14:21support-copilot · summarise_ticket → mistral-7b-fin-ft62msCOST
  • 14:21devtools · complete_code → deepseek-coder-33b240msBEST
  • 14:21analytics · sql_to_text → llama-3.1-70b187msBEST
  • 14:21crm-assist · classify_intent → mistral-7b-fin-ft61msFAST
  • 14:20search · embed_query → bge-large-embed11msFAST
  • 14:20fr-helpdesk · respond_ticket → qwen-2-72b214msBEST
  • 14:20support-copilot · summarise_ticket → mistral-7b-fin-ft59msCOST
  • 14:20compliance-agent · audit_check → llama-3.1-70b192msBEST
05 · The bigger shift

From API consumption to owned infrastructure.

The first AI wave was rented. The second is owned. Enterprises that have built anything serious are already migrating the workloads they care about into their own perimeter.

Era 01/ 2023 — 2024

API consumption

Plug into someone else's cloud. Ship the demo fast, sort the rest later.

  • Tokens, latency, and price set by the vendor.
  • Sensitive data exits the perimeter on every call.
  • Outages, deprecations, and rate limits land without notice.
  • Inference stack is opaque — no levers to pull.
  • Procurement and security teams arrive late and angry.
Sovereign
shift
Era 02/ 2025 →

Owned AI infrastructure

The models, the gateway, and the data plane live inside your environment. We operate them.

  • Per-GPU economics; meaningful unit-cost gains at scale.
  • Data never leaves the perimeter. Auditors stop arguing.
  • Pinned model versions; you decide when to upgrade.
  • Full runtime tuning — quant, batching, KV cache, routing.
  • Procurement, security, and platform on the same page.
01 · Egress
0 bytes
Sensitive data leaving the customer perimeter.
02 · Unit cost
−78%
Spend per million tokens vs equivalent API consumption.
03 · p95 latency
−42%
Gateway p95, models co-located with calling apps.
04 · Time to deploy
2–4 wks
From discovery to a production gateway endpoint.
05 · Uptime
99.95%
Trailing-90-day across managed customer fleets.
06 · Where this lands

Industries with no choice but sovereign.

Five categories already commit hard to private deployment — and a sixth where the economics simply favour it.

Banking & financial services

Private financial copilots, model risk management, audit-friendly inference logs.

SOC 2 · PCI · MRM

Healthcare & life sciences

HIPAA-bound retrieval, clinical-note assistants, PHI-isolated agent runtimes.

HIPAA · BAA

Defense & government

Air-gapped sovereign AI for intelligence, ops, and classified workloads.

FedRAMP · IL5

Large enterprises

Internal copilots and knowledge systems wired into existing IdP and ACLs.

SSO · IdP · ACL

SaaS & product companies

Private AI backend powering their product, with sane unit economics at scale.

CONTROL · COST
07 · Engagement shape

How the partnership is priced.

A small set of clean line items, sized to the actual surface we operate. Most customers start at the first two, expand into the rest over the first year.

01

Deployment

Discovery, infra design, hardening, first endpoint live in your environment.

2–4 weeks · fixed
onboarding
02

Managed LLMOps

We run layers 1–4: gateway, fleet, RAG, agents, observability, upgrades.

monthly · scale-tiered
recurring
03

GPU & inference optimisation

Quantisation, batching policy, speculative decoding — measured against your real traffic mix.

retainer or perf-share
project
04

Fine-tuning & eval

Domain adaptation with proper evals — synthetic data generation, LoRA training, DPO, gating.

per-model · gated
project
05

AI observability

Tokens, hallucination rate, agent behaviour, drift detection, cost attribution — all native.

bundled with ops
platform
06

Enterprise support & SLA

24/7 on-call, hardened SLO, named TPM, quarterly performance reviews.

tier · gold / platinum
enterprise
08 · Technologies used

Open source where it earns its keep.

Almost everything we operate is open-weights and open-source. The proprietary bit is the operational layer wrapped around it.

Model layer
Llama 3.xMistral · MixtralDeepSeekQwen 2Phi-3Gemma+ custom fine-tunes
Inference runtime
vLLMTensorRT-LLMRay ServeTritonSGLang
Orchestration
KubernetesKubeRayKarpenterArgoCDTerraform
RAG & data
pgvectorWeaviateQdrantLlamaIndexPostgresS3 / object
Agents & tuning
LangGraphDSPyUnsloth (LoRA)TRL · DPOAxolotl
Observability & governance
OpenTelemetryPrometheus · GrafanaLokiinternal · zt-tracepolicy as code
// LET'S DEPLOY

Bring AI inside the perimeter.

If you're rate-limited by a vendor, blocked by security review, or simply tired of explaining to procurement why prompts contain customer PII — we'll deploy and operate a sovereign stack for you. Cloud, on-prem, or air-gap.