Sovereign AI Infrastructure

01 · The problem

The default AI stack ships your data offshore.

Most enterprise AI today is a thin client over someone else's GPU. The convenience is real; so are the consequences. Four failure modes show up again and again in regulated and security-conscious orgs.

What "AI-as-an-API" actually costs

Customer data, source code, contracts, and patient records routinely transit third-party clouds.
Auditors flag every prompt and every retrieval as a data-egress event with no contractual recourse.
Models, prices, rate limits, and feature availability change unilaterally — you don't own the dependency.
Outages at a single vendor become outages at your product. Failover plans are mostly hopes.
The inference stack is a black box — no quantisation, no batching policy, no co-location with your data.
Per-token economics get punishing at scale. Most teams discover this only after launch.

What sovereign AI gives back

Open-weights models running in your VPC, your colo, or fully air-gapped.
Every prompt, retrieval, and trace stays inside your perimeter. Auditable end to end.
You pick the model, the version, the quantisation, the routing policy — and pin them.
Multi-model fleet with automatic failover. One model degraded, the gateway routes elsewhere.
Full inference stack tuning: vLLM, TensorRT, batching, KV cache, speculative decoding.
Per-GPU economics. At meaningful scale, 4–10× cheaper than equivalent API spend.

02 · The sovereign stack

Five layers. One perimeter. Zero external dependencies.

The stack is not novel; what's novel is operating it. We deploy and run layers 1–4 as a managed service, and your applications consume them through a stable internal gateway.

// STACK · zt-runtime/v3operated by ZeroTwo · L1–L4

03 · The platform

Six things we run for you.

The platform is the operational surface. Each capability is a hard problem we've already solved — so your team doesn't have to grow into an internal LLMOps org.

/ 03.ADEPLOY

Sovereign deployment

Open-weight models stood up inside your environment — VPC, on-prem, or fully air-gapped. No outbound traffic, no third-party logging, no surprise dependencies.

air-gapvpc-onlyprivate-link

Targets: cloud · bare
Egress: 0 bytes
Setup: 2–4 wks
Audit: built-in

/ 03.BFLEET

Multi-model orchestration

Run a fleet — Llama, Mistral, DeepSeek, Qwen, Phi, plus your own fine-tunes — with pinned versions, controlled rollouts, and automatic failover between them.

multi-modelpinnedfailover

Catalog: 20+
Quant.: fp8 / 4bit
Rollback: 1-click
SLO: 99.9%

/ 03.CGATEWAY

Intelligent gateway

One OpenAI-compatible endpoint your applications already know how to call. Behind it: routing by cost, latency, or quality; rate limits; per-team quotas; auth; full audit log.

openai-apismart-routeaudited

Routes: cost · fast · best
p95 add: ~3ms
Auth: SSO · key
Tracing: otel

/ 03.DTUNING

Fine-tuning infra

LoRA, full fine-tunes, and preference optimisation pipelines wired into your data. Synthetic data generation for the tail cases. Evals to prove the new weights are actually better.

loradposynth-data

Recipes: opinionated
Eval gate: required
Cycle: days
Lineage: tracked

/ 03.ERAG

Enterprise RAG stack

Private retrieval across PDFs, databases, email archives, codebases, wikis — with reranking, citation, freshness, and per-document access controls inherited from your IdP.

hybrid-searchrerankidp-acl

Vector DB: your choice
Connectors: 14
Refresh: streaming
ACL: per-doc

/ 03.FAGENTS + OBS

Agent runtime & observability

A runtime for long-running agents and workflows — tool use, retries, budgets, human-in-the-loop checkpoints — paired with full traces, token accounting, and behavioural evals.

tool-usebudgetedtraced

HITL: built-in
Traces: per-step
Eval: scheduled
Cost cap: per-run

04 · Operator console

What your platform team sees.

The same view our SRE team and your platform team look at every day. Real-time fleet health, gateway throughput, routing decisions, and what we'd call you about before you notice.

zt-runtime · operator

FleetGatewayRAGAgentsCost

LIVE · last refresh 11s ago

Requests · 24h▲ 12%

5.14M

internal apps · 23 consumers

p95 latency · gateway▼ 18ms

184ms

SLO 250ms · met

GPU utilisation▲ 4pt

67%

fleet avg · 12 nodes

Spend vs equivalent API▼ 78%

$11.4K

week-to-date · saved $40.8K

Inference fleet · live12 pods · 5 models

Model	State	GPU	RPS	p95	Pods
llama-3.1-70b	healthy	78%	421	187	4
mistral-7b-fin-ft	healthy	54%	912	62	2
deepseek-coder-33b	healthy	41%	118	240	1
qwen-2-72b	hot	92%	218	214	2
bge-large-embed	healthy	22%	3,412	12	2
phi-3-mini-canary	canary	14%	42	88	1

Routing policy · live decisionspolicy v14

The gateway picks per request based on consumer SLO and complexity. Cost-optimised by default; the fast and best tiers are opt-in per app or per route.

// gateway.router · last 8 decisions

14:21support-copilot · summarise_ticket → mistral-7b-fin-ft62msCOST
14:21devtools · complete_code → deepseek-coder-33b240msBEST
14:21analytics · sql_to_text → llama-3.1-70b187msBEST
14:21crm-assist · classify_intent → mistral-7b-fin-ft61msFAST
14:20search · embed_query → bge-large-embed11msFAST
14:20fr-helpdesk · respond_ticket → qwen-2-72b214msBEST
14:20support-copilot · summarise_ticket → mistral-7b-fin-ft59msCOST
14:20compliance-agent · audit_check → llama-3.1-70b192msBEST

05 · The bigger shift

From API consumption to owned infrastructure.

The first AI wave was rented. The second is owned. Enterprises that have built anything serious are already migrating the workloads they care about into their own perimeter.

Era 01/ 2023 — 2024

API consumption

Plug into someone else's cloud. Ship the demo fast, sort the rest later.

Tokens, latency, and price set by the vendor.
Sensitive data exits the perimeter on every call.
Outages, deprecations, and rate limits land without notice.
Inference stack is opaque — no levers to pull.
Procurement and security teams arrive late and angry.

Sovereign
shift

Era 02/ 2025 →

Owned AI infrastructure

The models, the gateway, and the data plane live inside your environment. We operate them.

Per-GPU economics; meaningful unit-cost gains at scale.
Data never leaves the perimeter. Auditors stop arguing.
Pinned model versions; you decide when to upgrade.
Full runtime tuning — quant, batching, KV cache, routing.
Procurement, security, and platform on the same page.

01 · Egress

0 bytes

Sensitive data leaving the customer perimeter.

02 · Unit cost

−78%

Spend per million tokens vs equivalent API consumption.

03 · p95 latency

−42%

Gateway p95, models co-located with calling apps.

04 · Time to deploy

2–4 wks

From discovery to a production gateway endpoint.

05 · Uptime

99.95%

Trailing-90-day across managed customer fleets.

06 · Where this lands

Industries with no choice but sovereign.

Five categories already commit hard to private deployment — and a sixth where the economics simply favour it.

Banking & financial services

Private financial copilots, model risk management, audit-friendly inference logs.

SOC 2 · PCI · MRM

Healthcare & life sciences

HIPAA-bound retrieval, clinical-note assistants, PHI-isolated agent runtimes.

HIPAA · BAA

Defense & government

Air-gapped sovereign AI for intelligence, ops, and classified workloads.

FedRAMP · IL5

Large enterprises

Internal copilots and knowledge systems wired into existing IdP and ACLs.

SSO · IdP · ACL

SaaS & product companies

Private AI backend powering their product, with sane unit economics at scale.

CONTROL · COST

07 · Engagement shape

How the partnership is priced.

A small set of clean line items, sized to the actual surface we operate. Most customers start at the first two, expand into the rest over the first year.

01

Deployment

Discovery, infra design, hardening, first endpoint live in your environment.

2–4 weeks · fixed

onboarding

02

Managed LLMOps

We run layers 1–4: gateway, fleet, RAG, agents, observability, upgrades.

monthly · scale-tiered

recurring

03

GPU & inference optimisation

Quantisation, batching policy, speculative decoding — measured against your real traffic mix.

retainer or perf-share

project

04

Fine-tuning & eval

Domain adaptation with proper evals — synthetic data generation, LoRA training, DPO, gating.

per-model · gated

project

05

AI observability

Tokens, hallucination rate, agent behaviour, drift detection, cost attribution — all native.

bundled with ops

platform

06

Enterprise support & SLA

24/7 on-call, hardened SLO, named TPM, quarterly performance reviews.

tier · gold / platinum

enterprise

08 · Technologies used

Open source where it earns its keep.

Almost everything we operate is open-weights and open-source. The proprietary bit is the operational layer wrapped around it.

Model layer

Llama 3.xMistral · MixtralDeepSeekQwen 2Phi-3Gemma+ custom fine-tunes

Inference runtime

vLLMTensorRT-LLMRay ServeTritonSGLang

Orchestration

KubernetesKubeRayKarpenterArgoCDTerraform

RAG & data

pgvectorWeaviateQdrantLlamaIndexPostgresS3 / object

Agents & tuning

LangGraphDSPyUnsloth (LoRA)TRL · DPOAxolotl

Observability & governance

OpenTelemetryPrometheus · GrafanaLokiinternal · zt-tracepolicy as code

Frontier AI,
inside your
own perimeter.

Nothing leaves the building.

Models live where your data lives.

The default AI stack ships your data offshore.

What "AI-as-an-API" actually costs

What sovereign AI gives back

Five layers. One perimeter. Zero external dependencies.

Six things we run for you.

Sovereign deployment

Multi-model orchestration

Intelligent gateway

Fine-tuning infra

Enterprise RAG stack

Agent runtime & observability

What your platform team sees.

From API consumption to owned infrastructure.

API consumption

Owned AI infrastructure

Industries with no choice but sovereign.

Banking & financial services

Healthcare & life sciences

Defense & government

Large enterprises

SaaS & product companies

How the partnership is priced.

Open source where it earns its keep.

Bring AI inside the perimeter.

Frontier AI,inside yourown perimeter.

Nothing leaves the building.

Models live where your data lives.

The default AI stack ships your data offshore.

What "AI-as-an-API" actually costs

What sovereign AI gives back

Five layers. One perimeter. Zero external dependencies.

Six things we run for you.

Sovereign deployment

Multi-model orchestration

Intelligent gateway

Fine-tuning infra

Enterprise RAG stack

Agent runtime & observability

What your platform team sees.

From API consumption to owned infrastructure.

API consumption

Owned AI infrastructure

Industries with no choice but sovereign.

Banking & financial services

Healthcare & life sciences

Defense & government

Large enterprises

SaaS & product companies

How the partnership is priced.

Open source where it earns its keep.

Bring AI inside the perimeter.

Frontier AI,
inside your
own perimeter.