Sovereign deployment
Open-weight models stood up inside your environment — VPC, on-prem, or fully air-gapped. No outbound traffic, no third-party logging, no surprise dependencies.
The first wave of enterprise AI shipped as API calls to someone else's cloud. The second wave doesn't. We deploy and operate open-source models, RAG systems, and agent runtimes inside the customer's VPC, on-prem, or air-gapped — and run them like a private OpenAI.
The whole stack runs inside the customer's perimeter. No silent token exfil, no shadow-IT API calls, no third-party logs of prompts.
Open-weights models, gateway, RAG, agents and observability — deployed into your VPC, your colo, your air-gapped rack.
Most enterprise AI today is a thin client over someone else's GPU. The convenience is real; so are the consequences. Four failure modes show up again and again in regulated and security-conscious orgs.
The stack is not novel; what's novel is operating it. We deploy and run layers 1–4 as a managed service, and your applications consume them through a stable internal gateway.
The platform is the operational surface. Each capability is a hard problem we've already solved — so your team doesn't have to grow into an internal LLMOps org.
Open-weight models stood up inside your environment — VPC, on-prem, or fully air-gapped. No outbound traffic, no third-party logging, no surprise dependencies.
Run a fleet — Llama, Mistral, DeepSeek, Qwen, Phi, plus your own fine-tunes — with pinned versions, controlled rollouts, and automatic failover between them.
One OpenAI-compatible endpoint your applications already know how to call. Behind it: routing by cost, latency, or quality; rate limits; per-team quotas; auth; full audit log.
LoRA, full fine-tunes, and preference optimisation pipelines wired into your data. Synthetic data generation for the tail cases. Evals to prove the new weights are actually better.
Private retrieval across PDFs, databases, email archives, codebases, wikis — with reranking, citation, freshness, and per-document access controls inherited from your IdP.
A runtime for long-running agents and workflows — tool use, retries, budgets, human-in-the-loop checkpoints — paired with full traces, token accounting, and behavioural evals.
The same view our SRE team and your platform team look at every day. Real-time fleet health, gateway throughput, routing decisions, and what we'd call you about before you notice.
| Model | State | GPU | RPS | p95 | Pods |
|---|---|---|---|---|---|
| llama-3.1-70b | healthy | 78% | 421 | 187 | 4 |
| mistral-7b-fin-ft | healthy | 54% | 912 | 62 | 2 |
| deepseek-coder-33b | healthy | 41% | 118 | 240 | 1 |
| qwen-2-72b | hot | 92% | 218 | 214 | 2 |
| bge-large-embed | healthy | 22% | 3,412 | 12 | 2 |
| phi-3-mini-canary | canary | 14% | 42 | 88 | 1 |
The gateway picks per request based on consumer SLO and complexity. Cost-optimised by default; the fast and best tiers are opt-in per app or per route.
The first AI wave was rented. The second is owned. Enterprises that have built anything serious are already migrating the workloads they care about into their own perimeter.
Plug into someone else's cloud. Ship the demo fast, sort the rest later.
The models, the gateway, and the data plane live inside your environment. We operate them.
Five categories already commit hard to private deployment — and a sixth where the economics simply favour it.
Private financial copilots, model risk management, audit-friendly inference logs.
SOC 2 · PCI · MRMHIPAA-bound retrieval, clinical-note assistants, PHI-isolated agent runtimes.
HIPAA · BAAAir-gapped sovereign AI for intelligence, ops, and classified workloads.
FedRAMP · IL5Internal copilots and knowledge systems wired into existing IdP and ACLs.
SSO · IdP · ACLPrivate AI backend powering their product, with sane unit economics at scale.
CONTROL · COSTA small set of clean line items, sized to the actual surface we operate. Most customers start at the first two, expand into the rest over the first year.
Deployment
Discovery, infra design, hardening, first endpoint live in your environment.
Managed LLMOps
We run layers 1–4: gateway, fleet, RAG, agents, observability, upgrades.
GPU & inference optimisation
Quantisation, batching policy, speculative decoding — measured against your real traffic mix.
Fine-tuning & eval
Domain adaptation with proper evals — synthetic data generation, LoRA training, DPO, gating.
AI observability
Tokens, hallucination rate, agent behaviour, drift detection, cost attribution — all native.
Enterprise support & SLA
24/7 on-call, hardened SLO, named TPM, quarterly performance reviews.
Almost everything we operate is open-weights and open-source. The proprietary bit is the operational layer wrapped around it.
If you're rate-limited by a vendor, blocked by security review, or simply tired of explaining to procurement why prompts contain customer PII — we'll deploy and operate a sovereign stack for you. Cloud, on-prem, or air-gap.