Skip to content
Live12+ production solutions40+ clients deployeddirect + partner
A glass-walled enterprise data centre with H100-class GPU racks in cool blue lighting, network cabling neatly dressed overhead, and an operator at a console reviewing model-serving telemetry.
On-Premises AI

On-Premises AI Buyer's Guide 2026

How to choose hardware, open-weight models and inference stacks for sovereign generative AI that runs entirely inside your perimeter. 2026 buyer's guide.

Zeour Engineering Dec 22, 2025 18 min read· 3,503 words
Topicson-premises AIsovereign LLMprivate LLMopen-weight LLMRAGAI infrastructurebuyer guide
Related solution: DT Consultation

Key takeaways

  • An 8B open-weight model in fp16 needs ~16GB VRAM for weights alone; a 70B needs ~140GB — sizing starts from this maths, not a vendor brochure.
  • One or two NVIDIA H100 80GB nodes (£30k-£45k each) plus a sober RAG layer beat a public-cloud LLM API on residency, audit and total cost beyond the inflection point.
  • The serious open-weight landscape in 2026 is Llama 3.x, Mistral, Mixtral 8x22B, Qwen 2.5, DeepSeek and Gemma — picked per task, not picked once.
  • vLLM, Ollama and TGI are the production inference stacks; each fits a different workload shape and you will likely run two side-by-side.
  • Mode-based assistants (frozen prompts, bounded tools, scoped retrieval — proven in the MediCare clinic management system AI Clinical Assistant) beat one general assistant on safety and audit.
  • A defensible business case clears 5-year net benefit of £1.4m-£4.8m for a mid-market estate — dominated by deflected per-token fees and recovered staff time.
  • Discovery £12k-£35k, small build £80k-£180k, enterprise £300k-£900k, single H100 node £30k-£45k, 8-GPU H200 cluster £500k-£900k. Add 30% for install.

Between the first internal demo of a public LLM API and the second meeting with the data-protection officer, every regulated enterprise hits the same wall: the model is useful, the legal posture is not. This guide is the conversation you would get from a senior engineer who has shipped on-premises AI into regulated production — what to size, what to buy, what to skip, and where the traps hide.

Who this guide is for

  • CIOs and CISOs at regulated enterprises. You have a board mandate to do something with generative AI and a contradictory mandate to not send customer data outside the perimeter. The vendor pitches alternate between "trust our cloud" and "buy a server." Neither is an architecture.
  • Heads of AI / Data Science. You have proven public-cloud LLM APIs work. You now need an open-weight LLM production architecture on your hardware, with RAG against your knowledge base and an evaluation harness you can defend at a regulator meeting.
  • Programme directors at hospital groups, ministries and operators. You are looking at the AI Clinical Assistant pattern from MediCare, or its equivalent in banking or government, and need a buying playbook that does not become a 24-month research project. See the healthcare industry profile.
  • Defence and national-security IT leads. You need air-gapped deployment, signed model bundles, full prompt-and-completion audit, and a hardware sizing plan that survives procurement. No external network calls, ever.

What is on-premises AI in 2026?

On-premises AI in 2026 means hosting an open-weight large language model — and the embedding model, vector store, inference server, safety filters, audit log and RAG pipeline around it — entirely inside infrastructure you control. That can be an on-prem data centre, a colocation cage, a sovereign-cloud tenancy that contractually never egresses, or a customer-owned GPU appliance behind an air gap. The defining property is that prompts, completions, embeddings and retrieved documents never traverse a third-party multi-tenant service.

Technically, you are deploying a serving layer (vLLM, Ollama, TGI or llama.cpp) hosting one or several open-weight models — Llama 3.x in 8B and 70B, Mistral 7B Instruct, Mixtral 8x22B, Qwen 2.5, DeepSeek V2 or V3, and Gemma where its licence fits — fronted by an application layer that orchestrates retrieval-augmented generation, tool calls and evaluation. The application talks to a vector store (pgvector inside Postgres, Qdrant or Milvus), an object store, a relational audit store, and your existing CRM, ITSM, EHR and knowledge bases via internal-only APIs.

What separates a real platform from shelfware is the cluster of things nobody puts on the slide: an evaluation harness on every prompt change, rollback for prompt templates, version-pinned weights with checksums, immutable audit on every prompt and completion, per-mode rate limits, scoped retrieval that cannot accidentally surface records an actor cannot see, and a clear answer to "what happens when the model says something wrong about a customer." Most demos have none of these; production has all of them. Treat the gap as the engineering work — that is what you are buying.

On-premises AI is not anti-cloud; it is anti-leakage. For non-sensitive workloads — marketing drafting, slide generation, demos — a public-cloud LLM API is often right. The buyer's job is to draw the boundary. The hybrid pattern below is what most serious estates run.

The 14-criterion scoring rubric — score every vendor

  1. 1Sovereign deployment posture. Why: if a vendor needs to send prompts off-site for support, you do not have on-prem. Test: ask for a customer who has run the stack air-gapped for 30 days with zero external connections.
  2. 2Open-weight model support — multiple, switchable. Why: a single-model bet ages badly. Test: deploy a 7B-13B dense model, a Mixture-of-Experts model and a 70B dense model on the same platform; switch in production.
  3. 3Inference stack maturity. Why: serving is the bottleneck. Test: demand a vLLM-or-TGI benchmark at your target batch size and sequence length.
  4. 4VRAM sizing transparency. Why: vendors who refuse to share VRAM math are guessing. Test: ask for the calculation for your concurrent users, tokens per request and target P99 latency.
  5. 5RAG architecture. Why: retrieval quality determines whether anyone trusts the output. Test: ask to see the chunking strategy, reranker, citation surface and evaluation set.
  6. 6Mode-based prompt design. Why: one generic assistant is impossible to audit. Test: count frozen, named modes — five or more is a real platform; one is a demo.
  7. 7Evaluation harness. Why: without eval you cannot ship a prompt change safely. Test: ask to see the harness output for the last three prompt changes in production.
  8. 8Audit trail granularity. Why: you will be asked who said what, when. Test: pull a row — prompt, completion, retrieved chunks, model version, prompt version, actor, timestamp, outcome.
  9. 9Hardware portfolio fluency. Why: H100 is the default, but H200, L40S, A100 80GB and AMD MI300 all have a place. Test: ask which they have shipped into production.
  10. 10Air-gap readiness. Why: over-the-air updates against a public hub will fail security review. Test: ask for the signed-bundle procedure and model-provenance manifest format.
  11. 11Bilingual EN + AR full RTL out of the box. Why: retrofit is painful. Test: same mode answering the same question in English and Arabic, including UI direction.
  12. 12Per-mode rate limits and circuit breakers. Why: an internal user looping a prompt exhausts your GPU budget fast. Test: push synthetic traffic at 10x normal and watch it break.
  13. 13Fixed-fee phased engagement. Why: time-and-materials is how AI pilots become 18-month research projects. Test: ask for a fixed-fee Discovery price and a deliverable list.
  14. 1490-day exit window. Why: you need the right to walk with weights, prompts, evals and audit. Test: read the contract clause — see the exit window definition.

How do you choose between cloud, on-premises and hybrid?

The honest answer is that most regulated enterprises end up hybrid; the question is where you draw the line. The table below is the version we walk customers through at Discovery.

DimensionPublic-cloud LLM APIOperator-hosted open-weight LLMHybrid (sensitive on-prem, non-sensitive cloud)
Data residencyVendor jurisdiction, multi-tenantOperator perimeter, single tenantPer-class routing — sensitive on-prem
Per-token economicsCheap at low volume, expensive at scaleCapital cost up front, near-zero per-token afterTracks workload mix
LatencyNetwork-bound, can spikeLocal LAN, predictableMixed; sensitive paths are local
AuditVendor-side, partialFull prompt + completion + chunk auditFull on-prem; vendor audit on cloud side
Compliance fitGDPR / HIPAA friction variesSovereign deployment out of the boxPossible with strict routing controls
Model choiceVendor menuAny open-weight model, switchableBoth surfaces
Operational burdenLowReal — GPUs, drivers, observabilityHighest — two stacks to run well

The pattern that survives procurement: anything touching PII, patient records, KYC documents, government correspondence, NDA-bound contracts or regulator-visible operational data goes on-premises. Marketing drafting, slide generation and generic code completion can live on a public-cloud LLM API under clear policy. The mistake is letting an enthusiastic team try the cloud with real customer data — the right architecture makes that technically impossible, not just policy-forbidden.

> Want a fixed-fee Discovery price before the end of the call? Talk to Zeour engineering — 30-minute scoping conversation, no slideware, and a published pricing band by the time we hang up.

How much does on-premises AI cost in 2026?

  • Discovery (fixed-fee, 2-4 weeks): £12k-£35k. Use-case scoping, hardware sizing, RAG architecture, compliance posture, evaluation plan, published Build price by the end.
  • Build small (8-12 weeks): £80k-£180k. Single-mode assistant, one GPU node, basic RAG, audit log, eval harness, two-environment deployment.
  • Build enterprise (12-20 weeks): £300k-£900k. Multi-mode (5-10), multi-tenant if needed, HA inference cluster, deep CRM / ITSM / EHR integrations, full SIEM export, RBAC, scoped retrieval, enterprise development services wraparound.
  • Integrate (3-6 weeks per system): £20k-£60k. Wire into Salesforce, Dynamics, SAP, Oracle, Temenos, ticketing, EHR or your internal IAM. Most estates land 3-6 integrations.
  • Pilot + Go-Live (4 weeks): £20k-£50k. Shadow mode, then a single business unit live, with the eval harness gating each change.
  • Hardware reference points: NVIDIA H100 80GB node £30k-£45k; 4-GPU H100 box £160k-£220k; 8-GPU H200 cluster £500k-£900k. AMD MI300 boxes land in a similar band per GB of HBM. Add 30% for install, networking, power and DC fit.
  • Care Plan: Self-Sufficient through Enterprise (24/7, quarterly model and prompt updates, on-call engineer). Banded against estate size.

These are real bands, not anchor-and-discount numbers. Discovery is what closes the loop between the slide and the SOW.

ROI calculator — build a defensible business case in 7 steps

Step 1: Inventory the deflectable tasks

List every task currently consuming licensed human time an assistant could handle end-to-end or as a co-pilot: email triage, first-line tickets, clinical letter drafting, document summarisation, CRM field population, compliance evidence, internal policy Q&A. Annotate with current weekly volume and average minutes per task.

Step 2: Estimate deflection rate per task

Mature on-prem deployments land between 25% full deflection and 65% co-pilot speedup. Be conservative — assume 30% for net-new modes and 55% for well-trodden patterns like clinical letter drafting or ticket triage.

Step 3: Convert to recovered hours per week

(volume × minutes × deflection) ÷ 60. Sum across tasks. A mid-market estate typically recovers 800-2,400 staff-hours per week across 6-10 modes.

Step 4: Apply a blended fully-loaded cost rate

Use fully-loaded cost (salary + employer cost + overhead), not headline salary. Most estates land between £40 and £85 per hour blended. This is your annual gross benefit.

Step 5: Compute counterfactual cloud-API cost avoided

For every workload on operator hardware that would otherwise hit a public-cloud LLM API: token volume per month × projected per-token rate. Past the inflection point — typically 50-200 million tokens per month per mode — operator hardware wins per-token economics by 4×-12×.

Step 6: Subtract capital and operating cost

Amortise hardware over 4 years, add power and rack cost, add Care Plan, add operator FTE share. Subtract from the sum of Step 4 and Step 5 benefits.

Step 7: Worked example — a mid-market estate over 5 years

A hospital group with 9 sites, an enterprise build (~£520k), one H100 4-GPU box (~£200k including install), 8 modes live, 1,800 staff-hours recovered weekly at £62 blended, plus 320M tokens per month moved off a public-cloud API: 5-year net benefit between £2.6m and £3.8m. A single-site bank with 4 modes lands around £1.4m; a national telco with 14 modes above £4.8m. Plug your own numbers — the model is the same.

Seven failure modes from real deployments

Failure mode 1: Under-sizing GPU memory. Teams estimate VRAM for the weights and forget the KV cache, activation memory and batching headroom. An 8B model at fp16 needs 16GB for weights but real serving wants 30GB+ headroom per concurrent stream at meaningful context lengths. Fix: do the maths against your target P99 latency and real batch size, and over-provision by 30%.

Failure mode 2: No evaluation harness. A team ships a prompt change because it "looks better" in five examples. Two weeks later a regression appears that nobody can attribute. Fix: every prompt template, model version and retrieval setting goes through a 100+ question scored evaluation set; nothing ships without passing.

Failure mode 3: Treating the LLM as deterministic. Caching one prompt's answer for reuse in another context produces real harm in regulated settings. Fix: scope outputs to the actor, time and retrieved chunk set; surface uncertainty; allow human review on high-stakes paths.

Failure mode 4: Leaking sensitive data in logs. Prompt-and-completion audit is essential, but the audit log itself becomes a sensitive store. Teams forget to encrypt it or accidentally ship it to a SaaS observability tool. Fix: treat the audit log as the most sensitive store; encrypt at rest, RBAC at row level, SIEM export over an internal-only channel.

Failure mode 5: No rate limit on internal users. A clever internal user wires the assistant into a loop and exhausts your GPU budget in 40 minutes. Fix: per-mode, per-user and per-tenant rate limits with circuit breakers that fail loud.

Failure mode 6: One giant general assistant. A single prompt asked to triage, draft, summarise and call tools is impossible to evaluate or debug. Fix: the mode-based pattern — 5 to 10 frozen, named, scoped modes, each with its own prompt, retrieval scope, tool list and eval set. This is the lesson the medicare clinic management system 7-mode AI Clinical Assistant codifies.

Failure mode 7: No rollback for prompt or model changes. Prompts and weights treated as ad-hoc chat-window edits rather than versioned artefacts. When something breaks Friday at 4pm, nobody can revert. Fix: prompts in git, model versions checksummed and pinned, every change atomically rollback-able from the admin console.

Migration path — moving from your current stack

Phase A: Shadow mode (3-6 weeks). Stand up the on-prem stack alongside whatever you run today. Mirror every production prompt to both surfaces, log both completions, score with the evaluation harness. Users see only the current system. You build confidence in the new one without risk.

Phase B: Cutover by mode (4-8 weeks). Pick the lowest-risk mode — usually document summarisation or internal Q&A — and route real users to the on-prem stack for that mode only. Keep shadow mode on the rest. Watch eval scores, P99 latency, GPU utilisation and human feedback. Move the next mode when the first stabilises.

Phase C: Full pilot cutover (4 weeks). One business unit moves fully to the on-prem stack across all modes. Sensitive workloads were already on-prem; non-sensitive ones can stay on a public-cloud LLM API under clear routing rules if that is the hybrid posture. The pilot unit owns the change-management story for the rest of the estate.

Phase D: Estate rollout (8-20 weeks). Other business units onboard with their own modes, integrations and evaluation sets. Central platform team owns the inference cluster, audit log and model catalogue. Local teams own their modes. The 90-day exit window is unaffected throughout.

Implementation playbook

  1. 1Discovery (2-4 weeks, fixed-fee). Use-case shortlist, hardware sizing, RAG architecture, compliance mapping (GDPR, PDPL, HIPAA, NIS2, NCA-ECC, ISO 27001 as relevant), eval plan, published Build price.
  2. 2Build (8-16 weeks). Inference cluster up, models pinned, RAG pipeline wired to your corpus, modes implemented, audit log live, eval harness on every commit.
  3. 3Integrate (3-5 weeks). Wire into CRM, ITSM, EHR, knowledge base, IAM and SIEM. Internal-only paths, signed tokens, per-mode scopes. Most estates land 3-6 integrations.
  4. 4Pilot + Go-Live (4 weeks). Shadow → mode-by-mode cutover → first unit live, with the eval harness gating each change.
  5. 5Operate. Quarterly model refresh, monthly prompt review, weekly eval report. Hardware refresh on 4-year amortisation. Care Plan banded against estate size and SLA.

Frequently asked questions

Do I really need on-premises hardware, or can a private cloud tenancy be enough?

It depends on what "private" means in the contract. A single-tenant, network-isolated, audited tenancy that contractually never trains shared models can satisfy many residency regimes — but not air-gap. For the most sensitive workloads (PHI under HIPAA, KYC documents, defence material) the answer is operator hardware behind the operator's perimeter. For sensitive-but-not-air-gapped workloads, a true sovereign-cloud tenancy can work; read the data-processing addendum and the support-access clauses carefully.

Which open-weight model should I pick for production?

Do not pick once. Usually two or three: an 8B-13B dense model (Llama 3.x 8B, Mistral 7B Instruct, Qwen 2.5 7B) for latency-sensitive interactive paths; a Mixture-of-Experts model (Mixtral 8x22B, DeepSeek V3) for capability-per-token on harder tasks; and possibly a 70B dense model (Llama 3.x 70B) for jobs that genuinely need it. Switch per mode based on the evaluation harness, not enthusiasm.

How much VRAM does a 70B model need in production?

Weights alone in fp16 are about 140GB. Add KV cache (often 20-60GB at meaningful concurrency), activation memory and headroom. Practical answer: two H100 80GB GPUs with tensor parallelism for low-concurrency interactive serving, four for higher throughput. fp8 or int4 quantisation can roughly halve or quarter weight memory, but re-run your eval harness on the quantised model — quality loss varies by task.

Should I use vLLM, Ollama or TGI?

vLLM is the default for serious multi-user serving — paged attention, continuous batching, mature throughput. TGI is comparable and fits where the surrounding Hugging Face stack is already in use. Ollama is excellent for developer ergonomics and small edge boxes; it is rarely the right answer for a multi-tenant production cluster. Most estates run vLLM for the main cluster and Ollama on developer machines for parity.

What does air-gapped deployment actually require?

No external network calls during inference, model weights signed and verified at load time, a documented provenance manifest, signed-bundle updates staged internally rather than pulled from a public hub, an internal package mirror for dependencies, and an eval harness running entirely inside the perimeter. Most failures are operational habits (a developer running pip install from an internet-connected jump host) that need to be designed out.

How do I make RAG retrieval actually trustworthy?

Four ingredients matter and most teams skip three. First, chunking strategy tuned to your corpus (legal text and clinical notes want different chunkers). Second, a reranker — semantic search alone rarely orders results well enough. Third, citation surface — the user sees which chunks fed the answer. Fourth, scoped retrieval — actor permissions constrain the chunk universe before retrieval, never after.

How does on-premises AI handle Arabic and other right-to-left content?

Llama 3.x, Mistral, Mixtral and Qwen 2.5 all produce useful Arabic with appropriate instruction prompts and, where quality matters, light fine-tuning on domain corpora. Tokenisation efficiency varies — Qwen handles Arabic tokens more compactly than older Llama tokenisers, which affects throughput. The UI layer needs full RTL out of the box, which the bilingual Arabic language model production baseline ships with. Other locales are added per engagement.

How do I evaluate AI vendors without becoming a research lab?

Write an evaluation set of 100-200 questions that represent your real workload, with scored target answers. Give it to every shortlisted vendor and ask them to run it through their stack and return scored outputs plus the configuration used. Good vendors do this in days; the ones who deflect or substitute their own benchmark are telling you something.

How do you handle prompt and model rollback safely?

Prompts live in git with reviewable diffs. Model weights are version-pinned with checksums. Every change ships as an atomic versioned artefact that the admin console reverts in seconds. The eval harness gates promotion from staging; the same harness runs nightly in production to catch silent regressions. Rollback is a one-click operation, not an incident.

What do I get to keep at the end of the engagement?

Under a standard fixed-fee phased engagement, the operator owns the repository, weights on disk, prompts, eval harness, audit log, deploy keys and runbooks. The 90-day exit window means a competent in-house team can take over with full documentation. The wider posture is laid out in the visitor management compliance buyer's guide.

Where Zeour fits

Zeour Ltd designs and ships sovereign on-prem AI as part of digital transformation consultation and enterprise development services, with the AI Clinical Assistant inside the MediCare clinic management system as the reference production deployment for the mode-based pattern. The same architecture underpins engagements across banking, government, oil and gas and telecom. Book a demo, browse our pricing bands, or read the wave-1 guides on queue management, virtual queuing, self-service kiosks and digital signage. Reference deployments live in our case studies; technical vocabulary in the glossary.

---

Last updated: May 17, 2026 — by the Zeour engineering team.

Share:
ZE

Written by

Zeour Engineering

The same engineers and consultants who ship Zeour’s 12 production solutions. We write about what we actually build and deploy — no vendor-fluff.

Want to Learn More?

Discover how our solutions can transform your business operations and customer experience.

Request a Demo
Glossary

Definitions for the concepts mentioned above. Open any term for the long-form entry plus its cross-links.