The most common AI conversation we have in 2026 with private-sector enterprise buyers is not "should we use AI" — that conversation closed sometime in 2024 — but rather "should we keep paying per-token to a third-party API forever, or build our own self-hosted, fine-tuned AI infrastructure layer that every department can tune for their own work". By the time a company is spending six figures a year on API tokens, the economics start telling them the same thing the compliance team has been saying for two years: bring it in-house. Here is what an honest self-hosted, fine-tuned AI stack actually looks like inside a private-sector enterprise, and how it lands across departments.
Why self-hosted, not API
Three things tilt the calculus toward self-hosted in 2026. The cost crossover. Once your company is doing more than roughly £8,000–£15,000 a month in API spend across all departments, the run-cost of a dedicated GPU rack with open-weight LLMs is cheaper than the API bill, with the GPU asset amortising to zero over 3–4 years. The data posture. Every prompt your customer-support team sends to a third-party API includes customer data — names, account details, complaint contents. Every finance reconciliation prompt includes ledger data. Every HR onboarding prompt includes employee data. Self-hosted keeps all of it inside the company perimeter, which is what your DPO has been asking for. And the customisation ceiling. You cannot fine-tune a closed API model on your company's data; you can fine-tune an open-weight model. That difference compounds quarter over quarter.
What does not tilt the calculus: the marketing pitch that self-hosted is "more secure" in the abstract. Self-hosted is more secure for the data it touches, full stop. It is not magically more secure for everything else. The security gain is specific.
What "trained" actually means in 2026
Most enterprises hear "fine-tuned AI" and picture months of GPU time on a research cluster. In 2026, the reality is much smaller. The pattern that works for a private-sector company across most departments is LoRA-based fine-tuning on a 7B–14B parameter open-weight model (Llama 3.1, Mistral, Qwen) on the company's domain-specific data — typically a few thousand to a few tens of thousands of curated examples per department. The training run takes hours to a day on a single GPU. The resulting LoRA adapter is small (tens to hundreds of MB), can be hot-swapped at inference time, and means one shared base model can serve five or ten differently-tuned departmental adapters from the same physical GPU.
Layer two: retrieval-augmented generation (RAG) against the company's knowledge base. Customer-support adapter retrieves from the support knowledge base, prior tickets, and product documentation. Finance adapter retrieves from the chart of accounts, prior reconciliation logs, and policy documents. HR adapter retrieves from the employee handbook, benefits documentation, and approved Q&A. The adapter handles the style and structure; RAG handles the freshness and specificity.
The combined LoRA + RAG approach is what is actually shipping in 2026. Pure fine-tuning was never the right answer for fast-moving data; pure RAG never gave departments the voice they wanted; the combination does both.
How it lands across departments
A self-hosted AI infrastructure layer is not one AI assistant for the whole company. It is shared infrastructure with departmental adapters and departmental RAG indexes. The departments that get the most consistent value in private-sector enterprises in 2026:
Customer support. First-line ticket triage, suggested-reply drafting, knowledge-base retrieval, sentiment scoring, escalation suggestion. The adapter is tuned on the company's tone-of-voice and historical resolved tickets; the RAG index covers the live knowledge base and the prior ticket archive. The measurable wins are usually a 30–50% reduction in first-response time and a 15–25% reduction in tickets escalated to L2.
Sales enablement. Prospect research summarisation, proposal-draft generation against the company's templates, RFP-response drafting, call-prep notes from CRM context. The adapter is tuned on the company's winning proposals and product positioning; the RAG index covers the product catalogue, the case-study library, and prior proposals. The measurable wins are usually a 40–60% reduction in time-to-first-draft on proposals and a noticeable improvement in win rate on competitive deals.
Human resources. Onboarding Q&A for new employees, benefits-question routing, policy-document interpretation, performance-review draft assistance, internal-mobility candidate matching. The adapter is tuned on the company's HR policies and tone; the RAG index covers the employee handbook, benefits documentation, and approved Q&A library. The measurable wins are usually a 50–70% reduction in HR ticket volume for routine questions.
Finance. Reconciliation assistance, invoice-coding suggestion, expense-policy interpretation, audit-question answering, FP&A narrative drafting from data. The adapter is tuned on the company's chart of accounts and reporting conventions; the RAG index covers the policy library and historical reconciliation logs. The measurable wins are usually a 25–40% reduction in time spent on routine reconciliation and a meaningful improvement in audit-trail completeness.
Engineering and product. Code-review assistance, internal documentation Q&A, ticket-to-spec drafting, postmortem first-draft generation, on-call runbook retrieval. The adapter is tuned on the company's codebase patterns and engineering tone; the RAG index covers the codebase, the docs site, the postmortem archive, and the runbook library. Engineering teams typically see this as the highest-leverage adapter in the company because it accelerates the team that builds everything else.
Operations. SOP retrieval, incident triage, supplier-document analysis, contract-clause extraction, compliance-question answering. The adapter is tuned on the company's operational patterns; the RAG index covers SOPs, supplier contracts, and the compliance library.
The point is not that any one of these is revolutionary. The point is that one shared infrastructure layer — one GPU rack, one base model, one RAG architecture, one operations team — serves six to ten differently-tuned departments. The unit cost per department drops as you add departments, which is exactly the opposite of the per-token API model.
The infrastructure envelope
A reference 2026 self-hosted AI stack for a 200–800 person private-sector enterprise looks roughly like this. Compute: one or two L40S GPUs (48 GB VRAM each) for inference, plus an additional GPU node for fine-tuning runs that can be the same hardware on a scheduled basis. Inference serving: vLLM as the default for production load, Ollama for individual developer environments. Vector store: pgvector or Qdrant, hosted inside the company perimeter. Embeddings: BGE-M3 or E5 for general use, with sector-specific embeddings (BioBERT for healthcare-adjacent, CodeBERT for engineering) where they earn their place. Adapter store: a simple model-registry pattern (MLflow or a Git LFS repo) that holds the LoRA adapters per department. Audit: append-only logging on every prompt and completion, with role-based access for the audit team.
Total hardware budget for the envelope above: well inside what most 200–800 person enterprises spend on a single year of mid-tier SaaS licensing. The asset amortises over 3–4 years and the running cost is power, cooling, and one engineer's attention.
The change-management envelope
The infrastructure is the easy half. The half that actually determines whether the deployment lands is change management. The pattern that works: a single executive sponsor (usually CTO, COO, or a designated AI lead) who owns the cross-departmental adoption. A small AI platform team (3–5 engineers) who own the shared infrastructure, the base models, the embeddings, and the adapter pipeline. A per-department AI champion (not a new hire — an existing senior person in each department) who owns the adapter, the RAG index, and the prompt patterns for that department. And a quarterly AI governance review where the adapters are audited, the RAG indexes are refreshed, and the prompt patterns are reviewed for drift.
Skip the executive sponsor and adoption stalls in two quarters. Skip the per-department champion and the adapter ages out within six months. Skip the governance review and the company quietly drifts back to using a public API for things it should not be sending to one.
The deployment shape
A typical self-hosted, fine-tuned AI deployment for a 200–800 person private-sector enterprise runs 16–24 weeks from kickoff. Discovery (4 weeks, fixed-fee) covers workshops with the executive sponsor, the AI platform team, and the first two or three departmental champions. The output is an adapter-and-RAG architecture, a hardware spec, a fine-tuning data plan, and a phased rollout plan. Build (8–12 weeks, milestone-fixed) stands up the infrastructure, fine-tunes the first three departmental adapters, and wires up the RAG indexes. Pilot (3–4 weeks) runs the first three departments live with the engineers on-call. Operate (ongoing) handles new departmental adapters as the company adds them — typically one per quarter for the first year, then steady state.
We have shipped variants of this engagement in the UK, the EU, and the GCC in 2026, across healthcare, banking, retail, hospitality, manufacturing, and professional services. The adapters differ. The infrastructure pattern does not.
If your company has crossed the cost-crossover or the compliance-pressure threshold for self-hosted AI and you want a no-pitch scoping call on what the right adapters and the right hardware envelope are for your operation — that is what the first conversation is for.
