Skip to content
Live12+ production solutions40+ clients deployeddirect + partner
Glossary · AI & Models

What is Quantisation?

Compressing LLM weights from 16-bit floats to 8-bit / 4-bit integers — runs the same model on smaller GPUs at a small accuracy cost.

Also known as

model quantizationawqgptqfp8int8 quantisationint4 quantisation
Definition

Quantisation — explained.

Quantisation compresses an LLM's weights from their native 16-bit floating-point representation down to 8-bit integers (INT8), 4-bit integers (INT4), or 8-bit floats (FP8). The result is a model that takes 2-4× less GPU memory and runs 1.5-3× faster, at a small (usually 1-3 percentage points on standard benchmarks) accuracy cost. Three families dominate: GPTQ (post-training quantisation with per-channel calibration), AWQ (Activation-aware Weight Quantisation, often better for instruction-following), and the more recent FP8 formats supported natively on H100 / H200 GPUs. The practical impact: a 70B-parameter model that needs roughly 140GB in FP16 fits comfortably on a single 80GB H100 at INT4. For on-prem AI deployments quantisation is usually mandatory — the difference between needing 4 GPUs vs. needing 1 is the difference between a feasible deployment and a hardware ask the operator won't approve. vLLM, TGI, Ollama, and TensorRT-LLM all support multiple quantisation formats out of the box.

Solutions where quantisation applies

Zeour solutions that operate on this layer.

DT Consultation

digital · transformation · consultation

Zeour Digital Transformation Consultation helps companies digitalise their services and operations through three pillars: process automation (workflow engines, RPA, integration platforms that retire repetitive manual work), self-service technologies (customer + employee portals, kiosks, mobile apps, WhatsApp / SMS / IVR channels), and sovereign on-premises AI (open-weight large language models, vision models, voice models, RAG pipelines, and AI-augmented workflows that run entirely on the operator's own hardware — patient data, customer data, and classified material never leave the perimeter). The service stack is the full path from problem to outcome: consulting (digital-maturity assessment, transformation roadmap, business-case modelling, vendor selection), implementation (the build itself, often delivered in partnership with our Enterprise Development team), AI model deployment (open-weight LLMs, fine-tuning, embedding pipelines, on-prem inference infrastructure, GPU sizing), customisation (tailoring deployed AI and automation to your specific operations — prompts, RAG corpora, workflow templates), and training (role-based curricula for executives, operators, and end users, with operations playbooks, runbooks, and train-the-trainer programmes that make your team self-sufficient). The same team that ships our production AI assistant in MediCare (7-mode OpenAI Responses API, evidence-based prompts, audit-logged interactions) is what you engage.

See the solution

Enterprise Dev

enterprise · development · services

Zeour Enterprise Development — we design, build, and operate corporate-grade software for organizations that take their software seriously. Custom web platforms, mobile apps, kiosk fleets, embedded/hardware-coupled systems, real-time services, AI-augmented workflows, system integrations (CRM / ERP / HRIS / payment gateways / BI / national health systems / lab analyzers / payment terminals / card readers / GPIO barriers), legacy modernization, cloud migration, on-premise deployments, DevOps + CI/CD, security hardening, and 24/7 support. Every other solution on this site — MediCare Clinic Management, Smart Parking, GLARUS Queue Management, Wayfinding, Digital Signage, Visitor Management, Online Appointment, Self-Service Kiosks, Customer Feedback — is something our team designed, built, and operates today. The same team is available for your bespoke engagement.

See the solution
Related terms

Adjacent definitions to read next.

Want to discuss quantisation for your operation?

Talk to a Zeour engineer.

A 30-minute scoping call to walk your operational profile against where quantisation actually sits in your stack, then a fixed-fee Discovery price by the end of the call.