What is Quantisation?
Compressing LLM weights from 16-bit floats to 8-bit / 4-bit integers — runs the same model on smaller GPUs at a small accuracy cost.
Also known as
Quantisation — explained.
Quantisation compresses an LLM's weights from their native 16-bit floating-point representation down to 8-bit integers (INT8), 4-bit integers (INT4), or 8-bit floats (FP8). The result is a model that takes 2-4× less GPU memory and runs 1.5-3× faster, at a small (usually 1-3 percentage points on standard benchmarks) accuracy cost. Three families dominate: GPTQ (post-training quantisation with per-channel calibration), AWQ (Activation-aware Weight Quantisation, often better for instruction-following), and the more recent FP8 formats supported natively on H100 / H200 GPUs. The practical impact: a 70B-parameter model that needs roughly 140GB in FP16 fits comfortably on a single 80GB H100 at INT4. For on-prem AI deployments quantisation is usually mandatory — the difference between needing 4 GPUs vs. needing 1 is the difference between a feasible deployment and a hardware ask the operator won't approve. vLLM, TGI, Ollama, and TensorRT-LLM all support multiple quantisation formats out of the box.
Zeour solutions that operate on this layer.
Verticals where quantisation is operationally critical.
Blog posts that go deeper on quantisation.
Adjacent definitions to read next.
Open-Weight LLM
AI & ModelsA large language model whose trained parameters (weights) are published openly — runnable on the operator's own hardware without API dependency.
vLLM
AI & ModelsA high-throughput LLM inference server using paged-attention memory management — the typical production runtime for self-hosted open-weight models.
On-Premises AI
AI & ModelsOpen-weight large language models running on the operator's own hardware — no prompt, completion, or embedding ever leaves the perimeter.
Arabic Language Model
AI & ModelsAn open-weight or fine-tuned LLM that handles Modern Standard Arabic and major dialects with appropriate tokenisation efficiency and right-to-left rendering at the application layer.
Context Window
AI & ModelsThe maximum amount of text an LLM can process in a single request, measured in tokens — caps how much document context can be fed for RAG and long-form analysis.
Embeddings
AI & ModelsNumerical vector representations of text (or images, or audio) where semantically similar inputs land in similar regions of vector space — the substrate of semantic search and RAG.
Fine-Tuning
AI & ModelsAdapting a pre-trained LLM to your domain or task by continuing its training on a small, high-quality dataset — typically via LoRA or full SFT.
Large Language Model
AI & ModelsA neural network trained on internet-scale text that produces fluent generative output and powers most of what people call "AI" in 2026 — including on-premises sovereign deployments.
Talk to a Zeour engineer.
A 30-minute scoping call to walk your operational profile against where quantisation actually sits in your stack, then a fixed-fee Discovery price by the end of the call.