Quantization Explained: Why the Same LLM Gives Better Results on High-End Hardware

Choosing an LLM means choosing a quantization, not just a model. Here's what you need to know.

Chris Hoffman • May 12, 2026

Photo: Chris Hoffman

If you’ve ever wondered why the same LLM feels sharper on a high-end desktop GPU than on your laptop, the answer usually comes down to quantization. It’s the behind-the-scenes process that makes local AI feasible at all. When you run a large language model on a PC or a high-end system like the NVIDIA DGX Spark, quantization is, by design, reducing the precision of the LLM's output.

Think of it a bit like a JPEG image. JPEGs use lossy compression: The more you compress the image, the blurrier it gets, and the less you can see sharp lines. But, just as we use JPEG files for our photos because they look great at high quality settings, common quantization techniques can often deliver near-indistinguishable LLM performance with less resource usage.

This isn't just about local AI, either: Cloud providers are running quantized models, too. You're rarely using a full-precision LLM, even if a company is running it on a supercomputer in a data center. And that's okay.

Quantization is a way of running the full model with reduced precision. This makes it different from other ways to run smaller models on your local hardware, like specialized models trained with fewer parameters and distills designed to imitate larger models. So let's talk about parameter counts and distills first — and explain why quantization is different.

Parameter counts: different models with the same name

Large language model names can be misleading. Take OpenAI's gpt-oss-20b and gpt-oss-120b models, which are open-source versions of its GPT models you can run on your own machine.

They share an architecture, but they're very different models. The 20 billion parameter model can run on a GPU with 16 GB of memory. The 120 billion parameter model needs around 80 GB of memory. You'll need a machine more like a DGX Spark, which has 128 GB of memory, to run it.

Image: Chris Hoffman

Distills: Smaller models imitating larger

Distills can also muddy the waters. Take the open-source DeepSeek R1 reasoning model, for example. There was a lot of coverage about how you could "run DeepSeek on a laptop" after the model was released in early 2025. And Ollama offers multiple models named DeepSeek R1 you can download and run on your local hardware.

These smaller models — named things like "deepseek-r1:8b" on Ollama — are based on different architectures. These are actually distills. (The full name of deepseek-r1:8b is "DeepSeek-R1-0528-Qwen3-8B.")

To create a distill, a different model is fine-tuned on outputs from a larger model. So, in this example, this 8 billion parameter model doesn't have the same architecture as DeepSeek R1 at all. It's a smaller Qwen open-source model that's been trained on DeepSeek R1's outputs to imitate it.

That may be useful, but it's nothing like running the full 671 billion parameter DeepSeek R1 model. You'd need a lot more memory to run that huge model.

But you'll need less memory than you think. That's because, even if you decide to run a massive full model like DeepSeek R1 — the 671 billion parameter model and not a distill — you won’t be running it at full precision. You'll be running a quantized version. And the same is true for other LLMs.

Image: Chris Hoffman

Quantization: the same model with lower precision

A quantized model is the exact same model — with the same architecture and the same number of parameters — stored and run at a lower numerical precision.

Think of it like the model's weights being stored at a lower resolution. The original model created from training may have weights stored as 16-bit floating point numbers. A quantized model will use fewer bits — 8-bit integers, 4-bit integers, or sometimes even less.

Quantized models aren't just smaller, though — they're faster to run. Modern AI hardware, including NVIDIA's Tensor Cores, which are found both on its consumer GPUs and the Blackwell hardware in NVIDIA's DGX Spark boxes, are designed to accelerate calculations performed with 8-bit or 4-bit integers, not high-precision floating point math with 16-bit numbers.

Going from a 16-bit model to an 8-bit quantized model dramatically improves inference performance but doesn't normally affect much in terms of output quality. In fact, cloud providers rarely serve full-precision FP16/BF16 models. They generally use FP8 or INT8 weight-only quantization models, even on their powerful supercomputers.

However, going to 4-bit or below can decrease output quality. The model's output may lose some nuance and creativity, and reasoning performance may be worse. It depends on the model, the type of quantization, and what you're asking the model for.

Quantization jargon explained: GGUF, Q4_K_M, and more

When you download a model to run on your local hardware, you can choose different quantizations. Ollama doesn't provide any visibility into this — it hands you a model that may or may not be a distill, and who knows how it's quantized — but if you're using something like LM Studio to download a model from Hugging Face, you will see different options when you download a model.

LM Studio presents these as "download options" with different sizes. Under the hood, each is a GGUF file that packages the already-quantized model weights, essential metadata, and everything the software needs to load and run the model correctly.

The "Q" stands for quantization, and the number after it stands for the number of bits. In other words, the "Q8" files are 8-bit quantizations, the "Q4" models are 4-bit quantizations, and so on. If you can only run a 2-bit "Q2" quantization on your hardware, that's an extremely lossy model that often degrades reasoning performance.

There are different approaches to quantization, and the jargon makes it into the name as well. For example, "Q4_K_M" is a 4-bit quantization that uses a newer quantization method than the older "Q4_0." The specifics are arcane technical details, but the "Q4" is what matters most when you're choosing a model.

Image: Chris Hoffman

Wondering which quantization level to choose? Your GPU's VRAM is the deciding factor

Less than 8GB points you toward Q2_K or Q3_K_M — you'll sacrifice some quality, but the model will run. At 8–12GB, Q4_K_M is the sweet spot: it's the format behind the majority of local LLM downloads for good reason, delivering around 92–95% of full-precision quality at a fraction of the size. Step up to 12–16GB and you can afford Q5_K_M or Q6_K, where the quality gains become noticeable for precision-sensitive tasks like coding or math. At 16–24GB, Q8_0 gives you near-full-precision output. And if you're running 24GB or more — say, an RTX 5090 or a high-end workstation GPU — you can run F16 full precision and skip the tradeoffs entirely.

A New Format for High-End Hardware: NVFP4

If you're running a high-end NVIDIA Blackwell GPU — like the one inside the DGX Spark — there's a newer quantization format worth knowing about: NVFP4. It's a 4-bit format, but engineered differently than standard 4-bit quantization. NVFP4 uses smaller groupings of values with more precise scaling factors, which means it can compress models aggressively while losing less quality than you'd expect from 4-bit. The result is models that run roughly 2-3x faster than higher-precision alternatives, without a dramatic hit to output quality.

Note that NVFP4 requires Blackwell architecture or newer. It's not a drop-in option for most consumer GPUs — it's built for hardware designed with this format in mind. But as more models ship with NVFP4 support, it represents where high-end local AI inference is heading.

How to think about quantization

If you're putting together a local AI rig with a consumer GPU, you'll often end up running 4-bit quantized models — Q4_K_M is a solid, balanced choice for many models. You'll almost certainly be better off running a larger model at 4-bit quantization instead of searching for a smaller model with fewer parameters just to fit into memory.

How much quantization actually matters depends on the model and what you're using it for. In some cases, a 4-bit model will be nearly as good as an 8-bit model. In others, the 8-bit model will be far ahead.

With a consumer GPU, an aggressively quantized 2-bit model might run on your system. With higher-end hardware like an NVIDIA DGX Spark, you may be able to run an 8-bit version of that model. It'll be a night and day difference in output quality, even if it's "the same model."

And remember: Even if you aren't running AI models locally, quantized models are running in data centers. Cloud providers are serving 8-bit or perhaps even 4-bit quantized models. That's one reason why different AI compute providers can give you outputs of varying quality when they're serving you the same open-source model.

If you run the models locally, you know what you're getting. You can choose the right model and the right quantization for your needs. The better your hardware, the less you have to depend on low-bit quantizations.

Updated May 12, 2026.

More from MC News

Chris Hoffman is a veteran tech journalist and the former Editor-in-Chief of How-To Geek. He's been going hands-on with PC hardware and getting his hands dirty with Windows for 15 years. And has written for publications including PCWorld, PCMag, Computerworld, The New York Times, Fast Company, and Reader's Digest.

DGX Spark; 20 core Arm, 10 Cortex-X925 + 10 Cortex-A725 Arm; 128GB LPDDR5x Unified RAM; 4TB Solid...

Original price $4,699.99

Todays price $4,499.99

NVIDIA RTX Pro 6000 Blackwell Workstation Edition Dual-Fan AI & Workstation Graphics Card; 96GB...

Original price $14,999.99

Todays price $12,999.99

NVIDIA GeForce RTX 5090 ROG Astral Overclocked Triple Fan 32GB GDDR7 PCIe 5.0 Graphics Card

Todays price $4,329.99

AI100 Workstation; AMD Ryzen Threadripper 9960X 4.2GHz Processor; NVIDIA GeForce RTX 5090 32GB...

Todays price $9,999.99

Ryzen Threadripper PRO 9985WX Shimada Peak 3.2GHz 64-Core sTR5 Boxed Processor - Heatsink Not...

North XL Mesh eATX Mid-Tower Computer Case - Black/Walnut

Todays price $199.99

Technology RM52 ATX 5U Server Chassis - Black

Todays price $579.99

RM44 eATX 4U Server Chassis - Black

Todays price $399.99

Quantization Explained: Why the Same LLM Gives Better Results on High-End Hardware

Parameter counts: different models with the same name

Distills: Smaller models imitating larger

Quantization: the same model with lower precision

Quantization jargon explained: GGUF, Q4_K_M, and more

Wondering which quantization level to choose? Your GPU's VRAM is the deciding factor

A New Format for High-End Hardware: NVFP4

How to think about quantization

Comment on This Post

See More Blog Categories

Buying Guides

How-To

PC Build Guides

Recent Posts

How to Install Windows 11 Without a Network Connection

Build of the Week 6/19/2026 - 6/25/2026

This Week in AI: The Demand Keeps Rising

Quantization Explained: Why the Same LLM Gives Better Results on High-End Hardware

Parameter counts: different models with the same name

Distills: Smaller models imitating larger

Quantization: the same model with lower precision

Quantization jargon explained: GGUF, Q4_K_M, and more

Wondering which quantization level to choose? Your GPU's VRAM is the deciding factor

A New Format for High-End Hardware: NVFP4

How to think about quantization

Comment on This Post

See More Blog Categories

Buying Guides

How-To

PC Build Guides

Recent Posts

How to Install Windows 11 Without a Network Connection

Build of the Week 6/19/2026 - 6/25/2026

This Week in AI: The Demand Keeps Rising

Sign in for the best experience