Quantization Explained: Why the Same LLM Gives Better Results on High-End Hardware
Choosing an LLM means choosing a quantization, not just a model. Here's what you need to know.How-To

If you’ve ever wondered why the same LLM feels sharper on a high-end desktop GPU than on your laptop, the answer usually comes down to quantization. It’s the behind-the-scenes process that makes local AI feasible at all. When you run a large language model on a PC or a high-end system like the NVIDIA DGX Spark, quantization is, by design, reducing the precision of the LLM's output.
Think of it a bit like a JPEG image. JPEGs use lossy compression: The more you compress the image, the blurrier it gets, and the less you can see sharp lines. But, just as we use JPEG files for our photos because they look great at high quality settings, common quantization techniques can often deliver near-indistinguishable LLM performance with less resource usage.
This isn't just about local AI, either: Cloud providers are running quantized models, too. You're rarely using a full-precision LLM, even if a company is running it on a supercomputer in a data center. And that's okay.
Quantization is a way of running the full model with reduced precision. This makes it different from other ways to run smaller models on your local hardware, like specialized models trained with fewer parameters and distills designed to imitate larger models. So let's talk about parameter counts and distills first — and explain why quantization is different.
Parameter counts: different models with the same name
Large language model names can be misleading. Take OpenAI's gpt-oss-20b and gpt-oss-120b models, which are open-source versions of its GPT models you can run on your own machine.
They share an architecture, but they're very different models. The 20 billion parameter model can run on a GPU with 16 GB of memory. The 120 billion parameter model needs around 80 GB of memory. You'll need a machine more like a DGX Spark, which has 128 GB of memory, to run it.
Image: Chris Hoffman
Distills: Smaller models imitating larger
Distills can also muddy the waters. Take the open-source DeepSeek R1 reasoning model, for example. There was a lot of coverage about how you could "run DeepSeek on a laptop" after the model was released in early 2025. And Ollama offers multiple models named DeepSeek R1 you can download and run on your local hardware.
These smaller models — named things like "deepseek-r1:8b" on Ollama — are based on different architectures. These are actually distills. (The full name of deepseek-r1:8b is "DeepSeek-R1-0528-Qwen3-8B.")
To create a distill, a different model is fine-tuned on outputs from a larger model. So, in this example, this 8 billion parameter model doesn't have the same architecture as DeepSeek R1 at all. It's a smaller Qwen open-source model that's been trained on DeepSeek R1's outputs to imitate it.
That may be useful, but it's nothing like running the full 671 billion parameter DeepSeek R1 model. You'd need a lot more memory to run that huge model.
But you'll need less memory than you think. That's because, even if you decide to run a massive full model like DeepSeek R1 — the 671 billion parameter model and not a distill — you won’t be running it at full precision. You'll be running a quantized version. And the same is true for other LLMs.
Image: Chris Hoffman
Quantization: the same model with lower precision
A quantized model is the exact same model — with the same architecture and the same number of parameters — stored and run at a lower numerical precision.
Think of it like the model's weights being stored at a lower resolution. The original model created from training may have weights stored as 16-bit floating point numbers. A quantized model will use fewer bits — 8-bit integers, 4-bit integers, or sometimes even less.
Quantized models aren't just smaller, though — they're faster to run. Modern AI hardware, including NVIDIA's Tensor Cores, which are found both on its consumer GPUs and the Blackwell hardware in NVIDIA's DGX Spark boxes, are designed to accelerate calculations performed with 8-bit or 4-bit integers, not high-precision floating point math with 16-bit numbers.
Going from a 16-bit model to an 8-bit quantized model dramatically improves inference performance but doesn't normally affect much in terms of output quality. In fact, cloud providers rarely serve full-precision FP16/BF16 models. They generally use FP8 or INT8 weight-only quantization models, even on their powerful supercomputers.
However, going to 4-bit or below can decrease output quality. The model's output may lose some nuance and creativity, and reasoning performance may be worse. It depends on the model, the type of quantization, and what you're asking the model for.
Quantization jargon explained: GGUF, Q4_K_M, and more
When you download a model to run on your local hardware, you can choose different quantizations. Ollama doesn't provide any visibility into this — it hands you a model that may or may not be a distill, and who knows how it's quantized — but if you're using something like LM Studio to download a model from Hugging Face, you will see different options when you download a model.
LM Studio presents these as "download options" with different sizes. Under the hood, each is a GGUF file that packages the already-quantized model weights, essential metadata, and everything the software needs to load and run the model correctly.
The "Q" stands for quantization, and the number after it stands for the number of bits. In other words, the "Q8" files are 8-bit quantizations, the "Q4" models are 4-bit quantizations, and so on. If you can only run a 2-bit "Q2" quantization on your hardware, that's an extremely lossy model that often degrades reasoning performance.
There are different approaches to quantization, and the jargon makes it into the name as well. For example, "Q4_K_M" is a 4-bit quantization that uses a newer quantization method than the older "Q4_0." The specifics are arcane technical details, but the "Q4" is what matters most when you're choosing a model.
Image: Chris Hoffman
How to think about quantization
If you're putting together a local AI rig with a consumer GPU, you'll often end up running 4-bit quantized models — Q4_K_M is a solid, balanced choice for many models. You'll almost certainly be better off running a larger model at 4-bit quantization instead of searching for a smaller model with fewer parameters just to fit into memory.
How much quantization actually matters depends on the model and what you're using it for. In some cases, a 4-bit model will be nearly as good as an 8-bit model. In others, the 8-bit model will be far ahead.
With a consumer GPU, an aggressively quantized 2-bit model might run on your system. With higher-end hardware like an NVIDIA DGX Spark, you may be able to run an 8-bit version of that model. It'll be a night and day difference in output quality, even if it's "the same model."
And remember: Even if you aren't running AI models locally, quantized models are running in data centers. Cloud providers are serving 8-bit or perhaps even 4-bit quantized models. That's one reason why different AI compute providers can give you outputs of varying quality when they're serving you the same open-source model.
If you run the models locally, you know what you're getting. You can choose the right model and the right quantization for your needs. The better your hardware, the less you have to depend on low-bit quantizations.
More from MC News
- Why VRAM and Memory Bandwidth are Key for Powering Local AI
- Hands-on with NVIDIA DGX Spark: Everything You Need to Know
- How to Build a PC with a Hardline Water-Cooling Loop
- 3D Print a Mac Mini Monitor Mount
- The End Has Come for Windows 10: Four Tips to Make the Most of Windows 11
- Everything You Need to Know About WiFi 7
- Keyboard 101: Intro to Computer Keyboards
- Can Your PC Run OpenAI's New GPT-OSS Large Language Models?
- Fix It Yourself: Talking to iFixit on Why Repairable Tech Matters
Chris Hoffman is a veteran tech journalist and the former Editor-in-Chief of How-To Geek. He's been going hands-on with PC hardware and getting his hands dirty with Windows for 15 years. And has written for publications including PCWorld, PCMag, Computerworld, The New York Times, Fast Company, and Reader's Digest.
