Run AI Locally: The Best LLMs for 8GB, 16GB, 32GB Memory and Beyond
How to choose between Qwen, DeepSeek, Gemma and other open-weight models to run on your AI rig right now.How-To
Photo: Dan AckermanLarge language models (LLMs) small enough to run locally on your home PC or laptop can be incredibly useful tools. The latest models on the smaller side of the spectrum are smart, but also run smoothly on a wide range of computer hardware.
Indeed, I’d argue the biggest problem is not a lack of choice, but the opposite. Literally thousands of models are available for easy download. Most model releases are available in a variety of model sizes and are often quantized (for better performance) or fine-tuned (to target a specific use case).
So, how do you choose the model best for you?
Minimum hardware system requirements
Not surprisingly, the very best examples of open source models are usually the largest; GPT-OSS-120B, Qwen3 235B, and DeepSeek V3.1 often top benchmark scores.
But this isn’t useful for most people, as these models consume more memory than what’s available on a consumer PC. The largest LLMs demand specialized GPUs like the Nvidia H100 or sometimes, in the case of models to the smaller end of this scale, the Nvidia RTX A6000 -- a GPU with up to 48GB of GDDR6 memory that can cost more than $5,000 (or the 96GB in the RTX PRO 6000).
Fortunately, smaller models can fit into much smaller memory footprints, and this guide makes recommendations from that perspective. We’ll look at preferred models for PCs with 4GB, 8GB, 16GB, or 32GB of memory -- and beyond.
Keep in mind that your PC will still need memory to run its normal system processes. For example, it’s technically possible to load a 14GB model on a laptop with 16GB of RAM, but it’s likely to cause performance issues for the LLM and everything else on your PC.
This is less of a problem if you’re running your model on a discrete GPU, because the model can be loaded into VRAM, which is separate from system memory. Also, the memory footprint increases with context window size and length of the chat. That means you’ll need more memory than just that required to load the model to have any kind of in-depth usage.
But while more memory and higher-end PC hardware is always better, recommendations for models with a 4GB or 8GB memory footprint should also work on recently released smartphones, as most include an NPU that can provide reasonable performance (although trying this will chew through your battery life).
4GB: Qwen3 4B 2507
A mere 4GB of memory is a slim memory budget for a large language model and, in truth, no model in this footprint can provide the quality, accuracy, and intelligence the best LLMs will provide. Still, you might not need a rocket scientist if all you want is smarter-than-average code completion, quick document summary, or an e-mail writing assistant. And for this I recommend Alibaba’s Qwen3 4B 2507. The 4-bit quantization requires only about 2.75GB of memory.
Available in both Thinking and non-Thinking flavors, Qwen3 4B 2507 can manage surprisingly decent scores in some benchmarks. According to Artificial Analysis, the Reasoning model scores 74% in MMLU-Pro, 67% in GPQA Diamond, and 64% in LiveCodeBench.
Qwen3 4B 2507 scores so well, in fact, that it may be worth a try even if you have a larger memory footprint available.
It also supports tool use (or as Alibaba defines it, function calling) to tap external tools, which can expand the way it is used -- though this does require some setup.
While Qwen 4B 2507 offers a Thinking model, which uses multi-step inferencing to improve the quality of responses, the non-Thinking model is also worth a try and is actually preferable (and faster) for many situations. Thinking models are slower to respond, which can be a problem if you’re on less performant hardware.
If even Qwen 4B 2407 requires too much memory for your device, take a step down to Qwen3 1.7B. The 4-bit quantization consumes about 1GB of memory. Just be warned that the capability of a model that small is limited; Qwen 3 1.7B scored just 57% in MMLU-Pro and 31% in LiveCodeBench. It will leave you disappointed if you’re expecting a general-purpose LLM.
8GB: DeepSeek R1 0528 Qwen3 8B
To be honest, you might want to stick with Qwen3 4B 2507 at this memory budget. It’s excellent for its size. Also, many models that are a step above Qwen3 4B 2507 in intelligence require a bit too much memory to fit comfortably on a device with 8GB of memory.
Still, I do have a recommendation, and it’s a DeepSeek R1 distillation.
DeepSeek R1, released in January of 2025, is an excellent reasoning model. However, the full-fat DeepSeek R1 (which is the reasoning variant of DeepSeek V3) weighs in at 671 billion parameters, which puts it well into the territory of enterprise-grade AI hardware.
DeepSeek R1 0528 Qwen3 8B gives a taste, however, through a process called distillation. Distillation is a process where a smaller "student" model learns to mimic the behavior of a much larger "teacher" model. The larger model's knowledge and reasoning patterns are compressed into the smaller model, allowing it to achieve performance that punches well above its weight class. The 4-bit quantized model requires about 5GB when loaded into memory.
To be clear, I think Qwen3 4B 2507 is a better model for vibe coding. However, I find DeepSeek R1 0528 Qwen3 8B is a fun model for brainstorming and creative tasks. It’s also rather quick for a reasoning model of its size, which makes it plausible to run it on mid-range PC laptops with an NPU or entry-level GPU.
16GB: Gemma 3 12B
A memory budget of 16GB gets tricky. It’s a more useful, and common, memory budget for a PC in 2025 -- but it’s still a bit south of what’s required to comfortably handle the most capable SLMs (small language models), particularly as the context window begins to fill up.
So my pick here is an older model; Google’s Gemma 3 12B.
Gemma 3 12B is not a reasoning model, which means it doesn’t handle multi-step reasoning to improve the quality of its answers. This has an impact on its benchmark scores. It only scores 60% in MMLU-Pro and a mere 14% in LiveCodeBench. So, this is not the preferred model for coding.
However, Gemma 3 12B is well-tuned for general chat. It produces easy-to-read responses in a friendly, explanatory tone. It currently ranks 66th in LMArena’s Text Arena, which is the best ranking of any similarly sized open weights LLM. And the 4-Bit quantization loads in at under 10GB of memory, so there’s a bit of overhead to handle longer conversations.
It’s also a vision model -- the first such model on this list. That means it can view and understand images. I find that vision is an important part of the general chatbot use case, since it’s so easy to just take a screenshot of what’s on your screen and upload it to the model. If you find Gemma 3 12B requires a bit too much memory, you might also consider Gemma 3n E4B. It loads in at about 5.5 GB, which is starting to push what’s practical with 8GB of memory on a PC laptop. But 16GB will handle it fine, and the model will respond more quickly than Gemma 3 12B.
A popular alternative in this category is Qwen2.5 Coder 14B. In contrast to Gemma 3 12B, Qwen2.5 Coder is tuned more for -- you guessed it -- coding. That means a crisp, matter-of-fact style of reply. The 4-bit quantization consumes about 8GB of memory. It’s an older model, though, without reasoning or tool calling.
32GB: Qwen3-30B-A3B-2507
We’re now into the heavy hitters, so a reminder: these models are going to require more serious hardware, particularly if you want to use long-context prompts. Performance is not all about tokens per second; prompt processing and time to first token are also significant. And you’re going to need something with a lot of memory and memory bandwidth, like a mid-range to high-end GPU, an Apple M4 Pro/Max, or AMD Ryzen AI Max+ CPU. Without that, you can see significant delays in prompt processing. It could be minutes before you see the first token.
But if you have that covered, Qwen3-30B-A3B-2507 is a great pick for a well-rounded default model. The reasoning variant (as it once again has both thinking and non-thinking variants) scores 81% in MMLU-Pro, and 71% in LiveCodeBench. It also scores 59% in AA-LCR Long Context, a more difficult benchmark that smaller models either score poorly on or can’t handle at all.
The 4-bit quantization of Qwen3 30B A3B 2507 consumes about 16.5GB of memory to start, so it’s arguably viable even on a system with 24GB of memory if you don’t have other memory-intensive apps open. I have used it on an M4 Macbook Air with 24GB of memory, though this is only viable with short prompts because, as mentioned earlier, long prompts can take minutes to process on less capable hardware.
I recommend the Qwen3 30B model without thinking for general chat, and with thinking if you want to tackle coding, science, and math. Both models support function calling.
Qwen3 30B doesn’t support vision, but Alibaba just released a model called Qwen3 VL 30B A3B that does. I haven’t had much time with it -- it's still very new -- but it’s probably worth a shot. Alternatively, you might stick with Google’s Gemma3-27B. It lacks reasoning, but it’s a good chat model with vision and the 4-bit quantization loads in at about 16.5GB of memory.
Larger still: Qwen3-Next-80B
If you have even more memory, and the compute performance to back it up, models up to 100B parameters can become viable on home PC hardware. I’ll admit I have to speculate a bit here, though, because no PC I own can handle models in this realm.
However, leaderboards show that Qwen rules this territory, and the Qwen3-Next-80B-A3B models stand out. They post high scores across the board, and while large, they’re still at least under 100B parameters. GPT-OSS-120B can also be a solid choice that can work on PCs with 128GB of unified memory, though scores competitively in benchmarks only when the “High” reasoning effort mode is used. That will also increase token output by the model, which increases the need for extremely excellent hardware capable of delivering lots of tokens per second.
One thing to note is that neither of these models support vision. Because of that, vision tasks may be just as well handled with Gemma3-27B. One massive Qwen model, Qwen-v1-235B-A22B, can handle vision. But at 235 billion parameters, even the best consumer PCs can’t handle it.
Honorable mentions
AI models, like CPUs and GPUs, tend towards a winner-takes-all dynamic. The only reason to use a second-best model is because the winner lacks a specific feature you need, like vision. Still, there’s some second-best honorable mentions that are worth a look if you want to expand your horizons.
-
Microsoft Phi 4: Though a bit older, Microsoft’s Phi 4 Reasoning and Phi 4 Reasoning Plus are decent reasoning models that support tool use. Phi 4 Reasoning is viable on systems with 8GB of memory or more, while the Plus model is really best off with 24GB of memory or more.
-
GPT-OSS-20B: The smaller version of OpenAI’s open model didn’t make the cut. It doesn’t score as well in benchmarks as Qwen3-30B and has pretty strict safety tuning. Still, it’s a decent model for general use.
-
Magistral: This model family from Mistrial AI covers a range of model sizes. While they don’t score as well as Qwen models in benchmarks, they support vision. Magistral Small 2509 is worth a look if you need vision and you’re not pleased by Google’s Gemma 3, as it’s a newer model and seems to post much better benchmark scores (though my personal experience with Magistral Small 2509 is limited).
When in doubt, try Qwen
There’s a clear victor throughout this article: Alibaba’s Qwen family of models.
While Meta’s Llama family took an early lead in small, open AI models, and still enjoys some goodwill from that success, it has ceded the space to Qwen. Alibaba has released a wider variety of recent, updated models that fit into smaller memory footprints.
Google and OpenAI are on the back foot, too. Gemma 3 is a great model, but over 200 days have passed since its release, making it a grizzled old veteran of the space. It’s remarkable, actually, that the Gemma 3 models still feel useful despite their age, but a Gemma 4 release is sorely needed.
OpenAI’s GPT-OSS was a solid effort, and only two months old. But OpenAI will need to put out new, open SLMs multiple times each year to keep up with Alibaba, and there’s no sign that is OpenAI’s plan.
That leaves Alibaba’s Qwen in a unique position, at least for now. It’s able to release a barrage of smaller open models without much opposition as its competitors remain laser-focused on raising capital and boosting company valuations. And with big-money AI deals coming fast and furious, there’s not much sign that’ll change -- not this year, at least.
| System Memory | Recommended Model | Size (4-Bit Quant) | Best Use Case | Vision Support? |
|---|---|---|---|---|
| 4GB | Qwen3 4B 2507 | ~2.75 GB | Basic code completion, summaries, email | No |
| 8GB | DeepSeek R1 0528 Qwen3 8B | ~5 GB | Brainstorming, creative tasks, light reasoning | No |
| 16GB | Gemma 3 12B | ~10 GB | General chat, friendly explanations | Yes |
| 32GB+ | Qwen3-30B-A3B-2507 | ~16.5 GB | Coding, science, math, complex reasoning | No* |
| 64GB+ | Qwen3-Next-80B | ~48 GB+ | Heavy research, high-level reasoning benchmarks | No |
| *Vision supported via the alternative Qwen3 VL 30B model. | ||||
More from MC News
- Quantization Explained: Why the Same LLM Gives Better Results on High-End Hardware
- Why VRAM and Memory Bandwidth are Key for Powering Local AI
- Hands-on with NVIDIA DGX Spark: Everything You Need to Know
- How to Build a PC with a Hardline Water-Cooling Loop
- 3D Print a Mac Mini Monitor Mount
- The End Has Come for Windows 10: Four Tips to Make the Most of Windows 11
- Everything You Need to Know About WiFi 7
- Keyboard 101: Intro to Computer Keyboards
- Can Your PC Run OpenAI's New GPT-OSS Large Language Models?
- Fix It Yourself: Talking to iFixit on Why Repairable Tech Matters
Matthew S. Smith is a prolific tech journalist, critic, product reviewer, and influencer from Portland, Oregon. Over 16 years covering tech he has reviewed thousands of PC laptops, desktops, monitors, and other consumer gadgets. Matthew also hosts Computer Gaming Yesterday, a YouTube channel dedicated to retro PC gaming, and covers the latest artificial intelligence research for IEEE Spectrum.
