Why VRAM and Memory Bandwidth are Key for Powering Local AI
The biggest bottleneck to your AI experience may come from your memory and memory bandwidth.How-To
Photo: Jacob BoboCompute performance has hogged the spotlight for most of the PC’s history. Memory and storage also mattered, of course, but most of what your PC could or couldn’t handle was determined by the CPU and GPU inside.
AI is turning that on its head.
Running local AI models on your PC still requires a capable GPU or NPU, but the biggest limit on the AI experience comes from the memory and memory bandwidth available in your PC.
You’re going to need more memory
To run an AI large language model (LLM) or image diffusion model on your PC, you'll first need to download the model weights. (Think of the model as the architecture of an AI and the model weights as the knowledge the AI relies on.) These are the trained parameters that determine how the model processes information and generates outputs. LM Studio is a great way to get started with LLMs on your PC. It’s a one-stop-shop for downloading and chatting with an LLM.

The weights are often massive. Models with parameter counts in the 20 to 40 billion range (which is what you’ll likely want to run, if your PC can handle it) are anywhere from 16 to 32 GB in size, or more. The largest models that are practical to run on a high-end PC, like GPT-OSS-120B, come in at about 50 GB to over 100 GB.
That's a lot of storage, to be sure, but it’s tolerable. After all, many video games require well over 100GB of storage for a full install.
But the real problem is that unlike a AAA game, you generally need to load an entire AI model into memory to use it at all. Every. Last. Gigabyte.
Why AI performance is a memory issue
So, you need a lot of memory to run the best AI models. Really, 128 GB would be ideal; 256 GB would be even better. That's a lot, but it's not absurd. You can pick up 128 GB of DDR5 memory for less than $400.
However, there’s another problem. It's not just the amount of memory that matters. Its speed, or memory bandwidth, is just as important.
As mentioned, a model’s weights can take up tens or even hundreds of gigabytes. When you ask a model a question those weights must be shuttled to the GPU, NPU or (hopefully not) the CPU, where billions or trillions of matrix-vector multiplications are executed per second.
That's quite a bit different from how most computer programs function. Most try to load as little into memory as possible and, at any given time, shuttle only a small portion of that off to a CPU or GPU. Indeed, tight memory management is a point of personal pride for many programmers. With current AI models, however, that’s not possible. The entire AI model needs to be loaded into memory so it can be accessed in less than the blink of an eye.
In fact, a truly optimized AI processor would have a chip architecture that places compute hardware inside the memory itself, effectively fusing the NPU and RAM into a single unified architecture. This is called compute-in-memory (CIM), and several start-ups and research initiatives are pursuing the idea in hopes of building a new type of AI processor that could have Nvidia running scared.
It’s early days, though, so it’ll be years, maybe even decades, before CIM processors arrive in consumer PCs.
Why you want VRAM
For now, PCs are tackling AI’s intense memory requirements the only way they know how. They’re using fast memory and making the pipe that connects the memory to an AI processor bigger. That’s where video random-access memory (VRAM) comes in.
First, a technical point. VRAM is a squishy term that doesn’t describe any specific technology. The term “video random-access memory” just means memory dedicated to video processing hardware.
With that said, the type of VRAM used by discrete GPUs is often much quicker than the type of memory used for a PC’s shared system memory, and connected to the GPU with a bigger pipe than what’s used to connect system memory to a CPU or NPU.

Consider the Nvidia RTX 5090, the most powerful piece of AI hardware most PC enthusiasts are likely to consider. It has 32GB of GDDR7 SDRAM memory and quotes a maximum memory bandwidth of 1.79 terabytes per second.
That puts DDR5 SDRAM, the type of memory most commonly used for a modern PC’s system memory, to shame.
The quickest DDR5 memory available for a PC, DDR5 9600, can deliver up to 307 gigabytes per second in a quad-channel configuration. That’s way less and arguably unrealistic, because DDR5 9600 memory is expensive, not widely available, and supported by a limited number of high-end motherboards and CPUs. A more common configuration, like 128GB of DDR5 6400 in a dual-channel configuration, provides about 102 GB/s.
And that’s a problem. It’s like trying to mash a hotdog through a strainer. You can do it, but it’s not the best idea.
Exceptions to the rule
If memory bandwidth is critical, and VRAM can offer an order-of-magnitude improvement in bandwidth, then it’s VRAM or bust, right?
Mostly, yes. Except when it’s not.
It’s important to note that what really matters is the memory bandwidth itself, not the type of memory used.
VRAM rules because it offers the best memory performance and is typically connected over a large memory bus (512-bit in the case of the RTX 5090). But other types of memory, in different configurations, could be competitive in the future.
The best example of this is Apple’s M3 Ultra, which remains Apple’s best system-on-a-chip (SoC) to date. The M3 Ultra in the Mac Studio supports up to an absolutely ludicrous 512GB of unified system memory with a maximum memory bandwidth of 819 GB/s. It uses DDR5-6400 memory, but pairs that with a unified memory architecture that includes 32 16-bit memory controllers (that’s 512-bit in total). That’s why the M3 Ultra can provide more memory bandwidth than a desktop PC while using the same DDR5-6400 memory.
The M3 Ultra’s bandwidth is still a lot less than the Nvidia RTX 5090, but it’s a heck of a lot closer than DDR5 in a typical PC, and it demonstrates that memory bandwidth, like most aspects of PC performance, is complex.
It’s true that VRAM such as GDDR7 is going to provide a lot of memory bandwidth and great AI performance—but also true that other types of memory could compete when a system’s architecture is designed with memory bandwidth as a priority.
For the moment, though, Apple’s chips are the sole exception. AMD, Intel, and Qualcomm don’t offer hardware with that much memory bandwidth to system memory.
The quest for the perfect AI PC
To make matters more complicated, there’s currently a huge problem that faces everyone looking to build an AI PC. Generally, PCs with enough memory bandwidth to handle the best models lack enough memory capacity, and PCs with enough memory capacity lack enough memory bandwidth.
Consider that RTX 5090. It’s a truly mighty GPU, but it only has 32GB of GDDR7 memory. That’s only enough to load models up to about 40 billion parameters. It will handle some solid models, like Qwen VI 30B, Gemma 3 27B, and GPT-OSS-20B.
But if you want to run GPT-OSS-120B, Llama 3.3 70B, or Qwen3 Next 80B, you’re kinda out of luck. While it’s technically possible to span the model across GPU and system memory, that will toss the GPU’s memory bandwidth advantage out the window.
On the other hand, you could snag a chip like the AMD Ryzen AI Max+ 395, which is found in the Framework Desktop, the HP ZBook Ultra G1a, and a handful of competing PCs. It can handle up to 128GB of system memory and has both an integrated GPU and NPU. That means you can load models as large as GPT-OSS-120B.
But that chip supports a maximum memory bandwidth of 256 GB/s which, though killer by historic standards, is just ok for AI workloads.
AMD says GPT-OSS-120B can achieve up to 30 tokens per second on the Ryzen AI Max+ 395 with 128GB of memory. That’s usable, but a lengthy reply will still take a minute or more. For comparison, an Nvidia RTX 5090 running GPT-OSS-20B can produce over 200 tokens per second.
VRAM rules, for now
This shows the trade-off between model quality and model performance.
Relying on a discrete GPU for AI performance limits the amount of memory you can purchase, which limits the intelligence of the model you can use, but performance will be snappy. Choosing to rely on an integrated GPU or NPU can give you access to larger, smarter models, but performance can be annoyingly slow.
But most people enthusiastic enough to explore running an AI model on their own hardware are heavy users and expect a speedy response. For now, a smaller model on a GPU with a healthy amount of VRAM is the best way to achieve that.
There are plenty of compact models that can offer decent model quality, and small models can still produce replies at blistering speeds on more modest GPUs. Just aim for a minimum of 12GB of VRAM, which is enough to handle models like Qwen IV 8B and Gemma 3 12B.
The choice that’s right for you will depend on how you plan to use the model. If you’re not pressed for time, a better model that produces results slowly might be the way to go.
More from MC News
- Hands-on with the NVIDIA DGX Spark
- How to Build a PC with a Hardline Water-Cooling Loop
- 3D Print a Mac Mini Monitor Mount
- The End Has Come for Windows 10: Four Tips to Make the Most of Windows 11
- Everything You Need to Know About WiFi 7
- Keyboard 101: Intro to Computer Keyboards
- Can Your PC Run OpenAI's New GPT-OSS Large Language Models?
- Fix It Yourself: Talking to iFixit on Why Repairable Tech Matters
Matthew S. Smith is a prolific tech journalist, critic, product reviewer, and influencer from Portland, Oregon. Over 16 years covering tech he has reviewed thousands of PC laptops, desktops, monitors, and other consumer gadgets. Matthew also hosts Computer Gaming Yesterday, a YouTube channel dedicated to retro PC gaming, and covers the latest artificial intelligence research for IEEE Spectrum.
Comment on This Post
See More Blog Categories
Recent Posts
Holiday Gift Guide 2025: The Best Gear for the Content Creator
From smart cameras to hands-free wearables, these are creator-tested tools that make the content creation process smoother, faster, and more fun.
Continue Reading About Holiday Gift Guide 2025: The Best Gear for the Content Creator
