guide

Local LLM hardware requirements explained

What you actually need to run AI models on your own machine, what to upgrade first, and what's overkill for typical use.

Apr 30, 2026 · 10 min read

The hardware question is the first thing people ask when they consider running AI locally, and most of the answers online are either wildly optimistic ("you can run anything on a laptop!") or wildly pessimistic ("you need an RTX 4090 minimum"). Neither is right. The actual answer depends on which models you want to run, how you want to use them, and what trade-offs you're willing to accept.

This post walks through what each component does for local AI, what specs you actually need for different use cases, and how to think about upgrades if you're planning a build or considering whether your current machine is up to it.

The components that matter for AI

Local AI inference puts unusual stress on hardware compared to typical computing tasks. Different components matter more or less than you'd expect.

Memory (RAM and VRAM) is the single most important factor. Running a model requires loading its weights into memory, and bigger models need more memory. The constraint isn't usually computational power; it's whether the model fits at all. A model that doesn't fit in memory either swaps to disk (extremely slow, often unusable) or fails to load entirely. Detailed sizing guidance is available in the Hugging Face documentation on model memory requirements.

The GPU is the second most important factor when running 7B+ models with reasonable speed. GPU acceleration via CUDA (NVIDIA), Metal (Apple Silicon), or ROCm (AMD) makes inference 5-20x faster than CPU. For small models or occasional use, CPU is workable. For sustained use of larger models, a GPU is close to essential.

Storage matters but is rarely a bottleneck. Each model is a few gigabytes; you need enough disk space for the models you want to run, but read speed isn't a major factor once the model is loaded.

CPU matters less than people expect. Modern CPUs are fast enough that the CPU isn't usually the bottleneck even when you're running on CPU only. RAM speed and bandwidth matter more than CPU clock speed.

Power and cooling become factors at higher tiers. Running a 70B model continuously generates real heat and draws real power. Laptops with high-end GPUs can thermal-throttle under sustained AI workloads.

The memory math

The most useful number to internalize: at the standard Q4_K_M quantization Ollama uses by default, plan for roughly 0.6 GB of memory per billion parameters, plus headroom for context. The llama.cpp project maintains detailed quantization documentation if you want to understand the trade-offs in depth.

Specific numbers for common model sizes:

3B parameter models: ~2-3 GB. Phi-3, smaller Qwen variants. Run on essentially any modern laptop, including phones and tablets in some cases. Quality is limited but they work for basic tasks.

7B parameter models: ~4-6 GB. Llama 3.2, Mistral 7B, Qwen 2.5. The sweet spot for most users. Capable enough for general chat, coding assistance, and creative writing. Run on laptops with 8GB+ of RAM, comfortably on anything with 16GB+.

13B parameter models: ~8-10 GB. MythoMax, larger Mistral variants. Noticeably better than 7B for most tasks. Need 16GB of RAM minimum, ideally with a GPU.

34B parameter models: ~20-24 GB. Yi 34B, Mixtral 8x7B (technically 47B but running as MoE). Significantly better reasoning. Need a high-end GPU (RTX 4090 with 24GB VRAM) or Apple Silicon with unified memory.

70B parameter models: ~40-48 GB. Llama 3.3 70B, Qwen 2.5 72B. Approach commercial-API quality. Need workstation-tier hardware: dual GPUs, top-tier Apple Silicon, or specialized AI hardware.

These numbers are for inference. Training requires several times more memory, but most users only do inference.

The "headroom for context" varies. A model running with a 4K context uses essentially the base memory. A model running with 32K context uses more (each token in context adds to memory usage). For Q4_K_M quantization, plan for an additional 1-3 GB beyond the base model size if you're using long contexts.

What different hardware tiers actually let you do

Mapping memory budgets to use cases:

8 GB of RAM (typical mid-range laptop): You can run 3B models and 7B models with smaller contexts. The experience is fine for testing and light use. Inference is slower than ideal because GPU memory is shared with the operating system. This tier works as an entry point but you'll outgrow it quickly.

16 GB of RAM (typical mid-range to premium laptop): Comfortable territory for 7B models with full context. 13B models work but with some sluggishness. This is the most common entry point for serious local AI use, and the experience is genuinely good.

24 GB of VRAM (RTX 4090, RTX 3090): 13B models run smoothly and fast. 34B models work with quantization. Inference is fast enough for real-time use even on the larger models. Strong tier for users who want top-tier local AI without going to workstation-class hardware.

32 GB of unified memory (M2 Pro, M3 Pro Mac): Similar capabilities to a 24GB GPU PC for AI workloads, with the advantage of being a single-piece laptop. Apple Silicon's unified memory architecture is genuinely good for AI inference. 13B models run fast; 34B models work.

64 GB of unified memory (M2 Max, M3 Max, M4 Max with high configurations): The territory where 70B models become workable. The unified memory architecture lets the GPU access all of it for inference.

Workstation tier (dual RTX 4090s, dual A6000s, etc.): 70B models run fast with full context. Multiple models can run simultaneously. This tier is overkill for most consumer use but right for users running production-style local AI workloads.

Apple Silicon vs NVIDIA for local AI

This is the most common hardware question for users buying or building specifically for local AI.

Apple Silicon advantages:

Unified memory architecture means the GPU has access to the same memory as the CPU. A 64GB M3 Max can run 70B models because the GPU can use all 64GB. An equivalent NVIDIA setup needs a card with 48GB+ VRAM to do the same thing. Apple's documentation on the unified memory architecture covers the technical details.

Power efficiency. Apple Silicon Macs can run AI workloads on battery for hours. NVIDIA-based laptops with discrete GPUs typically can't, and even desktops draw significantly more power.

Out-of-box experience. Metal acceleration works automatically. No driver setup, no CUDA configuration. Install the tool, run the model, done.

Mobility. A MacBook Pro with 64GB of unified memory is a portable AI workstation. An equivalent NVIDIA setup means a desktop or a very thick laptop with limited battery life.

NVIDIA advantages:

Raw inference speed. For models that fit in VRAM, top-tier NVIDIA GPUs are still faster than equivalent Apple Silicon for inference. The gap is shrinking but real.

Software ecosystem. CUDA has more years of optimization behind it than Metal. Some specialized tools work better with NVIDIA than with Apple Silicon. Training (if you ever want to do it) is much more practical on NVIDIA.

Upgradability. You can buy a base PC and upgrade the GPU later. With Apple Silicon, the unified memory is permanent at purchase.

Multiple GPUs. Running multiple NVIDIA cards is common for serious workloads. Apple Silicon doesn't really have an equivalent to multi-GPU configurations.

Cost at the top end. A workstation with dual RTX 4090s costs roughly the same as a top-spec MacBook Pro but offers more raw performance for the workload.

Practical recommendation:

For most users buying a new machine specifically because they want to run AI locally, Apple Silicon with as much unified memory as you can afford is the strongest choice. The memory architecture, power efficiency, and out-of-box experience produce a better day-to-day workflow than building or buying NVIDIA hardware for the same task.

For users who already have a powerful gaming PC, adding a single RTX 4090 (or already having one) makes that PC excellent for local AI. The infrastructure is already there.

For users in serious development or production use cases, NVIDIA workstations remain the better choice because of software ecosystem and ability to scale.

CPU-only inference

If you don't have a GPU, can you still run local AI? Yes, with caveats.

Modern CPUs (any Intel chip from the last 4 years, AMD Ryzen 5/7/9, Apple Silicon) can run small models acceptably. A 3B model on a recent CPU produces 5-15 tokens per second, which feels slow but is usable for non-real-time tasks.

7B models on CPU produce 2-8 tokens per second on most consumer hardware. Tolerable for testing, frustrating for sustained use.

13B models on CPU are slow enough that most users find them impractical: under 2 tokens per second on most consumer chips.

CPU-only AI works as an entry point or for occasional use. Anyone planning regular local AI use should plan to add GPU acceleration eventually.

Storage considerations

Each model takes a few gigabytes of disk space. If you only run one model, this isn't a big deal. If you experiment with many models, it adds up.

Practical numbers:

A 7B model in Q4_K_M is about 4-5 GB.

A 13B model in Q4_K_M is about 8 GB.

A 34B model in Q4_K_M is about 20 GB.

A 70B model in Q4_K_M is about 40 GB.

Higher-quality quantizations (Q5, Q6, Q8) take proportionally more space. Lower-quality quantizations (Q3, Q2) take less.

If you want to experiment with many models, plan on 100-500 GB of dedicated AI model storage. SSDs are preferred over HDDs because models load faster from SSDs, but once loaded, the disk type doesn't matter for inference speed.

What about laptops vs desktops

For most consumer use, the laptop vs desktop question comes down to: do you want portability or raw power?

Laptops are practical for AI use up through about 13B models on the high-end consumer tier (32-48GB unified memory Apple Silicon, or premium gaming laptops with RTX 4080/4090 mobile). Beyond this tier, laptops thermal-throttle under sustained AI workloads and battery life becomes a serious limit.

Desktops are practical for any tier, including the largest local models. Desktop GPUs run at full power without thermal limits. PSU and cooling are easier to address. The trade-off is mobility, which matters for some users and not others.

The Apple Silicon middle ground: A high-end MacBook Pro is more capable for AI than most laptops without being noticeably less portable. This is one reason Apple Silicon dominates discussions of "best laptop for local AI."

What's overkill

Several common upgrades are overkill for most local AI use:

Top-of-the-line CPUs. Diminishing returns past mid-range. AI inference doesn't benefit from CPU clock speed past a certain point.

Massive RAM that exceeds GPU/unified memory. If your GPU has 24GB of VRAM and you have 128GB of system RAM, the system RAM is sitting unused for AI tasks (though it's useful for other things).

Multiple GPUs for casual use. Multi-GPU setups are powerful but the configuration is complex and most consumer use doesn't need it.

The largest possible storage. Models are measured in tens of gigabytes, not terabytes. A 1TB SSD is plenty for any consumer local AI setup, and 2TB is overkill for anyone but power users.

Ultra-fast storage. NVMe is faster than SATA SSD, which is faster than HDD, but the difference doesn't matter much for AI workloads once models are loaded.

The first dollar of upgrade should usually go to memory (RAM or VRAM), then to GPU compute. Other upgrades produce smaller returns for most local AI use cases.

Frequently asked

What's the minimum hardware to run any local AI?

A computer with 8GB of RAM running Linux, macOS, or Windows. You'll be limited to small models with slow performance, but you'll be running AI locally.

What's the recommended hardware for serious use?

For most users: 32GB of unified memory Apple Silicon or a PC with 32GB of RAM and an RTX 4090 (24GB VRAM). This tier handles 13B models comfortably and 34B models with quantization, which covers most practical use cases.

Can I upgrade my laptop for AI?

Mostly no. Laptop RAM and GPU are usually soldered or non-upgradeable. The exception is some Windows gaming laptops where RAM and SSD can be added. For Apple Silicon, nothing is upgradeable; what you buy is what you have.

How much VRAM do I really need?

For 7B models with comfortable context: 8GB. For 13B models: 12-16GB. For 34B models: 24GB+. For 70B models: 48GB+. These are minimums; more is always better.

Does AMD or Intel Arc work for local AI?

AMD via ROCm is improving but still has rough edges. Intel Arc has limited but growing support. Both work for some setups but require more troubleshooting than NVIDIA or Apple Silicon. For users prioritizing ease of setup, NVIDIA or Apple Silicon are safer choices.

What about cloud GPU rentals for local-style AI?

Services like Vast.ai, RunPod, and similar let you rent GPU time. Useful for occasional heavy workloads but defeats the privacy benefit of local AI. Practical for training or experimentation, less useful for sustained inference.

How long until I need to upgrade?

Hardware bought in 2024-2025 will be useful for local AI for several years. New model architectures will continue to be released, but they typically run on existing hardware with the same memory requirements. The main reason to upgrade is to run larger models or run them faster, not because old hardware becomes incompatible.

What about future hardware specifically for AI?

Apple's M-series chips, NVIDIA's "AI PC" initiatives, and Intel's NPU integrations are all making consumer hardware more AI-capable. Buying current hardware is fine; the AI-specific generation will benefit but you don't need to wait for it.