Yes — the AMD Ryzen 5 5600G can run local LLMs without a discrete GPU, but only the small ones. CPU-only inference of a 7B model at q4_K_M lands at roughly 6-10 tokens per second on the 5600G's six Zen 3 cores per llama.cpp's CPU benchmark thread, and a 3B model hits 15-25 tok/s. The Vega 7 iGPU is unusable for inference; system RAM bandwidth, not the iGPU, is the floor.
The APU-only inference path for people who haven't bought a GPU yet
A surprising number of self-hosters land on r/LocalLLaMA and r/buildapc asking the same question: "can I run a local LLM without buying a GPU?" The honest answer is yes — for small models, on a modern APU, with realistic expectations. The AMD Ryzen 5 5600G is the catalog's cheapest credible host: six Zen 3 cores, twelve threads, a Vega 7 iGPU, and a 65 W TDP at roughly $130 used in 2026. It is enough CPU to run a 3-8B model on llama.cpp without any discrete card at all.
This piece is for two readers: the first-time builder who has not yet bought a GPU and wants to know if the 5600G is enough, and the seasoned PC builder picking a budget CPU to pair with an upcoming GPU and wondering whether the APU buys anything for inference workloads on its own.
The short version: the 5600G is a workable starting point for 3-7B model chat at low-to-mid throughput. It is the right place to start if you do not own a GPU yet, but the upgrade path to an RTX 3060 12GB is the move that unlocks usable interactive workloads.
Key Takeaways
- The 5600G runs 3B q4 models at 15-25 tok/s and 7B q4 at 6-10 tok/s, CPU-only.
- The Vega 7 iGPU is not useful for llama.cpp inference today — drivers and runtime support are weak.
- System RAM bandwidth is the bottleneck; dual-channel DDR4-3200 or 3600 doubles throughput vs single-channel.
- 32 GB of RAM is the practical minimum to run an 8-13B model on the CPU at q4.
- Adding a $300 used 3060 12GB transforms throughput by 5-8× — the obvious upgrade.
How fast is CPU-only inference on the 5600G's 6 cores?
CPU inference throughput on the 5600G depends on three things: model size, quantization, and RAM bandwidth. Llama.cpp uses AVX2 and a handful of AVX-512 paths (the 5600G's Zen 3 supports AVX2 but not AVX-512), so the CPU is well-utilized but bandwidth-limited.
Synthesized from public benchmarks and the llama.cpp CPU benchmark thread:
| Model size | Quant | Tok/s on 5600G CPU-only | RAM needed |
|---|---|---|---|
| 1.5B (TinyLlama) | q4_K_M | 30-50 | 2 GB |
| 3B (Phi-3-mini) | q4_K_M | 15-25 | 4 GB |
| 7B (Llama 3.1) | q4_K_M | 6-10 | 8 GB |
| 8B (Llama 3.1) | q4_K_M | 5-9 | 10 GB |
| 13B | q4_K_M | 3-5 | 16 GB |
| 14B (Mistral Nemo) | q4_K_M | 3-5 | 18 GB |
Numbers move within ±15% depending on DDR4 speed and timings. 8 tok/s on a 7B feels live for chat. 4 tok/s on a 13B does not — you watch the cursor crawl.
Does the Vega 7 iGPU help at all for LLM inference?
The short answer: not in practice. The 5600G's Vega 7 iGPU is GCN 5.0 architecture with 7 compute units and shared system RAM. There are three reasons it does not help with llama.cpp inference today:
- ROCm support is weak for Vega APUs. AMD's official ROCm hardware support matrix lists discrete RDNA cards but not Vega 7/8 APUs. Community ROCm forks exist but are fragile.
- Vulkan compute on the iGPU is functional but slow. llama.cpp's Vulkan backend works on Vega 7 but loses to the CPU path on most quants because the iGPU is sharing the same DDR4 bandwidth, not a dedicated VRAM pool.
- The iGPU borrows from system RAM, so there is no bandwidth gain. Discrete GPUs win because of their dedicated GDDR memory pool. An APU does not have that.
In rare configurations (specific kernels, OpenCL on certain models), the Vega 7 will hit 1.2-1.5× CPU throughput. Most of the time it loses to or ties the CPU path. Stick with --n-gpu-layers 0 on the 5600G unless you are specifically tuning for a known model where Vulkan helps.
How much does dual-channel DDR4 bandwidth limit tok/s?
This is the single most important variable for CPU-only inference. Llama.cpp's generation step reads the full model weights once per output token, so throughput is capped by RAM bandwidth.
- DDR4-3200 dual-channel: 51.2 GB/s peak.
- DDR4-3600 dual-channel: 57.6 GB/s peak.
- DDR4-3200 single-channel (one stick): 25.6 GB/s peak.
For a 7B q4 model at 4.5 GB, theoretical upper bound on tok/s is bandwidth ÷ model size: 51.2 ÷ 4.5 ≈ 11 tok/s. The measured 6-10 tok/s on real builds is within reach of the bandwidth ceiling, confirming RAM is the floor.
The implication for buyers: do not skimp on RAM. Run dual-channel. Hit DDR4-3600 if your motherboard supports it. Single-channel DDR4-2666 will halve your tok/s.
Spec-delta table: 5600G CPU-only vs 5800X CPU-only vs adding an RTX 3060
| Config | 7B q4 tok/s | Max usable model | Cost delta |
|---|---|---|---|
| 5600G CPU-only, DDR4-3200 | 6-10 | 7-8B | base |
| 5800X CPU-only, DDR4-3200 | 8-12 | 13B (slow) | +$80 |
| 5600G + RTX 3060 12GB | 40-50 | 14B q4_K_M | +$300 |
| 5800X + RTX 3060 12GB | 40-50 | 14B q4_K_M | +$380 |
The 5800X adds 33% to CPU-only throughput at ~$80 over the 5600G. Adding a 3060 to either CPU adds 5-8× over CPU-only, and the CPU choice barely affects GPU-bound throughput. The marginal dollar is far better spent on the GPU than the CPU step up.
Quantization matrix: q4/q5/q8 for 3B/7B/8B on the 5600G
| Model | q4_K_M tok/s | q5_K_M tok/s | q8_0 tok/s | RAM at 4k ctx |
|---|---|---|---|---|
| Phi-3-mini 3.8B | 18-25 | 14-20 | 9-13 | 4 GB / 5 GB / 8 GB |
| Llama 3.1 8B | 6-10 | 5-8 | 3-5 | 8 GB / 10 GB / 12 GB |
| Llama 3.1 13B (slow) | 3-5 | 2-4 | n/a | 14 GB / 18 GB / OOM |
Higher quantization (q5, q8) costs throughput because the per-token bandwidth read grows. On a CPU-only rig, q4_K_M is almost always the right choice — quality drop is small, throughput gain is real.
Prefill vs generation on a memory-bandwidth-bound APU
Prefill on the CPU is throughput-bound by compute (AVX2 matrix-multiplies), not bandwidth. The 5600G's six Zen 3 cores process roughly 80-120 tokens per second of prefill on a 7B model. A 2k-token prompt costs ~20 seconds of prefill before the first token shows. RAG pipelines that stuff retrieved context into every query will feel this delay on CPU-only.
Generation, the actual streaming-out of tokens, is bandwidth-bound, as discussed above.
Practical implication: do not run RAG with long retrieved contexts on a CPU-only 5600G. The prefill cost makes interactive use frustrating.
Context-length impact when there's no VRAM to spill into
On CPU-only inference, KV-cache lives in system RAM. A 7B model at 8k context with fp16 cache adds ~2 GB to RAM use; at 16k it adds ~4 GB. With 32 GB of system RAM, you have plenty of room — context is rarely the limiter on CPU.
The real cost of long context on CPU is wall-clock time, not memory. Prefill scales linearly with prompt length, so a 16k-token prompt takes 8× longer to process than a 2k prompt. On a 5600G that means minutes, not seconds, for prefill alone.
Perf-per-dollar: the 5600G as a $0-GPU starting point and the upgrade path to a 3060
The 5600G sells for roughly $130 used in 2026. Pair it with 16 GB of DDR4-3600 ($35) and a fast NVMe like the WD Blue SN550 1TB ($45). Total system cost: $210-250 for a working local-LLM host that can run 3-7B models at usable throughput.
Then the upgrade question becomes simple. Adding an RTX 3060 12GB ($300 used) gets you 5-8× throughput and unlocks the 14B model class. For roughly $300 more, you go from "barely usable for 7B chat" to "full local agent rig with headroom for 14B models and 16k context." That is the cheapest jump in local-inference performance you can make in 2026.
Common pitfalls running CPU-only inference on the 5600G
- Running single-channel RAM. A single 16 GB stick on a single channel halves your bandwidth. Always use two matched sticks.
- Picking q8 or fp16 quants. Throughput drops with quant precision; q4_K_M is the sweet spot on CPU.
- Ignoring background load. A browser, Discord, and a hungry dev tool all compete for RAM bandwidth. Close them when benchmarking.
- Assuming the iGPU helps. It does not, today.
- Using too small a context budget. RAM is cheap relative to model time; do not crop context to save 2 GB.
Bottom line + verdict matrix
The 5600G is a credible starting point for 3-7B class local LLM work in 2026, especially for first-time builders or for setting up a small "always-on" home inference host. It will not give you frontier-quality output, and it will not run agentic workloads with the kind of throughput an interactive chat demands. But it works, it is cheap, and it has the cleanest upgrade path of any budget CPU to a real GPU rig.
Start on the 5600G if:
- You do not own a GPU and are budgeting cautiously.
- Your workloads are 3-7B chat, summarization, narrow extraction.
- You want a 24/7 always-on host with low power draw.
- Your tolerance for sub-10 tok/s output is high.
Add a GPU when:
- You need 14B-class quality or above.
- You want sub-second time-to-first-token on real prompts.
- You start running agent loops where prefill becomes a bottleneck.
- You need RAG over multi-thousand-token retrieved contexts.
A reasonable progression for a self-hoster who starts on the 5600G: run 3-7B models for a few months, learn what kind of latency you can live with, and then add an RTX 3060 12GB or step up to an AMD Ryzen 7 5800X if you decide your bottleneck is CPU rather than GPU. The 5600G keeps its value as a backup or homelab host even after you add a GPU. Pair either CPU with an NVMe like the Crucial BX500 1TB SATA SSD for cheap model storage if you want a second tier of slower load times.
Related guides
- Ryzen 5 5600G for Local LLMs: iGPU + CPU Inference in 2026
- CPU Offload for Local LLMs: Does a Ryzen 7 5800X Help?
- Ollama vs LM Studio vs llama.cpp on an RTX 3060 12GB
- Best Budget GPU for Local 12B–14B LLM Inference
- Ryzen 5 5600G vs Ryzen 7 5700X for a Budget Homelab
Citations and sources
- AMD — Ryzen 5 5600G product page
- llama.cpp — CPU benchmark community thread
- AMD ROCm — Hardware support matrix
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
