Yes — llama.cpp lets you split a single LLM across two different GPUs, including cards of different generations and VRAM sizes, using the --tensor-split flag. The runtime distributes transformer layers proportionally to each card's available memory, lets the smaller card carry a fair share of work, and keeps the KV cache colocated with the first GPU. Mixed-GPU inference is real, it's practical on consumer hardware, and it's the cheapest path to extra usable VRAM for layer-splittable model sizes (roughly 30B–70B at q4).
Why this matters — when one card isn't enough and a second matching card is too expensive
The standard prescription for "I need more VRAM" is buy a second identical card. That's the right answer when the same GPU is in stock at the same generation at a sane price — but in mid-2026 the secondhand market frequently does not cooperate. A user with a ZOTAC RTX 3060 12GB sitting in their main rig who wants 24 GB total has three options:
- Buy another 3060 12GB (consistent VRAM, easy split, $250–320)
- Buy a 3060 Ti, 3070, 4060, or other card that fits the budget but mismatches the original (cheaper than a matched-pair build)
- Buy a 3090 24GB outright and retire the 3060 ($700–900 used)
Option 2 is the heterogeneous-GPU case. It works on llama.cpp's CUDA backend because the runtime treats each card as an independent compute pool with its own VRAM allocation, sequences inference layer-by-layer across cards, and uses PCIe to pass intermediate activations between them. The PCIe traffic is small per token — single-digit megabytes — so even a chipset-attached x4 Gen3 slot on a B550 board doesn't bottleneck inference once weights are loaded.
This piece walks through what works, what doesn't, and the trade-offs that decide whether mixed-GPU is the right answer for a given build.
Key takeaways
- llama.cpp's
--tensor-split N,Mflag distributes transformer layers proportionally between two GPUs, including mismatched models - Generation throughput on a heterogeneous pair is roughly the harmonic mean of the two cards' bandwidth — the slower card sets the pace
- PCIe x4 Gen3 is sufficient for inference traffic between cards; weight load time at startup is the only place a slow link hurts
- vLLM and TensorRT-LLM assume homogeneous GPU pools and don't split intelligently across mismatched cards — pick llama.cpp for mixed builds
- For 30B–34B class models, two 12 GB cards comfortably host q4 with 8K context; for 70B you still need 40 GB+ which exceeds typical consumer pairs
What is heterogeneous GPU weighting in practice?
In llama.cpp's CUDA backend (and the equivalent SYCL and Vulkan backends), tensor splitting assigns ranges of transformer layers to different physical GPUs. The runtime initializes by enumerating all available CUDA devices, then either auto-distributes weights according to free VRAM or accepts an explicit ratio from the user via --tensor-split 12,24 (or -ts 12,24).
Per the llama.cpp CUDA backend documentation, the split represents the weight ratio between GPUs — a 12,24 split puts roughly one-third of weights on GPU 0 and two-thirds on GPU 1. For a build with a 3060 12GB plus a 3090 24GB, that's the natural ratio: each card holds the share it can fit, the smaller card runs its share of layers, the larger card runs the rest.
The KV cache is colocated with GPU 0 by default. This matters more than it sounds: KV cache reads dominate the per-token memory traffic at long context, and the GPU 0 bandwidth caps how fast that cache can be served. The practical implication is that the faster card should be GPU 0 — invert the typical "primary card in the x16 slot" intuition if your other GPU has higher memory bandwidth.
Does llama.cpp actually split layers between two different GPUs?
Yes — the llama.cpp commit history shows continuous improvement to multi-GPU support since 2024, with explicit tests for mismatched-card configurations. The supported pattern is two or more CUDA devices, automatic per-device VRAM detection, and a runtime that schedules layer execution serially across devices.
What llama.cpp does not do is true tensor parallelism in the vLLM sense — it doesn't split a single attention head's matrix across two GPUs simultaneously. Instead, layer N runs entirely on whichever GPU holds its weights, then activations pass over PCIe to the next layer on whichever card holds it. The advantage is simplicity and compatibility with mismatched cards. The cost is that the cards don't compute in parallel for a single token — they pipeline.
For interactive workloads, pipelining is fine. For batch inference where multiple requests are running concurrently, the runtime can stagger requests across the pipeline and recover some of the parallelism. For a single user generating a single chat turn, you'll see roughly the per-card throughput of the slowest card.
Why don't more inference runtimes support mixed GPUs?
vLLM, TensorRT-LLM, and the major datacenter runtimes target homogeneous GPU pools because that's the dominant deployment context — hyperscalers and research labs buy GPUs in matched-pair racks, with NVLink for fast inter-GPU communication. Per vLLM's distributed serving documentation, the assumption is that tensor parallelism splits attention computations across matched cards with matched bandwidth and matched topology.
Plug a 3060 and a 3090 into vLLM and it'll either reject the configuration outright or assume the lower spec across the board, wasting the 3090's bandwidth and capacity. There's no architectural reason the runtime couldn't support mismatched cards — it's a deliberate scope decision. The community has periodically requested the feature; the response from the vLLM maintainers has consistently been "use llama.cpp for that workload."
That's not a knock on vLLM. It's the right tool for the homogeneous-rack deployment context. For a single user with a heterogeneous pair, llama.cpp remains the better answer in 2026.
Will a PCIe Gen3 x4 slot bottleneck the second card?
For inference, no. Per Puget Systems' multi-GPU LLM inference analysis, the per-token traffic between layers on different GPUs is on the order of single-digit megabytes — typical activations for a 70B model land in the 1–4 MB range per layer transition. A Gen3 x4 link provides about 4 GB/s of bidirectional bandwidth, which can sustain hundreds of layer transitions per second.
Where PCIe Gen3 x4 hurts is initial weight load. A 20 GB tensor stream from disk-cached pages to GPU VRAM at 4 GB/s takes 5 seconds; the same load at Gen4 x16 (~32 GB/s) takes well under a second. For a single inference session this is a startup cost paid once and forgotten. For workflows that swap between many models, the Gen3 x4 startup penalty compounds.
| Layout | Slot lanes | Effective bandwidth | Weight load 20GB | Per-token transit |
|---|---|---|---|---|
| Both cards in Gen4 x16 | x16 / x16 | ~32 GB/s | ~0.6s | <1 ms |
| Primary x16, secondary x4 Gen3 | x16 / x4-Gen3 | ~32 / 4 GB/s | ~0.6s / 5s | 1–3 ms |
| Both cards in Gen3 x8 | x8 / x8 (Gen3) | ~8 GB/s | ~2.5s | 2–4 ms |
| Primary x16, secondary x4 Gen2 | x16 / x4-Gen2 | ~32 / 2 GB/s | ~0.6s / 10s | 4–8 ms |
The PCIe x4 Gen2 slot — common on cheaper B450 boards — is the configuration to avoid. Per-token transit cost starts to show in interactive throughput at that link speed.
Is it cheaper to add a second 3060 12GB instead of a 3090?
| Configuration | Total VRAM | Total bandwidth (effective) | Cost | Llama 3 70B q4 | Llama 3 34B q4 |
|---|---|---|---|---|---|
| Single RTX 3060 12GB | 12 GB | 360 GB/s | $250–320 | Won't fit | OOM-borderline at 4K |
| Dual RTX 3060 12GB | 24 GB | ~360 GB/s effective | $500–650 | Won't fit (need 40+GB) | Fits clean, 12–18 tok/s |
| Single RTX 3090 24GB used | 24 GB | 936 GB/s | $700–900 | Tight q3 with offload, 8–12 tok/s | 30–45 tok/s |
| RTX 3060 + RTX 3090 mixed | 36 GB | ~480 GB/s effective | $950–1,200 | Fits clean, 12–18 tok/s | 25–35 tok/s |
For Llama 3 70B at q4, the dual-3060 build fails — 24 GB isn't enough. The mixed 3060+3090 build is the only consumer-priced configuration that hosts the model cleanly. For 34B-class models, the dual-3060 setup is genuinely competitive with the 3090 on capacity and within range on throughput.
The single 3090 wins on raw tok/s for any workload that fits. The mixed-pair wins on capacity-per-dollar above 24 GB. The dual-3060 wins on price-of-entry if you already own one and just need a second VRAM bucket.
Does the AMD Ryzen 7 5800X have enough PCIe lanes for a dual-GPU build?
The AMD Ryzen 7 5800X exposes 20 usable PCIe Gen4 lanes from the CPU plus chipset lanes on AM4. On a B550 board the standard layout is one CPU-connected x16 Gen4 slot and one chipset-connected x4 Gen3 slot. That's enough for a primary-x16 plus secondary-x4 dual-GPU build, with the limitations discussed above.
On X570, some boards expose an x8/x8 Gen4 split when both slots are populated — that's the better layout for dual-GPU inference if you have it. Check the manual for the specific board; the x8/x8 mode is usually labeled as "PCIe bifurcation" or "dual-GPU mode" in the BIOS.
For dual-GPU LLM inference on a B550 platform, the practical recommendation is to put the faster card (or the card that will be GPU 0 with the colocated KV cache) in the primary x16 slot and accept the x4 Gen3 link for the secondary card. The per-token cost is real but small for typical chat workloads.
Worked example: dual RTX 3060 12GB for a 34B-class model
A common 2026 build pattern is a Ryzen 7 5800X with a B550 board, 64 GB DDR4-3600, and two RTX 3060 12GB cards (one in x16 Gen4, one in x4 Gen3). Total VRAM is 24 GB; combined effective bandwidth is roughly 360 GB/s (matched cards, no harmonic-mean penalty).
For a 34B-class model at q4_K_M (roughly 21 GB of weights + 1.5 GB KV cache at 4K context), the math fits. Layer split is -ts 1,1 or auto. Throughput lands in the 12–18 tok/s range — comfortably interactive, materially better than the single-3060-with-offload alternative for 30B+ models.
Per llama.cpp's GitHub discussions, this is the configuration most commonly reported by community testers as the sweet spot for users coming up from a single 12 GB card. Cost-add is one GPU ($250–320), one slot, and roughly 170 W of additional power budget.
Common pitfalls in heterogeneous-GPU builds
- Forgetting to set CUDA_VISIBLE_DEVICES. If you don't want the runtime to see all cards (because one is dedicated to display), explicitly export
CUDA_VISIBLE_DEVICES=0,1or whichever subset you want llama.cpp to use. - Putting the slower card as GPU 0. The KV cache lives on GPU 0; making it the bandwidth-limited card kneecaps throughput. Always have the higher-bandwidth card enumerate first.
- Mismatched driver versions. Two CUDA cards on the same driver is fine. CUDA + ROCm cards in the same box is not supported by llama.cpp in 2026 — pick one ecosystem.
- PSU under-provisioning. Two 170 W cards plus a 105 W CPU plus the rest of the system needs a quality 750–850 W PSU. The 550 W "single 3060" PSU is not enough.
- Forgetting the secondary card's slot type. Some B550 boards' secondary x4 slot only powers up when a primary card is also installed — verify both cards enumerate at boot.
- Ignoring thermals. A second card sitting under the primary card in a typical mid-tower can cook from the primary card's exhaust. Open-air mining-rig style frames or a tower case with strong vertical airflow help.
When NOT to mix GPUs
If you have the budget for a single 3090 or 4090 and your workload fits in 24 GB, buy that instead. The single-card configuration is simpler, runs at higher throughput on any model that fits, and avoids the layer-split overhead. The mixed-GPU build is for the operator who has a card already and wants to extend capacity cheaply, not for the operator buying from scratch.
If your workload is batch inference for many concurrent users, vLLM on a matched-pair build is the right architecture. Heterogeneous llama.cpp is single-user-friendly but doesn't scale request throughput as cleanly.
Bottom line: when heterogeneous works
Mixed-GPU inference on consumer hardware is a real, supported, useful technique in 2026. llama.cpp's --tensor-split handles mismatched VRAM ratios; PCIe Gen3 x4 is enough bandwidth for the per-token traffic; the Ryzen 7 5800X on a B550 platform exposes the lane count to support a dual-GPU build.
The right shoppers are: existing 3060 12GB owners adding capacity for 30B+ models; existing 3090 owners adding a second card for 70B work; testbench builders comparing card pairings without committing to a matched-pair purchase. The wrong shoppers are: scratch buyers (buy a single 3090 or 4090); batch-inference operators (use vLLM on matched cards); anyone whose model fits cleanly on one GPU (no benefit, just complexity).
Related guides
- Best GPU for Local Llama 70B in 2026: RTX 3060 Stack vs Workstation
- Qwen3.6-27B on Dual RTX 3060 12GB: The $400 30-50 tok/s Local LLM Build
- Best CPU for Local LLM Inference in 2026: Ryzen 7 5800X vs 5700X vs 5600G
- Best Budget AM4 Build for Local LLM Inference in 2026
- CUDA 13.3 Landed: What Local LLM Operators Need to Know
Citations and sources
- llama.cpp — CUDA backend documentation — official
--tensor-splitbehavior, multi-GPU enumeration - Puget Systems — Multi-GPU LLM inference analysis — PCIe link-speed sensitivity, per-token bandwidth requirements
- vLLM — Distributed serving documentation — homogeneous-GPU tensor-parallel architecture, why mixed configs aren't supported
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
