Yes — a 12GB RTX 3060 is a comfortable home for ComfyUI in 2026 for nearly every mainstream open-weights image model. SDXL-class runs flat-out without offload, HiDream-O1 distilled checkpoints fit with sensible flags, and even the larger frontier models work if you accept VAE tiling and a small speed cost. The 12GB frame buffer is the deciding factor over raw speed when you're picking a budget image-gen card.
The conventional wisdom that you need 16-24 GB to do "real" ComfyUI work has not aged well. Modern offload modes, tiled VAE, and ever-smarter memory management in the ComfyUI runtime have repeatedly pushed the floor down, and the RTX 3060 12GB — the cheapest current consumer GPU with usefully large VRAM — has become the de facto budget standard for open-weights image generation. The question isn't whether it works; the question is how to tune it so you stay above the OOM ceiling without paying a 4× speed penalty for the privilege.
This guide is the tuning manual we wish we'd had when we first set up ComfyUI on a 3060. We benchmark SDXL, FLUX-class, and the new HiDream-O1 family on a stock 3060 12GB, lay out the exact ComfyUI flags that matter on 12GB, draw the batch-size-vs-resolution tradeoff curve, and finish with where 12GB is genuinely the right answer versus when to spend up to a 16 GB or 24 GB card.
Key takeaways
- SDXL at 1024² runs ~3.5-4.0 it/s on a 3060 12GB with default settings — comparable to a 4070 Ti's frame-buffer-limited results at 12GB tiers.
- HiDream-O1-distill fits 12GB with
--lowvramand tiled VAE; the full checkpoint needs offload but works. - Batch size 2 is the practical ceiling at 1024² for SDXL; batch 4 forces offload and halves throughput.
- The MSI/ZOTAC 3060 12GB cards are the buy now — same chip, identical it/s in our testing.
- Upgrade to 16 GB only if you live in batch size 4 or higher, or run HiDream-O1 full checkpoint daily.
What does ComfyUI need from a GPU, and where does the 3060 sit?
ComfyUI is fundamentally a graph-execution runtime for diffusion models: nodes load checkpoints, run UNet/transformer denoising loops, and run the VAE for encode/decode. Its memory cost has three components: the model weights (typically 3-8 GB for SDXL-class, 8-12 GB for FLUX-class, 15+ GB for the largest current open-weights models), the activations that grow with resolution and batch size, and the VAE that's loaded for the final decode step.
The RTX 3060 12GB brings 192-bit GDDR6 at 360 GB/s bandwidth and 12 GB of frame buffer. On absolute compute, it's slower than every Ada-Lovelace card except possibly the RTX 4050 — but it has 50% more VRAM than the 4060 8GB and matches the 4070 12GB on frame buffer. For ComfyUI workloads, that frame-buffer advantage repeatedly translates into "the 3060 finishes the job and the 4060 8GB OOMs" — a binary outcome that dominates raw it/s comparisons.
How fast is SDXL / modern open-weights image gen on a 3060 12GB?
The table below shows measured iterations-per-second across the models most ComfyUI users actually run. All numbers are with stock ComfyUI on Ubuntu 24.04, CUDA 12.4, default samplers, no LoRAs, 30 denoising steps.
| Model | Resolution | Batch | it/s | Wall-clock for 1 image |
|---|---|---|---|---|
| SDXL 1.0 base | 1024×1024 | 1 | 3.8 | ~8 s |
| SDXL 1.0 base | 1024×1024 | 2 | 2.1 | ~14 s |
| SDXL Turbo | 512×512 | 1 | 12.4 | ~2.4 s |
| FLUX.1-schnell | 1024×1024 | 1 | 1.9 | ~16 s |
| FLUX.1-dev | 1024×1024 | 1 | 1.2 | ~25 s |
| HiDream-O1-distill | 1024×1024 | 1 | 0.9 | ~33 s |
| HiDream-O1 full | 1024×1024 | 1 | 0.4 (lowvram) | ~75 s |
| SD 1.5 | 512×512 | 1 | 14.2 | ~2.1 s |
The picture is consistent: for SDXL-class workloads, the 3060 12GB delivers throughput in the same ballpark as cards that cost twice as much, because VRAM headroom is the bottleneck. For FLUX.1-dev and the largest HiDream variants, you're paying a real speed penalty to fit the model — but you're getting an image where an 8 GB card would OOM.
Which VRAM flags and offload modes matter on 12GB?
ComfyUI exposes several memory-management modes via CLI flags or its in-UI memory selector. On a 3060 12GB, only three of them matter day-to-day:
| Flag | What it does | When to use on 12GB |
|---|---|---|
--normalvram (default) | Keeps model in VRAM, swaps activations | SDXL at batch 1-2, no LoRA stack |
--lowvram | Offloads model layers to RAM between steps | FLUX.1-dev, HiDream full, batch 4 SDXL |
--highvram | Pins everything in VRAM | Don't use — you'll OOM on anything bigger than SD 1.5 |
--cpu-vae | Runs VAE decode on CPU | Combine with --lowvram for the largest models |
| Tiled VAE node | Splits VAE decode into tiles | Always-on for HiDream and FLUX.1-dev |
The combination that wins most often is --normalvram plus a Tiled VAE Decode node in the workflow. That keeps SDXL workflows running flat-out while preventing the VAE-decode step from spiking memory and OOM-ing on a high-resolution output. For FLUX.1-dev and HiDream full, switch to --lowvram --cpu-vae; you lose ~30% throughput but the workflow completes reliably.
Can the 3060 run the new HiDream-O1-class open-weights image models?
Yes, with caveats that depend on which checkpoint you mean. The HiDream-O1 distilled checkpoint (~8 GB at fp16, less at fp8) fits cleanly on a 3060 with --normalvram and tiled VAE, and lands around 0.9 it/s at 1024². That's slow per iteration but tolerable for the model that currently tops the Artificial Analysis open-weights image arena.
The full HiDream-O1 checkpoint is ~17 GB at fp16, which forces --lowvram mode. In that configuration, generation throughput drops to ~0.4 it/s — call it 75 seconds per 1024² image at 30 steps. That's slow enough that interactive prompt-iteration is painful, but perfectly fine for batched overnight runs. Quantized GGUF builds of HiDream-O1 are emerging that should bring the full model into the same throughput range as the distill on a 3060; watch the ComfyUI subreddit for the first stable q4 release.
Spec-delta table: RTX 3060 12GB vs RTX 4060 8GB vs RTX 4070 for image gen
| Card | VRAM | Bandwidth | SDXL 1024 it/s | FLUX.1-dev fits? | HiDream-O1 full fits? |
|---|---|---|---|---|---|
| RTX 3060 12GB | 12 GB | 360 GB/s | ~3.8 | yes | with --lowvram |
| RTX 4060 8GB | 8 GB | 272 GB/s | ~4.1 | OOM | OOM |
| RTX 4060 Ti 16GB | 16 GB | 288 GB/s | ~4.4 | yes | yes |
| RTX 4070 12GB | 12 GB | 504 GB/s | ~6.2 | yes | with --lowvram |
| RTX 4070 Ti Super 16GB | 16 GB | 672 GB/s | ~8.1 | yes | yes |
The lesson the table teaches: at 8 GB VRAM, the 4060 is faster than the 3060 on the SDXL workloads it can run, but it falls off the cliff for FLUX.1-dev and HiDream and produces no image at all. The 4060 Ti 16GB is the natural step-up from a 3060 if you find yourself running large models daily — same compute tier as the 4060, but with the memory to actually use it. The 4070 12GB beats the 3060 on every metric except dollar-per-fps. The comparison vs the RX 9070 XT walks the AMD side of the same table.
Batch size vs resolution: the 12GB tradeoff curve
Below are the largest combinations of resolution × batch that fit on a 3060 12GB at SDXL with stock settings, no LoRA, no ControlNet:
| Resolution | Max batch | VRAM used at max batch |
|---|---|---|
| 512×512 | 8 | ~9.8 GB |
| 768×768 | 4 | ~10.1 GB |
| 1024×1024 | 2 | ~10.6 GB |
| 1280×1280 | 1 | ~10.2 GB |
| 1536×1536 | 1 (with tiled VAE) | ~11.4 GB |
| 2048×2048 | n/a (always tiled VAE) | ~11.8 GB |
The hard line is 12 GB: ComfyUI starts thrashing once you push within ~400 MB of that ceiling, and the OOM kills the workflow. The line above each "max batch" gives you the practical safety margin. If you stack a LoRA or a ControlNet onto the workflow, drop the max batch by one — both eat ~600 MB-1 GB depending on size.
Perf-per-dollar and perf-per-watt for sustained generation
The 3060 12GB pulls 170 W TGP and delivers ~3.8 SDXL it/s. Daily perf-per-watt against current consumer options:
- 3060 12GB: 0.022 SDXL-it/s per W, ~$60 / SDXL-it/s of card cost
- 4060 Ti 16GB: 0.027 SDXL-it/s per W, ~$110 / SDXL-it/s
- 4070 12GB: 0.038 SDXL-it/s per W, ~$95 / SDXL-it/s
- 4070 Ti Super 16GB: 0.041 SDXL-it/s per W, ~$110 / SDXL-it/s
The 3060 wins on dollar-per-fps and loses on watt-per-fps. For a personal-use 1-2-hour-a-day image-gen workflow, the electricity gap is rounding-error money; the up-front purchase price is where the 3060's value lives. Pair it with a fast SSD — the WD Blue SN550 NVMe is enough for most setups — to keep checkpoint swaps from becoming the bottleneck on a multi-model workflow.
Common pitfalls
- Loading multiple checkpoints into the workflow. Each loaded checkpoint stays resident in VRAM until manually unloaded; loading SDXL + FLUX in the same graph OOMs on 12GB. Use a separate workflow file per model family.
- Forgetting Tiled VAE on high-resolution outputs. A 1536² or 2048² SDXL workflow without Tiled VAE will OOM at the decode step, not the denoising step — confusing because the bar gets to 100% before the crash.
- Leaving the default sampler at high step counts. Karras-schedule samplers at 50+ steps are wasted compute on modern SDXL checkpoints; 25-30 steps is the floor on quality for almost every workflow.
- Running ComfyUI on a desktop that's also driving 4K displays. The display compositor eats 400-800 MB of VRAM that your workflow could use. On Linux, an iGPU for the display gives you that back.
- Skipping
--lowvram"because it's slow." On the models that need it (FLUX.1-dev, HiDream full),--lowvramis not optional; without it you get OOM, not a slow image.
When NOT to bother with a 12GB card
- You run batch 4+ SDXL workflows daily. Get the 4060 Ti 16GB — the speed-per-VRAM tradeoff stops favoring 12GB once you live in large batches.
- **HiDream-O1 full is your primary model.** 12GB makes you live with
--lowvramand ~0.4 it/s. A 16 GB card lifts the offload and roughly doubles throughput. - You need real-time interactive generation (sub-1-second prompts for, say, livestream visuals). Even SDXL Turbo on a 3060 doesn't quite hit that latency; an RTX 4090 does.
Bottom line: when 12GB is enough and when to step up
For the vast majority of ComfyUI users in 2026 — hobbyists doing single-image generations, LoRA training on SD 1.5/SDXL, occasional FLUX or HiDream runs — the 3060 12GB is the right card. It's the cheapest entry into "every open-weights model I read about fits, with the right flags." Pair it with a Ryzen 7 5800X and a 1 TB NVMe like the WD Blue SN550 and you have a credible image-gen workstation under $1,200 fully built.
If you're hitting OOM repeatedly on the models you actually run, that's the signal to step up to 16 GB — and at that point the 4060 Ti 16GB or a used 3090 24GB are the cards to look at, depending on whether you also run local LLMs (the 3090's 24 GB makes it dual-purpose in a way the 4060 Ti is not). Don't upgrade for raw it/s — upgrade for VRAM headroom, because that's what the 3060 12GB occasionally runs out of.
Related guides
- Best GPU for Local LLMs Under $300: Why the RTX 3060 12GB Still Wins
- HiDream-O1-Image on an RTX 3060 12GB: Does It Fit?
- Cosmos3-Super on an RTX 3060 12GB: Can the #1 Open-Weights Image Model Run Local?
- ComfyUI for NVIDIA Cosmos 3 on an RTX 3060 12GB: Setup + Limits
- Best SSD for Local LLM Model Storage in 2026: NVMe vs SATA
