Skip to main content
ComfyUI on a 12GB GPU: SDXL and Flux Setup, VRAM Limits, and Real Throughput

ComfyUI on a 12GB GPU: SDXL and Flux Setup, VRAM Limits, and Real Throughput

SDXL runs comfortably; Flux needs quantization. ComfyUI's memory management is the unlock.

ComfyUI on an RTX 3060 12GB runs SDXL at 14-20s/image and Flux.1 at 30-90s/image with quantized GGUF or fp8 builds. Setup, VRAM limits, and real numbers.

Yes — ComfyUI runs SDXL comfortably on an RTX 3060 12GB at 1024x1024, generating an image in roughly 14-20 seconds with the default sampler settings. Flux.1 needs a quantized GGUF or fp8 build to fit; expect 60-90 seconds per Flux image. The 12GB of VRAM is the gate, ComfyUI's node-based memory management is what makes both models usable on a budget card.

Why ComfyUI is the right tool on a 12GB card

ComfyUI is the open-source, node-based diffusion frontend most actively maintained for low-VRAM hardware. Its design loads only the parts of a model that a given node needs, swaps weights to system RAM when memory pressure builds, and gives the user direct control over precision and offload behavior. On a 12GB card, that level of control is the difference between models that "just barely fit" and models that fail with an out-of-memory crash. The project's design and active development are tracked on the ComfyUI GitHub.

The card we are sizing for in this guide is the RTX 3060 12GB. Per TechPowerUp's RTX 3060 specifications, it carries 12 GB of GDDR6 on a 192-bit bus with 360 GB/s of bandwidth and 3,584 CUDA cores. That bandwidth and VRAM combination is enough for SDXL and quantized Flux without compromise on output quality.

Key takeaways

  • SDXL runs at full quality on a 3060 12GB at 1024x1024 with 14-20 second generations.
  • Flux.1 requires quantization (GGUF Q4_K_S or NF4) or fp8 to fit on 12GB; throughput drops to roughly one image per 60-90 seconds.
  • ComfyUI's --lowvram and --medvram flags determine how aggressively it offloads to system RAM. Use --medvram first; switch to --lowvram only if you OOM.
  • A fast NVMe drive for model storage matters more than people expect — SDXL base, refiner, and Flux are 7-12 GB each and load times dominate UX on a hard drive.
  • The CPU does little once a generation starts; the 5800X-class chip is overkill. Bandwidth is what limits throughput.
  • ControlNet, IP-Adapter, and LoRA stacks add VRAM cost. Plan for 1-2 GB of headroom on top of base model size.

Setup: installing ComfyUI on a 12GB card

Clone the ComfyUI repository on GitHub, create a Python 3.10 virtual environment, and install the CUDA 12.1 build of PyTorch. ComfyUI runs as a single Python process and serves a web UI on port 8188. The default install assumes plenty of VRAM; for a 12GB card, edit the launch command:

python main.py --medvram --use-pytorch-cross-attention

--medvram keeps the active sampling block on the GPU but swaps unused weights to system RAM between steps. The cross-attention flag enables PyTorch 2.x's memory-efficient attention, which cuts VRAM during sampling by roughly 20%. On a 12GB card both flags are nearly always worth enabling.

For Flux specifically, also install the ComfyUI-GGUF custom node and the relevant quantized Flux checkpoints. Without GGUF, Flux.1-dev at fp16 occupies ~24 GB and will not fit.

Can the RTX 3060 12GB run SDXL in ComfyUI?

Yes, comfortably. SDXL base is roughly 6.6 GB at fp16; the optional refiner is another 6 GB. ComfyUI loads only the active model into VRAM at any moment and swaps the refiner in when the workflow advances to that node. The result is a peak occupancy of around 8-9 GB during sampling, leaving 3 GB for ControlNet, embeddings, and a moderate batch size.

At 1024x1024 with the Euler sampler, 25 steps, and the default CFG, expect 14-20 seconds per image. Switching to DPM++ 2M Karras and 20 steps drops that to roughly 12 seconds. Adding the refiner pass brings the total to 25-35 seconds per finished image.

Generation throughput numbers vary across community measurements; the ranges above reflect what the broader Stable Diffusion community has reported on the same card with stock SDXL pipelines.

Will Flux.1 work on a 12GB card?

Flux.1 in fp16 is too large for a 12GB card — the model alone is around 23 GB at full precision. The two practical paths are GGUF quantization and fp8.

GGUF builds of Flux.1-dev and Flux.1-schnell come in Q4_K_S, Q4_K_M, Q5_K_S, Q6_K, and Q8_0 variants, sized 7-13 GB. The Q4_K_S build at ~7 GB fits on a 12GB card with margin and produces quality close to the full model for most prompts. Flux.1-schnell is the speed-tuned distilled variant, documented on the Hugging Face FLUX.1 schnell page, and it converges in 4 steps where Flux.1-dev needs 20-28. Pair Q4_K_S Flux.1-schnell on a 3060 12GB and expect roughly 30-45 seconds per 1024x1024 image at 4 steps. Flux.1-dev on the same hardware, at 20 steps, runs 60-90 seconds.

fp8 builds (using torch.float8 or the NF4 quantization technique) also fit on 12GB with similar quality. NF4 in particular is popular because it loads faster on Ampere-class cards.

Real throughput numbers

The numbers below summarize what active Stable Diffusion and Flux users report on a 3060 12GB. Treat them as representative ranges, not promises.

ModelResolutionStepsSamplerVRAM peakTime per image
SD 1.5512x51220Euler a~3 GB3-5 s
SDXL base1024x102425Euler~8 GB14-20 s
SDXL base + refiner1024x102425 + 10DPM++ 2M~9 GB25-35 s
Flux.1-schnell Q4_K_S1024x10244Euler~9 GB30-45 s
Flux.1-dev Q4_K_S1024x102420Euler~9.5 GB60-90 s
Flux.1-dev Q6_K1024x102420Euler~11.5 GB75-105 s
Flux.1-dev fp81024x102420Euler~11 GB70-95 s

The trend is clean: SD 1.5 is trivial on this card; SDXL is comfortably interactive; Flux is slower but workable with quantized builds.

VRAM gotchas on a 12GB card

ComfyUI workflows compound. The base model is only one cost; every additional node that loads weights or activations claims more VRAM.

ExtraTypical VRAM cost
ControlNet (one)~1.2-1.6 GB
ControlNet (stacked, two)~2.5-3.0 GB
IP-Adapter~0.8 GB
Multiple LoRAs~0.5-1.5 GB each
Tile upscale (Ultimate SD Upscale)~1.5-3.0 GB
Hi-res fix at 1.5x~2.0-3.0 GB

A workflow that runs SDXL base plus refiner plus a ControlNet plus a LoRA can easily push past 12 GB and OOM. ComfyUI's --lowvram mode handles this by aggressively swapping to system RAM, but generation time roughly doubles. The right answer is usually to pare the workflow down to what you actually need rather than fight VRAM pressure.

Does the CPU matter for ComfyUI?

Once a generation begins, the CPU does very little. It handles workflow orchestration, JSON parsing of the pipeline, and feeds prompts to the GPU. An AMD Ryzen 7 5800X is far more than enough — the CPU spends most of its time idle. A faster CPU buys you essentially no throughput improvement; a slower CPU (modern 6-core) is fine too.

What the CPU does affect is workflow load time. Heavy custom-node graphs with many nodes parse faster on a fast chip, but this is a one-time cost per workflow change.

How storage impacts UX

SDXL base is ~6.6 GB, SDXL refiner is ~6 GB, Flux.1-dev is 12-24 GB depending on quant. Switching workflows on a hard drive means waiting 30-60 seconds for model load. On a fast NVMe drive like the WD Blue SN550 1TB NVMe, model loads land in 4-8 seconds. For an interactive workflow, NVMe is not optional.

Common pitfalls

  • Loading Flux.1-dev at fp16 on a 12GB card. OOMs. Always use Q4_K_S, Q5_K_S, NF4, or fp8 builds.
  • Forgetting --medvram on a 12GB card. ComfyUI's default assumes 24GB and will leak VRAM as workflows compound.
  • Stacking ControlNets without checking VRAM. Two ControlNets plus SDXL often exceed 11 GB and force --lowvram mode, doubling generation time.
  • Mixing model precisions. Running a fp16 SDXL VAE with an fp8 base model wastes VRAM. Pick a precision and stay consistent.
  • Ignoring the VAE. The SDXL VAE alone is ~330 MB and the fp16 variant is the most stable on the 3060. Do not switch to bf16 unless you tested it.

When NOT to use ComfyUI on a 12GB card

If your workflow is single-prompt SDXL generation with no node experimentation, Automatic1111's webUI or Forge is faster to set up and just as fast on a 12GB card. ComfyUI's strength is graph-based workflows where you compose ControlNets, IP-Adapters, LoRAs, custom samplers, and post-processing in a single pipeline. For pure quick-image use, the workflow editor is overhead.

If you need Flux at full fp16 quality, no 12GB card will deliver it without quantization. The realistic step up is a 16-24 GB card.

Worked example: a SDXL portrait workflow on a 3060 12GB

Workflow: load SDXL base, apply a face-detail LoRA, generate at 1024x1024 with DPM++ 2M Karras for 25 steps, pass through SDXL refiner for 10 steps, upscale 2x with ESRGAN to 2048x2048. On the ZOTAC Gaming GeForce RTX 3060 Twin Edge 12GB with --medvram --use-pytorch-cross-attention this runs at peak ~10 GB occupancy in roughly 45-55 seconds per finished image including upscale. Quality is competitive with paid hosted services for portraits and concept art.

Worked example: Flux.1-schnell quick generations

Workflow: Q4_K_S Flux.1-schnell, 4 steps, 1024x1024, no ControlNet. On the MSI GeForce RTX 3060 Ventus 2X 12G, this runs around 30-45 seconds per image with peak occupancy near 9 GB. The output quality is striking for a 4-step model and is the practical default if you want Flux on this card.

Storage and platform notes

Stable Diffusion model libraries grow fast. Once you collect a few SDXL checkpoints, a Flux build or two, and a handful of ControlNet and LoRA models, you can occupy 100-200 GB. Plan for at least 1 TB on the model drive; the WD Blue SN550 1TB NVMe is a low-cost option that comfortably handles model loads and saves.

CPU choice barely matters for sampling speed. The AMD Ryzen 7 5800X keeps the rest of the system snappy and never bottlenecks ComfyUI workflows.

Sampler choices on the 3060 12GB

Sampler choice meaningfully affects both quality and throughput. Euler and Euler a are the fastest default samplers and produce competent results for SDXL out of the box. DPM++ 2M Karras tends to converge on subjectively cleaner detail at the cost of one or two extra steps. UniPC samplers reduce the step count further but sometimes introduce visible artifacts on complex compositions. On a bandwidth-constrained card like the 3060, every removed step is a real fraction of a second saved per image, so picking a sampler whose convergence profile matches your prompt complexity matters. A common rule of thumb across the community is to start at 25 steps with DPM++ 2M Karras for SDXL portraits, drop to 20 for landscapes, and only push higher when you can see specific quality issues in the output.

LoRA stacking budget

LoRAs are the most common way to add style or subject specificity to a base model. Each one adds VRAM and a small slowdown. On a 3060 12GB, two SDXL LoRAs is comfortable, three is the practical ceiling without dropping resolution or batch size. The order in which LoRAs are applied affects the result; ComfyUI's explicit node graph lets you control that, which is one of the practical reasons users on this card prefer it over webUI-style tools that hide the ordering.

Power and thermals

The RTX 3060 12GB pulls a 170 W TGP. Combined with a 105 W TDP CPU, a typical full system draws 280-340 W under sustained ComfyUI generation. A 650 W 80+ Bronze PSU is adequate; an 80+ Gold supply is the right pick if you plan long generation queues. Thermals are unremarkable — the card sits in the mid-60s C under a sustained queue with a typical dual-fan cooler.

Bottom line

For diffusion work on a 12GB GPU in 2026, ComfyUI plus an RTX 3060 12GB gives you SDXL at near-full quality with interactive turnaround and a workable Flux.1-schnell or Flux.1-dev experience through quantized GGUF or fp8 builds. The MSI Ventus 2X variant and ZOTAC Twin Edge variant of the 3060 are functionally identical for this workload — pick on price and warranty. Pair the GPU with an AMD Ryzen 7 5800X and a WD Blue SN550 1TB NVMe for fast model loads, run ComfyUI with --medvram --use-pytorch-cross-attention, and you have a budget diffusion station that holds its own against much pricier hardware on the workloads it actually targets.

Citations and sources

This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Watch a review

Friendly Fire: AMD Ryzen 7 5800X CPU Review & Benchmarks vs. 5600X & 5900X — Gamers Nexus on YouTube

Frequently asked questions

Can the RTX 3060 12GB run SDXL in ComfyUI?
Yes, comfortably. SDXL's base and refiner fit within 12 GB at 1024x1024 with ComfyUI's default memory management, and the node-based pipeline only loads what each step needs. You may want to enable model offloading for very large batch sizes or high-resolution upscales, but for standard single-image SDXL work the 3060 12GB stays inside its VRAM budget without special tricks.
Will Flux.1 work on a 12GB card?
Flux.1 in full precision exceeds 12 GB, but quantized GGUF and fp8 builds of Flux.1-dev and Flux.1-schnell run on a 3060 12GB through ComfyUI's GGUF and weight-streaming nodes. Expect slower per-image times than SDXL and longer initial load, but generation is workable, especially with the schnell variant that needs far fewer steps to produce a usable result.
What low-VRAM settings should I enable in ComfyUI?
ComfyUI auto-detects available VRAM, but launching with the lowvram or normalvram flags, enabling tiled VAE decode, and keeping batch size at one all reduce peak memory on a 12 GB card. Using fp8 or GGUF model variants and avoiding loading multiple large checkpoints in the same graph are the other big levers when you hit out-of-memory errors at higher resolutions.
Does NVMe speed matter for ComfyUI?
It mainly affects how long checkpoints take to load into VRAM, not generation speed once a model is resident. Large SDXL and Flux checkpoints are several gigabytes, so a fast NVMe drive like the WD Blue SN550 noticeably shortens the wait when you switch models mid-session. Once a model is loaded, drive speed is irrelevant and the GPU does all the work.
Is a 12GB GPU enough or should I buy more VRAM?
For SDXL, SD 1.5, and quantized Flux at standard resolutions, 12 GB is enough and the 3060 12GB is the value champion. You only need more VRAM if you plan to run full-precision Flux, very large batch jobs, high-resolution native generation above 1536px, or to train and fine-tune models, where 16 GB or more removes constant memory juggling and speeds iteration.

Sources

— SpecPicks Editorial · Last verified 2026-06-09

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →