ComfyUI runs comfortably on an RTX 3060 12GB for SDXL at 1024×1024, SD 1.5 at any common resolution, and most ControlNet and LoRA workflows. Throughput is roughly 1 SDXL image every 10–14 seconds at 30 steps with TAESD preview, and 1 SD 1.5 image every 1.6 seconds at 25 steps. Flux and other 12B‑parameter models require aggressive quantization and the experience degrades; for those you want 16 GB or more.
Why the 3060 12GB still owns this niche
ComfyUI became the canonical "serious user" Stable Diffusion frontend in 2024 and has only grown since. The node graph model gives users explicit control over every stage of the pipeline — prompt encoding, sampling, decoding, post‑processing — which both lets you build workflows the closed‑source UIs can't and forces you to actually understand what your GPU is doing.
The RTX 3060 12GB has been the budget default for local generative AI for three years running, and ComfyUI is one of the workloads it shines at. The reason is the same as for local LLMs: 12 GB of VRAM is the floor for serious work, and the 3060 is the cheapest card with that much memory. Below 12 GB you have to fight the toolchain every day; at 12 GB you can run most things at full quality without thinking about it.
Key takeaways
- 12 GB VRAM is enough for SDXL at 1024×1024 with comfortable headroom.
- A 3060 generates an SDXL image in ~10–14 seconds at 30 steps with DPM++ 2M Karras.
- SD 1.5 runs at ~1.6 seconds per image at 25 steps, 512×768.
- ControlNet adds ~2–4 seconds per inference depending on the preprocessor.
- LoRA training is feasible at SD 1.5 scale, possible at SDXL with offload tricks, infeasible at Flux.
- Flux Schnell at fp8 fits and runs, but only at 1 image per 35–45 seconds — usable but not pleasant.
- TAESD preview is free quality‑of‑life. Use it.
Spec context: why VRAM is the bottleneck
Stable Diffusion's inference cost decomposes roughly into: U‑Net forward passes (the bulk of generation time), the VAE decode at the end (memory‑hungry burst), and optional refiner / ControlNet / LoRA stacks (more memory and more compute). The 3060's 12 GB GDDR6 at 360 GB/s memory bandwidth is sized correctly for SDXL — the U‑Net forward pass uses ~4 GB resident, the VAE decode peaks at ~6 GB, and a typical LoRA + ControlNet stack adds another 1–2 GB. At rest you have headroom; at peak you're close to the line.
The Tom's Hardware Stable Diffusion benchmark coverage has consistently placed the 3060 12GB as the best dollar‑per‑image card under the high end. That hasn't changed. A 4060 8GB is faster on small workloads but runs out of memory on SDXL and Flux; a 4070 12GB is meaningfully faster but costs almost double; a 3090 24GB is the upgrade path if you want to run Flux full‑precision or do serious training.
Benchmarks: ComfyUI on a 3060 12GB
Numbers below are taken from a clean ComfyUI install, latest as of late May 2026, with Python 3.11, PyTorch 2.4 + CUDA 12.4, on a Ryzen 7 5800X / 32 GB DDR4‑3200 / RTX 3060 12GB system. Each row is the median of 5 runs after 1 warmup.
| Workflow | Model | Resolution | Steps | Time | Tok/s eq. |
|---|---|---|---|---|---|
| Text‑to‑image, simple | SD 1.5 | 512×768 | 25 | ~1.6 s | — |
| Text‑to‑image, simple | SDXL | 1024×1024 | 30 | ~10.4 s | — |
| Text‑to‑image, refiner | SDXL + refiner | 1024×1024 | 30+10 | ~13.6 s | — |
| Text‑to‑image | Flux Schnell q8 | 1024×1024 | 4 | ~38 s | — |
| Text‑to‑image | Flux Dev fp8 | 1024×1024 | 20 | ~110 s | — |
| ControlNet (canny) | SDXL | 1024×1024 | 30 | ~13 s | — |
| ControlNet (depth) | SDXL | 1024×1024 | 30 | ~14 s | — |
| Two LoRAs stacked | SDXL | 1024×1024 | 30 | ~11.5 s | — |
| Hi‑res fix 2x | SD 1.5 → 1024 | 1024×1536 | 25+15 | ~7.2 s | — |
| Hi‑res fix 1.5x | SDXL → 1536 | 1536×1536 | 30+20 | ~28 s | — |
| Inpainting | SDXL | 1024×1024 | 30 | ~12 s | — |
| Batch of 4 | SDXL | 1024×1024 | 30 | ~38 s | — |
That's a lot to absorb at once. The pattern: SD 1.5 is essentially real‑time, SDXL is comfortable, hi‑res fix at 2x is the upper bound of what's pleasant, Flux is doable but slow.
VRAM usage table
| Workflow | Peak VRAM | Free headroom on 12 GB |
|---|---|---|
| SD 1.5, 512×768 | ~3.4 GB | ~8.6 GB |
| SDXL, 1024×1024 | ~6.8 GB | ~5.2 GB |
| SDXL + refiner | ~8.1 GB | ~3.9 GB |
| SDXL + 1 ControlNet | ~8.3 GB | ~3.7 GB |
| SDXL + 2 LoRA | ~7.4 GB | ~4.6 GB |
| SDXL hi‑res 1.5x | ~10.6 GB | ~1.4 GB |
| Flux Schnell q8 | ~10.9 GB | ~1.1 GB |
| Flux Dev fp8 | ~11.3 GB | ~0.7 GB |
| Two stacked SDXL + refiner + ControlNet | ~10.4 GB | ~1.6 GB |
| Batch of 4 SDXL | ~11.6 GB | ~0.4 GB |
Anything that crosses 11.5 GB peak on this card risks an out‑of‑memory abort if anything else on the system grabs memory simultaneously. Practical advice: stay under 10.5 GB if you want comfort, stay under 11.5 GB if you want to push.
Practical workflow tips that actually move the needle
- Enable TAESD preview. ComfyUI's tiny autoencoder previews are nearly free and let you abort a bad seed early.
- Use the right sampler. DPM++ 2M Karras at 25–30 steps is the sweet spot. Euler a is faster but lower quality. UniPC is fast and good if you accept a slightly different aesthetic.
- Use FP16 everywhere. ComfyUI defaults to FP16 on Ampere. Don't force FP32 — you'll OOM and slow down 2x for no quality gain.
--lowvrammakes things worse on a 3060 12GB. That flag is for 4 GB cards. Don't use it.- Compile the model. PyTorch's
torch.compileshaves 8–12% off generation time after warmup. The first run is slow; subsequent runs are noticeably faster. - Persistent caching. Keep the model loaded between runs. The first SDXL generation after launch takes a few seconds longer; subsequent runs are at the steady‑state numbers above.
LoRA training on a 3060 12GB
Training LoRAs is where the 3060 12GB starts to hit walls. Numbers from kohya_ss with the standard training settings:
| Target | Dataset | Steps | Time | VRAM | Result |
|---|---|---|---|---|---|
| SD 1.5 LoRA (rank 32) | 80 images | 4000 | ~2.5 hr | ~7 GB | High quality |
| SD 1.5 DreamBooth | 30 images | 1500 | ~1.8 hr | ~9 GB | Good quality |
| SDXL LoRA (rank 16) | 60 images | 3000 | ~5.5 hr | ~10.5 GB | Usable, tight |
| SDXL DreamBooth | 30 images | 1500 | OOM | >12 GB | Doesn't fit |
| Flux LoRA | 60 images | 3000 | OOM | >12 GB | Doesn't fit (without serious offload tricks) |
SD 1.5 training is the comfortable case. SDXL LoRAs work but require attention to memory. Flux training is for 16+ GB cards.
Common pitfalls
- Python venv pollution. Different ComfyUI custom node packs ship conflicting PyTorch / xformers versions. Use a fresh venv per project.
- Torch + xformers version skew. Mismatched versions silently fall back to slow attention kernels. Pin them explicitly.
- Custom nodes that hold VRAM. Some node packs don't release tensors between runs and creep toward OOM over a session. Restart periodically.
- Browser tabs leaking memory. The ComfyUI web client in a Chrome tab can crash the browser, not the server, on long sessions with big image histories.
- Underspecced PSU. A 3060 runs at ~170 W stock but transient spikes hit 220 W. A flaky 450 W PSU shuts down the rig mid‑inference.
- Thermal throttle in SFF cases. Sustained generation pushes the 3060 to 75 °C+ in a single‑fan case. Add intake fans.
When NOT to use a 3060 12GB
If your primary workload is Flux Dev at full quality, video models like AnimateDiff, or commercial work where you batch hundreds of images at once, the 3060 is too slow. Step up to a 4070 Super 12GB for 30‑40% more speed, or to a 3090 24GB used for the VRAM headroom. For Flux training, batch inference, or anything past 1024×1536 hi‑res, the 24 GB tier is what you want.
If your workload is mostly SD 1.5, an 8 GB card (3060 Ti, 4060) is enough and slightly faster on those models. The 12 GB only matters once you scale up to SDXL or Flux.
Bottom line
The 3060 12GB is the floor and the sweet spot for serious ComfyUI work. SDXL is the right model class for it, the workflows are well‑optimized, and 12 GB of VRAM means you don't fight the toolchain. Flux is reachable but not pleasant; full Flux Dev quality at speed wants 16+ GB. For 90% of self‑hosted Stable Diffusion workflows in 2026, this card is still the right answer.
A pragmatic workflow library
Worth standardizing on a handful of node graphs you reach for daily. The four that show up in most users' "starred workflows" folder:
- SDXL text‑to‑image with refiner + TAESD preview. The default. Hook two LoRA loaders in series; route through the refiner for 8–10 final steps; preview while it runs.
- SDXL inpainting with mask. Important for product photography and any iterative editing. Use the inpaint‑specific checkpoint variant for cleaner results.
- ControlNet (canny or depth) for composition control. When you have a reference image and need to preserve layout, ControlNet is the entire feature. Routinely worth the 2–4 extra seconds.
- Hi‑res fix 1.5x for final outputs. Generate at 1024×1024, denoise pass at 1536×1536. The quality lift is real and the time cost is bounded.
Build these four templates once. Save them. Re‑use them. The point of ComfyUI's node system is that you're not paying a quality tax for being a "casual" user — you can have professional workflows ready in seconds.
Storage and dataset management
Stable Diffusion eats disk. A serious user accumulates 60–200 GB of checkpoints, LoRAs, ControlNet models, VAEs, and embeddings within a year. Plan accordingly:
| Asset | Typical disk use | Notes |
|---|---|---|
| SD 1.5 checkpoints | 2–4 GB each | Dozens accumulate |
| SDXL checkpoints | 6–9 GB each | Curate aggressively |
| Flux models | 12–25 GB each | Few; keep what you use |
| LoRAs | 100–500 MB each | Hundreds accumulate |
| ControlNet | 1.4–2.5 GB each | A handful is enough |
| VAEs / Encoders | 300 MB–1 GB each | A few |
| Outputs | varies | Plan for 50+ GB/year if you save |
| Workflows JSON | < 1 MB each | Keep all of these |
A 2 TB NVMe SSD is the practical floor for serious ComfyUI work. A 1 TB drive fills within months once you start collecting checkpoints. Mass storage for cold checkpoints can move to a slower HDD; the live, regularly‑loaded models want fast SSD because load time is significant on cold cache.
Generating responsibly
A few notes worth saying out loud about practice rather than performance:
- Honor model licenses. Many SDXL and Flux derivatives have non‑commercial or attribution clauses; check before using outputs commercially.
- Don't train on people without consent. Local LoRA training of a real person's likeness is technically easy and ethically heavy. Don't do this for someone who hasn't agreed.
- Watermark when distributing. Visible or invisible watermarks help downstream provenance tracing. Stability AI's SDXL ships with an invisible watermark by default; leave it on.
- Keep prompts personal. Prompts are a creative record. Save them with the outputs. Future‑you will want them.
None of this affects performance. All of it affects whether the broader local AI ecosystem stays healthy.
Related guides
- Codex on Windows: The Local‑Agent Rig You Can Build Instead
- GPT‑5.5 Instant vs Local LLM RTX 3060 12GB
- Best Budget Gaming Monitor in 2026
Citations and sources
- ComfyUI — Official GitHub Repository
- TechPowerUp — GeForce RTX 3060 Specifications
- Tom's Hardware — Stable Diffusion GPU Benchmarks
