Skip to main content
ComfyUI on an RTX 3060 12GB: SDXL Throughput and VRAM Tuning

ComfyUI on an RTX 3060 12GB: SDXL Throughput and VRAM Tuning

Concrete throughput numbers, launch flags, and node tricks for SDXL on an RTX 3060 12GB.

ComfyUI on an RTX 3060 12GB handles SDXL, FLUX fp8, and small LoRA stacks just fine — if you tune VRAM offload right. Here's the throughput math, launch flags, and quantization options.


title: "ComfyUI on an RTX 3060 12GB: SDXL Throughput and VRAM Tuning" slug: comfyui-rtx-3060-12gb-sdxl-throughput-2026 vertical: ai-rigs bucket: ai-tooling format: testbench hero_image_url: https://m.media-amazon.com/images/I/71FREn2eq5S._AC_SL1500_.jpg tags: [comfyui, rtx-3060, sdxl, ai-tooling, image-generation] ---

ComfyUI on an RTX 3060 12GB: SDXL Throughput and VRAM Tuning

Yes, a MSI RTX 3060 12GB comfortably runs ComfyUI with SDXL. Per community measurements on r/StableDiffusion, a 1024x1024 base plus refiner render lands in the 30 to 50 second range at roughly 3 to 4 s/it, LoRA stacks fit, and 1.5x Ultimate SD Upscale is feasible. The tuning that matters is VAE placement, text-encoder offload, and model precision.

Why ComfyUI's node graph is the budget-VRAM creator's tool

Per the ComfyUI repository docs, the tool exposes the diffusion pipeline as an explicit node graph: model loader, conditioning, sampler, VAE decode, save image. Every tensor handoff is visible, the opposite of the monolithic pipeline in AUTOMATIC1111's WebUI. On a 12 GB card that is the whole game: the difference between a successful 1024x1024 SDXL render and a CUDA out-of-memory crash is usually one well-placed offload node.

The practical consequence is that a MSI RTX 3060 12GB can match what a 16 GB card does in a less-tuned UI, just slower. Per ComfyUI's README, the supported zoo spans SD 1.5, SDXL base and refiner, SD3 Medium, FLUX.1 dev and schnell, HiDream, plus ControlNets and IP-Adapters. Most run on 12 GB if you know which weights to keep resident. The 3060 12GB has settled in as the entry card everyone tunes for, the way the GTX 1060 6GB was for SD 1.5 in 2023.

Step 0 diagnostic: are you out-of-memory or just slow?

Separate the two failure modes before tuning. "CUDA out of memory" means reduce model footprint. A successful but slow render is a throughput problem, addressed by precision and sampler choice, not offload flags. The most common reason ComfyUI feels slow on the 3060: people enable --lowvram after a single OOM, never turn it off, and pay a permanent latency tax. Open nvidia-smi -l 1 during a render. Under 11.5 GB is throughput; spiking to 12 GB and crashing is memory.

Key takeaways

  • The 3060 12GB renders SDXL at roughly 3 to 4 seconds per iteration at 1024x1024 with fp16 weights, putting a 30-step base plus 10-step refiner pass near 30 to 50 seconds.
  • VRAM headroom, not raw compute, is the difference between this card and 8 GB peers. Per TechPowerUp, the 12 GB SKU's wider 192-bit bus is what fits SDXL.
  • An NVMe drive like the WD Blue SN550 NVMe shaves model-swap waits when cycling between SDXL, FLUX, and SD 1.5 checkpoints.
  • Quantized weights (GGUF Q5_K_M or fp8) extend the card's reach to FLUX.1 dev and HiDream at usable speeds.
  • Keep the VAE on CPU, drop preview, and force CLIP off the GPU for FLUX, and a 3060 12GB sustains over 100 SDXL images per hour.

What ComfyUI needs from a GPU for SDXL pipelines

SDXL is the dividing line. SD 1.5 runs on any 6 GB card; SDXL's larger UNet, dual CLIP-L and CLIP-G encoders, and 1024x1024-native latent space push memory pressure into 10 to 12 GB at fp16. Per the Hugging Face model card for stable-diffusion-xl-base-1.0, the recommendation is "at least 8 GB", but that assumes aggressive offload and CPU VAE. For a node-graph workflow with refiner inline, 12 GB is the comfortable floor.

Compute matters less than headline TFLOPS suggest. The 3060's 13 TFLOPS FP16 is well behind a 4070 Ti, yet the per-iteration gap is roughly 2x to 3x, not 5x. SDXL sampling is bandwidth-bound for much of the step, and the 3060's 360 GB/s is respectable for the tier.

Spec table: RTX 3060 12GB vs adjacent ComfyUI cards

CardVRAMBus widthBandwidthTypical street priceComfyUI fit
RTX 3060 12GB12 GB GDDR6192-bit360 GB/s$260 to $310Entry SDXL, fp8 FLUX
RTX 3060 8GB8 GB GDDR6128-bit240 GB/s$230 to $270SD 1.5 only; SDXL needs lowvram
RTX 4060 8GB8 GB GDDR6128-bit272 GB/s$290 to $320SDXL with offload; FLUX painful
RTX 4060 Ti 16GB16 GB GDDR6128-bit288 GB/s$440 to $480SDXL comfortable, FLUX fp16
RTX 3090 24GB24 GB GDDR6X384-bit936 GB/s$700 to $900 usedVideo diffusion, training

Per TechPowerUp's specs, the 12 GB and 8 GB 3060s are not the same silicon with a memory swap; the 12 GB version uses a wider 192-bit bus, and the bandwidth difference is what makes SDXL latents transfer cleanly. For ComfyUI, the 12 GB SKU is the only 3060 worth buying.

Throughput on the 3060 12GB

Reference numbers, per community measurements on r/StableDiffusion and ComfyUI GitHub discussions, assuming fp16 weights and PyTorch 2.4:

WorkflowResolutionIteration timeTotal render
SD 1.5 base512x5121.5 to 2.5 it/s8 to 14 s
SDXL base1024x10243.5 s/it70 to 80 s (30 steps)
SDXL refiner1024x10243.0 s/it25 to 30 s (10 steps)
SDXL base + refiner1024x1024combined30 to 50 s
FLUX.1 dev fp81024x102412 to 18 s/it4 to 6 min (20 steps)
FLUX.1 schnell1024x10248 to 12 s/it30 to 50 s (4 steps)
HiDream / SD3 Medium1024x10244 to 6 s/it80 to 120 s (20 steps)

FLUX.1 schnell is usable because it converges in 4 steps. FLUX.1 dev is "set and make coffee". SDXL is the home base; SD 1.5 is interactive.

Launch flags and VRAM toggles

A handful of main.py flags change the memory and throughput envelope on 12 GB. Per cli_args.py:

  • --lowvram: aggressive UNet offload between steps. Use only when you OOM; costs 30 to 50 percent throughput.
  • --medvram: lighter offload; reasonable when stacking ControlNets.
  • --cpu-vae: keep the VAE on CPU. Saves 500 MB to 1.5 GB of VRAM; adds a couple seconds per image.
  • --bf16-vae: bfloat16 VAE; small win on Ampere.
  • --use-pytorch-cross-attention: PyTorch native SDPA. Per ComfyUI docs, recommended on modern PyTorch and usually beats legacy xFormers on Ampere.
  • --fp8_e4m3fn-text-enc: fp8 text encoders. Essential for FLUX on 12 GB; harmless on SDXL.

Start with python main.py --use-pytorch-cross-attention --bf16-vae. Do not preemptively add --lowvram.

VRAM accounting at 1024x1024

A rough budget, per community memory-profiling threads on r/StableDiffusion, for SDXL on the 3060 12GB:

ComponentVRAM (fp16)
SDXL UNet (resident)5.0 to 7.0 GB
VAE0.5 to 1.5 GB
Text encoders (CLIP-L + CLIP-G)1.5 to 2.5 GB
T5XXL (SD3 / FLUX only)4.5 to 9.0 GB
KV cache and latents0.5 to 1.0 GB
ControlNet (per active model)1.0 to 3.0 GB

That adds up to 7.5 to 12.0 GB for vanilla SDXL with one ControlNet — exactly why 12 GB is the comfort floor. A ZOTAC RTX 3060 12GB is a near-identical SKU and fine alternative if the MSI is out of stock; both use GA106 with the 192-bit bus.

Quantized model use

Quantization turns the 3060 12GB from "SDXL-comfortable" into "FLUX-capable". Per the ComfyUI GGUF custom node docs and Hugging Face cards for quantized FLUX variants:

FormatUNet VRAM (FLUX dev)Speed vs fp16Quality notes
fp16 reference~23 GB1.0xWon't fit on 12 GB
fp8 e4m3fn~12 GB0.85x to 0.95xVisually indistinguishable on most prompts
GGUF Q8_0~13 GB0.80xSlight color drift at high CFG
GGUF Q5_K_M~8 GB0.75xMild quality loss; room for text encoders
GGUF Q4_0~6 GB0.70xNoticeable artifacts on fine detail

The sweet spot for FLUX on the 3060 12GB is fp8 UNet plus fp8 text encoders, or Q5_K_M GGUF for LoRA headroom. Per community measurements on r/StableDiffusion, fp8 FLUX lands around 14 to 16 s/it.

Workflow optimizations that compound

  • Keep batch size at 1. Batched generation does not scale on 12 GB; every element holds another copy of latents and KV state.
  • Disable live preview or set it to "latent2rgb" rather than "TAESD" (an extra mini-VAE pass per step).
  • Use the "Unload Model" node between stages. SDXL base, then unload, then refiner is faster on 12 GB than holding both resident.
  • Force CLIP and T5 to CPU for SD3 or FLUX; encoders only run once per prompt.
  • Use Tiled VAE above 1536x1536. Per the Tiled VAE node docs, it splits the decode into 512-pixel tiles, dropping VAE VRAM by 60 to 80 percent.

Throughput math: how many images per day?

At 30 to 50 seconds per 1024x1024 SDXL image with base plus refiner, the 3060 12GB sustains roughly 80 to 120 images per hour, or 640 to 960 over an 8-hour workday. For thumbnails, blog imagery, or social posts, the card is not the bottleneck; prompt iteration is. FLUX.1 dev fp8 at 4 to 6 minutes cuts that to a few dozen per day.

Storage and the model zoo problem

ComfyUI's models/ directory grows fast. A typical creator setup tracked across r/StableDiffusion build threads:

AssetSize
SDXL base + refiner12.5 GB
SD 1.5 + a few fine-tunes4 to 12 GB
FLUX.1 dev fp16 / fp823 / 12 GB
FLUX.1 schnell fp812 GB
HiDream16 GB
LoRA collection (50 to 200 files)5 to 30 GB
ControlNets (SDXL set)12 to 18 GB

A 1 TB drive fills quickly. A WD Blue SN550 NVMe at 1 TB is the right tier: PCIe 3.0 x4 with sequential reads around 2,400 MB/s, enough to load an SDXL checkpoint into VRAM in 2 to 3 seconds. Storage does not affect generation once a model is resident, but for creators who swap checkpoints between batches it is the difference between flow state and constant waiting.

Perf-per-dollar: the entry ComfyUI card

At $260 to $310 street, the 3060 12GB has held the entry crown since 2023 because nothing in its price band ships with more than 8 GB. The RTX 4060 has more compute and better efficiency but lands at 8 GB. Arc A770 16GB has more memory and worse diffusion driver maturity. Used 3090s at $700 to $900 are the step up for FLUX or video diffusion.

Common pitfalls

  • Triton not installed: cross-attention silently falls back. Verify with import triton.
  • "Tensor not contiguous": almost always a custom node returning a view. Update the offending node.
  • Swap thrashing with --lowvram on systems under 16 GB of RAM. Per-iteration time triples. Pair 32 GB system RAM with 12 GB VRAM for heavy offload.
  • Tensor cores misconfigured: ensure PyTorch is built with CUDA 11.8+ and torch.backends.cudnn.allow_tf32 = True.

When not to run ComfyUI on a 3060 12GB

The card is wrong for video diffusion at usable speeds (SVD, AnimateDiff long sequences, Wan 2.1, CogVideoX) where 16 to 24 GB is the floor; 2K+ outputs at full step counts (cloud GPU is cheaper than patience); or simultaneous training and inference, which needs a 4070 Ti Super 16GB or 3090 24GB.

Verdict

Run ComfyUI on a MSI RTX 3060 12GB if your workflow is SDXL, SD 1.5, or FLUX.1 schnell at 1024x1024 to 1536x1536 with reasonable LoRA stacks and one or two ControlNets. Step up to 16 GB-plus for FLUX.1 dev at fp16, video diffusion, or training on top of inference. The 3060 12GB is the card people learn ComfyUI on for a reason: 12 GB is enough to never tune memory on the common path, and a single month of saved cloud-GPU spend pays for it.

Related guides

Frequently asked questions

Is an RTX 3060 12GB enough for ComfyUI and SDXL?

Yes, the 12GB of VRAM makes the 3060 a capable entry card for ComfyUI, comfortably running SDXL base and refiner pipelines that overwhelm 8GB cards. Generation is slower than on high-end GPUs, but the node graph lets you manage memory with tiled VAE and low-VRAM options. For hobby and learning workflows it is one of the best value choices available.

What ComfyUI settings help avoid out-of-memory errors on 12GB?

Enable tiled VAE decoding, use fp16 or fp8 where supported, keep batch sizes small, and avoid stacking many high-resolution ControlNets at once. ComfyUI also offers low-VRAM launch flags that offload parts of the pipeline to system RAM. Combining these keeps SDXL within the 3060's 12GB budget, trading a little speed for the reliability of completing renders without crashing.

How fast is SDXL generation on the RTX 3060 12GB?

Community figures place the 3060 12GB in a usable but unhurried range for SDXL, well behind flagship cards yet fine for iterative creative work. Actual iterations-per-second depend on resolution, sampler, and whether optimizations like xFormers are active. Always check recent benchmarks for your exact ComfyUI build, since updates and custom nodes regularly shift performance up or down.

Does ComfyUI benefit from a fast SSD?

Yes, indirectly. ComfyUI workflows load large checkpoints, LoRAs, and ControlNet models, and an NVMe drive like the WD Blue SN550 cuts the wait when switching between them. Storage speed does not affect generation once a model is in VRAM, but creators who frequently swap models or build complex graphs feel the difference in overall responsiveness compared with a slower SATA drive.

When should I upgrade from the 3060 12GB for ComfyUI?

Upgrade when you regularly hit memory limits with high-resolution outputs, heavy ControlNet stacks, video pipelines, or model training, where a 16GB-or-more card removes the constant tuning. The 3060 12GB excels at standard SDXL image work; demanding production or experimental pipelines justify stepping up. Match the card to your most memory-hungry workflow rather than your typical one to avoid frequent OOM interruptions.

Citations and sources

This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

Is an RTX 3060 12GB enough for ComfyUI and SDXL?
Yes, the 12GB of VRAM makes the 3060 a capable entry card for ComfyUI, comfortably running SDXL base and refiner pipelines that overwhelm 8GB cards. Generation is slower than on high-end GPUs, but the node graph lets you manage memory with tiled VAE and low-VRAM options. For hobby and learning workflows it is one of the best value choices available.
What ComfyUI settings help avoid out-of-memory errors on 12GB?
Enable tiled VAE decoding, use fp16 or fp8 where supported, keep batch sizes small, and avoid stacking many high-resolution ControlNets at once. ComfyUI also offers low-VRAM launch flags that offload parts of the pipeline to system RAM. Combining these keeps SDXL within the 3060's 12GB budget, trading a little speed for the reliability of completing renders without crashing.
How fast is SDXL generation on the RTX 3060 12GB?
Community figures place the 3060 12GB in a usable but unhurried range for SDXL, well behind flagship cards yet fine for iterative creative work. Actual iterations-per-second depend on resolution, sampler, and whether optimizations like xFormers are active. Always check recent benchmarks for your exact ComfyUI build, since updates and custom nodes regularly shift performance up or down.
Does ComfyUI benefit from a fast SSD?
Yes, indirectly. ComfyUI workflows load large checkpoints, LoRAs, and ControlNet models, and an NVMe drive like the WD Blue SN550 cuts the wait when switching between them. Storage speed does not affect generation once a model is in VRAM, but creators who frequently swap models or build complex graphs feel the difference in overall responsiveness compared with a slower SATA drive.
When should I upgrade from the 3060 12GB for ComfyUI?
Upgrade when you regularly hit memory limits with high-resolution outputs, heavy ControlNet stacks, video pipelines, or model training, where a 16GB-or-more card removes the constant tuning. The 3060 12GB excels at standard SDXL image work; demanding production or experimental pipelines justify stepping up. Match the card to your most memory-hungry workflow rather than your typical one to avoid frequent OOM interruptions.

Sources

— SpecPicks Editorial · Last verified 2026-06-14

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →