21 GPU's benchmarked running a small TTS model (vram peak: 5GB)

SpecPicks News — summary + source link

By SpecPicks News Desk · Published 2026-05-18 · Last verified 2026-05-20 · 10 min read

In brief — 2026-05-19 · A r/LocalLLaMA community member rented 21 different GPUs on the vast.ai cloud marketplace and ran the OmniVoice text-to-speech model on each, producing one of the more practical GPU comparison datasets the local-AI community has seen for audio inference workloads. The benchmark targets a model with a peak VRAM ceiling of approximately 5 GB — a threshold that puts a wide range of consumer and prosumer GPUs in contention — and reports results in xRT (times real-time), a metric that measures how many seconds of audio a GPU can synthesize per second of wall-clock time.

What happened

A Redditor posted results from a self-funded benchmark sprint on vast.ai, renting GPU instances for a few minutes each to run OmniVoice, a compact text-to-speech model with a peak VRAM draw of roughly 5 GB. The author's personal RTX 3090 served as the anchor reference point, and all 21 GPUs were evaluated against it using the xRT metric — a ratio where higher values indicate faster-than-real-time synthesis and lower values indicate the model is struggling to keep pace with real-time audio output.

The author is explicit that this is not a controlled scientific study. Rental environments introduce noise: background workloads on shared hosts, variable PCIe bandwidth, and thermal states that differ from a dedicated machine. Nevertheless, the dataset spans a representative cross-section of cards that the local-AI community actually uses — a mix of consumer gaming cards from multiple generations alongside prosumer and datacenter-class options available through vast.ai's rental marketplace.

What makes the dataset notable is the 5 GB VRAM ceiling on OmniVoice. This is low enough that GPUs from the 8 GB tier upward can run inference without memory pressure. That positions TTS workloads as a practical entry point for local AI experimentation on hardware many PC builders already own, without requiring the 16–24 GB configurations that large language models typically demand. The xRT framing is useful precisely because TTS is latency-sensitive: synthesis at 2× real-time keeps up with fast-paced use; below 1× real-time, live applications become unusable.

The RTX 3090 with its 24 GB of GDDR6X served as the community benchmark anchor — a card widely regarded as the high watermark of last-generation consumer performance for VRAM-hungry inference tasks.

Why it matters for builders

The 5 GB VRAM ceiling is the central shopping signal here. Buyers evaluating GPUs for a home AI stack — running a local assistant, a voice interface, or a personal TTS pipeline alongside a larger language model — now have data suggesting TTS inference is not the workload that forces an upgrade. A card that already handles lighter LLM tasks (quantized 7B or 8B models typically fit within 8–10 GB depending on quantization level) can double as a TTS engine without contention, provided the two workloads are not run simultaneously on the same device.

For buyers specifically shopping a dedicated TTS or audio-AI node, the benchmark argues for the mid-range tier. Cards in the 8–12 GB VRAM bracket can handle OmniVoice without memory pressure, which means the purchasing decision shifts away from raw capacity and toward compute throughput. A current-gen card with faster CUDA or tensor cores will post a higher xRT ratio than an older high-VRAM card running the same model, since the bottleneck is compute, not memory.

The vast.ai methodology itself is worth noting for cost-conscious builders on the fence about upgrading. Renting a GPU for minutes at a time — the approach this benchmark used — costs a fraction of a dollar per session and provides realistic inference data before committing to a hardware purchase. For someone deciding between an RTX 4070, an RTX 4080, and a prosumer card like an A6000 for a TTS workload, a $2 vast.ai experiment can answer the question directly.

Hardware angle

The 5 GB peak VRAM floor for OmniVoice means almost every discrete GPU released in the last four years with 8 GB or more of VRAM is a viable host. The RTX 3090's 24 GB is, in this context, significant overkill for VRAM — its role as the anchor is a reflection of its compute performance, not its memory capacity.

For builders eyeing the used and rental markets, prosumer cards like the Tesla P40 occupy an interesting niche: 24 GB of GDDR5 at a low used-market price, but slower compute than modern consumer cards. Community results documenting Qwen 3 27B at Q5 quantization running around 20 tokens per second on a P40 illustrate the tradeoff — high VRAM at the cost of architectural age. The xRT data from OmniVoice benchmarks would reveal whether that older architecture holds up on a compute-bound TTS workload versus newer mid-range consumer GPUs with less memory.

For dedicated single-board TTS hosts or display-attached AI assistants, builders pairing a compact display like the HAMTYSAN 7 Inch Raspberry Pi Screen with a discrete inference GPU on the same desk are a non-trivial slice of the local-AI audience — and the kind of buyer this benchmark directly informs. Sustained inference also raises VRM and memory-module thermals; pads like the Thermal Grizzly Minus Pad Basic 2.0mm are common picks for builders retrofitting older 24 GB cards for 24/7 inference duty.

What other coverage is saying

A parallel r/LocalLLaMA thread benchmarked Kokoro 82M and Supertonic 3 specifically on CPU, producing a head-to-head comparison the poster noted was not available elsewhere. The author writes that the outcome did not match expectations, underscoring that TTS model performance across hardware is less intuitive than general-purpose LLM inference benchmarks. That CPU-only thread complements the GPU-focused OmniVoice data by establishing a no-GPU reference point.

Elsewhere, a community thread on the Tesla P40 documented Qwen 3 27B running in Q5 quantization at approximately 20 tokens per second, with the caveat that MTP speculative decoding with K-cache quantization was not functional on that architecture. A separate r/LocalLLaMA community poll asking how many GPUs readers run on their local systems reflects the broader context: the audience skews toward single-GPU setups, making per-GPU xRT comparisons directly actionable.

Sources

21 GPU's benchmarked running a small TTS model (vram peak: 5GB) — Primary benchmark post; 21 consumer and prosumer GPUs evaluated on OmniVoice TTS at ~5 GB VRAM peak, results reported in xRT.
Benchmarked Kokoro 82M vs Supertonic 3 TTS on CPU — Head-to-head CPU-only TTS benchmark; provides a no-GPU baseline for compact TTS workloads.
Tesla P40 running Qwen 3.6 — Community thread documenting a prosumer GPU's inference performance at 20 t/s on a quantized 27B model; illustrates older high-VRAM cards' compute limitations.
How many GPUs do you have on your local system/server/AI PC? — r/LocalLLaMA community poll on local GPU counts; contextualizes who the benchmark data is most relevant for.

Filed by the SpecPicks News Desk. We summarize and link — never paywall-bypass.