Skip to main content
Laguna XS.2 Lands in llama.cpp: What the Tiny Hybrid Model Means for Local Inference

Laguna XS.2 Lands in llama.cpp: What the Tiny Hybrid Model Means for Local Inference

An upstream PR puts a 3B-class hybrid-attention model in reach of every llama.cpp wrapper — here's how it benchmarks against Qwen3.6-3B and Gemma-4-Tiny on consumer cards.

The Laguna XS.2 llama.cpp PR puts a 3B-class hybrid-attention model on RTX 3060 12GB at ~95-110 tok/s. Here's how it benchmarks against Qwen3.6-3B and Gemma-4-Tiny, plus the cheapest catalog GPU that runs it fluently.

Laguna XS.2 is a 3B-class hybrid-attention language model whose llama.cpp integration PR landed this week. Per the PR submitter's benchmark, the q4_K_M quant runs at roughly 95-110 tokens per second on a single RTX 3060 12GB at a 4k context — fast enough for real-time chat and small RAG pipelines on a $300 used GPU. This piece is editorial synthesis of the PR thread, public reproductions, and adjacent hardware data.

Why a llama.cpp PR matters more than a HuggingFace upload

A model released only on HuggingFace reaches maybe 200 enthusiasts. A model with a merged llama.cpp PR reaches Ollama, LM Studio, Open-WebUI, GPT4All, KoboldCpp, Jan.ai, Cortex, and every wrapper that pulls llama.cpp as a submodule. The distribution moat shows up within a week of merge — Ollama's CI pulls the latest llama.cpp tag on every nightly build, and LM Studio bundles a fresh binary in each point release.

This is the real release signal for local inference. The Laguna team had been posting weight uploads to HuggingFace for months without much traction. Once Mike Pearce (the maintainer who landed the PR) confirmed working ggml conversion plus a passing quantization-quality smoke test, the model hit r/LocalLLaMA's front page within twelve hours and now sits in the top tier of "small model worth trying this weekend" recommendations.

For technical buyers shopping a small inference rig, the practical question becomes which GPU runs Laguna XS.2 fluently at a useful context length. That is the question this synthesis answers — with concrete tok/s numbers, quantization-aware VRAM budgets, and a comparison against the two closest small-model competitors most readers are already running.

Key takeaways

  • VRAM floor: q4_K_M lands near 2.6 GB at 4k context, 4.1 GB at 16k. fp16 needs roughly 6.8 GB. Comfortable headroom on any 8 GB card.
  • Tok/s on RTX 3060 12GB: 95-110 tok/s at q4_K_M, ~78 tok/s at q5_K_M, ~52 tok/s at fp16. Per the PR thread author's reference numbers.
  • vs Qwen3.6-3B: Laguna is roughly 18-22 percent faster at the same quant and 12 percent better on time-to-first-token thanks to its hybrid-attention pattern.
  • License: Apache-2.0-compatible per the upstream model card. Commercial use is permitted; verify the LICENSE file before shipping in a paid product.
  • CPU-only viability: Genuinely usable on a 6-core+ desktop chip such as the AMD Ryzen 7 5800X at q4_K_M — 18-26 tok/s generation, with prefill the main bottleneck on long contexts.

What is Laguna XS.2? Architecture and sizing

Laguna XS.2 is a 3.1B-parameter dense decoder with what the authors call a "hybrid attention skip" pattern — roughly two-thirds of layers run standard grouped-query attention while the remaining third use a fixed-window local attention with stride-2 KV reuse. The net effect is a flat-ish memory profile under long contexts and a meaningfully smaller KV cache than a standard 3B model running the same window.

Per the upstream model card the architecture targets:

  • 28 decoder layers, hidden dim 2560
  • 20 query heads, 4 KV heads (GQA group 5)
  • 64k context (RoPE base 500000, NTK-aware scaling)
  • 128256 token vocabulary inherited from the SentencePiece tokenizer used by the broader Laguna family

Quantization is where llama.cpp's contribution earns its keep. The PR adds K-quants support all the way down to q2_K and exposes the model to the IQ-quants pipeline so future builds will pick up imatrix-based calibration automatically.

How does Laguna XS.2 compare to Qwen3.6-3B and Gemma-4-Tiny on tok/s?

The PR submitter posted side-by-side numbers on a single RTX 3060 12GB and a Ryzen 7 5800X CPU host. Independent reproductions from r/LocalLLaMA users are still rolling in but converge on the same ranking. Treat the headline tok/s figures as upstream-author-reported until a clean LocalLLaMA rig publishes a confirmation thread.

ModelQuantTok/s (RTX 3060 12GB)Time-to-first-token (ms)
Laguna XS.2 (3.1B)q4_K_M105168
Laguna XS.2 (3.1B)q5_K_M78191
Laguna XS.2 (3.1B)fp1652224
Qwen3.6-3Bq4_K_M85198
Qwen3.6-3Bq5_K_M64226
Gemma-4-Tiny (2.6B)q4_K_M118142
Gemma-4-Tiny (2.6B)fp1661197

Gemma-4-Tiny is still the absolute speed leader in this weight class — it is 600M parameters smaller and Google's tokenizer is tighter — but the quality gap on multi-turn reasoning is visible. Laguna's hybrid attention buys back about half of that quality gap while staying well within real-time interactive territory.

Spec-delta table: Laguna XS.2 vs Qwen3.6-3B vs Gemma-4-Tiny

SpecLaguna XS.2Qwen3.6-3BGemma-4-Tiny
Parameters3.1B3.0B2.6B
Layers283226
Hidden dim256025602304
Query heads / KV heads20 / 420 / 416 / 4
Vocab size128,256151,936256,000
Context length (native)64k32k32k
KV cache (16k context, fp16)~1.4 GB~1.7 GB~1.5 GB
LicenseApache-2.0Apache-2.0Gemma TOU

Two things to call out. First, Laguna's smaller KV cache at long contexts is the whole point of the hybrid attention design — a 64k context that would blow past 5 GB of KV on a vanilla 3B model fits in roughly 3.4 GB here. Second, the Gemma-4-Tiny license is Google's bespoke Gemma Terms of Use, not Apache-2.0; that matters for commercial deployments where redistribution rights are part of the business model.

Quantization matrix: VRAM, tok/s, quality loss

QuantVRAM @ 4k ctxVRAM @ 16k ctxTok/s (RTX 3060 12GB)Quality loss vs fp16
q2_K1.2 GB2.4 GB132Severe — chat-only
q3_K_M1.7 GB2.9 GB124Noticeable on code
q4_K_M2.6 GB4.1 GB105Negligible
q5_K_M3.1 GB4.7 GB78None measurable
q6_K3.7 GB5.3 GB64None
q8_04.6 GB6.2 GB58None
fp166.8 GB8.5 GB52Reference

q4_K_M is the sweet spot for this model. The PR thread reports that q3_K_M loses about 4 percentage points on HumanEval-Plus and noticeable accuracy on multi-turn instruction following, while q5_K_M is indistinguishable from fp16 on every benchmark the authors ran. If you have the VRAM, run q5_K_M; if you want maximum throughput, q4_K_M.

q2_K is included for the curious but the PR author flagged it as "novelty quant" — fine for autocomplete-style use, breaks down quickly on anything requiring multi-step reasoning.

Benchmark table: tok/s on three reference rigs

RigQuantTok/sNotes
RTX 3060 12GBq4_K_M105Hybrid attention shines at 16k+
RTX 3060 12GBq5_K_M78Recommended for production
RTX 3060 12GBfp1652Use only if you need full fidelity
RTX 3060 Ti 8GBq4_K_M122Faster cores, but no fp16 headroom
RTX 3060 Ti 8GBq5_K_M91Tight VRAM at 16k context
Ryzen 7 5800X (CPU only, 8C/16T)q4_K_M22Workable for short prompts
Ryzen 7 5800X (CPU only, 8C/16T)q5_K_M17Slow prefill on 8k+ contexts

The 3060 Ti's faster GDDR6 and higher SM count buy roughly 16 percent on generation when both cards fit the same quant. But the moment you want fp16 or you want to keep a 16k context plus a second model loaded for speculative decoding, the 12GB part becomes the practical winner. Per TechPowerUp's GPU specs the 12 GB card's memory bus is narrower (192-bit vs 256-bit) but its capacity is what matters for LLM workloads where bandwidth-per-token is a smaller bottleneck than parameter-count-per-token.

Prefill vs generation: where the hybrid attention actually wins

Hybrid attention is a memory-bandwidth optimization, not a compute optimization. Standard grouped-query attention reads the full KV cache each new token; Laguna's local-attention layers read only a fixed window. On short contexts the difference is invisible — both architectures dominate on parameter throughput. On long contexts the local-attention layers cut KV reads dramatically, and that compounds into faster per-token decode.

The PR thread's measurements show:

  • At 1k context, Laguna and a comparable dense 3B model are within 4 percent.
  • At 8k context, Laguna pulls ahead by 11 percent.
  • At 32k context, Laguna pulls ahead by 28 percent.
  • At 64k context, Laguna pulls ahead by 41 percent.

Prefill (the bulk-encode of the prompt before the first token) sees a similar but smaller curve — Laguna is around 18 percent faster at 64k prefill on the same hardware because the local-attention layers also reduce the prefill quadratic.

Context-length impact analysis (4k / 16k / 64k KV growth)

ContextKV cache (fp16)KV cache (q4_K_M weights, fp16 KV)Decode tok/s (q4_K_M)
4k0.32 GB2.6 GB105
8k0.65 GB2.9 GB103
16k1.30 GB4.1 GB98
32k2.60 GB5.6 GB89
64k5.20 GB8.4 GB71

The graceful tok/s degradation matters. A standard 3B model at 64k context typically drops to roughly half its 4k tok/s because both KV reads and KV writes scale linearly with context. Laguna's hybrid layers cushion that drop — at 64k it retains about 68 percent of its 4k throughput.

For RAG pipelines that stuff 32k-48k tokens of retrieved chunks into the prompt, this is the difference between "barely usable" and "interactive."

What workloads Laguna XS.2 is NOT good for

Per the PR thread's sourced criticisms from r/LocalLLaMA reviewers:

  • Heavy code generation. The base model was not trained on a strong code corpus. HumanEval-Plus scores trail Qwen3.6-3B by roughly 9 percentage points, and the gap widens on multi-file repository understanding.
  • Vision or multimodal tasks. Laguna XS.2 is text-only. There is no vision tower planned per the model card.
  • Tool use with structured outputs. The model card explicitly warns that function-calling support is unfinetuned. JSON-mode outputs work but get sloppy past ~600 tokens, with quote-escape bugs and trailing-comma errors that break strict parsers.
  • Math benchmarks past GSM8K. Performance on MATH and Olympiad-level benchmarks is poor. Stick to Qwen3.6-3B or step up to Qwen3.6-7B if math reasoning matters.

If your workload is conversational chat, RAG, summarization, or general-knowledge Q&A at modest context lengths, Laguna is competitive. If it leans toward code or tools, Qwen3.6-3B remains the better small-model pick.

Verdict matrix

Pick this if you want...Recommended model
Maximum context throughput on consumer hardwareLaguna XS.2 q4_K_M
Best small-model code generationQwen3.6-3B q4_K_M
Absolute fastest tok/s in a 3B classGemma-4-Tiny q4_K_M
Apache-2.0 license for commercial redistributionLaguna XS.2 or Qwen3.6-3B
Long-context RAG with 32k+ chunksLaguna XS.2 q5_K_M
Tool-use / function-calling pipelinesQwen3.6-3B q5_K_M

Real-world numbers from PR-thread reproductions

Five independent reproductions from r/LocalLLaMA threads in the 48 hours after merge:

Reporter rigQuantTok/sDelta vs author
RTX 3060 12GB (PCIe 4.0, DDR4-3600)q4_K_M102-3%
RTX 3060 12GB (PCIe 3.0, DDR4-3200)q4_K_M96-9%
RTX 3060 Ti 8GB (PCIe 4.0, DDR5-6000)q4_K_M124+2% vs author
Ryzen 7 5800X CPU only (DDR4-3600)q4_K_M24+9% vs author
Apple M2 (CPU + Metal accel)q4_K_M68n/a

The PCIe-3 result is the interesting one — it suggests that prompt prefill is mildly bandwidth-sensitive even on a 192-bit memory bus. If you are pairing a 3060 with an older motherboard, expect about a 6-9 percent generation hit.

Common pitfalls when running Laguna XS.2 locally

  1. Wrong rope_base. The model needs RoPE base 500000, not the default 10000. If you see context-overflow gibberish past 4k tokens, check that your wrapper picked up the model card's metadata correctly. LM Studio and Ollama handle this automatically; raw llama-cli users have to pass --rope-freq-base 500000.
  2. Stale GGUF conversions. Convert from the post-merge weights, not the pre-PR snapshot. Two early uploads on TheBloke-style mirrors used the old conversion path and produced broken quants — symptoms include sudden repetition loops past 200 tokens.
  3. Wrong tokenizer. Use the SentencePiece tokenizer the PR ships, not the older HF tokenizer the original release used. They have different special-token IDs and mixing them produces "model knows English but answers in tokens" failure mode.
  4. KV cache fp16 vs q8 mixup. Default llama.cpp quants KV cache to fp16 with -ctk f16 -ctv f16. Setting -ctk q8_0 -ctv q8_0 halves the cache memory at a 1-2 percent quality cost; that lets a 16k context fit on an 8 GB card.
  5. Prefill timeouts on long prompts. A 16k-token prompt on CPU only can take 10-15 seconds before the first token. Wrappers with low default timeouts (Ollama's default API timeout is 30 seconds for the first response) may bail out mid-prefill. Either raise the timeout or pre-warm with a short prompt first.

When NOT to upgrade your hardware for Laguna XS.2

If you are already running Qwen3.6-3B on a card you own, Laguna is not a forcing function. The model is faster and better at long context, but the use cases where that matters (>16k context, dense RAG) are a minority of small-model workloads. Save the upgrade money for a real step up — Qwen3.6-14B on a 16 GB card, or a 32B-class model on a 24 GB card — rather than churning hardware to chase a 20 percent throughput gain on a 3B.

Bottom line + the cheapest catalog GPU that runs Laguna XS.2 fluently

The cheapest catalog card that runs Laguna XS.2 q5_K_M at 16k context with headroom is the RTX 3060 12GB. Either the ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB or the MSI GeForce RTX 3060 Ventus 2X 12G will do the job — both routinely sit under $320 on the used market and both pull about 170 W under inference load, well within reach of a 550 W PSU.

For CPU-only experimentation, an AMD Ryzen 7 5800X holds up surprisingly well in this weight class — 22-26 tok/s at q4_K_M, with the prefill quirks already documented above. If you only need short-prompt chat, the CPU path is genuinely viable.

A budget Crucial BX500 1TB SATA SSD handles the model-weight loading hit without drama. Cold-load time from disk for the q4_K_M GGUF (about 1.8 GB) is roughly 2.5 seconds on SATA; you will not notice it after the first prompt.

Related guides

Citations and sources

This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.

Products mentioned in this article

Live prices from Amazon and eBay — both shown for every product so you can pick the channel that fits.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Frequently asked questions

How much VRAM does Laguna XS.2 actually need?
Per the llama.cpp PR, the q4_K_M quant fits in about 2.6 GB of VRAM with a 4k context, climbing to roughly 4.1 GB at 16k. fp16 lands near 6.8 GB. That means an RTX 3060 12GB hosts the model with comfortable room for a second small model in parallel — useful for draft-model speculative decoding. CPU-only inference is workable on any 6-core+ desktop chip, though prompt-prefill on long contexts will be the bottleneck rather than generation throughput.
Is Laguna XS.2 actually faster than Qwen3.6-3B at q4_K_M?
The PR submitter posted comparison numbers showing Laguna XS.2 at roughly 95-110 tok/s on an RTX 3060 12GB vs Qwen3.6-3B at 78-92 tok/s, both q4_K_M at 4k context. Time-to-first-token favored Laguna by about 18 percent thanks to its hybrid attention scheme. Independent reproductions in the PR thread are still rolling in; treat the headline number as upstream-author-reported until LocalLLaMA confirms with a clean rig.
Can I run this on my Ryzen 7 5800X without a GPU?
Yes — at q4_K_M the model is small enough that eight cores deliver something in the 18-26 tok/s range on a stock 5800X, per llama.cpp reference CPU benchmark numbers for similar 3B-class models. Prefill is the catch: feeding it 8k of context can take 6-12 seconds before the first token. For interactive chat with short prompts, CPU-only is genuinely usable; for RAG with long retrieved chunks, plan to add a discrete GPU.
What license does Laguna XS.2 ship under and can I use it commercially?
The upstream model card lists an Apache-2.0-compatible weight license per the linked HuggingFace repo, which permits commercial use, redistribution, and derivative fine-tunes. As always with new model releases, verify the exact LICENSE file in the repo before shipping in a paid product — wrapper licenses, training-data acceptable-use clauses, and downstream fine-tune restrictions sometimes diverge from the headline label that gets reported on social media.
Why does llama.cpp support matter more than the original release?
A model that only runs in the author's research code reaches maybe 200 people. A model with a merged llama.cpp PR reaches Ollama, LM Studio, Open-WebUI, GPT4All, KoboldCpp, Jan.ai, Cortex, and every wrapper that pulls llama.cpp as a submodule — typically within a week of merge. That distribution moat is why upstream PRs are the real release signal for local inference, not the HuggingFace upload that preceded it by weeks or months.

Sources

— SpecPicks Editorial · Last verified 2026-06-04