Laguna XS.2 Lands in llama.cpp: What the Tiny Hybrid Model Means for Local Inference

Name: Laguna XS.2 Lands in llama.cpp: What the Tiny Hybrid Model Means for Local Inference
Item: ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0 Gaming Graphics Card, IceStorm 2.0 Cooling, Active Fan Control, Freeze Fan Stop ZT-A30600H-10M
Author: Mike Perry

An upstream PR puts a 3B-class hybrid-attention model in reach of every llama.cpp wrapper — here's how it benchmarks against Qwen3.6-3B and Gemma-4-Tiny on consumer cards.

By Mike Perry · Published 2026-05-28 · Last verified 2026-07-06 · 12 min read

The Laguna XS.2 llama.cpp PR puts a 3B-class hybrid-attention model on RTX 3060 12GB at ~95-110 tok/s. Here's how it benchmarks against Qwen3.6-3B and Gemma-4-Tiny, plus the cheapest catalog GPU that runs it fluently.

Laguna XS.2 is a 3B-class hybrid-attention language model whose llama.cpp integration PR landed this week. Per the PR submitter's benchmark, the q4_K_M quant runs at roughly 95-110 tokens per second on a single RTX 3060 12GB at a 4k context — fast enough for real-time chat and small RAG pipelines on a $300 used GPU. This piece is editorial synthesis of the PR thread, public reproductions, and adjacent hardware data.

Why a llama.cpp PR matters more than a HuggingFace upload

A model released only on HuggingFace reaches maybe 200 enthusiasts. A model with a merged llama.cpp PR reaches Ollama, LM Studio, Open-WebUI, GPT4All, KoboldCpp, Jan.ai, Cortex, and every wrapper that pulls llama.cpp as a submodule. The distribution moat shows up within a week of merge — Ollama's CI pulls the latest llama.cpp tag on every nightly build, and LM Studio bundles a fresh binary in each point release.

This is the real release signal for local inference. The Laguna team had been posting weight uploads to HuggingFace for months without much traction. Once Mike Pearce (the maintainer who landed the PR) confirmed working ggml conversion plus a passing quantization-quality smoke test, the model hit r/LocalLLaMA's front page within twelve hours and now sits in the top tier of "small model worth trying this weekend" recommendations.

For technical buyers shopping a small inference rig, the practical question becomes which GPU runs Laguna XS.2 fluently at a useful context length. That is the question this synthesis answers — with concrete tok/s numbers, quantization-aware VRAM budgets, and a comparison against the two closest small-model competitors most readers are already running.

Key takeaways

VRAM floor: q4_K_M lands near 2.6 GB at 4k context, 4.1 GB at 16k. fp16 needs roughly 6.8 GB. Comfortable headroom on any 8 GB card.
Tok/s on RTX 3060 12GB: 95-110 tok/s at q4_K_M, ~78 tok/s at q5_K_M, ~52 tok/s at fp16. Per the PR thread author's reference numbers.
vs Qwen3.6-3B: Laguna is roughly 18-22 percent faster at the same quant and 12 percent better on time-to-first-token thanks to its hybrid-attention pattern.
License: Apache-2.0-compatible per the upstream model card. Commercial use is permitted; verify the LICENSE file before shipping in a paid product.
CPU-only viability: Genuinely usable on a 6-core+ desktop chip such as the AMD Ryzen 7 5800X at q4_K_M — 18-26 tok/s generation, with prefill the main bottleneck on long contexts.

What is Laguna XS.2? Architecture and sizing

Laguna XS.2 is a 3.1B-parameter dense decoder with what the authors call a "hybrid attention skip" pattern — roughly two-thirds of layers run standard grouped-query attention while the remaining third use a fixed-window local attention with stride-2 KV reuse. The net effect is a flat-ish memory profile under long contexts and a meaningfully smaller KV cache than a standard 3B model running the same window.

Per the upstream model card the architecture targets:

28 decoder layers, hidden dim 2560
20 query heads, 4 KV heads (GQA group 5)
64k context (RoPE base 500000, NTK-aware scaling)
128256 token vocabulary inherited from the SentencePiece tokenizer used by the broader Laguna family

Quantization is where llama.cpp's contribution earns its keep. The PR adds K-quants support all the way down to q2_K and exposes the model to the IQ-quants pipeline so future builds will pick up imatrix-based calibration automatically.

How does Laguna XS.2 compare to Qwen3.6-3B and Gemma-4-Tiny on tok/s?

The PR submitter posted side-by-side numbers on a single RTX 3060 12GB and a Ryzen 7 5800X CPU host. Independent reproductions from r/LocalLLaMA users are still rolling in but converge on the same ranking. Treat the headline tok/s figures as upstream-author-reported until a clean LocalLLaMA rig publishes a confirmation thread.

Model	Quant	Tok/s (RTX 3060 12GB)	Time-to-first-token (ms)
Laguna XS.2 (3.1B)	q4_K_M	105	168
Laguna XS.2 (3.1B)	q5_K_M	78	191
Laguna XS.2 (3.1B)	fp16	52	224
Qwen3.6-3B	q4_K_M	85	198
Qwen3.6-3B	q5_K_M	64	226
Gemma-4-Tiny (2.6B)	q4_K_M	118	142
Gemma-4-Tiny (2.6B)	fp16	61	197

Gemma-4-Tiny is still the absolute speed leader in this weight class — it is 600M parameters smaller and Google's tokenizer is tighter — but the quality gap on multi-turn reasoning is visible. Laguna's hybrid attention buys back about half of that quality gap while staying well within real-time interactive territory.

Spec-delta table: Laguna XS.2 vs Qwen3.6-3B vs Gemma-4-Tiny

Spec	Laguna XS.2	Qwen3.6-3B	Gemma-4-Tiny
Parameters	3.1B	3.0B	2.6B
Layers	28	32	26
Hidden dim	2560	2560	2304
Query heads / KV heads	20 / 4	20 / 4	16 / 4
Vocab size	128,256	151,936	256,000
Context length (native)	64k	32k	32k
KV cache (16k context, fp16)	~1.4 GB	~1.7 GB	~1.5 GB
License	Apache-2.0	Apache-2.0	Gemma TOU

Two things to call out. First, Laguna's smaller KV cache at long contexts is the whole point of the hybrid attention design — a 64k context that would blow past 5 GB of KV on a vanilla 3B model fits in roughly 3.4 GB here. Second, the Gemma-4-Tiny license is Google's bespoke Gemma Terms of Use, not Apache-2.0; that matters for commercial deployments where redistribution rights are part of the business model.

Quantization matrix: VRAM, tok/s, quality loss

Quant	VRAM @ 4k ctx	VRAM @ 16k ctx	Tok/s (RTX 3060 12GB)	Quality loss vs fp16
q2_K	1.2 GB	2.4 GB	132	Severe — chat-only
q3_K_M	1.7 GB	2.9 GB	124	Noticeable on code
q4_K_M	2.6 GB	4.1 GB	105	Negligible
q5_K_M	3.1 GB	4.7 GB	78	None measurable
q6_K	3.7 GB	5.3 GB	64	None
q8_0	4.6 GB	6.2 GB	58	None
fp16	6.8 GB	8.5 GB	52	Reference

q4_K_M is the sweet spot for this model. The PR thread reports that q3_K_M loses about 4 percentage points on HumanEval-Plus and noticeable accuracy on multi-turn instruction following, while q5_K_M is indistinguishable from fp16 on every benchmark the authors ran. If you have the VRAM, run q5_K_M; if you want maximum throughput, q4_K_M.

q2_K is included for the curious but the PR author flagged it as "novelty quant" — fine for autocomplete-style use, breaks down quickly on anything requiring multi-step reasoning.

Benchmark table: tok/s on three reference rigs

Rig	Quant	Tok/s	Notes
RTX 3060 12GB	q4_K_M	105	Hybrid attention shines at 16k+
RTX 3060 12GB	q5_K_M	78	Recommended for production
RTX 3060 12GB	fp16	52	Use only if you need full fidelity
RTX 3060 Ti 8GB	q4_K_M	122	Faster cores, but no fp16 headroom
RTX 3060 Ti 8GB	q5_K_M	91	Tight VRAM at 16k context
Ryzen 7 5800X (CPU only, 8C/16T)	q4_K_M	22	Workable for short prompts
Ryzen 7 5800X (CPU only, 8C/16T)	q5_K_M	17	Slow prefill on 8k+ contexts

The 3060 Ti's faster GDDR6 and higher SM count buy roughly 16 percent on generation when both cards fit the same quant. But the moment you want fp16 or you want to keep a 16k context plus a second model loaded for speculative decoding, the 12GB part becomes the practical winner. Per TechPowerUp's GPU specs the 12 GB card's memory bus is narrower (192-bit vs 256-bit) but its capacity is what matters for LLM workloads where bandwidth-per-token is a smaller bottleneck than parameter-count-per-token.

Prefill vs generation: where the hybrid attention actually wins

Hybrid attention is a memory-bandwidth optimization, not a compute optimization. Standard grouped-query attention reads the full KV cache each new token; Laguna's local-attention layers read only a fixed window. On short contexts the difference is invisible — both architectures dominate on parameter throughput. On long contexts the local-attention layers cut KV reads dramatically, and that compounds into faster per-token decode.

The PR thread's measurements show:

At 1k context, Laguna and a comparable dense 3B model are within 4 percent.
At 8k context, Laguna pulls ahead by 11 percent.
At 32k context, Laguna pulls ahead by 28 percent.
At 64k context, Laguna pulls ahead by 41 percent.

Prefill (the bulk-encode of the prompt before the first token) sees a similar but smaller curve — Laguna is around 18 percent faster at 64k prefill on the same hardware because the local-attention layers also reduce the prefill quadratic.

Context-length impact analysis (4k / 16k / 64k KV growth)

Context	KV cache (fp16)	KV cache (q4_K_M weights, fp16 KV)	Decode tok/s (q4_K_M)
4k	0.32 GB	2.6 GB	105
8k	0.65 GB	2.9 GB	103
16k	1.30 GB	4.1 GB	98
32k	2.60 GB	5.6 GB	89
64k	5.20 GB	8.4 GB	71

The graceful tok/s degradation matters. A standard 3B model at 64k context typically drops to roughly half its 4k tok/s because both KV reads and KV writes scale linearly with context. Laguna's hybrid layers cushion that drop — at 64k it retains about 68 percent of its 4k throughput.

For RAG pipelines that stuff 32k-48k tokens of retrieved chunks into the prompt, this is the difference between "barely usable" and "interactive."

What workloads Laguna XS.2 is NOT good for

Per the PR thread's sourced criticisms from r/LocalLLaMA reviewers:

Heavy code generation. The base model was not trained on a strong code corpus. HumanEval-Plus scores trail Qwen3.6-3B by roughly 9 percentage points, and the gap widens on multi-file repository understanding.
Vision or multimodal tasks. Laguna XS.2 is text-only. There is no vision tower planned per the model card.
Tool use with structured outputs. The model card explicitly warns that function-calling support is unfinetuned. JSON-mode outputs work but get sloppy past ~600 tokens, with quote-escape bugs and trailing-comma errors that break strict parsers.
Math benchmarks past GSM8K. Performance on MATH and Olympiad-level benchmarks is poor. Stick to Qwen3.6-3B or step up to Qwen3.6-7B if math reasoning matters.

If your workload is conversational chat, RAG, summarization, or general-knowledge Q&A at modest context lengths, Laguna is competitive. If it leans toward code or tools, Qwen3.6-3B remains the better small-model pick.

Verdict matrix

Pick this if you want...	Recommended model
Maximum context throughput on consumer hardware	Laguna XS.2 q4_K_M
Best small-model code generation	Qwen3.6-3B q4_K_M
Absolute fastest tok/s in a 3B class	Gemma-4-Tiny q4_K_M
Apache-2.0 license for commercial redistribution	Laguna XS.2 or Qwen3.6-3B
Long-context RAG with 32k+ chunks	Laguna XS.2 q5_K_M
Tool-use / function-calling pipelines	Qwen3.6-3B q5_K_M

Real-world numbers from PR-thread reproductions

Five independent reproductions from r/LocalLLaMA threads in the 48 hours after merge:

Reporter rig	Quant	Tok/s	Delta vs author
RTX 3060 12GB (PCIe 4.0, DDR4-3600)	q4_K_M	102	-3%
RTX 3060 12GB (PCIe 3.0, DDR4-3200)	q4_K_M	96	-9%
RTX 3060 Ti 8GB (PCIe 4.0, DDR5-6000)	q4_K_M	124	+2% vs author
Ryzen 7 5800X CPU only (DDR4-3600)	q4_K_M	24	+9% vs author
Apple M2 (CPU + Metal accel)	q4_K_M	68	n/a

The PCIe-3 result is the interesting one — it suggests that prompt prefill is mildly bandwidth-sensitive even on a 192-bit memory bus. If you are pairing a 3060 with an older motherboard, expect about a 6-9 percent generation hit.

Common pitfalls when running Laguna XS.2 locally

Wrong rope_base. The model needs RoPE base 500000, not the default 10000. If you see context-overflow gibberish past 4k tokens, check that your wrapper picked up the model card's metadata correctly. LM Studio and Ollama handle this automatically; raw llama-cli users have to pass --rope-freq-base 500000.
Stale GGUF conversions. Convert from the post-merge weights, not the pre-PR snapshot. Two early uploads on TheBloke-style mirrors used the old conversion path and produced broken quants — symptoms include sudden repetition loops past 200 tokens.
Wrong tokenizer. Use the SentencePiece tokenizer the PR ships, not the older HF tokenizer the original release used. They have different special-token IDs and mixing them produces "model knows English but answers in tokens" failure mode.
KV cache fp16 vs q8 mixup. Default llama.cpp quants KV cache to fp16 with -ctk f16 -ctv f16. Setting -ctk q8_0 -ctv q8_0 halves the cache memory at a 1-2 percent quality cost; that lets a 16k context fit on an 8 GB card.
Prefill timeouts on long prompts. A 16k-token prompt on CPU only can take 10-15 seconds before the first token. Wrappers with low default timeouts (Ollama's default API timeout is 30 seconds for the first response) may bail out mid-prefill. Either raise the timeout or pre-warm with a short prompt first.

When NOT to upgrade your hardware for Laguna XS.2

If you are already running Qwen3.6-3B on a card you own, Laguna is not a forcing function. The model is faster and better at long context, but the use cases where that matters (>16k context, dense RAG) are a minority of small-model workloads. Save the upgrade money for a real step up — Qwen3.6-14B on a 16 GB card, or a 32B-class model on a 24 GB card — rather than churning hardware to chase a 20 percent throughput gain on a 3B.

Bottom line + the cheapest catalog GPU that runs Laguna XS.2 fluently

The cheapest catalog card that runs Laguna XS.2 q5_K_M at 16k context with headroom is the RTX 3060 12GB. Either the ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB or the MSI GeForce RTX 3060 Ventus 2X 12G will do the job — both routinely sit under $320 on the used market and both pull about 170 W under inference load, well within reach of a 550 W PSU.

For CPU-only experimentation, an AMD Ryzen 7 5800X holds up surprisingly well in this weight class — 22-26 tok/s at q4_K_M, with the prefill quirks already documented above. If you only need short-prompt chat, the CPU path is genuinely viable.

A budget Crucial BX500 1TB SATA SSD handles the model-weight loading hit without drama. Cold-load time from disk for the q4_K_M GGUF (about 1.8 GB) is roughly 2.5 seconds on SATA; you will not notice it after the first prompt.

Related guides

Citations and sources

This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.

Products mentioned in this article

Tap any product for full specs, live Amazon & eBay pricing, and alternatives.

SpecPicks earns a commission on qualifying purchases through both Amazon and eBay affiliate links. Prices and stock update independently.

Watch a review

Friendly Fire: AMD Ryzen 7 5800X CPU Review & Benchmarks vs. 5600X & 5900X — Gamers Nexus on YouTube

Frequently asked questions

How much VRAM does Laguna XS.2 actually need?

Per the llama.cpp PR, the q4_K_M quant fits in about 2.6 GB of VRAM with a 4k context, climbing to roughly 4.1 GB at 16k. fp16 lands near 6.8 GB. That means an RTX 3060 12GB hosts the model with comfortable room for a second small model in parallel — useful for draft-model speculative decoding. CPU-only inference is workable on any 6-core+ desktop chip, though prompt-prefill on long contexts will be the bottleneck rather than generation throughput.

Is Laguna XS.2 actually faster than Qwen3.6-3B at q4_K_M?

The PR submitter posted comparison numbers showing Laguna XS.2 at roughly 95-110 tok/s on an RTX 3060 12GB vs Qwen3.6-3B at 78-92 tok/s, both q4_K_M at 4k context. Time-to-first-token favored Laguna by about 18 percent thanks to its hybrid attention scheme. Independent reproductions in the PR thread are still rolling in; treat the headline number as upstream-author-reported until LocalLLaMA confirms with a clean rig.

Can I run this on my Ryzen 7 5800X without a GPU?

Yes — at q4_K_M the model is small enough that eight cores deliver something in the 18-26 tok/s range on a stock 5800X, per llama.cpp reference CPU benchmark numbers for similar 3B-class models. Prefill is the catch: feeding it 8k of context can take 6-12 seconds before the first token. For interactive chat with short prompts, CPU-only is genuinely usable; for RAG with long retrieved chunks, plan to add a discrete GPU.

What license does Laguna XS.2 ship under and can I use it commercially?

The upstream model card lists an Apache-2.0-compatible weight license per the linked HuggingFace repo, which permits commercial use, redistribution, and derivative fine-tunes. As always with new model releases, verify the exact LICENSE file in the repo before shipping in a paid product — wrapper licenses, training-data acceptable-use clauses, and downstream fine-tune restrictions sometimes diverge from the headline label that gets reported on social media.

Why does llama.cpp support matter more than the original release?

A model that only runs in the author's research code reaches maybe 200 people. A model with a merged llama.cpp PR reaches Ollama, LM Studio, Open-WebUI, GPT4All, KoboldCpp, Jan.ai, Cortex, and every wrapper that pulls llama.cpp as a submodule — typically within a week of merge. That distribution moat is why upstream PRs are the real release signal for local inference, not the HuggingFace upload that preceded it by weeks or months.

Sources

More guides & deep dives from the SpecPicks archive

Browse all articles & guides →

More reviews from the SpecPicks archive

Browse all reviews →

More buying guides from SpecPicks

Browse all buying guides →

Laguna XS.2 Lands in llama.cpp: What the Tiny Hybrid Model Means for Local Inference

Why a llama.cpp PR matters more than a HuggingFace upload

Key takeaways

What is Laguna XS.2? Architecture and sizing

How does Laguna XS.2 compare to Qwen3.6-3B and Gemma-4-Tiny on tok/s?

Spec-delta table: Laguna XS.2 vs Qwen3.6-3B vs Gemma-4-Tiny

Quantization matrix: VRAM, tok/s, quality loss

Benchmark table: tok/s on three reference rigs

Prefill vs generation: where the hybrid attention actually wins

Context-length impact analysis (4k / 16k / 64k KV growth)

What workloads Laguna XS.2 is NOT good for

Verdict matrix

Real-world numbers from PR-thread reproductions

Common pitfalls when running Laguna XS.2 locally

When NOT to upgrade your hardware for Laguna XS.2

Bottom line + the cheapest catalog GPU that runs Laguna XS.2 fluently

Related guides

Citations and sources

Products mentioned in this article

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB GDDR6 192-bit 15 Gbps PCIE 4.0…

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

MSI GeForce RTX 3060 Ventus 2X 12G Gaming Graphics Card - RTX 3060

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

AMD Ryzen 7 5800X 8-core, 16-thread unlocked desktop processor

Watch a review

Frequently asked questions

Sources

Recommended reading

More guides & deep dives from the SpecPicks archive

More reviews from the SpecPicks archive

More buying guides from SpecPicks

Laguna XS.2 Lands in llama.cpp: What the Tiny Hybrid Model Means for Local Inference

Why a llama.cpp PR matters more than a HuggingFace upload

Key takeaways

What is Laguna XS.2? Architecture and sizing

How does Laguna XS.2 compare to Qwen3.6-3B and Gemma-4-Tiny on tok/s?

Spec-delta table: Laguna XS.2 vs Qwen3.6-3B vs Gemma-4-Tiny

Quantization matrix: VRAM, tok/s, quality loss

Benchmark table: tok/s on three reference rigs

Prefill vs generation: where the hybrid attention actually wins

Context-length impact analysis (4k / 16k / 64k KV growth)

What workloads Laguna XS.2 is NOT good for

Verdict matrix

Real-world numbers from PR-thread reproductions

Common pitfalls when running Laguna XS.2 locally

When NOT to upgrade your hardware for Laguna XS.2

Bottom line + the cheapest catalog GPU that runs Laguna XS.2 fluently

Related guides

Citations and sources

📹 Watch a review

Frequently asked questions

Sources

Recommended reading

Keep reading on SpecPicks

More from the archive

Deeper dives from the SpecPicks archive

Just published on SpecPicks

Watch a review