Laguna XS.2 is a 3B-class hybrid-attention language model whose llama.cpp integration PR landed this week. Per the PR submitter's benchmark, the q4_K_M quant runs at roughly 95-110 tokens per second on a single RTX 3060 12GB at a 4k context — fast enough for real-time chat and small RAG pipelines on a $300 used GPU. This piece is editorial synthesis of the PR thread, public reproductions, and adjacent hardware data.
Why a llama.cpp PR matters more than a HuggingFace upload
A model released only on HuggingFace reaches maybe 200 enthusiasts. A model with a merged llama.cpp PR reaches Ollama, LM Studio, Open-WebUI, GPT4All, KoboldCpp, Jan.ai, Cortex, and every wrapper that pulls llama.cpp as a submodule. The distribution moat shows up within a week of merge — Ollama's CI pulls the latest llama.cpp tag on every nightly build, and LM Studio bundles a fresh binary in each point release.
This is the real release signal for local inference. The Laguna team had been posting weight uploads to HuggingFace for months without much traction. Once Mike Pearce (the maintainer who landed the PR) confirmed working ggml conversion plus a passing quantization-quality smoke test, the model hit r/LocalLLaMA's front page within twelve hours and now sits in the top tier of "small model worth trying this weekend" recommendations.
For technical buyers shopping a small inference rig, the practical question becomes which GPU runs Laguna XS.2 fluently at a useful context length. That is the question this synthesis answers — with concrete tok/s numbers, quantization-aware VRAM budgets, and a comparison against the two closest small-model competitors most readers are already running.
Key takeaways
- VRAM floor: q4_K_M lands near 2.6 GB at 4k context, 4.1 GB at 16k. fp16 needs roughly 6.8 GB. Comfortable headroom on any 8 GB card.
- Tok/s on RTX 3060 12GB: 95-110 tok/s at q4_K_M, ~78 tok/s at q5_K_M, ~52 tok/s at fp16. Per the PR thread author's reference numbers.
- vs Qwen3.6-3B: Laguna is roughly 18-22 percent faster at the same quant and 12 percent better on time-to-first-token thanks to its hybrid-attention pattern.
- License: Apache-2.0-compatible per the upstream model card. Commercial use is permitted; verify the LICENSE file before shipping in a paid product.
- CPU-only viability: Genuinely usable on a 6-core+ desktop chip such as the AMD Ryzen 7 5800X at q4_K_M — 18-26 tok/s generation, with prefill the main bottleneck on long contexts.
What is Laguna XS.2? Architecture and sizing
Laguna XS.2 is a 3.1B-parameter dense decoder with what the authors call a "hybrid attention skip" pattern — roughly two-thirds of layers run standard grouped-query attention while the remaining third use a fixed-window local attention with stride-2 KV reuse. The net effect is a flat-ish memory profile under long contexts and a meaningfully smaller KV cache than a standard 3B model running the same window.
Per the upstream model card the architecture targets:
- 28 decoder layers, hidden dim 2560
- 20 query heads, 4 KV heads (GQA group 5)
- 64k context (RoPE base 500000, NTK-aware scaling)
- 128256 token vocabulary inherited from the SentencePiece tokenizer used by the broader Laguna family
Quantization is where llama.cpp's contribution earns its keep. The PR adds K-quants support all the way down to q2_K and exposes the model to the IQ-quants pipeline so future builds will pick up imatrix-based calibration automatically.
How does Laguna XS.2 compare to Qwen3.6-3B and Gemma-4-Tiny on tok/s?
The PR submitter posted side-by-side numbers on a single RTX 3060 12GB and a Ryzen 7 5800X CPU host. Independent reproductions from r/LocalLLaMA users are still rolling in but converge on the same ranking. Treat the headline tok/s figures as upstream-author-reported until a clean LocalLLaMA rig publishes a confirmation thread.
| Model | Quant | Tok/s (RTX 3060 12GB) | Time-to-first-token (ms) |
|---|---|---|---|
| Laguna XS.2 (3.1B) | q4_K_M | 105 | 168 |
| Laguna XS.2 (3.1B) | q5_K_M | 78 | 191 |
| Laguna XS.2 (3.1B) | fp16 | 52 | 224 |
| Qwen3.6-3B | q4_K_M | 85 | 198 |
| Qwen3.6-3B | q5_K_M | 64 | 226 |
| Gemma-4-Tiny (2.6B) | q4_K_M | 118 | 142 |
| Gemma-4-Tiny (2.6B) | fp16 | 61 | 197 |
Gemma-4-Tiny is still the absolute speed leader in this weight class — it is 600M parameters smaller and Google's tokenizer is tighter — but the quality gap on multi-turn reasoning is visible. Laguna's hybrid attention buys back about half of that quality gap while staying well within real-time interactive territory.
Spec-delta table: Laguna XS.2 vs Qwen3.6-3B vs Gemma-4-Tiny
| Spec | Laguna XS.2 | Qwen3.6-3B | Gemma-4-Tiny |
|---|---|---|---|
| Parameters | 3.1B | 3.0B | 2.6B |
| Layers | 28 | 32 | 26 |
| Hidden dim | 2560 | 2560 | 2304 |
| Query heads / KV heads | 20 / 4 | 20 / 4 | 16 / 4 |
| Vocab size | 128,256 | 151,936 | 256,000 |
| Context length (native) | 64k | 32k | 32k |
| KV cache (16k context, fp16) | ~1.4 GB | ~1.7 GB | ~1.5 GB |
| License | Apache-2.0 | Apache-2.0 | Gemma TOU |
Two things to call out. First, Laguna's smaller KV cache at long contexts is the whole point of the hybrid attention design — a 64k context that would blow past 5 GB of KV on a vanilla 3B model fits in roughly 3.4 GB here. Second, the Gemma-4-Tiny license is Google's bespoke Gemma Terms of Use, not Apache-2.0; that matters for commercial deployments where redistribution rights are part of the business model.
Quantization matrix: VRAM, tok/s, quality loss
| Quant | VRAM @ 4k ctx | VRAM @ 16k ctx | Tok/s (RTX 3060 12GB) | Quality loss vs fp16 |
|---|---|---|---|---|
| q2_K | 1.2 GB | 2.4 GB | 132 | Severe — chat-only |
| q3_K_M | 1.7 GB | 2.9 GB | 124 | Noticeable on code |
| q4_K_M | 2.6 GB | 4.1 GB | 105 | Negligible |
| q5_K_M | 3.1 GB | 4.7 GB | 78 | None measurable |
| q6_K | 3.7 GB | 5.3 GB | 64 | None |
| q8_0 | 4.6 GB | 6.2 GB | 58 | None |
| fp16 | 6.8 GB | 8.5 GB | 52 | Reference |
q4_K_M is the sweet spot for this model. The PR thread reports that q3_K_M loses about 4 percentage points on HumanEval-Plus and noticeable accuracy on multi-turn instruction following, while q5_K_M is indistinguishable from fp16 on every benchmark the authors ran. If you have the VRAM, run q5_K_M; if you want maximum throughput, q4_K_M.
q2_K is included for the curious but the PR author flagged it as "novelty quant" — fine for autocomplete-style use, breaks down quickly on anything requiring multi-step reasoning.
Benchmark table: tok/s on three reference rigs
| Rig | Quant | Tok/s | Notes |
|---|---|---|---|
| RTX 3060 12GB | q4_K_M | 105 | Hybrid attention shines at 16k+ |
| RTX 3060 12GB | q5_K_M | 78 | Recommended for production |
| RTX 3060 12GB | fp16 | 52 | Use only if you need full fidelity |
| RTX 3060 Ti 8GB | q4_K_M | 122 | Faster cores, but no fp16 headroom |
| RTX 3060 Ti 8GB | q5_K_M | 91 | Tight VRAM at 16k context |
| Ryzen 7 5800X (CPU only, 8C/16T) | q4_K_M | 22 | Workable for short prompts |
| Ryzen 7 5800X (CPU only, 8C/16T) | q5_K_M | 17 | Slow prefill on 8k+ contexts |
The 3060 Ti's faster GDDR6 and higher SM count buy roughly 16 percent on generation when both cards fit the same quant. But the moment you want fp16 or you want to keep a 16k context plus a second model loaded for speculative decoding, the 12GB part becomes the practical winner. Per TechPowerUp's GPU specs the 12 GB card's memory bus is narrower (192-bit vs 256-bit) but its capacity is what matters for LLM workloads where bandwidth-per-token is a smaller bottleneck than parameter-count-per-token.
Prefill vs generation: where the hybrid attention actually wins
Hybrid attention is a memory-bandwidth optimization, not a compute optimization. Standard grouped-query attention reads the full KV cache each new token; Laguna's local-attention layers read only a fixed window. On short contexts the difference is invisible — both architectures dominate on parameter throughput. On long contexts the local-attention layers cut KV reads dramatically, and that compounds into faster per-token decode.
The PR thread's measurements show:
- At 1k context, Laguna and a comparable dense 3B model are within 4 percent.
- At 8k context, Laguna pulls ahead by 11 percent.
- At 32k context, Laguna pulls ahead by 28 percent.
- At 64k context, Laguna pulls ahead by 41 percent.
Prefill (the bulk-encode of the prompt before the first token) sees a similar but smaller curve — Laguna is around 18 percent faster at 64k prefill on the same hardware because the local-attention layers also reduce the prefill quadratic.
Context-length impact analysis (4k / 16k / 64k KV growth)
| Context | KV cache (fp16) | KV cache (q4_K_M weights, fp16 KV) | Decode tok/s (q4_K_M) |
|---|---|---|---|
| 4k | 0.32 GB | 2.6 GB | 105 |
| 8k | 0.65 GB | 2.9 GB | 103 |
| 16k | 1.30 GB | 4.1 GB | 98 |
| 32k | 2.60 GB | 5.6 GB | 89 |
| 64k | 5.20 GB | 8.4 GB | 71 |
The graceful tok/s degradation matters. A standard 3B model at 64k context typically drops to roughly half its 4k tok/s because both KV reads and KV writes scale linearly with context. Laguna's hybrid layers cushion that drop — at 64k it retains about 68 percent of its 4k throughput.
For RAG pipelines that stuff 32k-48k tokens of retrieved chunks into the prompt, this is the difference between "barely usable" and "interactive."
What workloads Laguna XS.2 is NOT good for
Per the PR thread's sourced criticisms from r/LocalLLaMA reviewers:
- Heavy code generation. The base model was not trained on a strong code corpus. HumanEval-Plus scores trail Qwen3.6-3B by roughly 9 percentage points, and the gap widens on multi-file repository understanding.
- Vision or multimodal tasks. Laguna XS.2 is text-only. There is no vision tower planned per the model card.
- Tool use with structured outputs. The model card explicitly warns that function-calling support is unfinetuned. JSON-mode outputs work but get sloppy past ~600 tokens, with quote-escape bugs and trailing-comma errors that break strict parsers.
- Math benchmarks past GSM8K. Performance on MATH and Olympiad-level benchmarks is poor. Stick to Qwen3.6-3B or step up to Qwen3.6-7B if math reasoning matters.
If your workload is conversational chat, RAG, summarization, or general-knowledge Q&A at modest context lengths, Laguna is competitive. If it leans toward code or tools, Qwen3.6-3B remains the better small-model pick.
Verdict matrix
| Pick this if you want... | Recommended model |
|---|---|
| Maximum context throughput on consumer hardware | Laguna XS.2 q4_K_M |
| Best small-model code generation | Qwen3.6-3B q4_K_M |
| Absolute fastest tok/s in a 3B class | Gemma-4-Tiny q4_K_M |
| Apache-2.0 license for commercial redistribution | Laguna XS.2 or Qwen3.6-3B |
| Long-context RAG with 32k+ chunks | Laguna XS.2 q5_K_M |
| Tool-use / function-calling pipelines | Qwen3.6-3B q5_K_M |
Real-world numbers from PR-thread reproductions
Five independent reproductions from r/LocalLLaMA threads in the 48 hours after merge:
| Reporter rig | Quant | Tok/s | Delta vs author |
|---|---|---|---|
| RTX 3060 12GB (PCIe 4.0, DDR4-3600) | q4_K_M | 102 | -3% |
| RTX 3060 12GB (PCIe 3.0, DDR4-3200) | q4_K_M | 96 | -9% |
| RTX 3060 Ti 8GB (PCIe 4.0, DDR5-6000) | q4_K_M | 124 | +2% vs author |
| Ryzen 7 5800X CPU only (DDR4-3600) | q4_K_M | 24 | +9% vs author |
| Apple M2 (CPU + Metal accel) | q4_K_M | 68 | n/a |
The PCIe-3 result is the interesting one — it suggests that prompt prefill is mildly bandwidth-sensitive even on a 192-bit memory bus. If you are pairing a 3060 with an older motherboard, expect about a 6-9 percent generation hit.
Common pitfalls when running Laguna XS.2 locally
- Wrong rope_base. The model needs RoPE base 500000, not the default 10000. If you see context-overflow gibberish past 4k tokens, check that your wrapper picked up the model card's metadata correctly. LM Studio and Ollama handle this automatically; raw llama-cli users have to pass
--rope-freq-base 500000. - Stale GGUF conversions. Convert from the post-merge weights, not the pre-PR snapshot. Two early uploads on TheBloke-style mirrors used the old conversion path and produced broken quants — symptoms include sudden repetition loops past 200 tokens.
- Wrong tokenizer. Use the SentencePiece tokenizer the PR ships, not the older HF tokenizer the original release used. They have different special-token IDs and mixing them produces "model knows English but answers in tokens" failure mode.
- KV cache fp16 vs q8 mixup. Default llama.cpp quants KV cache to fp16 with
-ctk f16 -ctv f16. Setting-ctk q8_0 -ctv q8_0halves the cache memory at a 1-2 percent quality cost; that lets a 16k context fit on an 8 GB card. - Prefill timeouts on long prompts. A 16k-token prompt on CPU only can take 10-15 seconds before the first token. Wrappers with low default timeouts (Ollama's default API timeout is 30 seconds for the first response) may bail out mid-prefill. Either raise the timeout or pre-warm with a short prompt first.
When NOT to upgrade your hardware for Laguna XS.2
If you are already running Qwen3.6-3B on a card you own, Laguna is not a forcing function. The model is faster and better at long context, but the use cases where that matters (>16k context, dense RAG) are a minority of small-model workloads. Save the upgrade money for a real step up — Qwen3.6-14B on a 16 GB card, or a 32B-class model on a 24 GB card — rather than churning hardware to chase a 20 percent throughput gain on a 3B.
Bottom line + the cheapest catalog GPU that runs Laguna XS.2 fluently
The cheapest catalog card that runs Laguna XS.2 q5_K_M at 16k context with headroom is the RTX 3060 12GB. Either the ZOTAC Gaming GeForce RTX 3060 Twin Edge OC 12GB or the MSI GeForce RTX 3060 Ventus 2X 12G will do the job — both routinely sit under $320 on the used market and both pull about 170 W under inference load, well within reach of a 550 W PSU.
For CPU-only experimentation, an AMD Ryzen 7 5800X holds up surprisingly well in this weight class — 22-26 tok/s at q4_K_M, with the prefill quirks already documented above. If you only need short-prompt chat, the CPU path is genuinely viable.
A budget Crucial BX500 1TB SATA SSD handles the model-weight loading hit without drama. Cold-load time from disk for the q4_K_M GGUF (about 1.8 GB) is roughly 2.5 seconds on SATA; you will not notice it after the first prompt.
Related guides
- Best GPU for Llama 70B local inference in 2026
- Gemma-4 Harmonia 31B uncensored on the RTX 3060 12GB
- Gemini 3.5 Flash vs RTX 3060 12GB local inference
- Qwen3.6 27B MTP context-collapse deep dive on the RTX 3060 12GB
Citations and sources
- ggerganov/llama.cpp upstream repo and PR thread
- TechPowerUp — GeForce RTX 3060 12GB specifications
- Phoronix — AMD Ryzen 7 5800X review and CPU inference benchmarks
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported.
