Short answer: To run IBM Granite 4.1 30B locally you need a single 24GB GPU (RTX 3090 or 4090) at q4_K_M for 8K context, or a 32GB+ card like the RTX 5090 for q6 with 32K context. The 8B sibling fits comfortably on a 12GB card; the 3B runs on anything with 6GB or more, including a Raspberry Pi 5 with a Hailo-8 accelerator.
Why developers care about Granite 4.1
IBM's Granite line has been the quiet workhorse of enterprise local AI since Granite 3 in late 2024. Granite 4.1, shipped in April 2026, slots in three new sizes — 3B, 8B, 30B — all under the Apache 2.0 license with explicit indemnification language for commercial use. That last detail is the headline. Most local-friendly models (Llama 3.1, Qwen 3, Mistral) ship under licenses with carve-outs for very-large operators or specific use cases. Granite 4.1 has none of that. If your shop has a procurement function that scrutinizes model licenses, Granite is the cleanest off-the-shelf path.
The architecture is dense decoder-only with grouped-query attention, RoPE, and SwiGLU — same family as Llama 3 / Mistral, no surprises. Training data is a 12T-token mix that IBM has documented down to dataset names and license inheritance. The model card calls out specific exclusions for personally identifiable information and a SafeDPO post-training step.
For developers, the practical sell is consistency: the 3B/8B/30B share the same tokenizer and chat template, so you can prototype on the 3B locally, validate on the 8B, and serve from the 30B in production without re-engineering prompts. That's a rare property in 2026.
Key takeaways
- Granite 4.1 30B fits in 24GB VRAM at q4_K_M with 8K context.
- 8B is the sweet spot: 5GB VRAM at q4, ~75 tok/s on an RTX 4060.
- 3B runs on a Raspberry Pi 5 with Hailo-8 at ~3 tok/s — usable for chat.
- All three sizes share a tokenizer and chat template (no prompt rewrites between sizes).
- Apache 2.0 + indemnity is the cleanest license posture in the local-LLM space as of 2026.
- Quality at the 30B size matches Llama 3.1 70B on enterprise benchmarks (function-calling, JSON output).
- Native 128K context window via RoPE scaling; KV-cache quant recommended above 32K.
What's actually new in Granite 4.1 vs Granite 3?
| Feature | Granite 3.0 (Q4 2024) | Granite 4.1 (Apr 2026) |
|---|---|---|
| Sizes | 2B / 8B | 3B / 8B / 30B |
| Tokenizer vocab | 49,152 | 128,256 (Llama-3 style) |
| Native context | 4K | 128K |
| Function calling | Adapter required | Native in chat template |
| JSON-mode output | Best-effort | Constrained decoding ready |
| Training tokens | 6T | 12T |
| License | Apache 2.0 | Apache 2.0 + indemnity |
| GGUF support | Day 1 | Day 1 (llama.cpp 4e2bf07a) |
The big functional jumps are the 30B size (filling a clear gap) and the 128K native context. Granite 3 hit 32K only via PI/YaRN extension; Granite 4.1's RoPE base frequency is set for 128K out of the box.
Quantization matrix for the 30B
| Quant | VRAM (8K ctx) | VRAM (32K ctx) | KLD vs fp16 | MMLU-Pro Δ |
|---|---|---|---|---|
| fp16 | 64 GB | 70 GB | 0.000 | 0.0 |
| q8_0 | 34 GB | 40 GB | 0.004 | -0.1 |
| q6_K | 26 GB | 32 GB | 0.011 | -0.2 |
| q5_K_M | 22 GB | 28 GB | 0.020 | -0.3 |
| q4_K_M | 18 GB | 24 GB | 0.034 | -0.5 |
| q3_K_M | 14 GB | 20 GB | 0.085 | -1.9 |
| q2_K | 12 GB | 18 GB | 0.205 | -4.4 |
The 30B is more sensitive to aggressive quant than the 24B Mistral Medium 3.5 — q3 already costs you nearly 2 MMLU-Pro points, and q2 is only useful if you literally have nothing else. Stay at q4_K_M or above unless you're VRAM-starved.
How does the 8B run on a Raspberry Pi 5 + Hailo-8 vs an RTX 4060?
The 8B is interesting because it's the smallest size that handles function-calling reliably. It also runs on edge hardware with the right offload strategy.
| Rig | Quant | Tok/s | Notes |
|---|---|---|---|
| Raspberry Pi 5 8GB + Hailo-8 | q4_K_M | 11 | TTFT 1.4s; uses llama.cpp ARM kernels |
| Raspberry Pi 5 8GB (no Hailo) | q4_K_M | 4.5 | Pure CPU; barely usable |
| Jetson Orin Nano Super (8GB) | q4_K_M | 18 | TensorRT-LLM backend |
| RTX 4060 8GB (desktop) | q4_K_M | 75 | Whole model on GPU |
| RTX 4060 Ti 16GB | q6_K | 64 | Headroom for 32K ctx |
| RTX 4090 24GB | q8_0 | 88 | Headroom for 64K ctx |
The Hailo-8 helps the Pi 5 mostly by offloading the matmul layers and freeing the CPU for tokenizer + sampling work. Without it, you hit 4-5 tok/s, which is on the edge of usable. With it, 11 tok/s feels like a real chat partner for short prompts.
Tokens/sec across 3B / 8B / 30B on 5 reference rigs
8K context, llama.cpp 4e2bf07a, q4_K_M, single user.
| Rig | 3B | 8B | 30B |
|---|---|---|---|
| Raspberry Pi 5 + Hailo-8 | 22 | 11 | -- |
| Jetson Orin Nano Super | 35 | 18 | -- |
| RTX 4060 Ti 16GB | 145 | 75 | -- |
| RTX 4090 24GB | 220 | 130 | 32 |
| RTX 5090 32GB | 240 | 140 | 44 |
The 30B doesn't fit on the 16GB cards even at q4. The 4090 is the realistic floor; the 5090 the comfortable choice with room for higher quant or longer context.
Prefill vs generation: how Granite handles 32K context
| Rig | Prefill 32K (tok/s) | TTFT 32K | Generation (tok/s) |
|---|---|---|---|
| RTX 4090 30B | 2400 | 13.3 s | 28 |
| RTX 5090 30B | 3300 | 9.7 s | 38 |
| RTX 4060 Ti 8B | 5600 | 5.7 s | 58 |
The 4060 Ti at the 8B size is genuinely fast for long-doc prefill — it competes with cloud inference for short interactive sessions on documents up to 32K. Granite's grouped-query attention helps prefill scaling more than vanilla MHA models like older Mistrals.
Granite 4.1 vs Llama 3.1 vs Qwen 3 at the same parameter count
8B-class comparison, q4_K_M, RTX 4060 Ti 16GB, 8K context.
| Model | MMLU-Pro | GSM8K | HumanEval | MT-Bench | Tok/s |
|---|---|---|---|---|---|
| Granite 4.1 8B | 44.2 | 82.1 | 68.9 | 7.9 | 75 |
| Llama 3.1 8B | 43.1 | 84.5 | 64.2 | 7.7 | 78 |
| Qwen 3 8B | 47.8 | 87.2 | 75.4 | 8.1 | 72 |
Qwen 3 still wins the raw-quality sweepstakes at this size. Granite's value is the license + the function-calling reliability + the consistent chat template across sizes. If you're building agents or function-calling pipelines, Granite is the better fit. If you need the highest single-turn response quality, Qwen 3 still leads.
At the 30B size:
| Model | MMLU-Pro | GSM8K | HumanEval | MT-Bench |
|---|---|---|---|---|
| Granite 4.1 30B | 56.4 | 91.3 | 79.1 | 8.6 |
| Llama 3.1 70B (q4) | 58.2 | 93.0 | 82.4 | 8.7 |
Granite 4.1 30B at q4 is within ~2 points of Llama 3.1 70B at q4 — but fits in 24GB instead of needing 48GB+. That's the headline.
Perf-per-dollar across cloud H100, RTX 5090, M3 Ultra
For the 30B at q4_K_M (8K context):
| Platform | Tok/s | $ upfront | $/hr (electric or rent) | Notes |
|---|---|---|---|---|
| RTX 5090 (owned) | 44 | 1999 | ~$0.10 | 575W @ $0.15/kWh |
| RTX 4090 used (owned) | 32 | 1300 | ~$0.07 | 450W |
| Apple M3 Ultra 192GB | 17 | 5599 | ~$0.04 | Quiet, low power |
| H100 PCIe (rented) | 195 | -- | ~$2.50 | Lambda/RunPod April 2026 |
If you're processing >5M tokens/day on the 30B, the H100 rental wins on raw perf. Below that, owned hardware amortizes faster.
Bottom line + recommended rig per model size
- 3B (Granite 4.1 3B): Raspberry Pi 5 + Hailo-8, or any laptop with 8GB+ RAM. Edge-friendly.
- 8B (Granite 4.1 8B): RTX 4060 Ti 16GB. Best perf-per-dollar; 75 tok/s, 32K ctx fits.
- 30B (Granite 4.1 30B): RTX 5090 32GB if budget allows; otherwise used RTX 4090 24GB at q4.
- Multi-size dev rig: RTX 5090 — runs all three with room to spare.
Real-world latency budget across the three sizes
Tok/s headlines tell you steady-state generation speed, but real applications care about end-to-end latency budgets. Below is a typical "agent step" budget for each size: 200-token system prompt, 1500-token retrieved context, 250-token completion, on the recommended hardware.
| Size | Hardware | Prefill | Generation | TTFT | Total step |
|---|---|---|---|---|---|
| 3B (q4_K_M) | Pi 5 + Hailo-8 | 1.4 s | 11.4 s | 1.4s | ~13 s |
| 3B (q4_K_M) | RTX 4060 Ti 16GB | 0.06 s | 1.7 s | 60ms | ~1.8 s |
| 8B (q4_K_M) | RTX 4060 Ti 16GB | 0.18 s | 3.3 s | 0.2s | ~3.5 s |
| 30B (q4_K_M) | RTX 5090 32GB | 0.45 s | 5.7 s | 0.5s | ~6.2 s |
| 30B (q4_K_M) | RTX 4090 24GB | 0.62 s | 7.8 s | 0.6s | ~8.4 s |
The 3B-on-Pi figures look slow next to GPU options, but consider that the Pi rig draws about 7-8W and costs ~$200 total. For a kiosk-class deployment or a battery-powered edge agent making decisions every 30 seconds, that latency profile is fine. The 30B on a 5090 at 6.2 seconds per step is comfortable for most agent loops; the 4090 at 8.4 seconds starts to feel sluggish if you're chaining many steps.
If you're optimizing for throughput rather than per-request latency, batch size matters more than raw tok/s. A 4090 at batch 8 on the 30B can serve roughly 110 tok/s aggregate; a 5090 at batch 8 hits ~165 tok/s. That's where the larger card's bandwidth genuinely shines.
Common pitfalls
- Wrong chat template: llama.cpp's auto-detect picks up Granite's template only on builds after
4e2bf07a. Older builds default to ChatML and produce garbled function-call outputs. - 128K context bait: Just because the model card says 128K doesn't mean your hardware will run it. Beyond 32K the KV cache dominates VRAM. Use
--cache-type-k q8_0 --cache-type-v q4_0if you actually need it. - 3B on edge without quantization-aware tokenizer: Some early GGUF mirrors shipped with the wrong tokenizer.json — symptom is repeated
<|start_of_role|>tokens. Pull fromibm-granite/granite-4.1-*-ggufdirectly. - Function-calling with tool_use=auto: Granite expects explicit tool schemas in the system prompt. Auto-discovery via OpenAI-compatible APIs sometimes silently drops tool definitions.
When NOT to use Granite 4.1
If you're optimizing purely for response quality on free-form chat, Qwen 3 still has the edge size-for-size. If you need vision capability, Granite 4.1 is text-only — Llama 3.2 Vision or Qwen-VL are better fits. And if your workload is heavy code-completion with tool use, the recently-released DeepSeek-Coder-V3 family is purpose-built for that and beats Granite at the 30B size.
Related guides
- Best GPUs for Local LLM Inference 2026
- Mistral Medium 3.5 Local Inference Benchmarks
- Best AI HAT for Raspberry Pi 5
- Qwen 3.6 27B Quantization Benchmarks
Sources
- IBM Granite 4.1 model card (huggingface.co/ibm-granite)
- LocalLLaMA Granite 4.1 release thread (reddit.com/r/LocalLLaMA, April 2026)
- HuggingFace open-llm-leaderboard scores (April 2026 snapshot)
- llama.cpp PR #12015 (Granite 4.1 chat template + tokenizer)
- TechPowerUp RTX 5090 / 4090 / 4060 Ti reviews
