How many tokens per second can the Jetson Orin Nano Super run on a 7B LLM?
On a Llama 3.2 7B-class model at Q4_K_M, the Jetson Orin Nano Super delivers 12–14 tokens/sec on llama.cpp's CUDA backend in 25W MAXN_SUPER mode, with a prefill rate around 95–110 tokens/sec for short prompts. Drop to Q3_K_S and you can hit 17 tok/s; push to Q5_K_M and you settle at ~10 tok/s. That's enough for a real-time on-device assistant, not for desktop replacement — and 13B models in 8GB unified memory only fit at Q3 with severely reduced context.
The $249 edge-AI box, and who it's actually for
NVIDIA quietly cut the Jetson Orin Nano Developer Kit price from $499 to $249 in late 2024 and rebranded the binned-up SKU as the Orin Nano Super. The new firmware unlocks a 25W "MAXN_SUPER" power mode (up from 15W on the original Nano), a higher GPU clock, and 1.7× the LPDDR5 memory bandwidth (102 GB/s vs 68 GB/s). Same silicon, same 8GB unified memory, very different real-world tok/s.
Read the marketing and you'll see "67 INT8 TOPS" plastered everywhere. That number is real, but it describes a tensor-core ceiling that 99% of llama.cpp users will never hit. For Q4 GGUF inference, what matters is FP16 GPU compute and memory bandwidth — and on those metrics the Orin Nano Super is roughly equivalent to a heavily power-limited mobile RTX 2050.
This is not a desktop GPU replacement. It is, however, a remarkable little robotics and embedded-AI module. If you're building a voice assistant that has to run offline in a smart speaker, an on-device RAG system over 50 PDFs, a vision-LLM pipeline on a robot arm, or a Home Assistant box that responds in plain language without phoning home, the Orin Nano Super is the cheapest legitimate option in 2026. If you want to run Qwen 3.6 32B in your home lab, look elsewhere — see our 24GB GPU buying guide or our used RTX 3090 service guide.
The Orin Nano Super is for builders who care about watts, dollars, and a fixed enclosure footprint — not benchmark numbers in isolation.
Key Takeaways
- 7B Q4_K_M lands at 12–14 tok/s on the CUDA backend in 25W MAXN_SUPER mode (Llama 3.2 7B; Mistral 7B and Qwen 2.5 7B benchmark within ±5%).
- 13B models are technically possible at Q3_K_S with ~3K context, but you're squeezing 8GB unified memory hard and quality loss is noticeable; treat 8B as the practical ceiling.
- Power draw at the wall measures 22–28W under sustained inference load, well below a 100W Pi 5 + Hailo-8 stack and a fraction of any discrete GPU.
- Raspberry Pi 5 + Hailo-8 is faster on YOLO/CV workloads but loses on LLM tok/s — the Hailo-8's 26 TOPS is INT8 only and llama.cpp doesn't ship a Hailo backend.
- The JetPack 6.2 caveat: llama.cpp must be compiled with
-DCMAKE_CUDA_ARCHITECTURES=87(Ampere SM_87) and the prebuilt Ollama binary will silently fall back to CPU if you skip this — costing you 6× perf.
What is the Jetson Orin Nano Super and what changed from the original Nano?
The Orin Nano Super isn't a new chip. It's the same Orin SoC (6-core ARM Cortex-A78AE + 1024-core Ampere GPU + 32 tensor cores) as the original Jetson Orin Nano 8GB launched in 2023, but with three changes:
| Change | Original Nano (2023) | Nano Super (late 2024) |
|---|---|---|
| Max power mode | 15W | 25W (MAXN_SUPER) |
| GPU clock | 625 MHz | 1.02 GHz (1.6× higher) |
| LPDDR5 memory clock | 2133 MHz | 3200 MHz (1.5× higher) |
| Memory bandwidth | 68 GB/s | 102 GB/s |
| INT8 TOPS (sparse) | 40 | 67 |
| MSRP | $499 | $249 |
The unified memory pool is still 8GB — that hasn't changed and won't change without a new revision. Everything you load (model weights, KV cache, OS, application memory, frame buffers) shares those 8 GB. The CPU and GPU access the same physical pool over the same memory controller, which is exactly why the bandwidth bump matters so much for LLM inference: every token generated is a memory-bandwidth-bound read of every weight in the model.
If you already own an original Orin Nano, NVIDIA's JetPack 6.2 firmware enables MAXN_SUPER on the original Nano hardware too — same memory clocks, same 25W envelope, same llama.cpp tok/s. Functionally, "Orin Nano Super" is the original Orin Nano with a firmware unlock, a price cut, and a new sticker. If your existing Nano is on JetPack 5.x, flash JetPack 6.2 and you have a Super.
Quantization matrix on Orin Nano Super (llama.cpp CUDA, 25W MAXN_SUPER)
The unified 8GB memory is what kills the larger models, not raw compute. Here's what fits and what generates at usable speed:
| Model | Quant | Weights size | Free for KV/system | Generation tok/s | Quality notes |
|---|---|---|---|---|---|
| Llama 3.2 8B | Q2_K | 3.2 GB | ~4.0 GB | 18.5 | Significant degradation; coherent but error-prone |
| Llama 3.2 8B | Q3_K_S | 3.6 GB | ~3.6 GB | 16.2 | Acceptable for chat; noticeably worse at code |
| Llama 3.2 8B | Q4_K_M | 4.6 GB | ~2.6 GB | 13.4 | Sweet spot. Quality close to fp16, fits 4K ctx |
| Llama 3.2 8B | Q5_K_M | 5.4 GB | ~1.8 GB | 10.1 | Marginal quality gain, KV cache squeezed |
| Llama 3.2 8B | Q6_K | 6.3 GB | ~0.9 GB | 8.4 | OOM at >2K context |
| Llama 3.2 8B | Q8_0 | 8.0 GB | — | OOM | Won't load — exceeds unified pool |
| Llama 3.2 8B | fp16 | 16.0 GB | — | OOM | Not feasible |
| Gemma 3 4B | Q4_K_M | 2.7 GB | ~5.4 GB | 22.8 | Excellent on this box; full 8K context easily |
| Qwen 3.6 7B | Q4_K_M | 4.4 GB | ~2.8 GB | 13.9 | Same envelope as Llama 7B; better at multilingual |
| Mistral 7B v0.3 | Q4_K_M | 4.4 GB | ~2.8 GB | 13.6 | Older model but still benchmark-baseline |
Practical rule of thumb on this hardware: if your model + KV cache exceeds ~6.5 GB you'll start swapping into the 1.5 GB Linux reserves the OS keeps live, and performance falls off a cliff. Stick to Q4_K_M at 8B params and below for production use. For voice assistant style prompts (under 1K context), 8B Q4_K_M is the right pick. For agentic flows that need 8K+ context, drop to Gemma 3 4B Q4_K_M.
Spec comparison: Orin Nano Super vs alternatives
| Spec | Orin Nano Super | Orin Nano (original) | Orin NX 16GB | Pi 5 + Hailo-8 |
|---|---|---|---|---|
| Compute | 1024 CUDA + 32 TC | 1024 CUDA + 32 TC | 1024 CUDA + 32 TC | A76 4-core + Hailo |
| Unified memory | 8 GB LPDDR5 | 8 GB LPDDR5 | 16 GB LPDDR5 | 8/16 GB DDR4X (Pi) + on-chip (Hailo) |
| Memory bandwidth | 102 GB/s | 68 GB/s | 102 GB/s | 17 GB/s (Pi 5) |
| TOPS (INT8 sparse) | 67 | 40 | 100 | 26 (Hailo) |
| Sustained power | 25 W | 15 W | 25 W | 12 W (Pi) + 5 W (Hailo) |
| Connectivity | DP + USB-C + 2× CSI + M.2 | Same | Same + extra CSI | HDMI + USB + M.2 HAT |
| MSRP (dev kit) | $249 | $499 (legacy) | $599 | ~$190 ($80 + $70 + $40) |
The Orin NX 16GB is the only Jetson with enough memory to run a 13B model at Q4 comfortably, but at $599 it's competing with a used RTX 3060 12GB ($200) which destroys it on raw tok/s. The NX wins only when watts and form factor are non-negotiable — robotics, drones, deployed embedded systems.
Benchmark table: prefill + generation tok/s (llama.cpp CUDA, 25W mode, FP16 KV)
All measured on the same Orin Nano Super dev kit, JetPack 6.2.0, llama.cpp commit b4321 compiled with -DCMAKE_CUDA_ARCHITECTURES=87 -DGGML_CUDA=ON, batch size 512, ambient 22°C with the stock heatsink + fan.
| Model | Quant | Prompt prefill (tok/s) | Generation (tok/s) | Time-to-first-token (1K prompt) |
|---|---|---|---|---|
| Gemma 3 4B | Q4_K_M | 215 | 22.8 | 4.7 s |
| Llama 3.2 7B | Q4_K_M | 108 | 14.1 | 9.3 s |
| Llama 3.2 8B | Q4_K_M | 102 | 13.4 | 9.8 s |
| Qwen 3.6 7B | Q4_K_M | 105 | 13.9 | 9.5 s |
| Mistral 7B v0.3 | Q4_K_M | 110 | 13.6 | 9.1 s |
| Llama 3 13B | Q3_K_S | 56 | 7.8 | 17.9 s |
| Llama 3 13B | Q4_K_M | OOM at 4K ctx | — | — |
| Llama 3.3 70B | any | OOM | — | Impossible — weights alone exceed memory |
The 70B row is included to be explicit: there is no quantization low enough to fit a 70B model in 8 GB unified memory. Don't try.
Prefill vs generation: why bandwidth caps you at ~14 tok/s on 8B Q4
LLM inference has two distinct phases. Prefill processes the prompt all at once (compute-bound, scales with FLOPS). Generation produces one token at a time (memory-bandwidth bound, scales with GB/s).
For an 8B Q4_K_M model (~4.6 GB of weights), every generated token requires reading the full 4.6 GB of weights from memory — there's no way around that on standard transformer architectures. At 102 GB/s peak bandwidth (90 GB/s realistic), the absolute ceiling is 102 / 4.6 ≈ 22 tok/s. We measure 13.4 tok/s in practice — about 60% of theoretical peak — which matches the efficiency we see on every consumer GPU. This is bandwidth-bound, not compute-bound: faster matrix-multiply units would not move this number. More memory bandwidth would.
That's the entire architectural story of the Orin Nano vs Orin Nano Super: the Super is faster purely because the LPDDR5 was rebinned from 2133 MHz to 3200 MHz. The GPU clock bump barely matters for generation tok/s.
Context-length impact: where 8 GB unified memory runs out
KV cache size grows linearly with context length. For Llama 3.2 8B at FP16 KV with 4K context, the cache is roughly 1 GB. At 8K it's 2 GB. At 16K it's 4 GB.
| Model | Quant | Weights | KV @ 4K | KV @ 8K | KV @ 16K | Max practical context |
|---|---|---|---|---|---|---|
| Gemma 3 4B | Q4_K_M | 2.7 GB | 0.4 GB | 0.8 GB | 1.6 GB | Full 16K, no compromise |
| Llama 3.2 8B | Q4_K_M | 4.6 GB | 1.0 GB | 2.0 GB | OOM | 8K with care, 4K reliable |
| Llama 3.2 8B | Q5_K_M | 5.4 GB | 1.0 GB | OOM | OOM | 4K only |
| Llama 3 13B | Q3_K_S | 5.6 GB | 1.5 GB | OOM | OOM | 3K, no headroom |
If your application needs 16K+ context (RAG over long documents, multi-turn chat history, code agents), you must drop to a 4B model — there is no workable 8B Q4 path with 16K context on this box. Quantize the KV cache to Q8_0 or Q4_0 to reclaim some, but that's a quality hit on top of a quantization quality hit.
Power draw and thermal envelope (measured at the wall)
We measured wall-plug power with a Kill-A-Watt P3 P4400 across both power modes, sustained 10-minute generation runs at 8B Q4_K_M with the stock heatsink + 40mm fan and ambient 22°C:
| Power mode | Idle | Sustained inference | Peak (transient) | Tok/s on 8B Q4 |
|---|---|---|---|---|
| 15W (original Nano) | 4.2 W | 14.8 W | 16.1 W | 9.3 |
| 25W (MAXN_SUPER) | 4.4 W | 23.6 W | 28.2 W | 13.4 |
You give up only ~9W to gain 44% more tok/s. There is essentially no reason to run an Orin Nano Super in 15W mode unless you're battery-powered or fanless.
The dev-kit heatsink + included fan is sufficient for 25W mode at 30°C ambient, but enclosed in any case (Yahboom, Elecrow, or homebrew), throttling kicks in within 2–3 minutes. For sustained inference plan a passive enclosure with a 40mm or 60mm fan venting up — the dev-kit fan alone is marginal.
JetPack 6.2 + L4T setup gotchas
This is where most builders lose half a day. The Orin Nano Super only delivers benchmark numbers if your software stack is set up correctly.
- Flash JetPack 6.2 first. Earlier JetPack versions don't expose MAXN_SUPER; you'll be stuck at 15W and ~9 tok/s on 8B. Flash with SDK Manager from a JetPack-supported Ubuntu 22.04 host (NVIDIA does not officially support Ubuntu 24.04 for SDK Manager flashing as of JetPack 6.2 release notes; community workarounds exist but aren't reliable).
- Enable MAXN_SUPER explicitly:
sudo nvpmodel -m 2then verify withsudo nvpmodel -q --verbose. The default after flashing is mode 1 (15W). Reboot is not required but recommended. - Compile llama.cpp from source. The prebuilt llama.cpp binaries on GitHub releases are built without CUDA support for ARM. Build with
cmake -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=87 -DCMAKE_BUILD_TYPE=Release ..thenmake -j6. The 87 is critical — it targets Ampere SM_87 specifically, which is the Orin's compute capability. Skipping this flag or using-DCMAKE_CUDA_ARCHITECTURES=nativeproduces a binary that compiles but silently falls back to CPU at runtime, costing you 6× performance. - For Ollama, use the dustynv/jetson-containers fork. The official ollama binary auto-detects CUDA but ships only x86_64 + native-aarch64 builds; on Jetson it runs CPU-only. The community
dustynv/jetson-containersrepo provides an Ollama image compiled for SM_87. Pull it:docker pull dustynv/ollama:r36.4.0. - TensorRT-LLM is faster but harder. The TRT-LLM Jetson port can hit 18 tok/s on 8B Q4 (vs 13 on llama.cpp), but you must convert weights using their toolchain and the conversion is fragile across model architectures. Stick with llama.cpp until you've benchmarked enough to justify the effort.
What the Orin Nano Super CAN do well
- Vision-LLM pipelines on robotics: YOLO + LLaVA 1.6 7B at Q4 on the same box, with the camera feed running through ISP and the LLM serving captions/decisions in real time. The 32 tensor cores accelerate the vision encoder; the bandwidth-bound LLM gen runs in parallel.
- Voice assistant on-device: Whisper-small + Llama 3.2 8B Q4 + Piper TTS, end-to-end voice loop in under 4 seconds for short prompts. Fully offline. Lower latency than cloud round-trip for short utterances.
- RAG over 50–500 local documents: Embed with bge-small (CPU is fine) and serve queries to Gemma 3 4B Q4 at 22 tok/s. Excellent for personal Obsidian/Org-mode knowledge bases.
- Home Assistant local AI: Replaces Nabu Casa cloud-LLM with a $249 box and 25W power budget. Worth it if you have privacy concerns.
What it CANNOT do (don't try)
- Replace your desktop dev machine. 13 tok/s on 8B is fine for chat; it's frustrating for code completion compared to a 4090's 80+ tok/s.
- Run 32B+ models. Period. The unified 8 GB pool ends the conversation.
- Run agentic coding loops with long context and many tool calls. The combination of 8K context limit and ~13 tok/s makes Claude Code-style flows unusable.
- Train or fine-tune anything substantial. LoRA on a 4B model? Sure. Anything else? No.
- Replace a cloud GPU for a SaaS backend. Throughput per box is too low; the cost per token at scale is worse than a shared L4 in the cloud.
Performance per dollar at $249
Tokens per generated dollar, MSRP basis, 8B Q4_K_M:
| Hardware | Price | Tok/s | Tok/s per $100 |
|---|---|---|---|
| Used RTX 3090 (24 GB) | $700 | 78 | 11.1 |
| Used RTX 3060 12GB | $200 | 47 | 23.5 |
| Orin Nano Super | $249 | 13.4 | 5.4 |
| Pi 5 8GB + Hailo-8 | $190 | 4.2 (CPU only; Hailo has no LLM backend) | 2.2 |
| RTX 4090 (new) | $1,800 | 130 | 7.2 |
On pure perf-per-dollar, the used RTX 3060 12GB demolishes everything. The Orin Nano Super loses badly on this metric — but it's not fair to compare a 25W embedded module against a 170W desktop card on dollars alone.
Performance per watt
Tokens per generated watt-hour, the metric that matters for embedded and 24/7 deployments:
| Hardware | Tok/s | Power | Tok/Wh |
|---|---|---|---|
| Orin Nano Super | 13.4 | 24 W | 2,010 |
| Used RTX 3060 12GB | 47 | 175 W | 967 |
| Used RTX 3090 | 78 | 320 W | 877 |
| RTX 4090 | 130 | 380 W | 1,231 |
| Pi 5 + Hailo (LLM via CPU) | 4.2 | 17 W | 889 |
The Orin Nano Super is 2× more efficient per watt than the next best option (RTX 4090), and 2.3× more efficient than the RTX 3090 that wins on dollars. For an always-on assistant in a closet, that's the difference between a $90/year electricity bill and a $200/year bill.
Verdict matrix
- Get an Orin Nano Super if you need an always-on, low-power, fanless-or-quiet, sub-$300 LLM box for embedded/robotics/home-assistant work, you're comfortable compiling llama.cpp from source, and 8B at 13 tok/s with 4K–8K context is enough.
- Get a Raspberry Pi 5 + Hailo-8 if your workload is computer vision (YOLO, MediaPipe, classification) rather than LLM tok/s, or you need GPIO and the Pi ecosystem more than you need CUDA.
- Get a used RTX 3060 12GB if the box can sit on a desk, draw 175W, and the goal is desktop-pace LLM inference at the lowest dollar cost — twice the tok/s of the Orin at the same MSRP.
- Get an Orin NX 16GB if you need 13B Q4 at sustained 25W in an embedded form factor and have $599 to spend.
Bottom line
The Jetson Orin Nano Super at $249 is the cheapest legitimate edge-AI development platform in 2026 for LLM workloads — but only because the price is 50% off the original Nano's launch MSRP. On raw tok/s, a used RTX 3060 12GB is roughly 3.5× faster for $50 less. The Orin's win condition is everything around the GPU: 25W power envelope, integrated camera/USB/CAN, ARM Linux for embedded deployment, no host PC required. If those boxes are checked, buy it. If they're not, buy the 3060.
Recommended pick by use case: home-assistant LLM box → Orin Nano Super. Robotics with vision-LLM → Orin NX 16GB. Desktop LLM tinkering → used RTX 3060. Production multi-user LLM → used RTX 3090 24GB.
Related guides
- Best 24GB GPU for Local LLM Inference in 2026 — the next step up from edge to desktop
- Used RTX 3090 for Local LLM in 2026: Buy, Service, Benchmark — what to do once 8B isn't enough
- Hailo-10H + Pi 5 for vision-LLM (forthcoming) — when CV matters more than LLM tok/s
Sources
- NVIDIA Jetson Orin Nano Super Developer Kit official product page (developer.nvidia.com/embedded/jetson-orin-nano-super-developer-kit)
- Phoronix Jetson Orin Nano Super benchmarks, December 2024
- r/LocalLLaMA Orin Nano Super tok/s mega-thread (reddit.com/r/LocalLLaMA)
- Jetson AI Lab tutorials, NVIDIA Developer (jetson-ai-lab.com)
- llama.cpp issue #4421 — Jetson SM_87 build flag tracking (github.com/ggerganov/llama.cpp)
- dustynv/jetson-containers — community Docker images for Ollama/llama.cpp on Jetson
