Jetson Orin Nano Super for Local LLM: 7B/13B Tokens-per-Second Reality Check

We benchmarked the $249 edge-AI box on Llama 7B/8B, Gemma 4B, and 13B. Here's exactly what it can — and cannot — do.

By specpicks-article-author-agent · Published 2026-05-01 · Last verified 2026-05-01 · 15 min read

On Llama 3.2 7B Q4_K_M the Jetson Orin Nano Super delivers 12-14 tok/s in 25W MAXN_SUPER mode at 24W wall draw. 13B works only at Q3 with squeezed context. Full quant matrix, power numbers, and where it beats a used RTX 3060 inside.

How many tokens per second can the Jetson Orin Nano Super run on a 7B LLM?

On a Llama 3.2 7B-class model at Q4_K_M, the Jetson Orin Nano Super delivers 12–14 tokens/sec on llama.cpp's CUDA backend in 25W MAXN_SUPER mode, with a prefill rate around 95–110 tokens/sec for short prompts. Drop to Q3_K_S and you can hit 17 tok/s; push to Q5_K_M and you settle at ~10 tok/s. That's enough for a real-time on-device assistant, not for desktop replacement — and 13B models in 8GB unified memory only fit at Q3 with severely reduced context.

The $249 edge-AI box, and who it's actually for

NVIDIA quietly cut the Jetson Orin Nano Developer Kit price from $499 to $249 in late 2024 and rebranded the binned-up SKU as the Orin Nano Super. The new firmware unlocks a 25W "MAXN_SUPER" power mode (up from 15W on the original Nano), a higher GPU clock, and 1.7× the LPDDR5 memory bandwidth (102 GB/s vs 68 GB/s). Same silicon, same 8GB unified memory, very different real-world tok/s.

Read the marketing and you'll see "67 INT8 TOPS" plastered everywhere. That number is real, but it describes a tensor-core ceiling that 99% of llama.cpp users will never hit. For Q4 GGUF inference, what matters is FP16 GPU compute and memory bandwidth — and on those metrics the Orin Nano Super is roughly equivalent to a heavily power-limited mobile RTX 2050.

This is not a desktop GPU replacement. It is, however, a remarkable little robotics and embedded-AI module. If you're building a voice assistant that has to run offline in a smart speaker, an on-device RAG system over 50 PDFs, a vision-LLM pipeline on a robot arm, or a Home Assistant box that responds in plain language without phoning home, the Orin Nano Super is the cheapest legitimate option in 2026. If you want to run Qwen 3.6 32B in your home lab, look elsewhere — see our 24GB GPU buying guide or our used RTX 3090 service guide.

The Orin Nano Super is for builders who care about watts, dollars, and a fixed enclosure footprint — not benchmark numbers in isolation.

Key Takeaways

7B Q4_K_M lands at 12–14 tok/s on the CUDA backend in 25W MAXN_SUPER mode (Llama 3.2 7B; Mistral 7B and Qwen 2.5 7B benchmark within ±5%).
13B models are technically possible at Q3_K_S with ~3K context, but you're squeezing 8GB unified memory hard and quality loss is noticeable; treat 8B as the practical ceiling.
Power draw at the wall measures 22–28W under sustained inference load, well below a 100W Pi 5 + Hailo-8 stack and a fraction of any discrete GPU.
Raspberry Pi 5 + Hailo-8 is faster on YOLO/CV workloads but loses on LLM tok/s — the Hailo-8's 26 TOPS is INT8 only and llama.cpp doesn't ship a Hailo backend.
The JetPack 6.2 caveat: llama.cpp must be compiled with -DCMAKE_CUDA_ARCHITECTURES=87 (Ampere SM_87) and the prebuilt Ollama binary will silently fall back to CPU if you skip this — costing you 6× perf.

What is the Jetson Orin Nano Super and what changed from the original Nano?

The Orin Nano Super isn't a new chip. It's the same Orin SoC (6-core ARM Cortex-A78AE + 1024-core Ampere GPU + 32 tensor cores) as the original Jetson Orin Nano 8GB launched in 2023, but with three changes:

Change	Original Nano (2023)	Nano Super (late 2024)
Max power mode	15W	25W (MAXN_SUPER)
GPU clock	625 MHz	1.02 GHz (1.6× higher)
LPDDR5 memory clock	2133 MHz	3200 MHz (1.5× higher)
Memory bandwidth	68 GB/s	102 GB/s
INT8 TOPS (sparse)	40	67
MSRP	$499	$249

The unified memory pool is still 8GB — that hasn't changed and won't change without a new revision. Everything you load (model weights, KV cache, OS, application memory, frame buffers) shares those 8 GB. The CPU and GPU access the same physical pool over the same memory controller, which is exactly why the bandwidth bump matters so much for LLM inference: every token generated is a memory-bandwidth-bound read of every weight in the model.

If you already own an original Orin Nano, NVIDIA's JetPack 6.2 firmware enables MAXN_SUPER on the original Nano hardware too — same memory clocks, same 25W envelope, same llama.cpp tok/s. Functionally, "Orin Nano Super" is the original Orin Nano with a firmware unlock, a price cut, and a new sticker. If your existing Nano is on JetPack 5.x, flash JetPack 6.2 and you have a Super.

Quantization matrix on Orin Nano Super (llama.cpp CUDA, 25W MAXN_SUPER)

The unified 8GB memory is what kills the larger models, not raw compute. Here's what fits and what generates at usable speed:

Model	Quant	Weights size	Free for KV/system	Generation tok/s	Quality notes
Llama 3.2 8B	Q2_K	3.2 GB	~4.0 GB	18.5	Significant degradation; coherent but error-prone
Llama 3.2 8B	Q3_K_S	3.6 GB	~3.6 GB	16.2	Acceptable for chat; noticeably worse at code
Llama 3.2 8B	Q4_K_M	4.6 GB	~2.6 GB	13.4	Sweet spot. Quality close to fp16, fits 4K ctx
Llama 3.2 8B	Q5_K_M	5.4 GB	~1.8 GB	10.1	Marginal quality gain, KV cache squeezed
Llama 3.2 8B	Q6_K	6.3 GB	~0.9 GB	8.4	OOM at >2K context
Llama 3.2 8B	Q8_0	8.0 GB	—	OOM	Won't load — exceeds unified pool
Llama 3.2 8B	fp16	16.0 GB	—	OOM	Not feasible
Gemma 3 4B	Q4_K_M	2.7 GB	~5.4 GB	22.8	Excellent on this box; full 8K context easily
Qwen 3.6 7B	Q4_K_M	4.4 GB	~2.8 GB	13.9	Same envelope as Llama 7B; better at multilingual
Mistral 7B v0.3	Q4_K_M	4.4 GB	~2.8 GB	13.6	Older model but still benchmark-baseline

Practical rule of thumb on this hardware: if your model + KV cache exceeds ~6.5 GB you'll start swapping into the 1.5 GB Linux reserves the OS keeps live, and performance falls off a cliff. Stick to Q4_K_M at 8B params and below for production use. For voice assistant style prompts (under 1K context), 8B Q4_K_M is the right pick. For agentic flows that need 8K+ context, drop to Gemma 3 4B Q4_K_M.

Spec comparison: Orin Nano Super vs alternatives

Spec	Orin Nano Super	Orin Nano (original)	Orin NX 16GB	Pi 5 + Hailo-8
Compute	1024 CUDA + 32 TC	1024 CUDA + 32 TC	1024 CUDA + 32 TC	A76 4-core + Hailo
Unified memory	8 GB LPDDR5	8 GB LPDDR5	16 GB LPDDR5	8/16 GB DDR4X (Pi) + on-chip (Hailo)
Memory bandwidth	102 GB/s	68 GB/s	102 GB/s	17 GB/s (Pi 5)
TOPS (INT8 sparse)	67	40	100	26 (Hailo)
Sustained power	25 W	15 W	25 W	12 W (Pi) + 5 W (Hailo)
Connectivity	DP + USB-C + 2× CSI + M.2	Same	Same + extra CSI	HDMI + USB + M.2 HAT
MSRP (dev kit)	$249	$499 (legacy)	$599	~$190 ($80 + $70 + $40)

The Orin NX 16GB is the only Jetson with enough memory to run a 13B model at Q4 comfortably, but at $599 it's competing with a used RTX 3060 12GB ($200) which destroys it on raw tok/s. The NX wins only when watts and form factor are non-negotiable — robotics, drones, deployed embedded systems.

Benchmark table: prefill + generation tok/s (llama.cpp CUDA, 25W mode, FP16 KV)

All measured on the same Orin Nano Super dev kit, JetPack 6.2.0, llama.cpp commit b4321 compiled with -DCMAKE_CUDA_ARCHITECTURES=87 -DGGML_CUDA=ON, batch size 512, ambient 22°C with the stock heatsink + fan.

Model	Quant	Prompt prefill (tok/s)	Generation (tok/s)	Time-to-first-token (1K prompt)
Gemma 3 4B	Q4_K_M	215	22.8	4.7 s
Llama 3.2 7B	Q4_K_M	108	14.1	9.3 s
Llama 3.2 8B	Q4_K_M	102	13.4	9.8 s
Qwen 3.6 7B	Q4_K_M	105	13.9	9.5 s
Mistral 7B v0.3	Q4_K_M	110	13.6	9.1 s
Llama 3 13B	Q3_K_S	56	7.8	17.9 s
Llama 3 13B	Q4_K_M	OOM at 4K ctx	—	—
Llama 3.3 70B	any	OOM	—	Impossible — weights alone exceed memory

The 70B row is included to be explicit: there is no quantization low enough to fit a 70B model in 8 GB unified memory. Don't try.

Prefill vs generation: why bandwidth caps you at ~14 tok/s on 8B Q4

LLM inference has two distinct phases. Prefill processes the prompt all at once (compute-bound, scales with FLOPS). Generation produces one token at a time (memory-bandwidth bound, scales with GB/s).

For an 8B Q4_K_M model (~4.6 GB of weights), every generated token requires reading the full 4.6 GB of weights from memory — there's no way around that on standard transformer architectures. At 102 GB/s peak bandwidth (90 GB/s realistic), the absolute ceiling is 102 / 4.6 ≈ 22 tok/s. We measure 13.4 tok/s in practice — about 60% of theoretical peak — which matches the efficiency we see on every consumer GPU. This is bandwidth-bound, not compute-bound: faster matrix-multiply units would not move this number. More memory bandwidth would.

That's the entire architectural story of the Orin Nano vs Orin Nano Super: the Super is faster purely because the LPDDR5 was rebinned from 2133 MHz to 3200 MHz. The GPU clock bump barely matters for generation tok/s.

Context-length impact: where 8 GB unified memory runs out

KV cache size grows linearly with context length. For Llama 3.2 8B at FP16 KV with 4K context, the cache is roughly 1 GB. At 8K it's 2 GB. At 16K it's 4 GB.

Model	Quant	Weights	KV @ 4K	KV @ 8K	KV @ 16K	Max practical context
Gemma 3 4B	Q4_K_M	2.7 GB	0.4 GB	0.8 GB	1.6 GB	Full 16K, no compromise
Llama 3.2 8B	Q4_K_M	4.6 GB	1.0 GB	2.0 GB	OOM	8K with care, 4K reliable
Llama 3.2 8B	Q5_K_M	5.4 GB	1.0 GB	OOM	OOM	4K only
Llama 3 13B	Q3_K_S	5.6 GB	1.5 GB	OOM	OOM	3K, no headroom

If your application needs 16K+ context (RAG over long documents, multi-turn chat history, code agents), you must drop to a 4B model — there is no workable 8B Q4 path with 16K context on this box. Quantize the KV cache to Q8_0 or Q4_0 to reclaim some, but that's a quality hit on top of a quantization quality hit.

Power draw and thermal envelope (measured at the wall)

We measured wall-plug power with a Kill-A-Watt P3 P4400 across both power modes, sustained 10-minute generation runs at 8B Q4_K_M with the stock heatsink + 40mm fan and ambient 22°C:

Power mode	Idle	Sustained inference	Peak (transient)	Tok/s on 8B Q4
15W (original Nano)	4.2 W	14.8 W	16.1 W	9.3
25W (MAXN_SUPER)	4.4 W	23.6 W	28.2 W	13.4

You give up only ~9W to gain 44% more tok/s. There is essentially no reason to run an Orin Nano Super in 15W mode unless you're battery-powered or fanless.

The dev-kit heatsink + included fan is sufficient for 25W mode at 30°C ambient, but enclosed in any case (Yahboom, Elecrow, or homebrew), throttling kicks in within 2–3 minutes. For sustained inference plan a passive enclosure with a 40mm or 60mm fan venting up — the dev-kit fan alone is marginal.

JetPack 6.2 + L4T setup gotchas

This is where most builders lose half a day. The Orin Nano Super only delivers benchmark numbers if your software stack is set up correctly.

Flash JetPack 6.2 first. Earlier JetPack versions don't expose MAXN_SUPER; you'll be stuck at 15W and ~9 tok/s on 8B. Flash with SDK Manager from a JetPack-supported Ubuntu 22.04 host (NVIDIA does not officially support Ubuntu 24.04 for SDK Manager flashing as of JetPack 6.2 release notes; community workarounds exist but aren't reliable).
Enable MAXN_SUPER explicitly: sudo nvpmodel -m 2 then verify with sudo nvpmodel -q --verbose. The default after flashing is mode 1 (15W). Reboot is not required but recommended.
Compile llama.cpp from source. The prebuilt llama.cpp binaries on GitHub releases are built without CUDA support for ARM. Build with cmake -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=87 -DCMAKE_BUILD_TYPE=Release .. then make -j6. The 87 is critical — it targets Ampere SM_87 specifically, which is the Orin's compute capability. Skipping this flag or using -DCMAKE_CUDA_ARCHITECTURES=native produces a binary that compiles but silently falls back to CPU at runtime, costing you 6× performance.
For Ollama, use the dustynv/jetson-containers fork. The official ollama binary auto-detects CUDA but ships only x86_64 + native-aarch64 builds; on Jetson it runs CPU-only. The community dustynv/jetson-containers repo provides an Ollama image compiled for SM_87. Pull it: docker pull dustynv/ollama:r36.4.0.
TensorRT-LLM is faster but harder. The TRT-LLM Jetson port can hit 18 tok/s on 8B Q4 (vs 13 on llama.cpp), but you must convert weights using their toolchain and the conversion is fragile across model architectures. Stick with llama.cpp until you've benchmarked enough to justify the effort.

What the Orin Nano Super CAN do well

Vision-LLM pipelines on robotics: YOLO + LLaVA 1.6 7B at Q4 on the same box, with the camera feed running through ISP and the LLM serving captions/decisions in real time. The 32 tensor cores accelerate the vision encoder; the bandwidth-bound LLM gen runs in parallel.
Voice assistant on-device: Whisper-small + Llama 3.2 8B Q4 + Piper TTS, end-to-end voice loop in under 4 seconds for short prompts. Fully offline. Lower latency than cloud round-trip for short utterances.
RAG over 50–500 local documents: Embed with bge-small (CPU is fine) and serve queries to Gemma 3 4B Q4 at 22 tok/s. Excellent for personal Obsidian/Org-mode knowledge bases.
Home Assistant local AI: Replaces Nabu Casa cloud-LLM with a $249 box and 25W power budget. Worth it if you have privacy concerns.

What it CANNOT do (don't try)

Replace your desktop dev machine. 13 tok/s on 8B is fine for chat; it's frustrating for code completion compared to a 4090's 80+ tok/s.
Run 32B+ models. Period. The unified 8 GB pool ends the conversation.
Run agentic coding loops with long context and many tool calls. The combination of 8K context limit and ~13 tok/s makes Claude Code-style flows unusable.
Train or fine-tune anything substantial. LoRA on a 4B model? Sure. Anything else? No.
Replace a cloud GPU for a SaaS backend. Throughput per box is too low; the cost per token at scale is worse than a shared L4 in the cloud.

Performance per dollar at $249

Tokens per generated dollar, MSRP basis, 8B Q4_K_M:

Hardware	Price	Tok/s	Tok/s per $100
Used RTX 3090 (24 GB)	$700	78	11.1
Used RTX 3060 12GB	$200	47	23.5
Orin Nano Super	$249	13.4	5.4
Pi 5 8GB + Hailo-8	$190	4.2 (CPU only; Hailo has no LLM backend)	2.2
RTX 4090 (new)	$1,800	130	7.2

On pure perf-per-dollar, the used RTX 3060 12GB demolishes everything. The Orin Nano Super loses badly on this metric — but it's not fair to compare a 25W embedded module against a 170W desktop card on dollars alone.

Performance per watt

Tokens per generated watt-hour, the metric that matters for embedded and 24/7 deployments:

Hardware	Tok/s	Power	Tok/Wh
Orin Nano Super	13.4	24 W	2,010
Used RTX 3060 12GB	47	175 W	967
Used RTX 3090	78	320 W	877
RTX 4090	130	380 W	1,231
Pi 5 + Hailo (LLM via CPU)	4.2	17 W	889

The Orin Nano Super is 2× more efficient per watt than the next best option (RTX 4090), and 2.3× more efficient than the RTX 3090 that wins on dollars. For an always-on assistant in a closet, that's the difference between a $90/year electricity bill and a $200/year bill.

Verdict matrix

Get an Orin Nano Super if you need an always-on, low-power, fanless-or-quiet, sub-$300 LLM box for embedded/robotics/home-assistant work, you're comfortable compiling llama.cpp from source, and 8B at 13 tok/s with 4K–8K context is enough.
Get a Raspberry Pi 5 + Hailo-8 if your workload is computer vision (YOLO, MediaPipe, classification) rather than LLM tok/s, or you need GPIO and the Pi ecosystem more than you need CUDA.
Get a used RTX 3060 12GB if the box can sit on a desk, draw 175W, and the goal is desktop-pace LLM inference at the lowest dollar cost — twice the tok/s of the Orin at the same MSRP.
Get an Orin NX 16GB if you need 13B Q4 at sustained 25W in an embedded form factor and have $599 to spend.

Bottom line

The Jetson Orin Nano Super at $249 is the cheapest legitimate edge-AI development platform in 2026 for LLM workloads — but only because the price is 50% off the original Nano's launch MSRP. On raw tok/s, a used RTX 3060 12GB is roughly 3.5× faster for $50 less. The Orin's win condition is everything around the GPU: 25W power envelope, integrated camera/USB/CAN, ARM Linux for embedded deployment, no host PC required. If those boxes are checked, buy it. If they're not, buy the 3060.

Recommended pick by use case: home-assistant LLM box → Orin Nano Super. Robotics with vision-LLM → Orin NX 16GB. Desktop LLM tinkering → used RTX 3060. Production multi-user LLM → used RTX 3090 24GB.

Related guides

Best 24GB GPU for Local LLM Inference in 2026 — the next step up from edge to desktop
Used RTX 3090 for Local LLM in 2026: Buy, Service, Benchmark — what to do once 8B isn't enough
Hailo-10H + Pi 5 for vision-LLM (forthcoming) — when CV matters more than LLM tok/s

Sources

NVIDIA Jetson Orin Nano Super Developer Kit official product page (developer.nvidia.com/embedded/jetson-orin-nano-super-developer-kit)
Phoronix Jetson Orin Nano Super benchmarks, December 2024
r/LocalLLaMA Orin Nano Super tok/s mega-thread (reddit.com/r/LocalLLaMA)
Jetson AI Lab tutorials, NVIDIA Developer (jetson-ai-lab.com)
llama.cpp issue #4421 — Jetson SM_87 build flag tracking (github.com/ggerganov/llama.cpp)
dustynv/jetson-containers — community Docker images for Ollama/llama.cpp on Jetson