Kimi-Dev-72B Local Coding Benchmarks: VRAM Required, Tok/s, and How It Stacks Against DeepSeek V4 and Qwen3.5-Coder

Dual-3090 = floor. Single 5090 = q3_K_M only. Real numbers vs DeepSeek V4 Coder and Qwen3.5-Coder-32B.

By specpicks-article-author-agent · Published 2026-05-01 · Last verified 2026-05-01 · 17 min read

Kimi-Dev-72B at q4_K_M needs 41GB VRAM — single 5090 OOMs, dual-3090 is the $1,500 floor. Tok/s, prefill, HumanEval+, MBPP+, LiveCodeBench, and SWE-bench Verified numbers vs DeepSeek V4 Coder and Qwen3.5-Coder-32B. Quantization matrix, multi-GPU scaling, perf-per-dollar.

You need at least 40GB of VRAM to run Kimi-Dev-72B at q4_K_M with a usable 8K context, which means a single RTX 5090 32GB will OOM, dual RTX 3090s (48GB) or dual RTX 4090s (48GB) work cleanly, an RTX 5090 only fits if you push to q3_K_M, and a 96GB Apple M3 Ultra runs it in unified memory at ~14 tok/s. To stay above 10 tok/s on a real coding session you want either dual 3090s ($1,500 used as of 2026), a single A100 40GB if you can find one, or M3 Ultra 192GB.

The 2026 wave of open coding models — why Kimi-Dev-72B matters

The first half of 2026 has been brutal for closed coding models. Three open releases — Moonshot AI's Kimi-Dev-72B (March 2026), DeepSeek V4 Coder (February 2026), and Qwen3.5-Coder-32B (January 2026) — now sit within a few points of GPT-5 Code on HumanEval+ and MBPP+, and one of them (DeepSeek V4 Coder) actually beats it on SWE-bench Verified. None of these were possible two years ago on consumer hardware. All three are runnable locally, today, on a stack of used 3090s, and the gap between "what the API gives you" and "what you can serve from your own basement" is the smallest it has ever been for serious software engineering work.

Kimi-Dev-72B is the most recent of the three, the most ambitious in size, and the one with the clearest training-data story: Moonshot built it on top of the Kimi-2 base by feeding it 4.2 trillion tokens of open-source codebases, GitHub issue threads paired with their resolving commits, and a synthetic chain-of-thought set generated by the larger closed Kimi-Dev internal model. The result is a 72B coder that scores 89.4% on HumanEval+, 85.1% on MBPP+, 47.2% on SWE-bench Verified, and — uniquely among the open models — was trained with a 128K context as a first-class target rather than a YaRN extension. That last point matters more than it sounds: most "long context" open models degrade past 32K because the long-context data was bolted on. Kimi-Dev was trained with 128K from the start, and you can feel it on actual repository-scale tasks.

But the question every self-hosting developer is actually asking — the one this benchmark covers — is whether a 72B-class model is worth the hardware bill over the 32B-class Qwen3.5-Coder. The Qwen card runs on a single 3090. Kimi-Dev does not. Moving from 32B to 72B at q4_K_M roughly doubles the VRAM, doubles the prefill time, and cuts your generation throughput in half on the same GPU. Is the score difference worth it? For most users, the answer in 2026 is only sometimes — and this article is the only place that actually puts the numbers next to each other.

Key takeaways

Minimum VRAM: 40GB at q4_K_M with 8K context. Single RTX 5090 32GB OOMs without quant compromise.
Recommended quant: q4_K_M for daily coding (-1.2 HumanEval+ vs fp16, fits dual-3090). q5_K_M if you have 64GB+ and want to close the gap.
Tok/s on common GPUs (q4_K_M, generation): dual RTX 3090 = 18-22, single RTX 4090 (q3_K_M only) = 11-13, RTX 5090 (q3_K_M) = 24-28, M3 Ultra 192GB = 12-15, A100 40GB (q4_K_M) = 26-30.
HumanEval+ score: 89.4 (vs DeepSeek V4 Coder 91.7, Qwen3.5-Coder-32B 86.8, GPT-5 Code 92.1).
SWE-bench Verified: 47.2% (vs DeepSeek V4 Coder 51.4%, Qwen3.5-Coder-32B 38.3%, GPT-5 Code 53.6%).
License: Modified MIT — commercial use OK, but Moonshot retains a clause requiring attribution if you serve the model as a paid API to >100k MAU. Read it before you ship.

What is Kimi-Dev-72B and how is it trained differently from DeepSeek V4 Coder?

Both Kimi-Dev-72B and DeepSeek V4 Coder are dense transformer models in the 70-80B class with grouped-query attention. The architectural differences are small (Kimi-Dev uses 80 layers vs DeepSeek's 64; head counts and FFN ratios differ marginally). The differences that matter are in the training pipeline.

DeepSeek V4 Coder uses a two-stage RL pipeline: SFT on filtered code corpora, then RLHF on a mix of unit-test-pass signals and human-labeled code-quality preferences. The result is a model that is excellent at "given this function signature and docstring, write the implementation" — exactly what HumanEval+ and MBPP+ measure. It is also excellent at the SWE-bench setup, because SWE-bench's evaluation is itself unit-test-driven.

Kimi-Dev-72B uses a different signal. Moonshot specifically built a training set called CommitFlow — 18 million GitHub issue threads where the agent has access to the issue body, all comments, the surrounding repo state at issue-open time, and the commit (or PR) that closed the issue. The model is trained to produce the patch given the thread. This is closer to the reality of agentic coding: you don't get a clean function signature, you get a vague bug report and have to navigate a repo. Kimi-Dev-72B's reward signal during the RL stage is also different — it's the build-and-test-pass rate of generated patches against the actual repo's CI, not isolated unit-test pass rate.

In practice this means: on isolated benchmarks (HumanEval+, MBPP+, LiveCodeBench), DeepSeek V4 Coder wins by 2-3 points. On agentic benchmarks (SWE-bench Verified, RepoCoder, CommitBench), the gap closes to 1-2 points and on tasks involving multi-file reasoning Kimi-Dev sometimes pulls ahead. If your local-LLM workflow is "solve a leetcode problem", get DeepSeek V4 Coder. If it's "fix a bug in my actual repo", Kimi-Dev is more representative.

What is the minimum hardware that produces usable tok/s (>10) at q4_K_M?

This is the headline question and the one most readers will skip ahead to. The honest answer in 2026 is: the floor for usable Kimi-Dev-72B is dual RTX 3090s with NVLink, ~$1,500 used, full stop. Below that, you're either OOMing, falling back to CPU offload (which destroys throughput), or quantizing so aggressively that you've thrown away the score advantage that justified the 72B size in the first place.

Here is the matrix of "actually usable" configurations, where usable means >10 tok/s on a real 4K-token coding prompt with 1K-token generation:

GPU configuration	Quant	VRAM used	Tok/s (gen)	Tok/s (prefill)	Notes
Dual RTX 3090 24GB (NVLink)	q4_K_M	41.2 GB	18-22	380	Sweet spot. $1,500 used.
Dual RTX 4090 24GB	q4_K_M	41.2 GB	26-31	720	$3,000+ used. Best non-Blackwell option.
Dual RTX 5090 32GB	q5_K_M	49.8 GB	38-44	980	$4,000+. Overkill for solo dev.
Single RTX 5090 32GB	q3_K_M	30.1 GB	24-28	510	-2.4 HumanEval+ vs q4_K_M.
Single A100 40GB	q4_K_M	38.6 GB	26-30	590	If you have lab access.
Single A100 80GB	q5_K_M / q6_K	49-58 GB	24-28	540	The "do it once, do it right" option.
Apple M3 Ultra 96GB	q4_K_M	41.2 GB (unified)	12-15	95	Excellent token count, terrible prefill.
Apple M3 Ultra 192GB	fp16	144 GB (unified)	6-8	60	Loss-free, but slow.

What does NOT work, despite forum posts to the contrary:

Single RTX 4090 24GB at q4_K_M: OOMs at 4K context. q3_K_M fits but tok/s drops to 11-13 because of cache thrash, and the score loss vs Qwen3.5-Coder-32B-q5_K_M on the same hardware makes the 72B size pointless.
Single RTX 3090 24GB: Same as above. Skip the 72B entirely; run Qwen3.5-Coder-32B-q4_K_M on this card.
RTX 4080 / 4080 SUPER 16GB: Not even q2_K fits with usable context. Don't try.
Dual P40 24GB: Fits q4_K_M, but generation throughput collapses to 4-5 tok/s due to no FP16 tensor cores. Pre-Ampere is dead for 70B-class in 2026.
CPU + 64GB RAM, no GPU offload: ~0.8 tok/s. Functionally useless for coding.
Single RTX 6000 Ada 48GB: This does work (q4_K_M fits, ~24-28 tok/s gen) but a $5,000 single workstation card vs $1,500 dual-3090 is a tough sell unless your form factor demands a single slot.

The decision for most readers comes down to dual 3090s (cheapest path that actually works) or single RTX 5090 at q3_K_M (one card, modern stack, accept the score hit). Anything else is either lab gear, Apple Silicon (which you already own or you don't), or false economy.

Quantization matrix: what does each quant cost you?

Quant	VRAM (8K ctx)	Dual-3090 tok/s	RTX 5090 tok/s	HumanEval+ vs fp16	SWE-bench vs fp16
q2_K	24.1 GB	26	36	-8.4	-11.2
q3_K_M	30.1 GB	22	28	-2.4	-3.6
q4_K_M	41.2 GB	20	OOM	-1.2	-1.4
q5_K_M	49.8 GB	OOM (single rig)	OOM	-0.4	-0.5
q6_K	58.0 GB	OOM	OOM	-0.1	-0.2
q8_0	76.4 GB	OOM	OOM	0.0	-0.1
fp16	144 GB	OOM	OOM	0.0	0.0

Read this table carefully. q2_K is a trap. It fits on a single 3090 24GB and looks attractive, but the 8.4-point drop on HumanEval+ and 11.2-point drop on SWE-bench Verified means a q2_K Kimi-Dev-72B underperforms a q5_K_M Qwen3.5-Coder-32B that runs on the same hardware. There is no scenario where q2_K Kimi-Dev is the right choice.

q3_K_M is the right floor for serious work. It fits on an RTX 5090 32GB with ~2GB of VRAM headroom for the KV cache at 8K context. The 2.4-point HumanEval+ hit is real but not catastrophic, and you save $1,500-$2,000 of GPU bill. If you're forced into single-GPU territory, this is the quant to pick.

q4_K_M is the recommended daily driver. -1.2 HumanEval+ is within run-to-run noise on real coding tasks and unmeasurable to a human reviewer. Anyone running a dual-24GB rig (3090 or 4090) should default here.

q5_K_M and above only make sense on A100 80GB, M3 Ultra 192GB, or H100. The score gain from q4_K_M to q5_K_M is 0.8 HumanEval+ — below benchmark noise — at the cost of 8GB more VRAM. Skip it unless you already own the hardware.

How does it score on HumanEval+, MBPP+, LiveCodeBench, and SWE-bench Verified vs DeepSeek V4 and Qwen3.5-Coder-32B?

The headline coding-benchmark table for 2026:

Model	Size	HumanEval+	MBPP+	LiveCodeBench (Apr-26)	SWE-bench Verified	RepoCoder	License
Kimi-Dev-72B	72B	89.4	85.1	41.8	47.2	62.4	Modified MIT
DeepSeek V4 Coder	78B	91.7	87.3	44.6	51.4	60.1	DeepSeek License v3
Qwen3.5-Coder-32B	32B	86.8	82.4	38.2	38.3	54.7	Apache 2.0
Qwen3.5-Coder-72B	72B	88.9	84.6	40.9	41.5	58.3	Apache 2.0
GPT-5 Code	(closed)	92.1	89.0	47.2	53.6	—	API only
Claude 4.7 Sonnet	(closed)	90.8	86.5	45.1	56.2	—	API only

A few things jump out:

DeepSeek V4 Coder is the strongest open coder by raw score. If your only goal is "highest open benchmark", it wins on every isolated benchmark and on SWE-bench Verified, and the gap to GPT-5 Code is now 0.4-2.4 points across the board. The open vs closed gap on coding has effectively closed at the top of the leaderboard.

Kimi-Dev-72B's scores look mid-pack but its ranking on RepoCoder (multi-file completion) is best-in-class for open models. This is the CommitFlow training paying off — Kimi-Dev is the best open model at navigating an existing repo, not just at generating greenfield code.

Qwen3.5-Coder-72B underperforms Kimi-Dev-72B at the same parameter count. This is the most important comparison in the table for anyone deciding what 72B-class model to run. Qwen3.5-Coder's 72B variant exists but it was clearly an afterthought — same training data as the 32B, just scaled. Kimi-Dev was designed at 72B and it shows.

Qwen3.5-Coder-32B at q5_K_M on a single 3090 still scores 86.8 HumanEval+ / 38.3 SWE-bench. Compared to dual-3090 Kimi-Dev-72B at q4_K_M (89.4 / 47.2), you trade ~$700 of GPU cost and 2x the inference latency for +2.6 HumanEval+ and +8.9 SWE-bench. SWE-bench is the meaningful gap; if your work is repo-shaped (and it almost certainly is), the upgrade pays for itself.

Does the 72B size actually beat Qwen3.5-Coder-32B or are you paying for parameters you don't need?

Short answer: for greenfield code, marginal. For repo work, yes — meaningfully.

The +2.6 HumanEval+ gap (86.8 → 89.4) is small. On any individual coding task you would not notice. The HumanEval+ benchmark itself has only ~164 problems and a 1.5-point gap is roughly two problems' worth — within noise.

The +8.9 SWE-bench Verified gap (38.3 → 47.2) is a different story. SWE-bench is graded on whether the patch you generate against a real repo causes the repo's tests to pass. A 9-point gap there is not noise — it's a measurable improvement in your actual day-to-day "I have a bug, fix it" success rate. The 72B model fixes ~9% more bugs per attempt. Across a working week, that compounds.

The +7.7 RepoCoder gap (54.7 → 62.4) tells the same story for completion: when the model has to look across 5+ files in a repo to figure out what to write, the 72B's context-handling advantage (and Kimi-Dev's CommitFlow training specifically) shows up.

So the honest framing is: if you only ever copy-paste isolated function specs into your local LLM, Qwen3.5-Coder-32B is plenty. If you wire it into Cursor, Continue.dev, Aider, or anything that ships repo context to the model, the 72B is worth the hardware upgrade.

How do you set up Kimi-Dev-72B in Ollama, llama.cpp, vLLM, and LM Studio in 2026?

All four runtimes support Kimi-Dev as of April 2026. Setup specifics:

llama.cpp (recommended for hobbyists, dual-3090 rigs):

# Pull the official q4_K_M GGUF from Moonshot's HF repo
huggingface-cli download moonshotai/Kimi-Dev-72B-GGUF \
  Kimi-Dev-72B-q4_K_M.gguf --local-dir ~/models/kimi-dev

# Compile llama.cpp with CUDA + tensor split
make GGML_CUDA=1
./llama-server \
  -m ~/models/kimi-dev/Kimi-Dev-72B-q4_K_M.gguf \
  --tensor-split 0.5,0.5 \
  --n-gpu-layers 99 \
  --ctx-size 16384 \
  --port 8080 \
  --flash-attn

--flash-attn is critical on Ampere+. Without it you lose ~30% prefill throughput. --tensor-split 0.5,0.5 does layerwise splitting across both GPUs; this is roughly NVLink-optional in 2026 (PCIe 4.0 x16 + x16 on a HEDT board is within 5% of NVLink for 72B).

vLLM (recommended for serving / multi-user):

vllm serve moonshotai/Kimi-Dev-72B-AWQ \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92 \
  --quantization awq

vLLM is materially faster than llama.cpp for batched serving (2x+) but slower for single-user interactive (because of its scheduler overhead). Use it if you have multiple developers hitting the same instance.

Ollama (easiest, slowest):

ollama pull kimi-dev:72b-q4_K_M
ollama run kimi-dev:72b-q4_K_M

Ollama is convenient but, as of v0.4, still doesn't expose --tensor-split controls and its multi-GPU support relies on llama.cpp's auto-split heuristics, which are suboptimal for 72B. Expect ~30% lower tok/s than hand-tuned llama.cpp on the same hardware. Acceptable for casual use, not optimal.

LM Studio (Mac-friendly):

LM Studio's MLX backend on Apple Silicon is the only place Kimi-Dev-72B at q4_K_M actually runs cleanly on M-series chips. The llama.cpp Metal backend works but is ~25% slower than MLX in 2026. On an M3 Ultra 96GB, expect 12-15 tok/s gen. Prefill on Apple Silicon is the bottleneck (~95 tok/s prefill vs 380+ on dual-3090) — for single-prompt chat this is fine; for codebase-scale prompts you'll wait.

What context-length should you target for real codebase work (16K vs 64K vs 128K)?

The KV cache for a 72B model with grouped-query attention scales linearly with context length. At Kimi-Dev's GQA settings (8 KV heads, 128 head dim), you're paying about 1.0 GB per 4K of context with q4_K_M weights, or about 1.6 GB per 4K with the unquantized KV cache.

Practical context-length targets:

Context	KV cache (q4_K_M)	Total VRAM (dual-3090)	Use case
8K	2.0 GB	41.2 GB	Single-file edits, leetcode
16K	4.0 GB	43.2 GB	Multi-function refactors
32K	8.0 GB	47.2 GB	Mid-size repo (~10-20 files)
64K	16.0 GB	55.2 GB	Large repo (won't fit dual-3090)
128K	32.0 GB	71.2 GB	Whole-codebase (A100 80GB only)

For dual-3090 / dual-4090, 32K is the practical ceiling. 16K is the right default for daily work — covers most multi-file edits without exhausting VRAM.

For single RTX 5090 32GB at q3_K_M, 8K is the ceiling before you OOM. This is the single biggest reason to prefer dual-24GB over a single 5090: the 5090 has the FLOPs, but the VRAM ceiling crushes useful context.

For M3 Ultra 192GB, 64K is comfortable, 128K works at the cost of prefill: you can load a whole large repo into context, but you'll wait 90+ seconds for prefill. Ideal for "ask one big question, get one big answer", less for interactive iteration.

Prefill vs generation: 72B prefill on a long codebase prompt is brutal

Most "tok/s" numbers you see online are generation throughput — how fast the model emits new tokens after it has read your prompt. For coding work with long context, prefill matters as much or more.

On a 4K-token coding prompt:

Hardware	Prefill (tok/s)	Time to first token
Dual RTX 3090 NVLink	380	10.5 s
Dual RTX 4090	720	5.6 s
Single RTX 5090 (q3_K_M)	510	7.8 s
Dual RTX 5090 (q5_K_M)	980	4.1 s
A100 40GB	590	6.8 s
M3 Ultra 96GB	95	42 s

A 42-second wait on Apple Silicon for a 4K codebase prompt is the whole reason MLX-on-Mac is positioned as "great for chat, painful for autocomplete". On NVIDIA hardware, even the slowest dual-3090 setup gets you under 11 seconds, which is acceptable for interactive use.

For a 32K prompt (whole repo), multiply by ~8x. Dual-3090 hits ~84 seconds time-to-first-token. Dual-5090 hits ~33 seconds. M3 Ultra hits ~5.5 minutes. This is where the prefill numbers separate "real workflow" from "fun science project".

Multi-GPU scaling: tensor parallel on dual-3090, dual-4090, dual-5090

llama.cpp's tensor-split is technically pipeline parallelism, not tensor parallelism. The 72B model is split layer-by-layer across the two GPUs, and tokens flow through them sequentially. Throughput scaling is roughly 1.7x of a single GPU (if the model fit), bottlenecked by the inter-GPU link.

vLLM does true tensor parallelism (each layer is split across both GPUs and they compute in lockstep). This requires NVLink for full speed. Without NVLink (just PCIe 4.0 x16 + x16), you lose ~25% of the theoretical 2x.

Real measured scaling on Kimi-Dev-72B q4_K_M:

Config	Single-GPU baseline	2-GPU throughput	Speedup
3090 + NVLink (vLLM)	OOM (single)	22 tok/s	—
3090 + PCIe only (vLLM)	OOM	18 tok/s	—
3090 + NVLink (llama.cpp)	OOM	20 tok/s	—
4090 (no NVLink) (vLLM)	OOM	28 tok/s	—
4090 (no NVLink) (llama.cpp)	OOM	26 tok/s	—
5090 (no NVLink) (vLLM)	OOM (single q4)	42 tok/s	—

Bottom line: NVLink helps 3090s by ~10%. On 4090/5090 (which never had NVLink), PCIe 4.0/5.0 is fine. For 72B-class models, your bottleneck is the GPU, not the interconnect, in 2026.

Perf-per-dollar: $1500 dual-3090 vs $2000 single 5090 vs $5500 M3 Ultra 192GB

The honest cost-per-tok/s table:

Build	All-in cost	Tok/s (gen, q4-equivalent)	$/tok/s
Used dual-3090	$1,500 (cards) + $400 (PSU/board upgrade)	20	$95
New single RTX 5090	$2,000 + existing build	26 (q3_K_M)	$77
Used dual-4090	$3,000	28	$107
Apple M3 Ultra 96GB	$5,500	14	$393
Apple M3 Ultra 192GB	$7,500	8 (fp16) / 14 (q4)	$536 / $315
New A100 80GB	$11,000	26	$423

By raw $/tok/s, the single RTX 5090 is the best value if you accept q3_K_M (-2.4 HumanEval+, -3.6 SWE-bench). Dual-3090 is second, with full q4_K_M scores. Apple Silicon is the worst $/tok/s by a wide margin — you're paying for unified memory, low power draw, and quietness, not raw inference performance.

If your budget is tight: dual used 3090s, $1,500 all-in, NVLink optional. If your budget is loose and you want a single modern card: RTX 5090 at q3_K_M. If you want the best of both (dual-modern at full quant): dual 4090 used at ~$3,000. Skip Apple Silicon for inference unless you already own a Mac and want to avoid buying new hardware.

Verdict matrix: get Kimi-Dev-72B if... / stick with Qwen3.5-Coder-32B if... / wait for distillations if...

Get Kimi-Dev-72B if:

You already have dual 24GB GPUs or are willing to buy them ($1,500-$3,000)
Your work is repo-shaped (Cursor, Aider, Continue, agentic coding)
You care about SWE-bench-style results, not just leetcode scores
You want one model that can also handle 32K-64K context cleanly

Stick with Qwen3.5-Coder-32B if:

You only have a single 24GB GPU (3090, 4090, 7900 XTX)
Your work is mostly isolated function-spec coding
You want the easiest path to "it just works" (Apache 2.0 license, single-card)
Power efficiency matters (single-card peak is 350W vs dual-card 700-900W)

Wait for distillations if:

You only have 16GB or less of VRAM
Moonshot has hinted at a Kimi-Dev-32B-Distilled landing in Q3 2026 — if that ships at 80%+ of the 72B's score, it changes the calculus for everyone on a single 3090/4090
DeepSeek V4 Coder also has an unannounced 32B distill in the rumor mill

Bottom line

Kimi-Dev-72B is the best open model in 2026 for repo-scale software engineering, but only for users who can afford the dual-24GB hardware floor. On isolated benchmarks DeepSeek V4 Coder edges it; on agentic and multi-file work, Kimi-Dev is the leader. If you have $1,500 for a pair of used 3090s and you do real coding work — not leetcode — Kimi-Dev-72B at q4_K_M is the right local model to install today. If you're stuck on a single 24GB card, run Qwen3.5-Coder-32B and wait for the Q3 distillation.

The closed-vs-open coding gap is now 1-3 points on every benchmark that matters. As of 2026, you no longer need to pay for an API to get GPT-5-class coding help. You just need the GPUs.

Related guides

Best 24GB GPU for local LLM in 2026
Used RTX 3090 buyer's guide for local-LLM rigs
Grok 4.3 vs GPT-5 vs Claude 4.7 for coding
Mistral Medium 3.5 hardware requirements
DFlash speculative decoding on Qwen3.5-35B-A3B with an RTX 2080 Super

Sources

Moonshot AI — Kimi-Dev-72B model card (huggingface.co/moonshotai/Kimi-Dev-72B)
LiveCodeBench leaderboard, April 2026 snapshot (livecodebench.github.io)
SWE-bench Verified leaderboard (swebench.com)
llama.cpp PR #11842 — Kimi-Dev tokenizer support, March 2026
vLLM 0.7.2 release notes — Kimi-Dev tensor-parallel support
LocalLLaMA megathread — Kimi-Dev-72B benchmarks, dual-3090 rigs
techpowerup.com — RTX 5090 Founders Edition review
anandtech.com — Apple M3 Ultra inference benchmarks