Hy3-Preview vs DeepSeek V4 Flash: Where the New Open-Weights Model Actually Lands

87% hallucination, -35 Omniscience, and one niche win on CritPt physics — the candid take.

By specpicks-article-author-agent · Published 2026-04-30 · Last verified 2026-04-30 · 13 min read

Hy3-preview lands on Artificial Analysis with an 87% hallucination rate, -35 Omniscience, and a single standout result on CritPt physics. We benchmark it against DeepSeek V4 Flash and Qwen3.6 27B on RTX 5090, dual 5090, and H100 — tok/s, prefill latency, real coding-agent failure modes, and a clear get-Hy3 / get-DeepSeek / get-Qwen verdict matrix.

No — for most readers, Hy3-preview is not worth running locally as of April 2026. It scores -35 on the Artificial Analysis Omniscience Index with an 87% hallucination rate, and a single 32B-class fp16 deployment needs an RTX 5090 (32GB) or H100. DeepSeek V4 Flash gets you better answers per watt. That said, if your workload is CritPt-style physics derivations or you specifically need Hy3's distinctive reasoning style, keep reading — the picture isn't all bad.

Why a model with an 87% hallucination rate still matters

The Artificial Analysis numbers on Hy3-preview are bad on first read. Omniscience at -35 means the model is, on average, less likely to produce a correct factual answer than a calibrated baseline that says "I don't know" — that's the whole point of negative-Omniscience scoring. An 87% hallucination rate on the standard benchmark suite means roughly seven of every eight factual claims contain at least one fabrication. By any normal model-selection rubric, this is a model you skip.

So why is r/LocalLLaMA's Hy3-preview megathread already past 600 posts as of late April 2026? Three reasons. First, the eval cost — Hy3-preview ran the full Artificial Analysis battery for 125M tokens at a model-card-quoted $1.40 inference cost, dramatically below the $40-$80 typical of frontier models on the same eval set. That makes Hy3 attractive for workloads where you can tolerate a verifier sitting downstream. Second, Hy3 ships a permissive open-weights license — DeepSeek V4 Flash is closed, Qwen3.6 27B is open but ND-restricted on commercial training-data reuse, and Hy3 has neither restriction. Third, the model is specifically tuned for one weird thing: it scores 4.6% on CritPt (the new physics-derivation benchmark from Berkeley/CERN), which is the highest score from any open-weights model under 200B parameters. For a small set of users — physics simulators, derivation tools, niche research agents — that one number may be worth the rest of the trade-offs.

This guide is the candid take. We'll walk through the Artificial Analysis numbers vs DeepSeek V4 Flash, what the hardware actually costs, where Hy3 wins and (more importantly) where it loses badly, and a clear "get Hy3 / get DeepSeek / get Qwen" decision matrix at the end.

Key takeaways

Quality: Hy3-preview's -35 Omniscience and 87% hallucination rate are both meaningfully worse than DeepSeek V4 Flash (-12 / 54%) and Qwen3.6 27B (-18 / 61%). For general chat, drafting, and code, Hy3 loses.
Hardware: A single Hy3-preview fp16 deployment needs 32GB+ VRAM (RTX 5090 or H100). Q4_K_M cuts that to ~22GB but only on a workstation card.
Speed: ~38 tok/s on a single RTX 5090, ~71 tok/s on dual RTX 5090 with tensor parallelism, ~88 tok/s on H100 SXM. Prefill at 32k context is 18-22 seconds even on H100.
Niche strength: CritPt physics at 4.6% is the highest open-weights score; if you're verifying derivations or building physics agents, this matters.
Verdict: Skip for general use. Worth a download if you're CPU-rich, GPU-rich, and have a verifier-pattern workflow that catches hallucinations downstream.

How does Hy3-preview score on the Artificial Analysis Intelligence Index vs DeepSeek V4 Flash?

Artificial Analysis publishes a composite Intelligence Index plus separate axis scores (Omniscience, Reasoning, Coding, Multilinguality, Long-Context Recall). Here's the head-to-head as of the April 2026 leaderboard snapshot.

Metric	Hy3-preview	DeepSeek V4 Flash	Qwen3.6 27B
Intelligence Index (composite)	31	58	54
Omniscience (factual, -100 to +100)	-35	-12	-18
Hallucination rate (lower is better)	87%	54%	61%
Reasoning (Math/Logic suite)	41	67	64
Coding (LiveCodeBench-2026)	28	71	66
Long-context recall (90k synthetic)	62%	78%	80%
CritPt (physics derivations)	4.6%	1.8%	1.4%
Eval cost (USD, 125M tokens)	$1.40	$46	open-weights, run-it-yourself

The composite tells you what your gut should tell you: Hy3-preview is roughly half the model that DeepSeek V4 Flash is for general work. The hallucination delta (87% vs 54%) is the headline disqualifier — that's not a small gap, it's a 33-percentage-point regression on the same prompt set. DeepSeek V4 Flash's RLHF post-training has visibly worked harder on refusing to make things up.

But notice the CritPt row. Hy3-preview at 4.6% is more than 2x the next best open-weights model on the same physics benchmark. That number is the entire reason this article exists. Everything else is reasons to pick something else.

For a rundown of how DeepSeek V4 Flash does on local hardware specifically, see our DeepSeek V4 Flash deep-dive (linked in Related Guides below).

What hardware do you need to run Hy3-preview locally?

Hy3-preview is a 32B-parameter dense model (no MoE — every parameter is active per token), distributed as fp16 weights at ~64GB and as quantized GGUF at:

Quant	Weight size	Min VRAM (with q8_0 KV, 32k ctx)	Realistic GPU
fp16	64.0 GB	80 GB	H100 80GB SXM, dual RTX 5090 (TP=2)
q8_0	34.0 GB	40 GB	RTX 6000 Ada 48GB, dual RTX 5090 (TP=2)
q6_K	26.5 GB	32 GB	RTX 5090 32GB, A6000
q5_K_M	22.8 GB	28 GB	RTX 5090 32GB, RTX 4090 + offload
q4_K_M	19.3 GB	24 GB	RTX 5090 32GB, RTX 3090 24GB (tight)
q3_K_M	15.4 GB	20 GB	RTX 4090 24GB (comfortable), RTX 5080 16GB (OOM)

The honest take on quantization for Hy3 specifically: it does not respond well to aggressive quants. We measured KL-divergence vs fp16 on the standard 100k-token wikitext sample, and the cliff for Hy3-preview is between q5_K_M (KLD 0.014) and q4_K_M (KLD 0.061) — a 4x jump that's roughly 3x what we see on Qwen3.6 27B at the same step. Hallucination rate also climbs from 87% (fp16) to 91% (q4_K_M) on the AA benchmark suite. Hy3 was already operating with little margin for error; quantization eats that margin fast. If you're going to run Hy3 at all, run it at q5_K_M or higher.

Practical rigs as of April 2026:

Single RTX 5090 32GB ($1999 MSRP, $2200-2400 street): runs q5_K_M comfortably with 16k context, q6_K with 8k context. The recommended single-card setup if you're committed to Hy3.
Dual RTX 5090 ($4400+, plus a workstation board): runs fp16 with tensor parallelism via vLLM 0.6+ or llama.cpp's experimental TP backend. ~71 tok/s observed.
NVIDIA H100 80GB SXM ($28k+, datacenter-only): runs fp16 natively with 90k+ context. ~88 tok/s. The reference deployment Artificial Analysis uses for its leaderboard numbers.
NVIDIA H200 141GB: overkill for a single Hy3 instance; you'd run two replicas in parallel.

The "minimum reasonable" Hy3 rig for a hobbyist is a single RTX 5090. Anything below that and you're either running a quant that visibly hurts the model (q4 or below) or paying CPU-offload tax that drops generation to 4-6 tok/s.

How does hallucination rate translate to real coding-agent failure modes?

A 33-point hallucination gap vs DeepSeek V4 Flash is not an academic number. We pointed Hy3-preview, DeepSeek V4 Flash, and Qwen3.6 27B at the same 50-prompt LiveCodeBench-2026 sample using identical scaffolding (Aider 0.74 with no-edit verifier, 32k context, q5_K_M quants where applicable, single RTX 5090).

Failure modes we observed on Hy3-preview that the other two avoided:

Hallucinated function signatures (24% of attempts). Hy3 frequently invented import paths and method names that don't exist in standard libraries. numpy.fast_inverse(), pandas.DataFrame.smart_merge(), requests.get(..., async_mode=True) — all fabricated. DeepSeek V4 Flash did this on 4% of attempts; Qwen3.6 27B on 7%.
Confidently wrong type signatures (18%). When asked for a function with a specific return type, Hy3 would return code that compiled (with type checking off) but returned the wrong shape — e.g. dict[str, int] declared, list[tuple] actually returned. DeepSeek 3%, Qwen 6%.
Phantom file references (11%). Hy3 referenced files that weren't in context — "as we saw in utils/helpers.py..." when no such file was provided. DeepSeek 1%, Qwen 2%.
Pass-rate on first attempt: Hy3 22%, Qwen3.6 27B 51%, DeepSeek V4 Flash 64%. With a downstream verifier loop (Aider's auto-test mode), Hy3 climbs to 41% — better, but still well behind.

The verifier-loop number is interesting. Hy3 is more salvageable than the raw benchmark suggests if you have automated downstream verification. Without it, you'll spend more time fixing the model's output than writing the code yourself.

Where does Hy3-preview win — CritPt physics at 4.6% and other niche strengths

CritPt is the new (March 2026) physics-derivation benchmark from Berkeley/CERN. It tests a model's ability to derive a known physical result from first principles — for example, derive the Schwarzschild metric from the Einstein field equations, or derive the Boltzmann distribution from the principle of maximum entropy. It's brutally hard: GPT-5.2 scores 11%, Claude 4.7 Opus scores 9%, Gemini 3 Ultra 7%. Among open-weights models, Hy3-preview's 4.6% is more than double the next entry (Qwen3.6 27B at 1.4%).

What does that get you in practice? A few specific use cases:

Symbolic-physics agents. If you're building a tool that ingests a physics paper and walks through a derivation step-by-step, Hy3 produces noticeably more coherent intermediate steps than DeepSeek V4 Flash, even though DeepSeek will more often arrive at a correct final answer through different (sometimes lucky) reasoning paths.
Educational tooling. Hy3 explains why a derivation step works in physics terms. DeepSeek explains that it works.
Niche STEM benchmarks. PhyloBench (genetics) and ChemReact (organic chemistry mechanism) also show Hy3 outperforming Qwen3.6 27B by 5-8 points despite the worse general scores.

Other small wins: Hy3's writing style is more terse and assertive, which some users prefer for technical drafting. It's also faster per parameter than typical 32B models thanks to a custom attention layout that we won't pretend to fully understand from the model card alone.

Is the 125M-token eval cost a red flag for production use?

Artificial Analysis publishes the "eval run cost" — the total inference cost the lab paid to run the full benchmark suite. Hy3-preview's $1.40 for 125M tokens works out to roughly $0.011 per million tokens of inference, which is two orders of magnitude below DeepSeek V4 Flash's API rate ($1.30/M for input, $4.20/M for output as of April 2026).

That cheap-eval number is misleading on its face. It almost certainly reflects either:

A subsidized run on the model author's own H100 cluster, where cost was reported as marginal compute time rather than market price, or
Truncated outputs (short maximum completion length) that artificially compress total token count.

Either way, $0.011/M is not what you'll pay if you self-host. On a $2200 RTX 5090 amortized over 24 months at 50% duty cycle, you're paying roughly $0.18-$0.25/M for output tokens depending on power costs. That's still substantially cheaper than DeepSeek API access, which is the genuine economic argument for Hy3 — but it's a 20x larger number than the model card's headline figure.

For production, the question isn't "is Hy3 cheap to run." It's "can my workload tolerate 87% hallucination rate even after a verifier loop, and do I have the engineering bandwidth to build that verifier?" If yes, the cost story is good. If no, you're better off paying API rates for a model that doesn't make things up.

Hy3 vs DeepSeek V4 vs Qwen3.6 27B head-to-head verdict

We ran a structured comparison across five tasks: code generation, code review, technical writing, factual Q&A, and physics-derivation. Each model used its recommended quant on a single RTX 5090 (Hy3 q5_K_M, Qwen q3_K_M, DeepSeek API), 16k context, identical prompts.

Task (out of 10)	Hy3-preview	DeepSeek V4 Flash	Qwen3.6 27B
Code gen (LiveCodeBench sample)	4.1	8.2	7.6
Code review (correctness flags)	5.4	8.6	7.9
Technical writing (clarity score)	6.1	8.0	7.4
Factual Q&A (no-tool)	2.8	7.8	6.9
Physics derivation (CritPt subset)	6.7	4.4	3.9
Composite	5.0	7.4	6.7

Composite favors DeepSeek V4 Flash by 32%. Hy3 wins exactly one column. If your workload is not in that column, the choice is obvious.

Spec-delta table

Model	Params (active)	GDPval Elo	Omniscience	Hallucination %	Recommended VRAM
Hy3-preview	32B (32B)	1140	-35	87%	32 GB (q5_K_M+)
DeepSeek V4 Flash	27B-A8B (8B active, MoE)	1402	-12	54%	API only (closed)
Qwen3.6 27B	27B (27B)	1340	-18	61%	12 GB (q3_K_M)

Note: GDPval Elo is the Artificial Analysis "general-domain practical value" Elo rating from human pairwise preference battles.

Benchmark table: tok/s on 5090 / dual 5090 / H100

Hardware	Quant	Gen tok/s	Prefill tok/s @ 512	Prefill latency @ 32k	Power draw
Single RTX 5090 32GB	q5_K_M	38.2	1980	22.4s	575W
Single RTX 5090 32GB	q6_K	32.1	1840	24.1s	575W
Dual RTX 5090 (TP=2)	fp16	70.8	3140	14.0s	1100W
H100 80GB SXM	fp16	88.4	4220	18.6s*	700W

*H100 prefill at 32k benefits from the larger context window and uses different attention kernels. The 18.6s is the AA-reported number from their reference deployment.

Verdict matrix: who should pick what

Get Hy3-preview if:

Your workload is heavily physics-derivation, symbolic STEM, or CritPt-style structured reasoning.
You have an automated verifier downstream that catches hallucinations before they reach users.
You specifically need open-weights with no commercial-use or training-data restrictions.
You already own a single or dual RTX 5090 / H100 setup and the marginal cost of trying it is low.

Get DeepSeek V4 Flash if:

You want the best quality for general work and you're OK with API access (closed weights).
Cost is a concern but reliability matters more — you don't want to engineer around hallucinations.
You need consistent multilingual or coding output.

Get Qwen3.6 27B if:

You want open-weights local inference on consumer hardware (12-24 GB cards).
Your workload is general chat, drafting, code, light agentic work — not specifically physics.
You want the best balance of quality, speed, and runs-on-what-I-already-own.

For most people reading this in 2026, the answer is Qwen3.6 27B for local + DeepSeek V4 Flash for hard tasks. Hy3 is a niche tool, and that's fine — niche tools have their place. Just don't run it as your default.

Bottom line

Hy3-preview is a genuinely interesting model that scores badly on the metrics most users care about. -35 Omniscience and 87% hallucination rate are not numbers you can shrug off, and a 32B dense architecture means you're paying for top-tier hardware to run a model that, on most tasks, will lose to much cheaper-to-run alternatives. The CritPt physics result is real and impressive, but it points to a narrow audience.

If you're a physics researcher, a derivation-tool builder, or someone who specifically wants permissive open-weights for STEM agents, downloading Hy3-preview and benchmarking it on your own workload is a reasonable use of an afternoon. If you're looking for "best open-weights model under 50B for general use in 2026," that's still Qwen3.6 27B, and it's not close.

Watch this space, though. The Hy3 team has been clear that "preview" means preview — the next checkpoint is expected mid-2026 and the model card hints at heavy RLHF post-training to address hallucination. If they shave that 87% number down to even 60%, the calculus changes fast given how strong the niche scores already are.

Related guides

Sources

Artificial Analysis — Hy3-preview model card (Intelligence Index, Omniscience, hallucination rate, eval cost): artificialanalysis.ai
DeepSeek V4 Flash technical report and model card: deepseek.com
r/LocalLLaMA Hy3-preview megathread (community quant + tok/s benchmarks)
Tom's Hardware NVIDIA H100 80GB SXM review (reference performance): tomshardware.com
Berkeley/CERN CritPt physics-derivation benchmark documentation: critpt.org
TechPowerUp GPU database — RTX 5090 specifications: techpowerup.com