The TL;DR. AMD's MI350P is the first PCIe-form-factor Instinct accelerator since the MI210 in 2022, and per ServeTheHome's launch coverage it's effectively half of the OAM-form-factor MI350X. 128 compute units, 8,192 stream processors, 512 matrix cores, 144GB of HBM3E at 4 TB/s, and a 600W TBP that drops to 450W via a config switch. It's also the first AMD datacenter card to ship with a 16-pin 12V-2×6 power connector. AMD's pitch — reported by Tom's Hardware — is roughly 40% more theoretical FP16/FP8 compute than NVIDIA's H200 NVL in a similar PCIe envelope. The bigger story for most operators isn't the FLOPs lead, though. It's that there's finally a CDNA 4 part you can air-cool and rack in a normal server.
Why this card matters: the four-year PCIe Instinct gap
For three full silicon generations — MI250X, MI300X, MI325X — AMD's Instinct line shipped exclusively in the OAM modular form factor. OAM means liquid cooling, 700-1000W per accelerator, an 8-GPU tray, and a server platform engineered around it. That's fine if you're a hyperscaler. It's a non-starter for almost everyone else.
Per The Register's launch coverage, AMD's PCIe absence over that window let NVIDIA's H100 PCIe, then H200 NVL, then RTX PRO 6000 Server (96GB GDDR7) take essentially uncontested ownership of the on-prem AI buyer who wants:
- A card they can drop into a Dell PowerEdge R750 / HPE DL380 Gen11 / Supermicro H13 chassis.
- An air-cooled accelerator that doesn't require a closed-loop liquid plumbing kit.
- More than the 80-96GB ceiling of the Hopper / Blackwell PCIe SKUs.
The MI350P is the answer. And per the AMD product page, it's already shipping through partner channels.
What's actually in the card
The numbers below are sourced from the AMD blog post announcement, TweakTown's spec breakdown, and ServeTheHome's MI350P intro.
| MI350P (PCIe) | |
|---|---|
| Architecture | CDNA 4 |
| Compute Units | 128 |
| Stream Processors | 8,192 |
| Matrix Cores | 512 |
| Peak engine clock | 2.2 GHz |
| Process | 3nm compute chiplets, 6nm I/O die |
| Configuration | 4 XCDs |
| Memory | 144GB HBM3E |
| Memory bandwidth | 4 TB/s (8 Gbps × 4096-bit bus) |
| Last-level cache | 128MB Infinity Cache |
| Peak FP32-equivalent | 2,299 TFLOPS |
| Peak MXFP4 | 4,600 TFLOPS |
| Native low-precision | MXFP4, MXFP6 |
| Sparsity | Most mainstream FP8 / FP16 paths |
| Form factor | FHFL dual-slot PCIe 5.0 ×16 (10.5") |
| Cooling | Air, passive heatsink |
| TBP | 600W (configurable to 450W) |
| Power connector | 16-pin 12V-2×6 |
| GPU partitioning | 4 partitions |
| Inter-card | PCIe only — no Infinity Fabric exposed |
| Cards per server tray | Up to 8 |
Two specs that quietly matter
The 12V-2×6 connector. This is the first AMD datacenter card to use the 16-pin 12V-2×6 (the post-melting-incident successor to 12VHPWR). For server builders this is meaningful for two reasons: a single dense connector instead of 2-3 EPS12V cables per card, and forward-compatibility with the same PSU lineup that powers consumer RTX 50-series and the upcoming Hopper-successor PCIe parts. PSUs spec'd for the RTX 4090 / 5090 era will hand the MI350P clean 600W on one cable.
The 450W power-cap mode. Almost no current-generation server chassis is rated for 600W per accelerator slot, sustained, dual-card. PowerEdge R750, HPE DL380 Gen11, Supermicro H13 all top out around 350-450W per slot in their default PSU/airflow envelope. Per AMD's product brief, the 450W mode trades roughly 15-20% of peak throughput for compatibility with that enormous installed-chassis base. For a fleet operator who wants to retrofit existing hardware rather than buy new chassis, this is decisive. The OAM MI355X at 1000W liquid has no equivalent fallback.
"Half an MI350X" — what that framing actually buys you
Per ServeTheHome, the MI350P is half the silicon of an MI350X OAM module:
| MI350P (PCIe) | MI350X (OAM) | |
|---|---|---|
| Compute Units | 128 | 256 |
| Memory | 144GB HBM3E | 288GB HBM3E |
| Memory bandwidth | 4 TB/s | ~8 TB/s |
| TBP | 600W (450W cap) | ~1000W |
| Cooling | Air | Liquid |
| Inter-card | PCIe only | 7× Infinity Fabric mesh |
| Partitioning | 4 | 8 |
| Form factor | PCIe FHFL | OAM module |
If you're a hyperscaler running a 405B-parameter model with tensor parallelism across 8 GPUs, the MI350X is your part. The 7-link Infinity Fabric mesh is what keeps inter-GPU activations from saturating PCIe. If you're running a single-card or 2-card on-prem deployment of a 70B-class model, the MI350P trades that mesh — which you weren't going to use anyway — for an air-cooled card you can install in a normal server. Same CDNA 4 silicon. Same FP4/FP6/FP8/FP16 pipelines. Same 4-bit-aware tensor cores.
The clearest read: the MI350P isn't a degraded MI350X for budget buyers. It's an MI350X die binned and packaged for a different deployment problem.
How it compares to NVIDIA's PCIe lineup
The H200 NVL is the obvious head-to-head. Both are 600W PCIe dual-slot. Both are HBM3E. Both target the same on-prem AI-server buyer. AMD's published claim, reported by Tom's Hardware: roughly 40% more theoretical FP16/FP8 compute than the H200 NVL.
| Card | Memory | Bandwidth | TBP | FP4 native? | Form factor |
|---|---|---|---|---|---|
| AMD MI350P | 144GB HBM3E | 4 TB/s | 600W / 450W | Yes (MXFP4 + MXFP6) | PCIe dual-slot |
| NVIDIA H200 NVL | 141GB HBM3e | ~4.8 TB/s | 600W | No | PCIe dual-slot |
| NVIDIA RTX PRO 6000 Server | 96GB GDDR7 | ~1.8 TB/s | 600W | Yes (FP4) | PCIe dual-slot |
| NVIDIA H100 PCIe | 80GB HBM3 | ~2.0 TB/s | 350W | No | PCIe dual-slot |
| NVIDIA L40S | 48GB GDDR6 | ~864 GB/s | 350W | No | PCIe dual-slot |
A few things that surprise on first read:
Bandwidth is a wash with H200 NVL. AMD's 4 TB/s vs NVIDIA's claimed ~4.8 TB/s. NVIDIA actually leads here on raw HBM bandwidth despite slightly less capacity. Throughput on memory-bound LLM-inference paths (large-batch, long-context) is going to be very close.
Capacity advantage is small but useful. 144GB vs 141GB is 2% — not a category-defining gap. But it's enough to fit one more 7B-parameter LoRA adapter, or to leave more room for KV-cache at 128K-token context. For a single-card deployment that's already pinning the H200 NVL at 95% memory utilization, the extra 3GB of headroom is real.
FP4 is the differentiating spec. The MI350P natively supports MXFP4 and MXFP6 — OCP-standard 4- and 6-bit microscaling formats. H200 NVL does not. RTX PRO 6000 Server does. For inference-only deployments running quantized models, MXFP4 doubles effective throughput vs FP8 with negligible quality loss on most LLMs. AMD's headline 4,600 TFLOPS MXFP4 figure is what produces the "40% faster than H200 NVL" claim — that gap mostly evaporates if you're running BF16, but it's the headline for the FP4-quantized inference workloads that dominate production deployments today.
FP64 still favors AMD. CDNA 4 retains AMD's traditional FP64 strength. For HPC, scientific computing, and molecular dynamics workloads, the MI350P remains in a different league than the GDDR7-derived consumer-architecture NVIDIA parts.
Per Spheron's MI350X vs B200 analysis, the wider MI350-series-vs-NVIDIA picture in MLPerf Inference 6.0 had the MI355X tying NVIDIA's B200 on Llama-2-70B Offline, hitting 97% on Server, and reaching 119% on the Interactive benchmark. The MI350P inherits the same compute pipeline at half the scale, so single-card numbers should land in the same competitive range proportional to its halved CU count.
What's it good for
1. On-prem inference of 70B-class models at long context
A single MI350P holds Llama-3.1-70B-Instruct at FP8 with substantial KV-cache headroom for 128K-token contexts. Per the ROCm vLLM project, MXFP4 quantization is supported on launch — meaning the same 70B model fits in roughly 35GB of HBM with the remaining 109GB free for batch-32 KV-cache or speculative-decoding draft models.
For comparison: a dual-RTX-5090 build (2 × 32GB = 64GB) requires tensor-parallel sharding across PCIe with no NVLink, which costs 20-30% throughput and adds operational complexity. One MI350P removes the sharding entirely.
2. Fine-tuning 70B models with full bf16 optimizer state
Full-precision optimizer state for a 70B model is roughly 3× the model weight footprint. 144GB on a single card hosts the model + AdamW optimizer + gradients + activation buffers without offloading — a configuration that otherwise requires 4-8 RTX 4090/5090 cards or a multi-A6000/RTX PRO 6000 setup. Per HuggingFace's ROCm inference blog, the PEFT/LoRA paths on ROCm have parity with the CUDA equivalent for 70B-class targets.
3. HPC + scientific computing where FP64 throughput is the bottleneck
Per Phoronix's MI350P review, the FP64 advantage over NVIDIA's PCIe lineup remains substantial. Molecular dynamics (GROMACS, AMBER), weather modeling (ICON, WRF), and computational fluid dynamics workloads where double-precision is the rate-limiting step continue to favor CDNA over Hopper-PCIe.
4. AI workstations that need to coexist with a display GPU
The dual-slot air-cooled form factor and PCIe-only interconnect are a feature, not a limitation, for the workstation buyer. A Threadripper PRO 7995WX board with 128 PCIe 5.0 lanes can host one MI350P + a separate display GPU + 100GbE NIC + multiple NVMe drives without bifurcation tradeoffs.
What it's not for
- Gaming or graphics. No display outputs. Compute-only.
- Lowest-possible-latency single-stream chat on small models. The MI350P's win is throughput-at-scale and very-large-model capacity, not minimal latency on a 7B at batch=1. That workload still favors a tuned RTX 5090 + llama.cpp setup.
- Multi-GPU training of >70B models. No Infinity Fabric mesh on the PCIe SKU. PCIe 5.0 ×16 (~63 GB/s bidirectional) is the inter-card pipe — more than usable for inference data parallelism, but a bottleneck vs the OAM MI355X / H200-SXM mesh for training tensor-parallel >70B targets.
- Hobbyist budgets. Pricing isn't on a public retail channel as of the launch announcement. Channel-partner expectations from MI355X tier pricing put the MI350P in the $25,000–$35,000 range, in line with H200 NVL.
Software: ROCm 7 and the "does it just work" question
The historic complaint about ROCm — "works in the docs, doesn't work on my server" — has materially improved since the MI300X launch. Per Phoronix's coverage of the ROCm release cadence, the launch ROCm version targeting MI350-series silicon is feature-complete for the production-relevant inference paths:
- vLLM-ROCm with FP8 weights + FP8 KV-cache, Speculative Decoding, MXFP4 quantization, and prefix caching (via the ROCm/vllm fork).
- PyTorch native CDNA 4 kernels installed via
pip install torch --index-url https://download.pytorch.org/whl/rocm7.0. - FlashAttention 2 and 3 ports merged.
- HIP-graph capture stable for inference servers.
- HuggingFace Transformers native support — most existing CUDA-only inference scripts run unmodified after the ROCm wheel install.
Per HuggingFace's ROCm LLM-inference blog, the gap on production inference paths (vLLM, TGI, sglang, llama.cpp) is now days of configuration rather than weeks of porting. Getting bleeding-edge research code (newly published papers, custom kernels, niche fine-tuning frameworks) running on ROCm is still slower than CUDA. For most operators that's not the relevant workload.
Build / deploy considerations
If you're spec'ing an MI350P workstation or a 1-2 card on-prem inference node, the things to check before pulling the trigger:
- PCIe 5.0 ×16 slot with proper power delivery. Current high-end workstation boards (TRX50, WRX90, W790) support this; older Z690 / X670 throttle PCIe 5.0 lane signaling.
- PSU — 12V-2×6 cable to the card. ATX 3.1 PSUs in the 1500W+ class with native 16-pin output are the simplest path. ATX 3.0 with the original 12VHPWR works but you should run the post-recall cable revision and ensure firm seating.
- Airflow. Passive heatsink. Workstation chassis need front-to-back airflow with at least one ~140mm intake at 1500+ RPM capability. Server chassis usually already over-spec for this.
- CPU lane availability. Threadripper PRO 7995WX with 128 PCIe 5.0 lanes is the comfortable workstation choice. EPYC 9004/9005 series is the comfortable server choice. Consumer X670/Z890 boards force compromises.
- OS. Ubuntu 22.04 / 24.04 LTS, RHEL 9.4+, SLES 15.5+ per the ROCm install docs.
- BIOS Above-4G-Decoding. Confirm the board allows >256GB Above-4G mapping. Some workstation boards default to 64GB and require a BIOS toggle.
Pricing, availability, and the buying channel
AMD's launch announcement says the MI350P is "now available across a variety of partners." No specific OEM list and no MSRP appeared in the launch press materials.
The retail channel for PCIe datacenter accelerators — MI350P, H200 NVL, RTX PRO 6000 Server — runs through partner system integrators (Supermicro, Lambda Labs, Dell, HPE, ASUS workstation channel) rather than Amazon or NewEgg. Workstation builders looking to buy one card direct from a reseller will go through a Lambda or BoxX-class quote channel. A retail listing in the next 60 days is unlikely.
The eBay datacenter-pull market is where previous-generation parts (MI210, A100 PCIe, V100, H100 PCIe) end up at meaningful discounts to MSRP. Expect the same trajectory for MI350P 12-18 months after launch.
Pricing volatility on these PCIe accelerators is high and channel-dependent. For now, the realistic comparison shoppers can run today is consumer flagships (RTX 5090, RTX PRO 6000 Workstation) on Amazon vs datacenter pulls (RTX A6000, H100 PCIe, MI210) on eBay.
How AMD's MLPerf claims read in context
Per AMD's MLPerf Inference 6.0 results blog, the MI355X (the OAM sibling) crossed 1 million tokens/second aggregate on Llama-2-70B and posted 100,282 tok/s single-GPU — 3.1× the previously submitted MI325X. The MI350P shares the same FP4/FP6/FP8 pipeline at half the CU count, so a reasonable single-card MI350P expectation lands in the 45,000-55,000 tok/s range on the same benchmark profile. AMD has not yet posted MLPerf Inference numbers under the MI350P submission category specifically.
Should you wait for MI400 or buy now?
The standard generational-cadence answer applies. AMD's roadmap targets a CDNA 5 / MI400-class part in late 2027. If your workload is "buy a card today so an existing service stops thrashing into KV-cache spillover or paging weights to system RAM," the MI350P is the answer. If it's "spec a 5-year HPC cluster that needs to be at the leading edge in 2028," wait for MI400 detail.
Most operators are in the first bucket.
FAQ
Is the MI350P available at retail like Amazon or NewEgg?
No — the MI350P ships through partner system integrators (Supermicro, Lambda Labs, Dell, HPE) and datacenter resellers, not consumer retail. Previous-generation MI210 PCIe cards do appear on eBay as datacenter pulls 12-18 months after launch; expect the same trajectory for MI350P.
How does the MI350P actually compare to NVIDIA H200 NVL?
Per AMD's published claim (reported by Tom's Hardware), roughly 40% more theoretical FP16/FP8 compute. Memory capacity: 144GB MI350P vs 141GB H200 NVL (3GB advantage to AMD). Memory bandwidth: 4 TB/s vs ~4.8 TB/s (slight edge to NVIDIA). Both 600W PCIe dual-slot. The deciding factor is software — CUDA-native shops pick H200 NVL; shops already running ROCm see MI350P as a no-porting-work generational upgrade.
What is MXFP4 and why does AMD keep mentioning it?
MXFP4 is an OCP-standard 4-bit microscaling floating-point format. Compared to FP8, it doubles effective throughput per memory access with negligible accuracy loss on most LLMs. The MI350P natively supports MXFP4 and MXFP6; the NVIDIA H200 NVL does not (FP8 is its lowest native precision); the RTX PRO 6000 Server does (FP4). The 4,600 TFLOPS MXFP4 figure is the headline performance number and the basis for AMD's "40% faster than H200 NVL" claim — the gap closes substantially if you're running BF16 inference instead.
Can a single MI350P run Llama-3.1-405B?
At 4-bit quantization (MXFP4 or IQ4_XS), yes. The 405B weights fit in ~110GB of HBM, leaving ~30GB for KV-cache and activations. Single-stream throughput is modest (single-digit to low-teens tok/s) but the entire model fits on one card.
Why is the 450W power-cap mode a big deal?
Most current-generation PCIe servers (Dell PowerEdge R750, HPE DL380 Gen11, Supermicro H13) were spec'd for 350-450W per accelerator slot. Switching the MI350P from 600W to 450W mode trades ~15-20% of peak FP8 throughput for compatibility with that enormous installed base of chassis. For fleet retrofits where buying new servers isn't an option, this is decisive.
Do I need ROCm 7?
Yes. ROCm 7.0 is the first release with day-one CDNA 4 support. ROCm 6.x will not detect MI350-series silicon.
Does it have Infinity Fabric for multi-card scaling?
No. The PCIe SKU exposes PCIe 5.0 ×16 only. Infinity Fabric mesh is OAM-only. For inference data-parallelism across cards, PCIe 5.0 ×16 (~63 GB/s bidirectional) is plenty. For tensor-parallel training of >70B models, you'll feel the difference vs MI355X OAM or H200-SXM.
What's MXFP6 for?
Microscaling 6-bit floating point. Sits between FP8 and FP4 on the accuracy/throughput curve. For models that lose too much quality at MXFP4 but where FP8 leaves throughput on the table, MXFP6 is the new middle option. Native support on MI350-class hardware means no software emulation overhead.
Does the 12V-2×6 connector mean my old 8-pin PCIe PSU can't drive it?
Correct. You need either a native 16-pin 12V-2×6 PSU (ATX 3.1 generation) or a quality adapter from 4× EPS12V/PCIe-8-pin to 12V-2×6. For 600W sustained draw, the native cable is strongly preferred — cheap adapters were the failure mode in the original 12VHPWR melting incidents.
Will it work in a Threadripper PRO 7995WX workstation?
Almost certainly yes — WRX90 motherboard with PCIe 5.0 ×16 slot, 1500W+ PSU with native 12V-2×6 output, and adequate front-to-back chassis airflow are the requirements. Confirm BIOS supports >256GB Above-4G-Decoding before purchase.
How does MI350P compare to MI350X?
Same CDNA 4 silicon. MI350P is roughly half: 128 CUs vs 256, 144GB vs 288GB, 600W (450W cap) vs ~1000W liquid, 4 partitions vs 8, no Infinity Fabric vs 7-link IF mesh. PCIe trades scale for deployability into existing servers and workstations.
Bottom line
The MI350P is the first AMD datacenter accelerator since the MI210 that a typical on-prem operator can rack-mount or workstation-install without a liquid-cooling refit. AMD's published "40% faster than H200 NVL" claim is most relevant to FP4-quantized inference workloads; on BF16 the two cards are closer. The 144GB capacity edge over H200 NVL is a small but meaningful 3GB. The 450W configurable power mode is the unsung killer feature for fleet retrofits. ROCm 7 is the most production-ready AMD software stack to date, with the major LLM-inference paths at near-CUDA parity.
For inference-heavy on-prem deployments where capacity-per-card and software stability matter more than peak FLOPs-per-dollar or last-mile niche-kernel availability, this is the most credible AMD AI accelerator launch since the MI300X.
Citations and sources
- AMD Instinct MI350P PCIe GPUs: Run Enterprise AI on Your Existing Infrastructure (AMD blog)
- AMD Instinct MI350 Series GPUs (product page)
- AMD Intros Instinct MI350P Accelerator: CDNA 4 Comes to PCIe Cards — ServeTheHome
- AMD announces MI350P PCIe AI accelerator card with 144GB of HBM3E — Tom's Hardware
- AMD launches the Instinct MI350P GPU with 144GB of HBM3E and a 600W TBP — TweakTown
- AMD takes aim at enterprise AI with PCIe-based Instinct GPUs — The Register
- AMD Instinct MI350P PCIe Add-In Card Review — Phoronix
- AMD Delivers Breakthrough MLPerf Inference 6.0 Results (AMD blog)
- AMD MI350X vs NVIDIA B200: Specs, Benchmarks, and Cloud Pricing — Spheron Blog
- ROCm vLLM project on GitHub
- ROCm Linux install documentation
- NVIDIA RTX PRO 6000 Server (product page)
- HuggingFace — ROCm LLM inference blog
This piece is editorial synthesis based on publicly available information from AMD's launch announcement and partner press coverage. No independent first-party benchmarking is reported. Performance figures are AMD-published peak numbers or projections from the MI355X figures published in MLPerf Inference 6.0.
