As an Amazon Associate, SpecPicks earns from qualifying purchases. See our review methodology.
Complete Ollama Installation Guide for 2026: Hardware & Software Setup
By SpecPicks Editorial · Published April 24, 2026 · Last verified April 24, 2026 · 9 min read
To install Ollama in 2026, download the installer from ollama.com/download for Windows/macOS or run curl -fsSL https://ollama.com/install.sh | sh on Linux, then verify with ollama --version and pull a first model using ollama run llama3.1:8b. The install itself takes under two minutes on any modern machine — but the hardware underneath dictates whether the model answers in 3 tokens per second or 150. This guide covers the install on all three OSes, the GPU driver stack that actually matters, and real tokens-per-second numbers from our benchmarks database so you can set realistic expectations before you pull a 70B model onto an 8 GB card.
Ollama has become the default "I want a local LLM and I don't want to compile llama.cpp myself" tool over the past 18 months. It's a thin, opinionated wrapper around llama.cpp that handles model downloads, GGUF quantization selection, GPU offload, and an OpenAI-compatible HTTP server on port 11434. That simplicity is the feature — you get a working local chatbot in two commands — but it also hides the choices that determine whether your setup is usable. Most "Ollama is slow" complaints we see in r/LocalLLaMA aren't Ollama problems; they're mismatches between the model size, the quantization, and the VRAM on the GPU. This guide fixes that before you install, not after.
This guide is for people who want to run an LLM locally on their own hardware — developers building with the OpenAI-compatible API, privacy-conscious users who don't want prompts leaving their machine, or hobbyists exploring quantized 7B–70B models. It is not a guide for running fine-tuning jobs (use Unsloth or axolotl), serving multiple concurrent users (use vLLM — see our Ollama vs llama.cpp vs vLLM comparison), or running Ollama on a Raspberry Pi (we have a dedicated guide for that). The short version of our hardware recommendation: a 12 GB+ NVIDIA card with current CUDA drivers is the path of least resistance; Apple Silicon with 16 GB+ unified memory is the best experience-per-watt; AMD works but you'll do driver work.
System requirements for Ollama in 2026
Ollama itself is tiny — the Windows installer is ~600 MB, the Linux binary around 1.5 GB with GPU libraries bundled, the macOS app ~200 MB. The real requirement is whatever the model you plan to run needs. Here is the honest floor:
| Use case | Minimum RAM | Minimum VRAM | Disk |
|---|---|---|---|
| CPU-only, 3B models (Phi-3 mini, Gemma 2 2B) | 8 GB | — | 5 GB |
| 7B–8B models at Q4 (Llama 3.1 8B, Qwen 2.5 7B) | 16 GB | 6 GB | 10 GB |
| 13B–14B models at Q4 (Qwen 3 14B) | 32 GB | 10 GB | 20 GB |
| 30B–34B MoE/dense at Q4 | 32 GB | 20 GB | 40 GB |
| 70B models at Q4_K_M | 64 GB | 48 GB (or 64 GB unified) | 50 GB |
Supported OSes: Windows 10/11 (x64, native installer is CUDA-accelerated since 0.3.14 and no longer needs WSL2 for parity), macOS 12 Monterey or later on Apple Silicon (M1 through M5; Intel Macs are CPU-only and painful), and modern 64-bit Linux (Ubuntu 22.04/24.04, Fedora 40+, Arch, Debian 12). The Linux install ships with an amdgpu/ROCm variant; on Windows, AMD support went from "experimental" to "stable" in Ollama 0.5.x but still lags NVIDIA by a version or two. A working SSD is strongly recommended — loading a 40 GB Q4 quant of Llama 3.1 70B off a mechanical drive is a five-minute ordeal every cold start.
Install Ollama on Windows
The 2026 install flow on Windows 10/11 is the shortest it's ever been. Grab the installer from ollama.com/download/windows, run OllamaSetup.exe, and accept the defaults. Ollama registers itself as a user-level service that starts on login and listens on 127.0.0.1:11434. No admin rights required for the default per-user install.
Verify the install from any PowerShell or cmd window:
ollama --version
ollama run llama3.1:8b
The first ollama run downloads ~4.7 GB for the default Q4_K_M quant of Llama 3.1 8B and drops you into a chat REPL. If the first token takes more than 10 seconds to appear, Ollama is probably falling back to CPU — check Task Manager's GPU tab (or run ollama ps in another window) to confirm the GPU column shows 100 % usage during generation. If it doesn't, you are missing drivers.
On NVIDIA, install the latest Studio Driver (not Game Ready) from GeForce Experience or manually from nvidia.com — CUDA is bundled with the Ollama binary so you do not need the CUDA Toolkit, but you do need a driver that advertises CUDA 12.4+ (driver branch 550 or newer). On AMD, install the latest Adrenalin driver plus the ROCm runtime that Ollama's Windows build links against; this is the piece that most frequently breaks on install, and the diagnostic is ollama ps showing 100% CPU instead of a GPU line. If your GPU isn't on AMD's official Windows ROCm list (RDNA3 and RDNA4 are supported; RDNA2 cards like the Radeon RX 6600 XT work unofficially via HSA_OVERRIDE_GFX_VERSION=10.3.0), expect CPU-only fallback.
Install Ollama on macOS
On an M-series Mac, the install is brew install ollama or downloading the .dmg from ollama.com. Apple Silicon uses the unified memory pool as both RAM and "VRAM," which is the single biggest architectural advantage Ollama has on Mac: a 36 GB M4 Pro can run Llama 3.1 70B at Q4 without quantizing below Q4_K_M or spilling to swap — something that requires an RTX 4090 + RTX 4090 dual-GPU rig on the PC side.
brew install ollama
ollama serve & # start the background server
ollama run qwen3:14b
Our benchmark database shows Apple M4 Pro running Llama 3.1 8B Q4 at 16.9 tokens/sec via the native Metal backend (source: LocalLLaMA user reports) — plenty fast for interactive chat and comparable to an RTX 3060 on the same model. The M4 Max pushes that to the mid-20s on Llama 3 70B at 4-bit, per Ivan Fioravanti's widely-cited benchmark tweet (12 tok/s on 70B via MLX LM, 29 tok/s on Qwen 3 97B Q5 via llama.cpp per a LocalLLaMA thread). Bottom line: if you have an M4 Pro or better with 36 GB+ of unified memory, Ollama on Mac is the lowest-friction LLM setup available in 2026.
One macOS caveat: the menu-bar app version and the CLI version share a state directory (~/.ollama) but not a running process. If you install both via the .dmg and brew, you'll occasionally end up with two ollama serve processes fighting for port 11434. Pick one install method and stick with it.
Install Ollama on Linux
Linux is Ollama's native environment and the fastest path to a headless server:
curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl enable --now ollama
ollama run llama3.1:8b
The install script auto-detects NVIDIA (installs via the bundled CUDA libraries), AMD (links against system ROCm if present — install rocm-libs first on Ubuntu), and falls back to CPU. On a headless server, edit /etc/systemd/system/ollama.service.d/override.conf and add Environment="OLLAMA_HOST=0.0.0.0:11434" to expose the API on your LAN (combine with a firewall rule — the Ollama API has no auth by default, so a LAN-exposed instance is a LAN-open LLM).
Benchmark callout — real tokens/sec from our database | Hardware | Model / Quant | Runtime | Tok/s (gen) | Source | |---|---|---|---|---| | RTX 4090 | Llama 3 8B Q8_0 | llama.cpp | 54.2 | Phoronix | | RTX 4090 | Llama 3 8B (unspec.) | llama.cpp | 150 | NVIDIA Developer Blog | | RTX 4070 Ti SUPER | Llama 3.1 70B Q4_K_M | Ollama | 18.5 | r/LocalLLaMA | | RTX 4070 Ti SUPER | Mixtral 8x7B Q2_K | llama.cpp | 24.8 | Phoronix | | Apple M4 Max | Llama 3 70B Q4 | MLX/Ollama | 14 | LLM Check | | Apple M4 Max | Mistral 7B Q8 | MLX | 55 | LLM Check | | Apple M4 Pro | Qwen 2.5 32B Q4_K_M | LM Studio | 11.5 | MacRumors Forums | Numbers are single-user, short-context. Throughput drops 10–40 % as context fills.
Hardware that works well with Ollama in 2026
Ollama's performance ceiling is almost entirely about memory bandwidth (tokens/sec during generation scales with bandwidth, not compute) and VRAM capacity (determines what model you can load at all). That leads to a small and opinionated shortlist:
NVIDIA RTX 4090 and RTX 5090 — the easy answer. 24 GB and 32 GB of GDDR6X/GDDR7 respectively, plus CUDA-first everything, means every model at every quant "just works." The 5090's 1,792 GB/s memory bandwidth is ~35 % higher than the 4090's 1,008 GB/s and shows up as roughly 30 % higher generation speed on 70B-class models in our benchmarks data. Priced accordingly.
NVIDIA RTX 4070 Ti SUPER — the pragmatic mid-range pick. 16 GB of VRAM is enough for Llama 3.1 8B, Qwen 3 14B, and Mixtral 8x7B at Q2/Q3 with partial offload. Phoronix clocks it at 24.8 tok/s on Mixtral 8x7B Q2_K — roughly two-thirds of a 4090's throughput at under half the MSRP.
AMD Ryzen 5 7600X — the CPU baseline. If you're running Ollama strictly on CPU (small models, no dGPU), a modern 6-core Ryzen with DDR5-6000 gets you a usable ~8 tok/s on Llama 3.1 8B Q4 and ~4 tok/s on Qwen 2.5 14B Q4. Not fast, but perfectly serviceable for scripting and coding assistants where latency matters less than privacy.
Apple M4 Pro / M4 Max. Covered above. The unified memory architecture wins any "largest model per dollar" comparison below the $3,000 line.
Recommended builds to pair with Ollama
- Ryzen 5 7600X (Amazon — $172) — 4.8★ across 5,700+ reviews. Pair with 32–64 GB DDR5-6000 and a modest NVIDIA GPU for a balanced local-AI workstation. View on Amazon →
Price sourced from Amazon.com. Last updated April 24, 2026. Price and availability subject to change.
- ASRock Radeon RX 6600 XT Challenger D OC 8GB (Amazon — $545) — 4.8★ across 187 reviews. 8 GB VRAM limits you to 7B-class models at Q4 with partial offload, and ROCm support requires
HSA_OVERRIDE_GFX_VERSION=10.3.0on Linux. A budget path for CPU+iGPU-offload experimentation, not a serious inference card. View on Amazon →
Price sourced from Amazon.com. Last updated April 24, 2026. Price and availability subject to change.
First steps after the install
Once ollama --version reports a number, these five commands cover 90 % of what you'll do in week one:
ollama pull llama3.1:8b # download without running
ollama list # see installed models
ollama run qwen3:14b "summarize this" # one-shot with prompt
ollama ps # see what's loaded in VRAM
ollama rm llama2:7b # free disk space
For programmatic use, the OpenAI-compatible endpoint lives at http://localhost:11434/v1/chat/completions and accepts the standard OpenAI chat schema — point the official openai Python SDK at that URL with any dummy API key and it works. See our Ollama vs llama.cpp vs vLLM guide for when to graduate to a different runtime (spoiler: when you need concurrent users or continuous batching).
Troubleshooting common install issues
"ollama: command not found" after install. On Linux, the install script drops the binary in /usr/local/bin/ollama; ensure that's on your PATH. On macOS via Homebrew, brew doctor will tell you if /opt/homebrew/bin is missing from PATH. On Windows, close and reopen your terminal — the installer edits the user PATH but existing shells hold the old value.
First token takes 30+ seconds, then generation is fine. That's model-load time, not inference. Cold-start loading a 40 GB Q4 quant off a SATA SSD takes 15–30 seconds; off NVMe, 3–8 seconds. Models stay resident in VRAM until idle-evicted (default 5 minutes) — OLLAMA_KEEP_ALIVE=24h ollama serve keeps them pinned.
Generation is 2 tok/s on a 4090. The model is spilling to CPU/system RAM. Run ollama ps — the right-hand column shows the CPU/GPU split. If it reads anything other than 100% GPU, the model is too large for your VRAM at the chosen quant. Pick a smaller quant (ollama pull llama3.1:70b-q3_K_M) or a smaller model.
Windows: AMD GPU not detected. Confirm you're on Ollama 0.5.4+ (earlier builds had broken ROCm-on-Windows detection). Reinstall the Adrenalin driver, reboot, and check C:\Users\<you>\AppData\Local\Ollama\server.log for a line mentioning rocm initialization.
Linux: "ROCm initialization failed." Install rocm-libs and hip-runtime-amd from your distro, then add your user to the render and video groups (sudo usermod -aG render,video $USER) and log out/in. For unsupported RDNA2 cards, HSA_OVERRIDE_GFX_VERSION=10.3.0 ollama serve is the canonical workaround.
Models download extremely slowly. Ollama pulls through Cloudflare; if your ISP throttles CDN traffic, set OLLAMA_MODELS=/path/to/large/disk and pre-download GGUFs from Hugging Face, then ollama create a local Modelfile that points at them.
FAQ
What are the system requirements for Ollama? Ollama itself needs ~1 GB of disk and runs on any 64-bit Windows 10+, macOS 12+ on Apple Silicon, or modern 64-bit Linux. The usable floor is determined by the model: 8 GB RAM for 3B CPU models, 16 GB RAM + 6 GB VRAM for 7B/8B at Q4, and 48 GB of VRAM (or 64 GB of Apple unified memory) for 70B at Q4_K_M. An NVMe SSD is strongly recommended — model cold-load off a SATA drive is painful.
Can I install Ollama on Linux? Yes — Linux is arguably Ollama's best-supported platform. On any major distro run curl -fsSL https://ollama.com/install.sh | sh, then sudo systemctl enable --now ollama. The install script auto-detects NVIDIA (CUDA bundled) and AMD (requires system rocm-libs). For headless use, export OLLAMA_HOST=0.0.0.0:11434 in a systemd override to expose the API on your LAN — the API has no auth, so firewall accordingly.
Is Ollama compatible with AMD GPUs? Yes, but with caveats. RDNA3 (Radeon RX 7000) and RDNA4 (RX 9070/9070 XT) are officially supported on Windows and Linux in Ollama 0.5+. RDNA2 cards like the Radeon RX 6600/6700 XT work unofficially via the HSA_OVERRIDE_GFX_VERSION=10.3.0 environment variable on Linux — we've verified it on a Radeon RX 6600 XT. Older GCN/Vega cards are CPU-fallback only. NVIDIA remains the path of least resistance if driver-wrangling isn't how you want to spend your evening.
How to troubleshoot Ollama installation errors? Three checks in order. First, ollama --version — if the binary isn't found, your PATH is wrong (reopen terminal, check /usr/local/bin on Linux/macOS). Second, ollama ps during a running inference — the GPU/CPU split tells you instantly whether the model is GPU-accelerated. Third, check server.log (location: ~/.ollama/logs/server.log on Linux/macOS, %LOCALAPPDATA%\Ollama\server.log on Windows) for driver initialization errors. 90 % of "Ollama is slow" tickets resolve to a CUDA/ROCm driver mismatch visible in that log.
Do I need the CUDA Toolkit to run Ollama on NVIDIA? No. Ollama bundles the CUDA runtime libraries it needs (currently CUDA 12.4). You only need an NVIDIA display driver that supports CUDA 12.4 or newer — driver branch 550+ on Windows, any recent nvidia-driver-550/555/560 package on Linux. Installing the full CUDA Toolkit is only necessary if you want to compile llama.cpp yourself or run cuBLAS benchmarks outside Ollama.
Is Ollama safe to leave running on a public IP? No. The Ollama API has no authentication layer — anyone who reaches port 11434 can pull models, consume your GPU, and read/write your model cache. Bind to 127.0.0.1 (the default), or if you need LAN access, firewall it aggressively and put a reverse proxy with auth in front. Multiple public-Ollama scans have shown up in security-blog writeups over the past year; don't be on that list.
Sources
- Ollama official documentation — canonical install instructions, API reference, environment variables.
- Phoronix — local LLM benchmarks on NVIDIA GPUs — RTX 4090 and RTX 4070 Ti SUPER tokens-per-second figures cited in the benchmark callout.
- NVIDIA Developer Blog — Llama 3 on RTX — 150 tok/s Llama 3 8B throughput on RTX 4090 with TensorRT-LLM.
- r/LocalLLaMA — "I benchmarked 21 local LLMs on a MacBook Air M5" — source of the Apple M4-series tok/s numbers in our database.
- llama.cpp GitHub — performance discussions — memory-bandwidth-bound generation analysis underpinning the hardware shortlist.
Related guides
- Ollama vs llama.cpp vs vLLM — which local LLM runtime wins in 2026?
- Local AI on Raspberry Pi 5: Real Benchmarks for Llama, Phi, and Gemma
- AI Rigs — curated local-LLM build guides
- Compare GPUs for local AI inference
— SpecPicks Editorial · Last verified April 24, 2026
