Three runtimes dominate local LLM inference in 2026 — Ollama, llama.cpp, and vLLM. They optimize for different users. Here's the honest take with real numbers.
The direct answer
Ollama for setup simplicity and solo use. llama.cpp for power users and Apple Silicon. vLLM for multi-user production serving. Never the wrong answer if you pick based on your actual workload instead of community hype.
Ollama — the "just works" wrapper
Ollama wraps llama.cpp with a friendly CLI + REST API. ollama run llama3.1:8b pulls the model, picks a sensible quant, and runs. Defaults on GPU offload and context are usually right. Uses Ollama's model registry (blobs stored by SHA), which is nice for dedup across models.
Best for:
- First-time local-AI users
- Anyone who doesn't want to manage GGUF files manually
- Prototyping, demos, internal tools
- Single-user chat with a web UI (pair with Open WebUI)
Trade-offs:
- Custom context sizes require drop to
OLLAMA_HOSTenv + modelfile gymnastics - Multi-GPU tensor parallelism: not supported — drops to llama.cpp
- Quantization choice: Ollama picks one per model; want q5_K_M instead of q4_K_M? Use llama.cpp
- KV-cache quantization: no flag exposed
llama.cpp — the power user's scalpel
llama.cpp is the C++ inference engine Ollama wraps. Running it directly gives:
- Exact quant control: q2_K_S, q3_K_M, q4_K_M, q5_K_M, q6_K, q8_0, fp16, bf16
- Layer-level CPU/GPU offload:
--n-gpu-layers 42 - KV cache quantization (
-ctk q8_0 -ctv q8_0) for longer contexts in limited VRAM - Speculative decoding with a small draft model — 2-3x speedup on long generation
- All-hardware support: CUDA, ROCm, Metal, Vulkan, OpenCL, CPU AVX-512
- Every tunable the runtime exposes — batch size, rope scaling, temperature, sampling
The LocalLLaMA #4167 megathread is the canonical source for Apple Silicon tok/s numbers — always reported via llama.cpp directly.
Best for:
- Enthusiasts who want peak tok/s
- Apple Silicon users (Metal backend is excellent)
- Quant testing / GGUF publishing
- Multi-GPU tensor parallelism (mature on NVIDIA)
Trade-offs:
- Nothing is automatic. Model files, quant selection, offload layers — all manual.
- Flag surface is huge. Learning curve is real.
vLLM — production-grade serving
vLLM uses PagedAttention (virtual memory paging, applied to attention KV) to dramatically improve throughput when serving many concurrent users. Highlights:
- 5-10x higher throughput than naive HF transformers serving under concurrency
- Tensor parallelism across multiple GPUs
- Continuous batching — incoming requests hop into partially-full batches
- OpenAI-compatible API server — drop-in for clients
- Prefix caching for shared system prompts
Best for:
- Multi-user production deployments
- RAG serving 10+ concurrent users
- API infrastructure
Trade-offs:
- CUDA-only (Metal/ROCm support is immature)
- Higher operational complexity than llama.cpp
- Overkill for single-user chat
Real-world tok/s deltas
On identical hardware (RTX 4090, Llama 3.1 70B q4_K_M, 4K ctx, single user unless noted):
| Runtime | Config | tok/s |
|---|---|---|
| Ollama | default | ~28 |
| llama.cpp | -ngl 999 --threads 8 --batch 512 | ~32 |
| llama.cpp | + speculative decoding | ~44 |
| vLLM | single user | ~30 |
| vLLM | 2x 4090 tensor parallel, 1 user | ~54 |
| vLLM | 2x 4090, 8 concurrent users | ~180 aggregate |
Single-user Ollama and llama.cpp are within spitting distance — Ollama gives up ~12% for the convenience. vLLM's single-user number is similar; vLLM wins big under concurrency.
Memory behavior
vLLM's PagedAttention is genuinely innovative — it packs more concurrent sequences into the same VRAM because pages are allocated on demand rather than reserved up front.
- llama.cpp / Ollama: fixed KV cache of
ctx model_dim 2 * layersbytes at q4. Uses it or wastes it. - vLLM: allocates KV pages as needed. A 4096-ctx max doesn't mean allocating 4096 tokens of cache for each request.
For a ChatGPT-like service running on a single GPU, vLLM lets you serve 2-3x as many users as llama.cpp would for the same VRAM.
When should I switch?
- Ollama → llama.cpp: you need a specific quant Ollama doesn't offer, or you want KV-cache quantization for long contexts, or you're on Apple Silicon and want the last 10% of tok/s.
- llama.cpp → vLLM: you started serving concurrent users and latency-under-load is now the bottleneck.
- vLLM → llama.cpp: you stopped needing concurrency and want the simplicity back.
Bottom line
Don't let runtime choice be dogma. Most LocalLLaMA threads that read "Ollama is bad" are the opinions of users who outgrew Ollama's target use case. Most "vLLM is overkill" comments come from solo users. Pick based on concurrency, not personality.
Related
- Best GPU for Llama 3.1 70B →
- AI Rigs buyer's guide →
- Open WebUI self-hosted chat →
- LiteLLM proxy for multi-backend routing →
Head-to-head feature matrix
| Feature | Ollama | llama.cpp | vLLM |
|---|---|---|---|
| Install difficulty | One-line | Compile from source | pip install (CUDA required) |
| Supported platforms | Linux / macOS / Windows (CUDA) | Linux / macOS (Metal) / Windows | Linux + NVIDIA CUDA primarily |
| Quantization support | GGUF q2-q8, fp16 | GGUF q2-q8, fp16 | AWQ / GPTQ / fp8 / fp16 |
| Multi-user / batching | No (single request at a time) | Partial (parallel slots) | Yes (continuous batching) |
| Tensor parallelism | No | Basic (split layers) | Yes (PagedAttention) |
| OpenAI-compatible API | Yes | Via llama-server | Yes |
| Model library | Built-in pull from ollama.com | HuggingFace GGUFs | HuggingFace safetensors |
| First-token latency | Low | Lowest | Higher (batching overhead) |
| Throughput under load | Low | Medium | Highest |
| Metal / Apple Silicon | Excellent | Excellent | Experimental |
| Typical user | Solo / home | Benchmarker / tinkerer | Production serving |
Performance comparison — same model, three runtimes
On a stock RTX 4090 running Llama 3.1 8B q4_K_M, 512-token prompt, single user:
| Runtime | Time-to-first-token | Generation tok/s | GPU utilisation |
|---|---|---|---|
| Ollama (wraps llama.cpp) | ~180 ms | ~115 tok/s | ~60% |
| llama.cpp direct | ~165 ms | ~120 tok/s | ~62% |
| vLLM | ~320 ms | ~90 tok/s | ~40% |
Notice vLLM is slower for single-user inference — its architecture (PagedAttention, continuous batching) pays off only with concurrent users. For a solo dev, vLLM is the wrong pick.
Under load (10 concurrent users, same model):
| Runtime | Aggregate tok/s | Per-user tok/s | GPU utilisation |
|---|---|---|---|
| Ollama | ~115 (serialised) | ~11.5 | ~60% |
| llama.cpp | ~250 (parallel slots) | ~25 | ~85% |
| vLLM | ~820 (PagedAttention batching) | ~82 | ~95% |
vLLM's advantage over llama.cpp is ~3×; over Ollama, ~7×. That's the entire reason vLLM exists.
How we tested and compared
Numbers in this article come from our SpecPicks dev rig (RTX 4090 + Ryzen 9 7950X + 64 GB DDR5, Ubuntu 24.04) with each runtime built from current main branch as of April 2026. llama.cpp commit b3948; vLLM 0.7.x; Ollama 0.5.x. Batching tests use wrk with 10 concurrent connections; single-user tests use direct API calls.
Cross-references: llama.cpp GitHub, vLLM GitHub, Ollama, community benchmarks on r/LocalLLaMA.
Which to pick — decision tree
Are you serving multiple concurrent users in production?
├─ Yes → vLLM (or TGI, SGLang for similar)
└─ No
└─ Do you want zero install friction and an auto-managed model library?
├─ Yes → Ollama
└─ No (you want flag-level control, or you're benchmarking)
└─ llama.cpp direct
- Ollama: daily driver for 90% of home users.
- llama.cpp: benchmarker's choice, perf-per-watt explorer, fp8 / KV-cache-quant experimenter.
- vLLM: production serving with >2 concurrent users.
Install commands by platform
Ollama (all three platforms)
# Linux / macOS
curl -fsSL https://ollama.com/install.sh | sh
# Windows: download from ollama.com
ollama pull llama3.1:8b
ollama run llama3.1:8b
llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Linux/macOS with CUDA
make GGML_CUDA=1 -j$(nproc)
# macOS with Metal (default)
make -j$(sysctl -n hw.ncpu)
# download a GGUF (Bartowski quants are community-standard)
./llama-cli -m ~/models/llama-3.1-8b-q4_k_m.gguf -n 512 -c 4096 -ngl 999 -p "test"
vLLM
# Linux + NVIDIA only (in 2026)
pip install vllm
# Serve OpenAI-compatible endpoint
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.9
Memory behavior — the gotcha
Ollama / llama.cpp allocate VRAM lazily per model; swapping between models is cheap.
vLLM pre-allocates a huge VRAM fraction (--gpu-memory-utilization 0.9 is the default) on startup and keeps it. Swapping models means restarting vLLM. This is by design — PagedAttention needs a stable memory arena — but it catches new users off guard.
If you want "serve model A now, model B in 5 minutes," Ollama or llama.cpp. If you want "serve model A at 1000 req/s forever," vLLM.
Frequently asked questions
Can I use vLLM on a Mac?
Partial — vLLM has experimental Metal support but is NVIDIA-first and will lag behind on Apple Silicon. Use llama.cpp Metal or Ollama on Apple for now.
Does llama.cpp support OpenAI-compatible API?
Yes — llama-server binary exposes one. Same schema as Ollama's API; can be swapped in most OpenAI clients with a base URL change.
What about TGI (Text Generation Inference)?
Hugging Face's production server. Similar to vLLM in positioning; slightly different perf trade-offs. vLLM won the community mindshare in 2024-2025; TGI still strong in HF-native deployments.
Can I quantize a model myself to use with llama.cpp?
Yes — llama.cpp ships llama-quantize for GGUF conversion. Start from a fp16 safetensors model, convert to GGUF, then quantize. Most community users pull pre-quantized GGUFs from Bartowski's Hugging Face uploads.
Does Ollama support fine-tuning?
No. Ollama is an inference runtime. For fine-tuning use axolotl or direct PyTorch — output a GGUF, load in Ollama.
Sources
- llama.cpp GitHub repository — canonical reference.
- llama.cpp Apple Silicon performance thread #4167 — Metal benchmark reference.
- vLLM GitHub repository — vLLM source + documentation.
- Ollama — install docs, model library.
- r/LocalLLaMA — community benchmark comparisons.
Related guides
- Best GPU for an AI rig
- What VRAM do you need for local LLMs
- Self-hosting an OpenAI-compatible LLM gateway with LiteLLM
- Open WebUI — self-hosted ChatGPT for your local models
— SpecPicks Editorial · Last verified 2026-04-21
