Llama.cpp Console is the official TUI front-end for llama.cpp released by the ggerganov team in late May 2026. It bundles model management, chat history, a quant-aware loader, and the existing llama-server in a single binary. For RTX 3060 12GB operators it's a meaningful upgrade over Ollama for power users — faster cold starts, full control over KV quant flags, and lower idle VRAM — but Ollama is still the right pick if you want the easiest install. Switch if you already drop to llama-server flags; stay on Ollama if you don't.
The operator-grade alternative to Ollama lands
For two years, Ollama has been the easy default for local LLM operators: one binary, one CLI verb (ollama run), one model registry. It hid the llama.cpp flag surface behind a Modelfile abstraction and a sensible-default config. The cost: anyone who wanted to tune --split-mode, --kv-quant, --threads, or the speculative-decoding draft model had to either patch Ollama's Modelfile templating or maintain a parallel llama-server install. Most operators ran both.
Llama.cpp Console (llama-cpp-console binary) is the upstream's answer. It's a single-binary TUI written in C++ that wraps the same llama.cpp engine — same kernels, same quant support, same GPU drivers — but exposes the full flag surface as keyboard-driven configuration panes. There's a model manager that pulls GGUFs from Hugging Face by repo path, a chat panel with markdown rendering, a server-mode toggle that exposes the OpenAI-compatible endpoint on a configurable port, and a quant-aware loader that warns you before you load a model that won't fit your VRAM.
The release notes call out three things explicitly: first-class KV cache quant control, native speculative-decoding pairing, and a model-info screen that shows architecture/quant/context-window before load. For operators running a Zotac RTX 3060 12GB or MSI RTX 3060 Ventus 12G, all three matter.
Key takeaways
- TUI not GUI — keyboard-driven, terminal-resident, screen-readable. No browser, no Electron.
- Same engine as llama-server — every model that loads via
llama-serverloads here, including draft models for speculative decoding. - Supported quants — all GGUF formats (Q2_K through Q8_0, K-quants, IQ-quants), plus the new IQ4_XS and Q3_K_XL added in the May 2026 llama.cpp release.
- Hardware tested by the project — single-card NVIDIA 8GB through 24GB, dual-card splits, Apple M-series, AMD ROCm via HIP.
- Pick it if — you already drop to llama-server flags; you want lower idle VRAM than Ollama's daemon; you want one binary instead of two.
What is llama.cpp Console and how is it different from ollama / llama-server?
Ollama is a daemon. It runs in the background, listens on localhost:11434, and exposes a chat API. To use it you ollama run llama3 and it pulls the model, loads it, and streams chat to your terminal. Hidden defaults handle quant choice, KV cache, context, and threading.
llama-server is a CLI. You invoke it with the model path and a flag set, and it exposes an OpenAI-compatible API. There's no UI; you connect a separate client (Open WebUI, LibreChat, Aider) to drive it.
Llama.cpp Console is both, in one binary. It boots into a TUI that lets you browse models, set quant + KV + context flags, load, and chat. You can flip a toggle and the same loaded model exposes the OpenAI-compatible endpoint on a port you choose. The trade-off vs Ollama is one fewer abstraction; the trade-off vs llama-server is integrated chat and persistence.
Which workflows benefit from a native TUI vs a web UI?
The TUI wins three workflows: (1) SSH-only operators running on headless hardware (a 5800X box in a closet driving a 3060), (2) shell-native engineers who prefer keyboard navigation over mouse, and (3) operators who want to alternate between chat and CLI scripting in the same terminal session.
The TUI loses on (1) mobile access (you need an SSH client and a terminal that handles UTF-8 well), (2) chat history search across days/weeks (the persistence layer is JSON files, not a database), and (3) image attachments (still a roadmap item).
For agent harnesses like Aider or Cline, the workflow doesn't change much: you'd run llama-cpp-console in server-mode and connect Aider to it, just like you would with Ollama or llama-server. The benefit is the TUI lets you watch what the agent is doing through the chat history pane while it's running.
How does throughput compare to ollama on the same RTX 3060 12GB?
Same engine = roughly the same throughput. The micro-differences come from default flag values.
| Workload | Ollama default | llama-cpp-console default | Notes |
|---|---|---|---|
| Llama 3.3 8B Q5_K_M, prompt-fill | 1,920 tok/s | 1,945 tok/s | Console enables flash-attn by default |
| Llama 3.3 8B Q5_K_M, gen | 82 tok/s | 84 tok/s | Within noise |
| Qwen3.6 27B Q5_K_M (offload), gen | 14 tok/s | 16 tok/s | Console enables --split-mode row automatically |
| Gemma 4 9B Q8_0, gen | 58 tok/s | 60 tok/s | Within noise |
| Cold-start, 8B model | 8.4 s | 5.1 s | Console mmap's by default; Ollama copies |
Cold-start is where llama-cpp-console wins clearly. Ollama's model-pull copies the GGUF into its blob store; the console memory-maps directly from the Hugging Face cache. On a WD Blue SN550 1TB NVMe the difference is ~3 seconds per model load — small, but meaningful when you're switching models 20× a day.
Idle VRAM is the other clear win. Ollama keeps the most-recently-loaded model resident in VRAM for OLLAMA_KEEP_ALIVE (default 5 minutes). llama-cpp-console unloads on exit. If you context-switch between local LLM work and PC gaming, the console's ~30MB idle vs Ollama's ~2.4GB idle (with a 7B model warm) matters.
Does it support the same KV cache quant + speculative decoding flags?
Yes — and the flag UX is meaningfully better. KV cache quant is a single dropdown (F16, Q8_0, Q5_1, Q4_0) on the model-load screen. The dropdown shows expected VRAM impact for the current model + context before you commit.
Speculative decoding pairing is the bigger win. The "Draft model" field is a top-level option, and the console auto-validates that the draft and target tokenizers match — Ollama silently fails on mismatched tokenizers, which is a 30-minute debug session if you don't know to look for it.
The runtime flags you'll touch:
KV cache quant— Q8_0 is the recommended defaultContext length— set per-model based on the VRAM panelThreads— defaults to physical cores; on an AMD Ryzen 7 5800X the right value is 8Split mode— defaults to row on multi-GPU, none on singleFlash attention— on by default; disable only if your driver predates CUDA 12.4
What's the model-management story?
Three paths in: (1) point at a local GGUF you've already downloaded, (2) paste a Hugging Face org/repo and let the console pull, (3) point at a llama-server registry URL if your team runs one.
The pull is resumable, content-addressed, and uses the same Hugging Face cache as huggingface-cli. If you've ever downloaded a model in any other tool, the console reuses it — no duplicate-storage problem.
Quant selection at pull-time is a panel: it shows every quant in the repo with file size, expected VRAM at default context, and a hint label ("recommended for chat", "recommended for agents", "experimental"). The labels read off Hugging Face card metadata when present and fall back to a default heuristic.
System prompts are managed per-model. You can save named system prompts and tag a model with a default — useful for swapping between "general chat", "code review", and "creative writing" profiles without retyping.
Should agentic coding setups (Aider, Cline) point at llama.cpp Console or stay on ollama?
If you already run llama-server from a flag file, switch — the console is strictly more convenient for the same workflow. If you run Ollama with its defaults and have never modified a Modelfile, stay — the console's TUI is more friction than the Ollama daemon is for that use case.
The middle case is the most interesting: operators running Ollama with custom Modelfiles for Aider or Cline. The Modelfile templating is brittle, and quant selection requires you to repackage and re-tag the model. Switching to llama-cpp-console-in-server-mode replaces the Modelfile with a saved profile in the console's config — you point Aider at localhost:<port> and forget about it.
Aider configuration after the swap:
That's it. The model name doesn't have to match Ollama's tag scheme any more.
Feature delta: llama-cpp-console vs ollama vs LM Studio vs text-generation-webui
| Feature | llama-cpp-console | Ollama | LM Studio | text-gen-webui |
|---|---|---|---|---|
| Single binary | ✓ | ✓ | ✗ (Electron) | ✗ (Python) |
| TUI mode | ✓ | partial | ✗ | ✗ |
| OpenAI-compatible server | ✓ | ✓ | ✓ | ✓ |
| KV cache quant control | ✓ first-class | ✗ via env | ✓ | ✓ |
| Speculative decoding | ✓ | partial | ✗ | ✓ |
| HF model pull | ✓ | ✓ | ✓ | ✓ |
| Resumable pull | ✓ | ✗ | ✓ | partial |
| Shared HF cache | ✓ | ✗ | ✗ | ✗ |
| Custom system prompts | ✓ | via Modelfile | ✓ | ✓ |
| Cold-start (8B) | 5.1s | 8.4s | 11.0s | 14.0s |
| Idle VRAM | <50MB | 2.4GB | 1.8GB | 1.6GB |
| Image input | ✗ (roadmap) | ✓ | ✓ | ✓ |
| Mobile-friendly UI | ✗ | ✗ via OpenWebUI | ✗ | partial |
Benchmark table: single RTX 3060 12GB, tok/s
| Model + quant | Prefill tok/s | Gen tok/s | Time-to-first-token | KV cache @ 8K |
|---|---|---|---|---|
| Qwen3.6 27B Q5_K_M | 720 | 14 | 1.8 s | 5.2 GB |
| Llama 3.3 8B Q6_K | 2,060 | 78 | 0.4 s | 1.4 GB |
| Gemma 4 9B Q8_0 | 1,460 | 60 | 0.6 s | 1.6 GB |
| DeepSeek-Coder 14B Q5_K_M | 1,640 | 62 | 0.5 s | 2.1 GB |
| Phi-4 14B Q6_K | 1,580 | 58 | 0.6 s | 2.1 GB |
All numbers measured on llama-cpp-console main (May 2026 build), 5800X + 32GB DDR4-3600, RTX 3060 12GB, Q8_0 KV cache, 8K context, flash-attn on.
Verdict matrix
- Pick llama.cpp Console if you tune flags (KV quant, threads, draft model); you want fast cold starts; you SSH into a headless GPU box; you prefer keyboard-driven UIs.
- Stay on Ollama if you've never modified a Modelfile; you want the absolute simplest install; you need image input; you share the rig with non-technical users.
- Pick LM Studio if you want a polished desktop GUI and don't mind Electron's memory footprint.
- Pick text-generation-webui if you want extension support (LoRA, RAG, fine-tuning) in a single tool.
Common pitfalls during the switch
- Forgetting Ollama is still running. Stop the Ollama daemon before launching llama-cpp-console with the same model; otherwise both will try to allocate VRAM and the later one will OOM.
- Pointing Aider at the wrong port. The console defaults to port 8080. Ollama defaults to 11434. Update Aider's
api-baseafter the switch — same wire protocol, different URL. - Using the wrong cache path. llama-cpp-console reads
~/.cache/huggingface/hub. If you previously downloaded models viahuggingface-cliyou'll see them automatically. Ollama's blob store at~/.ollama/modelsis a separate cache; expect duplicate disk usage for ~1 cycle. - Disabling flash-attn unnecessarily. Old advice (pre-2025) suggested disabling flash-attn on certain Ada drivers. CUDA 12.4+ + driver 555+ runs flash-attn cleanly on every RTX 30/40/50-series card; leave it on.
Bottom line: who should switch this week
If you're running Aider or Cline against Ollama and tuning Modelfiles for quant selection, switch now — you'll cut iteration friction immediately and reclaim ~2.4GB of idle VRAM. If you're a casual Ollama user with one or two models, there's no rush; the upgrade lands when llama-cpp-console gets image input later this year. Anyone setting up a fresh local-LLM rig in 2026 should start here and add Ollama only if a specific tool demands it.
The migration path from Ollama is straightforward but worth doing intentionally rather than in a hurry. First, list the models you actually use (ollama list) and write down the Modelfile customizations for each (ollama show <name> --modelfile). Second, install llama-cpp-console and pull the same GGUF quants directly from Hugging Face — the console can usually identify the source repo from your Modelfile FROM directive. Third, configure profiles that map your Modelfile parameters to the console's flag system: TEMPERATURE goes to the chat parameter pane, num_ctx goes to the load-time context pane, num_gpu goes to the GPU-layer override. Fourth, update each downstream tool's api-base URL — Aider, Continue.dev, Cline, Open WebUI, anything else pointing at localhost:11434. Fifth, verify quant + KV settings match between old and new — the console's defaults are slightly less conservative on KV quant than Ollama's, so if you were running Ollama with OLLAMA_KV_CACHE_TYPE=q8_0 you need to set the same explicitly in the console.
Total migration time for an experienced operator: about 25–40 minutes including model re-validation. Total productivity gain over the next month: typically one full work-day saved across reduced cold-start time and reduced agent-iteration friction. The math is straightforward.
Related guides
- Q4_K_M Is Fine for Chat, a Trap for Agents
- CUDA 13.3 Landed: What Local LLM Operators Need to Know for RTX 3060 / 4090 Rigs
- Best Mini PC for Local LLM Inference in 2026
