As an Amazon Associate, SpecPicks earns from qualifying purchases. Some hardware listed is sold primarily on eBay; affiliate links to both marketplaces are included where each is appropriate. See our review methodology.
Inside reusable-agents: A Self-Hostable LLM Agent Framework Driving Two Live Consumer Sites in 2026
By Mike Perry · Published May 6, 2026 · Last verified May 6, 2026 · 18 min read
The short answer
reusable-agents (github.com/voidsstr/reusable-agents) is an open-source MIT-licensed framework for running scheduled, LLM-driven agents with shared memory, human-in-the-loop confirmations, and a control dashboard. The repository's public README cites aisleprompt.com (an AI meal planner that builds your Instacart cart) as a live deployment running entirely on the framework's automation pipeline — agents curate recipes, optimize SEO, generate articles, and ship code edits to the Vite/React frontend on a 2-hour cron cycle. This article synthesizes the framework's public design (sourced from its README, docs/architecture.md, and recent commit history) with publicly available benchmarks for the GPUs that make sense for self-hosting it at home — from a single RTX 5090 box up to a multi-node Ollama cluster.
Key takeaways
- Architecture (per the public README): every registered agent subclasses
AgentBase, gets a per-run lifecycle (status updates → decision log → run-index entry → goal-progress recording), and writes all state to either Azure Blob Storage or local filesystem. - LLM provider chain (per
framework/core/code_editor.pyin the repo): a fall-through chain of code-editor backends —jcode-copilot→aider-copilot-proxy→aider-github-copilot→opencode-azure→crush-azure→aider-azure→codex-azure→jcode-azure→jcode-ollama. The chain runs first to last; the first backend that returns rc=0 with a non-empty file diff wins. - Local fallback works. Per commit
d92d687(May 6, 2026) in the public repo, the maintainer documented an end-to-end validation run: with the cloud chain skipped (IMPLEMENTER_FORCE_FALLBACK=1) andjcode-ollamapinned, devstral-small-2:24b running on a local Ollama server applied a 1-line meta-description trim through jcode's tool-calling harness, the implementer committed the change, and the deployer pushed a new container app revision. The commit message records the full pipeline timeline. - Hardware-tier reality: the public commit messages reference an RTX 5090 32 GB primary box with
qwen3-coder:30banddevstral-small-2:24bpulled as the maintainer's documented dev rig. Per the framework's CLAUDE.md, this hardware is sufficient for the cloud chain (Copilot → Azure → Ollama fallback) plus local Ollama for the final fallback step. - Scaling paths covered below: (1) per-box VRAM upgrade (RTX 5090 → RTX A6000 → RTX PRO 6000 Blackwell), (2) multi-box Ollama distribution (cheap consumer cards aggregated), and (3) the Mac Studio M3 Ultra alternative for capacity-bound workloads.
What reusable-agents is, in plain terms
The repo's README.md describes the framework as "a self-hostable framework for running LLM-driven agents with shared memory, scheduled execution, human-in-the-loop confirmations, and a control dashboard."
Key public design decisions (sourced from docs/architecture.md):
- Agents live in their own repos and POST a
manifest.jsonto the framework instance's HTTP API. The framework auto-creates a systemd--usertimer + service unit so the agent runs on its declared cron schedule. - All state is persisted to a pluggable storage backend (Azure Blob in production, LocalFilesystem for tests). This includes status, decision logs, per-run artifacts (recommendations, emails, code diffs), confirmation records, and goal-progress history.
- A FastAPI service + React dashboard ship in
framework/api/andframework/ui/respectively. The dashboard surfaces live agent status via WebSocket, a dependency graph, an inbox for confirmations, an implementer queue, and a goals view. - The implementer agent is the framework's code-editor — it takes approved recommendations and applies them as commits to the target site repo. It uses a fallback chain of LLM-driven code-editor binaries (the chain is configurable in storage at
config/code-editor-config.json).
The repo's first paragraph in the README states the design intent clearly: "Most agent systems are monoliths. You install one product and your agents have to live inside it. This framework inverts the relationship: your agent code lives in your own repo. The framework runs next to your agents and provides the cross-cutting infrastructure."
For a home-lab operator, the practical implication is that you do not need to write any framework code to publish agents. You write your agent's agent.py (subclassing AgentBase) plus a manifest.json, register it once with bash install/register-agent.sh, and the framework takes over.
The dispatch pipeline — what actually happens when an agent finishes a run
Per framework/core/dispatch.py (cite: the source file in the public repo), the canonical end-of-run path for an agent that produces work for the implementer looks like this:
Step-by-step (sourced from the file's docstring + the related CLAUDE.md):
- Producer agent finishes its
run()and decides which recommendations should ship. It callsframework.core.dispatch.dispatch_now(...)with a list of rec ids and a path to the run-dir. - Site lock acquired via
framework.core.locks.site_dispatch_lockso two dispatches against the same site don't trample each other; different sites run in parallel. systemd-run --user --scopespawns the implementer in a detached scope so the implementer outlives the producer's process.- Implementer's
run.shdrives an LLM session againstAGENT.md(the implementer's runbook) using either the claude-pool path (subscription-billed Claude Max accounts via theclaudeCLI) or the framework'scode_editorfallback chain. code_editorchain tries each backend in order until one returns rc=0 with a non-empty file diff. The chain (per the public default config) isjcode-copilot→aider-copilot-proxy→aider-github-copilot→opencode-azure→crush-azure→aider-azure→codex-azure→jcode-azure→jcode-ollama.- Commit + tag under
agents/<id>/release/<run-ts>so every shipped change is traceable. - Deployer chain runs
test → build → push → deploy → smokeper the recipe inexamples/deployer/azure-container-apps.yaml. The deployer is cloud-agnostic: any shell command works in the recipe.
Validated outcomes from the public commit log: commit f9d6415 in the repository's connected example deployment documents a 1-line meta-description trim that fully ran through the framework's local-only chain (jcode-ollama + devstral-small-2:24b). Commit bd0b0e8 shows the deployer's shipped-marker bug fix arriving via the same pipeline. Both are MIT-licensed open-source records and reproducible by anyone forking the repo.
Hardware — what does it take to self-host?
This is the section most home-lab operators care about. The framework itself is light on compute (the FastAPI service + dashboard run on a small Container App; the heavy work is the LLM calls). What matters is the hardware you use for the local fallback step in the code-editor chain and for any agents that prefer self-hosted models over the subscription/cloud paths.
The maintainer's setup, per public commit history
The repo's CLAUDE.md and the recent commits (specifically d92d687 switching the jcode-ollama default model to devstral-small-2:24b) document the maintainer's primary box as having an NVIDIA RTX 5090 (32 GB GDDR7), 32 GB VRAM total. The local Ollama instance has the following coding-relevant models pulled per the commit message:
qwen3-coder:30b(~18 GB at Q4) — Qwen 3 coder, 30B total params with MoE 3B activeqwen2.5-coder:32b(~19 GB at Q4) — proven older coderdevstral-small-2:24b(~15 GB) — Mistral Devstral 2, agentic-coding-tuned. Promoted to default on May 6, 2026 afterqwen3-coder:30breturned rc=0 / files_changed=0 in 60s on a 1-line edit while devstral-small-2 made the edit cleanly.
For a single-box self-host today, the public design supports anything from a 16 GB consumer GPU running a smaller coder model up through a 96 GB pro card running 70B+ inference natively. The chain only invokes the local model when the cloud chain has exhausted its retries, so a smaller local card is acceptable.
Tier 1 — the maintainer's tier (RTX 5090 32 GB)
| Spec | RTX 5090 (Blackwell) |
|---|---|
| VRAM | 32 GB GDDR7 |
| Memory bandwidth | 1,792 GB/s |
| TDP | 575 W |
| MSRP | $1,999 |
| What it runs locally | qwen3-coder:30b, devstral-small-2:24b, qwen2.5-coder:32b — all with headroom |
| Where to buy | Amazon (ZOTAC RTX 5090 Solid OC, B0F1YG5STN) · eBay search |
This is the documented dev-box tier. According to commit d92d687's benchmark notes, devstral-small-2:24b on this card completes a multi-step agentic edit in roughly 7 minutes wall-time inside jcode's harness; qwen3-coder:30b returned in 60s but failed to engage the Edit tool on the same prompt (cited as the reason for the model swap).
Tier 2 — capacity scaling per box (RTX A6000 48 GB)
The next step up is more VRAM in a single slot, not raw speed. The NVIDIA RTX A6000 is the cheapest single-card path to 48 GB and the only sub-$5,000 workstation card with NVLink for two-card scaling.
| Spec | RTX A6000 (Ampere) |
|---|---|
| VRAM | 48 GB GDDR6 |
| Memory bandwidth | 768 GB/s |
| TDP | 300 W (1U-blower form factor) |
| MSRP | $4,650 (used market $3,800–$4,800 per active eBay listings) |
| NVLink | Yes (112 GB/s peer-to-peer) |
| What it runs natively | Llama 3.3 70B Q4_K_M (~43 GB used), DeepSeek-R1 70B, Qwen 72B — all without offload |
| Public benchmarks | Per DatabaseMart's A6000 Ollama benchmark: Llama 3.3 70B at 13.56 tok/s gen, DeepSeek-R1 70B at 13.65 tok/s. Per OpenLLM Benchmarks: Llama 3 8B Q4_K_M at 102.22 tok/s. |
| Where to buy | eBay (primary, used $3,800-$4,800) · Amazon (intermittent third-party listings) |
For agent self-hosting, the A6000 is overkill on raw speed (Ampere is two architectures behind Blackwell) but transformative on capacity: the framework's article-author agent can run a 70B local model as the proposer for long-form content generation without ever touching the cloud chain.
Tier 3 — top-of-stack per box (RTX PRO 6000 Blackwell 96 GB)
| Spec | RTX PRO 6000 Blackwell |
|---|---|
| VRAM | 96 GB GDDR7 |
| Memory bandwidth | 1,792 GB/s |
| TDP | 300 W |
| FP4 / FP8 | Yes (5th-gen tensor cores) |
| MSRP | $8,499 |
| What it runs natively | Llama 3.1 405B Q4 on a single card (~99 GB used per llm-tracker's RTX PRO 6000 measurements), Qwen3 235B MoE, DeepSeek V3-0324 685B at MLX 4-bit |
| Where to buy | Amazon (PNY Server Edition B0FXMY871V) · eBay search |
This is the only single-card desktop GPU available at retail in 2026 that loads 405B Q4 in one box without offload, per the cited llm-tracker source.
Multi-box scaling — distributing agents across machines with Ollama
The framework's design has nothing in it that requires a single host. Per docs/architecture.md, the API service speaks HTTP to anything that POSTs a manifest, and the host-worker that dispatches agents is independent of where individual LLM calls land. The cleanest multi-box pattern is to run Ollama on a separate machine and point the framework's code-editor-config.json at it via an OLLAMA_HOST env var.
Public Ollama distributed-inference patterns
Ollama's public docs (https://ollama.com/blog/openai-compatibility) describe a model-serving server that listens on :11434 and exposes an OpenAI-compatible /v1/chat/completions endpoint. The framework's JcodeBackend consumes that endpoint via jcode --provider ollama per the code_editor.py source. To distribute load:
- Run one Ollama server per GPU on a dedicated inference box.
- Front them with a small reverse proxy (caddy / nginx / a custom round-robin Python service) listening on
:11434from the perspective of the framework host. - Set
OLLAMA_HOST=http://<your-proxy>:11434before the framework's code-editor chain dispatches.
The public reference for this pattern is Ollama's distributed-inference issue tracker — multiple users have published reverse-proxy configs that aggregate 2-4 nodes.
Cheap-cards-aggregated tier
If your goal is to maximize total VRAM across a home network rather than per-box, the highest-leverage hardware in 2026 is RTX 4090 used market (per active eBay listings, $1,200–$1,500) for 24 GB-per-node, or RTX 3090 used ($550–$700) if you can find them with rebuilt fans.
| Card | VRAM | Used 2026 price | Per-GB cost |
|---|---|---|---|
| RTX 4090 (Ada) | 24 GB | $1,200–$1,500 | ~$56/GB |
| RTX 3090 (Ampere) | 24 GB | $550–$700 | ~$25/GB |
| RTX 5090 (Blackwell) | 32 GB | $1,800–$2,400 | ~$66/GB |
Per Tom's Hardware's used-GPU buying guide, the RTX 3090 dual-card path with NVLink (which the 3090 still supports) gives 48 GB pooled VRAM at roughly $1,200 — half the price of a single A6000 used. For a home lab running an Ollama node per machine, the per-GB economics favor used 3090s heavily.
Where to buy:
- RTX 4090 on Amazon (B0BJFRT43X) — primary new+used third-party listings.
- RTX 4090 on eBay — broader used inventory.
- RTX 3090 on eBay (used market) — cheapest 24 GB Ampere with NVLink.
- RTX 5090 on Amazon (B0F1YG5STN) — current-gen flagship.
Mac Studio M3 Ultra alternative
For agents whose bottleneck is model capacity rather than throughput — for example, an article-author agent that needs DeepSeek-R1 671B as its proposer — the Apple M3 Ultra Mac Studio with up to 512 GB unified memory is the only single-desktop machine that fits 405B+ models without server-tier hardware.
| Config | Price | What it loads |
|---|---|---|
| M3 Ultra 60 GB | $3,999 | up to ~32B at Q4 |
| M3 Ultra 96 GB | $4,799 | 70B Q4 with headroom |
| M3 Ultra 256 GB | $9,499 | Llama 3.1 405B Q4, Qwen3 235B MoE |
| M3 Ultra 512 GB | $13,999 | DeepSeek-R1 671B at MLX 4-bit, DeepSeek V3 685B |
The 512 GB SKU runs DeepSeek-R1 671B at ~18 tok/s with 448 GB used per MacRumors's measurements — a workload no single Nvidia card under $20,000 can do without offload.
Open-source models that match each tier
The framework's config/code-editor-config.json ships sensible defaults that work on any of the tiers above. Per the file's default_chain array, the local fallback is jcode-ollama with devstral-small-2:24b as the configured model. Operators can override per agent via the dashboard's /code-editor page or by writing a different model into the storage config.
Models that work for the implementer's code-editor step
Per public benchmarks (and validated by the framework's commit log on May 6, 2026):
devstral-small-2:24b(Mistral Devstral 2) — the documented default. Engages tool-calling reliably in jcode's harness per commitd92d687. ~15 GB at Q4. Fits any 16 GB+ card.qwen2.5-coder:32b(Alibaba Qwen 2.5 Coder) — proven older option, ~19 GB at Q4. Matched on the Aider leaderboard at parity with closed-source models for surgical edits.qwen3-coder:30b(Qwen 3 Coder, MoE 3B-active) — strong on benchmarks but per the framework's commit message, less reliable at engaging the Edit tool inside agentic harnesses. Still useful for chat agents that don't need tool calls.
Models that work for chat agents (the AI provider chain, not the code-editor chain)
The framework's framework/core/ai_providers.py exposes a separate provider abstraction for chat agents. Per the public CLAUDE.md, the recommended local-Ollama chat models are:
qwen3:32bfor general agent reasoning at ~19 GBqwen3:14bfor lighter agents at ~9 GBqwen3:8bfor high-throughput agents at ~5 GB
Confirmation flows and human-in-the-loop
Per framework/core/confirmations.py, every agent method that modifies production state can be wrapped with a @requires_confirmation decorator. The framework records the request, sends an email to the configured owner, and blocks execution until the human responds via the dashboard or via reply email.
The public design distinguishes four confirmation patterns documented in the README:
email-recommendations— the default for SEO-style agents. Recommendations get queued; the human approves a batch via the dashboard.per-action— every dangerous method blocks individually.preview-mode— the agent generates artifacts but never ships.upstream-gated— the agent's output flows to a downstream agent that has its own confirmation gate.
For a home-lab operator who's nervous about agents pushing code unsupervised, the preview-mode setting in each agent's manifest is the safe default — agents will produce diffs to a changes/ folder but the implementer won't commit them.
Limitations and what the framework does NOT do
To stay honest:
- No multi-tenant isolation. Per the README's "Operational rules" section, this is a single-operator framework. Agents from different repos can register but they all share one Azure Blob storage container. There is no per-customer namespacing.
- systemd-only scheduling. The README documents systemd
--usertimers as the canonical scheduler. Operators on macOS use launchd via a manual port; Windows is unsupported. - No first-party RAG or vector store. Agents that need vector search bring their own (the framework provides nothing).
- The code-editor chain's success rate depends on the LLM. Per the May 6, 2026 commit log, claude-opus-4.7 via Copilot proxy succeeds on >90% of small edits in production; local-only chains soft-fail more often (
qwen3-coder:30breturned rc=0 with no edits in one validated test). - No Windows host support for the host-worker. The host-worker is a bash script that uses
systemd-run. WSL2 may work but is undocumented.
A self-hosting checklist for someone starting today
Per the framework's install/ directory and the README's "Quick start" section:
- Clone the repo:
git clone https://github.com/voidsstr/reusable-agents.git - Provision Azure Blob storage (or set
STORAGE_BACKEND=localfor testing). Azure CLI walks-through are ininstall/bootstrap-azure.sh. - Install the host-worker as a systemd
--userservice:bash install/install-host-worker.sh. This is the daemon that pulls agents off the queue and dispatches them. - Pick your LLM provider chain. Operators with a Claude Max subscription can route through
claude-cli; operators on the GitHub Copilot Pro plan can route throughcopilot(the framework's default). For local-only operation, set the storageconfig/ai-defaults.jsondefault toollamawith your chosen model. - Optional: install the local code-editor binaries.
aider,opencode,crush,codex, andjcodeeach have their own install paths. Perframework/core/code_editor.py, the framework auto-detects which are present and skips the rest. - Register your first agent by writing an
agent.pyandmanifest.jsonin your project repo, then runningbash install/register-agent.sh <agent-dir>. - Browse the dashboard at the framework's URL (default port
8090for the API,8091for the UI; production deploys viainstall/deploy-azure.sh).
A non-trivial agent (one that applies code edits and ships them) needs additional one-time setup: ~/.codex/config.toml for codex-azure, ~/.config/opencode/opencode.json for opencode-azure, and ~/.config/crush/crush.json for crush-azure. The framework gracefully skips backends whose config is missing.
Verdict — who should self-host this
Self-host this if
- You run multiple sites or projects that would benefit from per-domain agent automation (content production, SEO, deployment chains).
- You have a Claude Max or GitHub Copilot Pro subscription and want to actually use that allowance to drive editorial workflows.
- You have a home AI rig (RTX 5090 / RTX A6000 / Mac Studio M3 Ultra) and want a real reason to keep it busy beyond chat.
- You want to study a working production agent system and you prefer reading code over reading vendor blog posts.
Skip this if
- You need turnkey hosted multi-tenant agent SaaS — this framework is a single-operator product.
- You're allergic to systemd or live exclusively in Windows.
- Your only goal is "run an LLM at home" — Ollama alone is enough; you don't need the full framework.
Pick your hardware tier
- Just want the framework to work, modest local fallback: RTX 5090 on Amazon or eBay — $1,999, runs devstral-small-2:24b for the local fallback, doubles as a top-tier gaming card.
- Need a 70B local proposer for an article-author-style agent: RTX A6000 on eBay (primary) — $3,800–$4,800 used, native 70B Q4 inference, NVLink for two-card scaling.
- Production-grade single-box for 405B-class workloads: RTX PRO 6000 Blackwell on Amazon — $8,499, 96 GB GDDR7, FP4/FP8 native.
- Capacity-bound (DeepSeek-R1 671B class): Mac Studio M3 Ultra 256 GB or 512 GB — the only single-desktop machine that fits 405B+ at Q4.
- Multi-box home cluster: RTX 3090 on eBay (used) — $550–$700 each, NVLink two-card pairs at $1,200 for 48 GB pooled. Best $/GB-VRAM if you can run multiple inference nodes.
Prices accurate as of May 6, 2026 and subject to change.
Related: RTX A6000 48GB workstation review →
Related: RTX 5090 vs RTX A6000 for local LLMs →
Related: Mac Studio M3 Ultra vs RTX 5090 for AI inference →
Frequently asked questions
Is reusable-agents really MIT-licensed? Per the LICENSE file in the repo's root: yes, MIT. The framework code, the dashboard React components, the install scripts, and the agent blueprints are all redistributable under the MIT terms.
Does it lock me into Azure? No. Per framework/core/storage.py, the storage backend is pluggable: AzureBlobStorage and LocalFilesystemStorage are shipped, and the abstract base class is documented. An S3 backend is straightforward — implement the same interface. The dashboard's Container App deployment is one option among several documented in examples/deployer/ (App Service, Functions, AWS ECS Fargate, AWS Lambda, AWS App Runner are all sample recipes).
What are the Azure cost expectations? Per the README's "Operational rules" section, the production deployment runs on a single small Azure Container App + an Azure Postgres + Azure Blob storage. Order of magnitude for a personal-scale operator: under $50/month. Heavy LLM costs go to your Copilot Pro / Claude Max subscriptions, not to the framework.
Can I use it without LLM costs at all? Yes — set default_provider: ollama-local in config/ai-defaults.json and point everything at your local server. Per the May 6, 2026 commit history, devstral-small-2:24b on an RTX 5090 + jcode is sufficient for the implementer's code-edit work; chat agents work with qwen3:14b or qwen3:32b per the documented defaults.
Does the framework support OpenAI's Codex / Anthropic API directly? Yes, both. The provider chain (per framework/core/ai_providers.py) supports openai, anthropic, azure_openai, claude-cli, copilot, and ollama natively. The default chain is ('copilot', 'azure_openai', 'openai', 'anthropic', 'ollama') per the public docstring.
How do I scale agents that produce a lot of work? Per the dispatch architecture, agents finish their run() and call dispatch_now() which spawns the implementer in a systemd-run --scope. Each dispatch is independent. To scale: distribute agents across multiple host machines (each runs its own host-worker), or move heavy LLM calls to a separate Ollama box and reverse-proxy in.
Is there a managed/hosted version? Per the README: no. The maintainer's stated position is "this framework is a single-operator product." A managed offering would require multi-tenant isolation that the public design does not provide.
Where do I report bugs? The repo's Issues page on GitHub is the canonical channel.
Citations and sources
This article synthesizes information from the following public sources. All claims about the framework's design come from its public repository; all benchmark numbers come from third-party benchmark publications cited inline above.
- reusable-agents repository on GitHub — the canonical source for the framework's architecture, manifest format, code-editor chain configuration, dispatch flow, and provider abstraction. README.md, docs/architecture.md, docs/agents-catalog.md, framework/core/code_editor.py, framework/core/dispatch.py, framework/core/ai_providers.py, framework/core/storage.py.
- DatabaseMart Ollama RTX A6000 benchmark — Llama 3.3 70B and DeepSeek-R1 70B token-generation rates on the RTX A6000.
- OpenLLM Benchmarks RTX A6000 review — Llama 3 8B and Llama 3 70B prefill + generation rates.
- llm-tracker RTX PRO 6000 measurements — Llama 3.1 405B Q4 single-card load + DeepSeek V3-0324 685B 4-bit measurements on the PRO 6000 Blackwell.
- DatabaseMart Ollama RTX 5090 benchmark — RTX 5090 token-generation rates across the Llama 3 family.
- Phoronix RTX 5090 Linux review — RTX 5090 LLM inference power draw + Linux driver behavior.
- Runpod RTX 5090 LLM benchmarks — Qwen 2.5-Coder-7B FP16 vLLM batched throughput.
- TechPowerUp RTX A6000 entry — RTX A6000 silicon, memory, NVLink, and bandwidth specifications.
- TechPowerUp RTX 5090 entry — RTX 5090 silicon, memory, and bandwidth specifications.
- Ollama OpenAI compatibility blog post — Ollama's OpenAI-compatible HTTP endpoint.
- Ollama distributed inference tracking issue — community reverse-proxy patterns for multi-node Ollama setups.
- Aider leaderboard — code-editor LLM benchmark scores including Qwen 2.5 Coder and devstral-small-2.
- MacRumors DeepSeek-R1 on M3 Ultra coverage — DeepSeek-R1 671B at MLX 4-bit token-generation rates on the M3 Ultra Mac Studio.
- Tom's Hardware used GPU buying guide — used-GPU price ranges for RTX 3090 and RTX 4090.
- Live eBay marketplace listings for RTX A6000, RTX PRO 6000 Blackwell, RTX 5090, RTX 4090, RTX 3090 — sampled May 6, 2026.
This piece is editorial synthesis based on publicly available information. No independent first-party benchmarking is reported here; all performance numbers are sourced from the publications cited inline above. Hardware availability and pricing change daily — verify current stock and pricing on the linked retailer pages before purchasing.
