A self-hosted multi-agent LLM stack

The full writeup behind Self-hosted LLMs and the context discipline that makes them work. Internal specifics are scrubbed; everything here is generic enough to run on your own gear.

A GPU host behind a gateway, running LLM inference, web search, scraping, embeddings, and speech-to-text with zero external API dependency. On top of it: a single-slot model swap, model tiering, a handful of narrow agents, subagents with isolated context, skills that wrap an API as one tool call, and a memory layer that learns from its own runs. The discipline matters more than the model size, and the serving methodology near the end is what makes that discipline affordable on hardware you own.

Table of Contents#

Overview
Infrastructure
llama-swap: model server
Open WebUI: API gateway
Firecrawl: scrape + AI extraction
SearXNG: metasearch
The core idea: clean contexts beat big models
Five concepts that make it work
Model tiering
OpenCode: the agent runtime
Custom tools
The research pipeline
Skills: one tool call over an API
Provider configuration
Lessons learned
Serving optimization methodology
Memory and self-learning
Sources

1. Overview#

A single server runs every AI service I touch: LLM inference, embeddings, speech-to-text, web search, web scraping, and a unified API gateway in front of all of it. Everything self-hosted, no external API dependency.

Clients (OpenCode, browser, automation, scripts)
    |
    v  <llmbox>
+--------------------------------------------------+
|                                                  |
|  Open WebUI :8080  ----+                          |
|  (API gateway + UI)    |                          |
|                        v                          |
|  Firecrawl :443  ----> llama-swap :1234           |
|  (scrape + extract)    (model server)             |
|                        |                          |
|  SearXNG :8888         +-> llama-server instances |
|  (metasearch)          |   (one per loaded model) |
|                        |                          |
|                        +-> GPU 0                   |
|                        +-> GPU 1                   |
+--------------------------------------------------+

The interesting part of working with language models in 2026 is not the model. It is the discipline you wrap around it, and that discipline is far easier to build and inspect on a stack you own end to end. The hardware is the boring part. The rest of this doc is the layers on top: the single-slot constraint that shapes the design, model tiering, the narrow agents built around it, subagents with isolated context, skills, the serving tuning that makes it fit in VRAM, and a memory layer that learns from its own runs.

2. Infrastructure#

The reference machine in this doc is a single multi-GPU box, called <llmbox> throughout. The exact CPU, RAM, and SSD models are not load-bearing - any box with two 24-32 GB consumer GPUs, a large amount of system RAM, and a fast NVMe for model weights will run the same stack. The numbers below are the shape of one working build.

Compute#

Component	Detail
CPU	high-core-count workstation/server class
RAM	256 GB (typical usage ~20 GB for OS + Docker, the rest available for model layers and page cache)
GPU 0	NVIDIA, 32 GB VRAM, PCIe Gen5 x16
GPU 1	NVIDIA, 32 GB VRAM, PCIe Gen5 x16
Total VRAM	64 GB

Both GPUs hang off the CPU PCIe host bridge (PHB topology), no NVLink. Models are split across the two cards using llama.cpp's layer splitting (--split-mode layer --tensor-split 1,7). With the worker tier loaded, GPU 0 holds the persistent small models plus a slice of the big model; GPU 1 holds the bulk of the big model layers.

Storage#

Dedicate a fast NVMe to model weights. Layout that works:

/ (root, ~2 TB): OS, Docker images and volumes, configs.
/var/lib/llama (separate fast NVMe, PCIe Gen5, ~12 GB/s read): model weights on an LVM volume. The read speed matters because every hot-swap reloads weights from disk.

Disk usage on the reference box: model weights ~840 GB under /var/lib/llama/models/, embedding weights ~7 GB under /var/lib/llama/embed/, Docker images ~45 GB on root.

Network#

A single primary interface on the LAN is all that is needed. Throughout this doc the box's address is written <llmbox>; on the reference build it is a static LAN IP in the documentation range 192.0.2.0/24. Docker uses its default 172.17.0.0/16 bridge.

OS#

Ubuntu LTS, a recent kernel, and a matching CUDA driver. Nothing distro-specific in the stack; it runs the same on any glibc Linux with working NVIDIA + Docker.

3. llama-swap: model server#

How it works#

llama-swap sits in front of llama.cpp's llama-server. A YAML file maps model names to launch commands. When a request arrives for a model:

If that model is already loaded -> route directly to its process.
If a different model from the same exclusive group is loaded -> unload it, start llama-server for the requested model, wait for it to load into VRAM, then route.
Persistent models (embeddings, whisper, the small utility model) stay loaded permanently alongside whatever big model is current.

Request: model="qwen3.5-122b-a10b"
    |
    v
llama-swap (:1234)
    +-- Currently loaded: qwen3.6-27b (big group, exclusive)
    +-- Requested: qwen3.5-122b-a10b (same group)
    +-- 1. Unload 27B process
    +-- 2. Start llama-server for 122B with the right GPU config
    +-- 3. Wait for model to load into VRAM
    +-- 4. Route request
    v
Response (swap took ~10-15s, subsequent requests instant)

Model groups#

groups:
  big:                    # Hot-swapped - only one loaded at a time
    swap: true
    exclusive: true
    members:
      - qwen3.6-27b              # + -fast, -uncensored - the default worker
      - qwen3.6-35b-a3b          # + -fast
      - qwen3.5-122b-a10b        # + -fast, -uncensored - reason escalation
      - qwen3.5-397b-a17b        # + -fast - deep escalation
      - gemma4-31b
      - glm-5-1
      - minimax-m2.7-229b
      - nemotron3-super-120b     # + -fast
      - step3.5-flash            # + -fast
      - reranker-qwen3-8b
      - reranker-bge-v2-m3

  small:                  # Persistent - always loaded
    persistent: true
    members:
      - qwen3.5-4b

  embeddings:             # Persistent - always loaded
    persistent: true
    members:
      - embedding-bge-m3         # default: small, multilingual, best quality-to-size
      - embedding-qwen3-4b
      - embedding-qwen3-8b

  audio:                  # Persistent - always loaded
    persistent: true
    members:
      - whisper-large-v3

Startup preload: the default worker plus the three persistent models (small utility, an embedder, whisper).

Available models#

Model	Type	Active params	Tier / use
`qwen3.5-4b`	Dense	4B	utility: page extraction, classification, the reflector
`qwen3.6-27b`	Dense	27B	the warm worker tier (default), + `-fast` / `-uncensored`
`qwen3.6-35b-a3b`	MoE	3B active	faster worker variant, + `-fast`
`qwen3.5-122b-a10b`	MoE	10B active	reason escalation, + `-fast` / `-uncensored`
`qwen3.5-397b-a17b`	MoE	17B active	deep escalation, + `-fast`
`gemma4-31b`, `glm-5-1`, `minimax-m2.7-229b`, `nemotron3-super-120b`, `step3.5-flash`	various	alternates on the swap menu
`embedding-bge-m3`	Embed	~0.6B	embedder of choice: small, multilingual, best quality-to-size
`embedding-qwen3-4b` / `-8b`	Embed	4B / 8B	higher-dimension embedding options
`reranker-bge-v2-m3` / `reranker-qwen3-8b`	Rerank	-	result reranking
`whisper-large-v3`	Audio	-	speech-to-text

The chat models run from unsloth *-MTP-GGUF builds (Unsloth Dynamic UD-Q4_K_XL) with multi-token-prediction speculative decoding (--spec-type draft-mtp) and a vision projector (--mmproj), so they are multimodal and draft-accelerated. The -fast variants are the same weights with thinking disabled (enable_thinking: false) and adjusted sampling (temp 0.7, top_p 0.8); subagents use them. -uncensored variants exist where a primary needs them.

The MoE letter math is total params - active params. A 122b-a10b activates 10 billion parameters per token while holding 122 billion. Memory holds the whole model, compute touches only the experts the router picks. That is the trick that makes the big tiers affordable on this hardware.

GPU memory layout (worker tier loaded)#

GPU 0 (32 GB):
  qwen3.5-4b (persistent) .......... ~3 GB
  embedding (persistent) ........... ~2 GB
  whisper-large-v3 (persistent) .... ~2 GB
  qwen3.6-35b-a3b (partial layers) . ~4 GB   (tensor-split 1,7 = 1/8 of model)
  KV cache + overhead .............. ~4 GB
  Free ............................. ~17 GB

GPU 1 (32 GB):
  qwen3.6-35b-a3b (main layers) .... ~24 GB  (7/8 of model)
  KV cache + overhead .............. ~5 GB
  Free ............................. ~3 GB

Key configuration#

# /etc/llama-swap.yaml (excerpt)
defaults:
  server: /opt/llama.cpp/build/bin/llama-server --port ${PORT}
          --no-warmup --jinja --metrics -np 1
  whisper: /opt/whisper.cpp/build/bin/whisper-server --port ${PORT}
  proxy: http://127.0.0.1:${PORT}

qwen3.6-27b:
  cmd: |
    ${server}
    --device CUDA0,CUDA1 --split-mode layer --tensor-split 1,1
    -m ${MODEL_BASE}/unsloth/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-UD-Q4_K_XL.gguf
    --mmproj ${MODEL_BASE}/unsloth/Qwen3.6-27B-MTP-GGUF/mmproj-F16.gguf
    --spec-type draft-mtp --spec-draft-n-max 3 --spec-draft-n-min 0
    --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0
    --ctx-size 262144
  proxy: ${proxy}

*-MTP-GGUF is the key part: an Unsloth Dynamic build with a trained multi-token-prediction draft head baked in. --spec-type draft-mtp runs that head as the speculative drafter, which (unlike ngram drafting) measurably speeds up free-form generation. --mmproj loads the vision projector, so the model is multimodal.

llama-swap runs as a native systemd service (not Docker) for direct GPU access:

/usr/bin/llama-swap -config /etc/llama-swap.yaml -watch-config -listen 0.0.0.0:1234

4. Open WebUI: API gateway#

External clients (OpenCode, browser, scripts)
    |
    v
Open WebUI (<llmbox>:8080)
    +-- /api/chat/completions -----> llama-swap :1234
    +-- /api/embeddings -----------> llama-swap :1234 (embedding models)
    +-- Web UI for direct chat
    +-- User management, conversation history
    +-- Model aliases and routing

Open WebUI is the single entry point for all LLM interaction. OpenCode, automation, and custom scripts connect to it as an OpenAI-compatible API; it routes requests to llama-swap internally and adds user management, conversation history, and model routing. It is reachable publicly at chat.archworks.co, which points its OPENAI_API_BASE at the local llama-swap.

Docker image: ghcr.io/open-webui/open-webui:latest, port 8080.

5. Firecrawl: scrape + AI extraction#

Docker stack#

Container	Role	Port
firecrawl-api	Main API service	:443 -> 3002
firecrawl-playwright	Headless Chromium for JS rendering	internal
firecrawl-redis	Job queue	internal
firecrawl-postgres	Job metadata	internal
firecrawl-rabbitmq	Task orchestration	internal

Endpoints#

Endpoint	What it does
`POST /v1/scrape`	Scrape URL -> clean markdown
`POST /v1/extract`	Scrape + LLM extraction -> structured JSON
`POST /v1/search`	DuckDuckGo search via headless browser
`POST /v1/crawl`	Crawl an entire site (async)
`POST /v1/map`	Discover URLs on a domain

LLM integration#

The /v1/extract endpoint uses the local 4B model for intelligent extraction:

Client -> <llmbox>:443/v1/extract
  -> Firecrawl scrapes the page with Playwright
  -> Passes content + extraction prompt to llama-swap (host.docker.internal:1234)
  -> llama-swap routes to qwen3.5-4b (persistent, instant)
  -> Returns structured JSON with only the requested facts

Config in docker-compose.yml:

OPENAI_API_KEY: "sk-local"
OPENAI_BASE_URL: "http://host.docker.internal:1234/v1"
MODEL_NAME: "qwen3.5-4b"

6. SearXNG: metasearch#

SearXNG aggregates results from multiple providers into one response, on <llmbox>:8888.

Engine	Status	Notes
Brave	Primary	Best quality, rate-limits under heavy automated use
Qwant	Primary	Strong for European content
DuckDuckGo	Enabled	Connection issues from some datacenter IPs
Google	Enabled	Silent failures from datacenter IPs (consent page)
Startpage	Enabled	CAPTCHA from datacenter IPs
Bing	Disabled	Consistently poor results

# /etc/searxng/settings.yml
search:
  default_lang: all             # Language set per-query by clients
  ban_time_on_fail: 30          # Fast recovery (default: 180s)
  max_ban_time_on_fail: 120     # Suspension cap (default: 600+s)

outgoing:
  request_timeout: 30.0
  enable_http2: true
  pool_connections: 100

Valkey (a Redis fork) stores engine-suspension state. Flush it when engines get stuck banned:

valkey-cli -s /run/valkey/valkey.sock FLUSHALL

Docker compose#

All Docker services live in one compose file (/opt/ai-stack/docker-compose.yml):

cd /opt/ai-stack
docker compose up -d                    # start all
docker compose up -d firecrawl-api      # restart a single service
docker compose logs -f open-webui       # follow logs
docker compose ps                       # status

7. The core idea: clean contexts beat big models#

A naive way to use an LLM for research: hand it a question, let it search the web, scrape ten pages, dump the content into context, ask for a report. By page seven it has forgotten what it read for page two. By the report it is inventing quotes that appeared in none of the sources.

This is not a small-model problem. Frontier models hallucinate under the same load. A model with a polluted context performs worse than a smaller model with a clean one. The shape of the problem is attention dilution: the model spreads its limited attention across everything in context, and the more noise you stuff in, the less attention each fact gets.

The fix is not bigger models. The fix is smaller contexts.

The mental model is a filing cabinet.

BAD:  one person, 50 documents, 10 questions
      = overloaded, mixes details, forgets, starts guessing
GOOD: 10 people, 5 documents each, 1 question each
      + 1 analyst combining the answers
      = focused, accurate, verifiable

The senior analyst is the orchestrator model. The ten people are worker models, each with a clean desk. That is what context discipline is: many workers with clean desks, one orchestrator combining their findings.

8. Five concepts that make it work#

8.1 Orchestrated delegation#

One agent plans the work and combines the results. The workers it spawns execute focused subtasks. Because the backend keeps one big model warm at a time (see section 3), the coordinator and its workers run on the SAME model - a subagent inherits the model of the agent that called it. The win here is a clean context per worker, not a bigger brain on top.

You: "Research complex topic X"
         |
         v
+--Coordinator (the warm worker model)----+
|  "I'll break this into 12 questions"     |
+--+------+------+------+------+------+----+
   v      v      v      v      v      v
 [Worker][Worker][Worker][Worker][Worker][Worker]
  (same warm model, inherited - not a separate tier)

Each worker:
  - Gets ONE specific question
  - Starts with a clean context
  - Searches, reads 3-5 pages
  - Writes findings with source URLs
  - Returns ONLY the distilled findings

Coordinator:
  - Receives 12 concise answers (not 12 raw page dumps)
  - Checks for gaps, retries failed topics
  - Verifies critical claims
  - Writes the final report

The coordinator does the synthesis, the workers do focused retrieval, and neither is overwhelmed. Escalating to a bigger model is a separate, deliberate step (see section 9), not something that happens inside one fan-out: a swap-bound backend cannot run a 122B coordinator and a fleet of smaller workers at the same moment. The gain holds at any scale - context isolation improves attention quality even on frontier models.

8.2 Context isolation#

The single biggest quality improvement. Each subtask gets its own context. Nothing leaks between workers.

WITHOUT isolation (one agent does everything):

  +--------------------------------------------------+
  | User question                      500 tokens    |
  | Page 1 scrape                    8,000 tokens    |
  | Page 2 scrape                   12,000 tokens    |
  | Page 3 scrape                    6,000 tokens    |
  | Search results                   2,000 tokens    |
  | Page 4 scrape                   15,000 tokens    |
  | Page 5 scrape                    9,000 tokens    |
  +--------------------------------------------------+
  = 52,500 tokens of accumulated noise
  = Attention spread thin across everything
  = Quality of final answer: LOW

WITH isolation (workers have separate contexts):

  Worker 1 context:        Worker 2 context:
  +------------------+     +------------------+
  | Question  200 tk |     | Question  200 tk |
  | Extract   800 tk |     | Extract 1,200 tk |
  +------------------+     +------------------+
  = fully focused          = fully focused

  Orchestrator context:
  +---------------------------+
  | Original question  500 tk |
  | Worker 1 findings  400 tk |  <- distilled, not raw pages
  | Worker 2 findings  600 tk |  <- distilled, not raw pages
  +---------------------------+
  = 1,500 tokens of clean signal
  = Quality of final answer: HIGH

The total compute is the same. The quality is not. It works because of three failure modes the isolation avoids:

Attention dilution: models spread attention across everything; less noise means better focus.
Retrieval failure: key facts from early in a long context get buried; a short context loses nothing.
Instruction drift: models gradually forget the original task as context grows; a fresh context keeps the task clear.

8.3 Extract, don't dump#

When a worker needs a fact from a web page, don't put the whole page in its memory. Use a tiny model to extract just the relevant facts first.

RAW SCRAPE (wasteful):               UTILITY EXTRACTION (efficient):
+---------------------------+         +---------------------------+
| Navigation menu           |         | Fine amount: 726 EUR      |
| Cookie banner             |         | Repeat offense: 2,180 EUR |
| *** Actual content ***    |         | Source: example.com       |
| Footer / ads / comments   |         +---------------------------+
+---------------------------+         = 200 tokens
= 8,000 tokens (95% noise)

A small 4B model reads the page and returns only what was asked for. The worker (35B) never sees the noise. This compounds: five extracted pages cost a thousand tokens of worker context; five raw pages would cost forty thousand. That compounding is the difference between fitting in context and not. The total compute is the same; the quality is not.

8.4 Verification pipeline#

Models confidently state wrong facts. They cite wrong sources, mix up jurisdictions, and invent information when they can't find the real thing. A single-pass pipeline has no way to catch this.

The fix: a verification step that checks claims against their sources.

Research output: "The fine is 726 EUR (Source: example.com/law)"
                          |
                          v
                   Verify Agent
        +-----------+-----+-----------+
        |           |                 |
   Visit URL    Compare          Rate it
   example.     "claim says     CONFIRMED /
   com/law      726 EUR,        DEBUNKED /
                page says        NOT FOUND
                726 EUR"
                          |
                          v
              CONFIRMED - claim matches source

What this catches in practice: wrong legal citations (a paragraph from the wrong jurisdiction), URLs that don't contain the claimed information, numbers the model fabricated when search was down, and paraphrased text presented as exact quotes.

8.5 Graceful degradation#

Things fail. Search engines rate-limit. Pages go down. Workers use up their budget. The system must handle this without collapsing or lying.

Search engine down?
  -> Automatic fallback to alternative search backend (different code path)
Worker returns empty?
  -> Orchestrator retries with different keywords
  -> After retry fails: marked "NOT VERIFIED" (honest)
Worker out of step budget?
  -> Writes partial findings + lists remaining questions
  -> Orchestrator delegates the rest to a fresh worker
All search backends down?
  -> Worker reports "search unavailable"
  -> Does NOT generate content from training data
  -> Honest gap > confident hallucination

The key rule: an honest "I couldn't find this" is ALWAYS better than a confident fabrication. The whole point of building a research pipeline is to be more trustworthy than the model talking to itself. Systems that degrade gracefully are trustworthy; systems that fill gaps silently are dangerous.

Measured impact#

A complex 20-topic research task, single-agent vs multi-agent on the same models and the same search tools:

Metric	Single agent	Multi-agent
Topics covered	~15%	75%
Verifiable sources	0-2	23
Hallucinations	5+ per report	Caught by verify
Useful output	Muddled summary	Structured report with citations

The only difference was how the work was organized: many focused workers with clean context, vs one overloaded agent with polluted context.

9. Model tiering#

The single-slot constraint shapes the tiers. One big model is warm at a time, so the default driver and the subagents it spawns share that warm model - a subagent inherits the caller's model rather than picking its own. Reaching for a bigger model means a deliberate swap, so the big tiers are escalation-only.

Tier	Model	Thinking	Role
Utility	4B (resident)	off	page extraction, classification, the reflector that writes memory
Worker	27B (the warm slot)	off	the default driver AND the workers it spawns (inherited)
Reason	122B MoE	on	escalation: only after the worker has failed twice on its own
Deep	397B MoE	on	last resort: only after reason came back and was verified wrong

The worker tier is the default. It runs the primary agents and the read-only subagents they fan out to. Thinking off, because for retrieval and most coding the reasoning monologue burns context without improving the result.

The utility tier is the always-resident 4B. It runs alongside the warm slot at no swap cost and does the cheap mechanical jobs: turn a raw HTML page into the three facts a worker asked for, classify, and (see section 17) review a finished session and write down what worked.

The reason and deep tiers are MoE escalation. You do not reach for them first - each step up costs a swap. A 122b-a10b activates only 10B parameters per token while holding 122 billion: memory holds the whole model, compute touches only the experts the router picks, which is what makes the big tiers affordable on this hardware.

When delegation pays off#

The whole pipeline is overhead. For a simple question it loses to a single model with a clean prompt.

Situation	Pattern	Why
Complex research, many subtopics	Full pipeline: delegate + isolate + verify	Each subtopic gets clean context, claims get checked
Code review / audit	Orchestrator + focused reviewers	Each reviewer checks one aspect
Data analysis	Orchestrator + extraction workers	Each worker processes one source
Simple Q&A	Single model, no delegation	The overhead is not worth it
One-page lookup	Single worker with extract tool	No orchestration needed

Rule of thumb: more than 3-5 documents to read, or more than 3 questions to answer, delegation pays. Below that, one model with clean prompting wins.

10. OpenCode: the agent runtime#

OpenCode is a TUI-based agent runtime. You select an agent in a terminal; the agent uses tools to accomplish a task and can spawn subagents, which is what makes the multi-agent delegation pattern real.

~/.config/opencode/
  opencode.json          # provider config, model menu, per-agent tool permissions
  agents/                # agent definitions (markdown files)
    # primaries (you pick one in the TUI; each refuses off-topic work):
    chat.md              # knowledge Q&A, comparisons, recommendations
    build.md             # code, refactors, infrastructure-as-code
    research.md          # sourced multi-step research
    ops.md               # live infra, remote troubleshooting, runbooks
    student.md           # coursework, math, write-ups
    # escalation tiers (called only after the cheap model fails):
    reason.md            # 122B - subtle bugs, race conditions, trade-offs
    deep.md              # 397B - last resort
    # read-only subagents (inherit the caller's model):
    web-researcher.md  verify.md  plan.md  review.md  researcher.md  ...
  tools/                 # custom tools (TypeScript) - see section 11
  plugins/               # lifecycle hooks: memory, traits, reflect, auto-format
  traits/                # one-line learned corrections (auto-loaded)
  learned-skills/        # multi-step procedures the stack wrote itself
  memory/                # flat-file facts + index, loaded every session

Agent definitions#

Every agent is a markdown file with YAML frontmatter:

---
description: What this agent does (shown in the agent picker)
mode: primary          # primary = user-facing, subagent = only called by others, all = both
model: openwebui/qwen3.6-27b
steps: 50              # maximum actions before forced stop
permission:
  edit: deny           # block file editing
  bash: deny           # block command execution
  websearch: deny      # block built-in web search (use the custom search tool)
---

The rest is the agent's system prompt: instructions, rules,
workflow steps, output format.

Agent modes#

Mode	Who can use it	Example
`primary`	User selects it in the TUI	research, build, review
`subagent`	Only other agents can spawn it	web-researcher, verify
`all`	Both user and other agents	plan

Model assignment#

Primaries pin the warm worker model. Read-only subagents leave the model unset and inherit the caller's, so a fan-out never swaps mid-run. Only the escalation tiers pin a bigger model on purpose:

chat.md / build.md / research.md / ops.md -> openwebui/qwen3.6-27b        (the warm slot)
web-researcher.md / verify.md / plan.md   -> (unset, inherits the caller's model)
reason.md                                 -> openwebui/qwen3.5-122b-a10b   (deliberate swap)
deep.md                                   -> openwebui/qwen3.5-397b-a17b   (deliberate swap)

Subagents run thinking-disabled, because OpenCode returns subagent results via assistant-message prefill, which is incompatible with llama.cpp's thinking mode.

Step budgets#

steps: 50 is the hard maximum. The system prompt sets a softer budget:

## HARD LIMITS

You have a budget of about 15-20 tool calls. After around 15, start wrapping up.
By 20 at most, you MUST stop and write your findings as text.

CRITICAL: Your findings are ONLY returned to the orchestrator if you write
them as a text message. Tool call results are NOT forwarded. If you never write
a text response, ALL your work is lost.

The gap between soft budget (20) and hard limit (50) gives the model 30 steps to write its findings even if it overshoots the tool-call budget. Without this buffer, workers consume all steps on tool calls and return nothing.

11. Custom tools#

Tools extend what agents can do. Three power the research pipeline and get the detail here; the stack ships about two dozen more, grouped at the end of this section. Each tool is denied globally and re-enabled per agent in opencode.json, so an agent only holds the tools its job needs.

search.ts - web search#

SearXNG (primary) -> Firecrawl /v1/search (fallback)

Queries SearXNG first (Brave, Qwant, DDG, Google).
If 3+ SearXNG engines are down, falls back to Firecrawl's built-in DuckDuckGo search.
2-second rate limiter between calls to prevent engine exhaustion.
Language parameter per query (de-AT, pl-PL, etc.).
When all backends fail, returns "Do NOT retry" to stop the agent from spiraling.

extract.ts - AI-powered extraction#

Firecrawl /v1/extract (LLM extraction) -> scrape + truncation (fallback)

Sends URL + extraction prompt to Firecrawl.
Firecrawl scrapes the page, passes it through the 4B model, returns structured JSON with only the requested facts.
If LLM extraction fails, falls back to a raw scrape truncated to 16K characters.
This is the key context-management tool: returns 500 tokens instead of 8,000.

extract("https://example.com/...", "Extract fine amounts for registration violations")

Returns:
{
  "penalties": [
    {"offense": "First offense", "fine": "up to 726 EUR"},
    {"offense": "Repeat offense", "fine": "up to 2,180 EUR"}
  ]
}

vs a raw scrape returning 20,000 characters of full text plus navigation and footers.

scrape.ts - raw page scrape#

Direct Firecrawl scrape, returns the full page as markdown (up to 15K chars). Used only when extract fails, or when the agent needs to browse a page without knowing what to look for.

The rest, grouped#

Beyond the research three, the stack adds tools that make agents capable and deterministic. Each is denied globally and re-enabled per agent.

Group	Tools	Purpose
Web	`crawl`, `map`, `mirror`, `scrape_html`, `har_parse`	recursive crawl, URL discovery, single-file or whole-site mirror, post-JS raw HTML, HAR API reverse-engineering
Vision	`vision_describe`, `screenshot`, `screenshot_local`	describe an image via a vision model, screenshot a URL or a local static site headlessly
Determinism	`resolve_path`, `json_query`, `backup_before`, `wait_for`	fuzzy-resolve a partial path, jq over JSON, timestamped backup before a destructive change, poll-until-true instead of `sleep`
Math (student)	`math_eval`, `math_symbolic`, `math_solve`, `math_linalg`, `math_numerical`, `math_stats`, `math_plot`	numeric eval, SymPy, equation / linear-algebra / numerical solving, stats, ASCII plots

The determinism group exists because a model left to write one-liners reaches for fragile shell: sleep 5 blocks the runtime, python -c breaks on odd input, rm runs before any backup. Each tool replaces a class of flaky improvisation with one call that behaves the same every time.

12. The research pipeline#

The most complex workflow. It demonstrates orchestration, context isolation, verification, and graceful degradation working together.

research.md (coordinator, warm model)
    +---> plan.md (optional, builds a detailed plan)
    +---> web-researcher.md x N (workers, inherited model)
    |       Each answers ONE question
    |       Tools: search, extract, scrape
    +---> verify.md (fact-checker, inherited model)
    |       Checks claims against source URLs
    |       Tools: search, extract, scrape
    v
    Final report with citations and verification verdicts

Step 1 - parse the request. The orchestrator reads the message line by line, extracting every issue, entity, and constraint. The prompt explicitly requires cross-checking so nothing gets missed.

Step 2 - build 12-20 narrow research questions. Not broad topics ("research social welfare") but specific questions: "What are the income thresholds for benefit X in 2026?", "What penalties exist for misrepresentation on application Y?", "What alternatives exist if someone is excluded from benefit Z?"

Step 3 - spawn all workers in parallel. Each web-researcher gets ONE question, starts with a clean context, and has the search/extract/scrape tools.

Worker lifecycle:
1. Search (1-2 calls)
2. Extract from 3-5 result URLs
3. Reason about source quality and consistency
4. Write findings with citations
5. Return text to orchestrator (tool outputs discarded)

Step 4 - gap analysis. Did every subtopic get answered? For gaps, spawn new workers with different keywords. After one failed retry, mark as "NOT VERIFIED".

Step 5 - verification (mandatory). The verify agent receives the 5-10 most critical claims with their source URLs, visits each URL, compares the claim against actual page content, and rates each CONFIRMED / PARTIALLY CONFIRMED / DEBUNKED / UNVERIFIABLE.

Step 6 - synthesize. The orchestrator combines verified findings into the final report. Debunked claims are marked; unverified items are listed honestly.

What goes wrong without this#

Problem	Cause	Solution
Worker returns empty	Used all steps on tool calls	Step budget (soft 20, hard 50)
Wrong law cited	Model hallucinated from training data	Verify agent checks source URLs
Worker forgets first 5 of 10 pages	Context overflow from raw page dumps	Extract tool returns structured JSON
All engines rate-limited	Too many rapid queries	2s rate limit + Firecrawl fallback
Worker keeps retrying failed search	Ignores "stop" instructions	Search tool returns "Do NOT retry"
Orchestrator skips verification	Prompt said "optional"	Changed to "MANDATORY"

13. Skills: one tool call over an API#

Subagents handle the process-discipline side. Skills handle the API-surface side. Together they let an agent touch real systems without drowning in plumbing.

A skill is a small documented capability with a clear contract. Three layers per skill:

+-------------------+     +-------------------+     +-------------------+
|    SKILL.md       |     |    AGENT.md       |     |    Python CLI     |
|  When to use      | --> |  API reference    | --> |  Makes the calls  |
|  Workflow steps   |     |  Field IDs        |     |  Handles auth     |
|  Output format    |     |  Rate limits      |     |  Parses responses |
+-------------------+     +-------------------+     +-------------------+

SKILL.md is the public face. The orchestrator reads it to decide when and why to use the skill.
AGENT.md is the reference layer. The executing agent reads it to know how to make correct API calls: API IDs, custom field names, endpoints. Loaded only when the skill is invoked.
Python CLI handles auth and transport. The model never sees credentials.

Directory structure#

~/.claude/skills/
  jira/
    SKILL.md          # "Use this skill for ticket management..."
    AGENT.md          # "POST /rest/api/2/issue, fields: {project: {key: ...}}"
    jira.py           # subprocess.run(["curl", ...]) with auth headers
    config.json       # {"url": "...", "token": "..."} - never shared
  calendar/
    SKILL.md
    AGENT.md
    calendar.py
    config.json

Each skill is a folder. Each opens with a short description so the agent knows when to reach for it, and ends with one or two example invocations so the agent has a working template. A practical set covers the things you touch daily: a ticket tracker, email, a calendar, a wiki, a time-tracking system, a tmux driver, a notes API.

Example: creating a ticket#

User: "Create a ticket for the DNS migration"

1. skill-runner reads jira/SKILL.md
   -> Determines this is a "create" operation
   -> Reads jira/AGENT.md for field mappings

2. skill-runner constructs the command:
   python3 jira.py create \
     --project OPS \
     --type Task \
     --summary "DNS migration to new provider" \
     --labels infra,dns

3. jira.py:
   - Reads config.json for URL + token
   - POST /rest/api/2/issue with the correct payload
   - Returns: "OPS-1234 created"

4. skill-runner reports back: "Created OPS-1234"

The model constructs the command; the Python script handles auth. Credentials never enter the model's context. The agent's whole view of "send a ticket" or "send an email" or "create a calendar event" is one tool call: a function that takes a small dict and returns a small dict.

The gain is concrete. Prompts dropped from "search for the ticket about deploying X, paste in the ticket ID, here is my username, here is the API token format" to "create a ticket about deploying X". Less plumbing in context, more attention left for the actual task.

14. Provider configuration#

OpenCode connects to backends via providers in opencode.json:

{
  "provider": {
    "local": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Local LLM",
      "options": {
        "baseURL": "http://127.0.0.1:1234/v1",
        "apiKey": "no-key"
      },
      "models": {
        "qwen3.6-27b": { "name": "qwen3.6-27b" },
        "qwen3.5-122b-a10b": { "name": "qwen3.5-122b-a10b" }
      }
    },
    "openwebui": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Open WebUI",
      "options": {
        "baseURL": "https://chat.archworks.co/api",
        "apiKey": "<redacted>"
      },
      "models": {
        "qwen3.5-122b-a10b": { "name": "qwen3.5-122b-a10b" }
      }
    }
  }
}

Agents reference models as provider/model-name (e.g. openwebui/qwen3.5-122b-a10b).

Two providers, same kind of backend. openwebui is the gateway every agent uses, so a subagent inherits the exact provider/model string of its caller and the warm slot is never fought over mid-run. The direct local provider points at a llama-swap on the workstation, kept for manual model-switching and offline use rather than wired into any agent.

15. Lessons learned#

Each of these was discovered through a specific failure.

Lesson	Discovery
Use `-fast` (thinking disabled) for all subagents	Thinking-enabled models reject assistant prefill, causing empty returns
Steps must be >> tool-call budget	Workers consumed all steps on tool calls, leaving none for writing findings
"MANDATORY" not "optional" for verify	The model skipped verification every time it was described as optional
Extract (structured JSON) not scrape (raw markdown)	5 raw scrapes = 50K tokens of noise, 5 extracts = 2K tokens of signal
Search tool must say "Do NOT retry"	Workers retried failed search 13+ times, exhausting all engines
Orchestrator must cross-check plan against user input	Broad planning missed specific details in the request
Subagent must include "write partial + list remaining questions"	Workers that couldn't finish returned nothing instead of partial results
Rate limit in the search tool, not just the prompt	Models ignore "wait 2 seconds" instructions; the tool enforces it in code
Two independent search backends	SearXNG and Firecrawl search use different code paths; they don't fail simultaneously

16. Serving optimization methodology#

How to squeeze the most context and tokens-per-second out of a llama.cpp + llama-swap stack on two 24-32 GB consumer GPUs (64 GB VRAM total on the reference box). The methodology was developed while tuning the Qwen3.5 family and generalizes to any large dense or MoE model too big for full GPU residency.

16.1 What you're actually optimizing#

For a single forward pass, the critical-path cost has three parts:

Attention over the KV cache - memory-bandwidth bound. Dominates token generation (TG) for long contexts.
Feed-forward (FFN / MoE expert) compute - compute-bound on GPU, PCIe-bound if experts live on CPU and their activations round-trip.
Per-request scheduling overhead - small, grows with -np (parallel slots).

On a 64 GB box a 100B+ model does not fully fit in VRAM, so you make three trade-offs at once: how much of the model lives on GPU vs CPU RAM, how large the context window is (KV scales linearly with context), and how many parallel slots the server exposes. Every knob is a move along one of those axes.

16.2 Free wins - apply to every model#

These cost nothing in quality and little to no speed, so they go into the server macro or every per-model entry.

Flash attention:

-fa on

Fused softmax + tiled matmul. Default-on for supported models on recent CUDA llama.cpp; specifying it makes intent clear. Zero quality impact, reduces memory bandwidth, unlocks the quantized KV path.

Quantized KV cache:

-ctk q8_0 -ctv q8_0

q8_0 is indistinguishable from fp16 in practice (<0.1% perplexity delta). It halves KV-cache VRAM: room to pin more expert layers, longer contexts at the same budget, faster load, same TG speed on bandwidth-bound attention. Verified clean up to 131k tokens at q8/q8.

Unsloth Dynamic quants. Prefer UD-Q4_K_XL or UD-Q4_K_S over plain Q4_K_M. The Dynamic family keeps sensitive tensors (attention projections, embedding, first/last layers) at higher precision and aggressively quantizes the rest, measurably beating same-size quants from other publishers. Between Q4_K_S (K-quant, faster CUDA MMQ kernels, slightly larger) and IQ4_XS (i-quant, ~5% smaller, slightly slower PP, slightly higher quality per byte): on a recent GPU with the MMQ path optimised, Q4_K_S is usually faster.

Build flags that matter when compiling llama.cpp against CUDA:

cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES="120" \    # match your GPU's compute capability
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CUDA_FA_ALL_QUANTS=ON \        # quantized KV cache in flash attention
  -DGGML_CUDA_F16=ON \                  # F16 intermediate precision
  -DLLAMA_CURL=ON

GGML_CUDA_FA_ALL_QUANTS=ON is what enables -ctk q8_0 -ctv q8_0 under flash attention. Without it the server silently falls back to slow paths.

16.3 MoE trick: selective expert offload#

This is where the biggest wins live for >100B MoE models like qwen3.5-122b-a10b, qwen3.5-397b-a17b, glm-4.7:358b, nemotron-3-super:120b, minimax-m2.7:229b.

A Mixture-of-Experts layer has the form:

y = router_gate(x) · sum_{e in top_k_experts(x)} expert_e(x)

where expert_e is a separate FFN block. In qwen3.5-122b-a10b the 122B count is mostly experts; only ~10B are active per token (top-k routing, k=8 across 128 experts in the named config).

llama.cpp lets you override where individual tensors go:

-ot "<regex-on-tensor-name>=<device>"

applied in declaration order (first match wins). A useful pattern:

-ot "blk\.(1[0-5]|[0-9])\.ffn_.*_exps\.=CUDA1"   # experts of layers 0-15 -> GPU1
-ot "blk\.(1[6-9]|2[0-4])\.ffn_.*_exps\.=CUDA0"  # experts of layers 16-24 -> GPU0
-ot ".ffn_.*_exps.=CPU"                          # every remaining expert -> CPU

Non-expert tensors (attention projections, embedding, norm, shared MLP, output head) still obey --split-mode layer --tensor-split 1,2, so attention lives entirely on GPU and only routed expert blocks are CPU-bound.

Why the trade is worth it. At TG time the critical path is attention (GPU) -> router (GPU) -> 8 expert FFN blocks (wherever they live). Moving some experts back onto GPU eliminates PCIe round-trips for those layers on every token, keeps attention fully GPU-resident, and leaves CPU-pinned experts only for layers where there's no GPU room.

On qwen3.5-122b-a10b, moving from 10 to 26 expert layers on GPU pushed prompt processing 138 -> 232 t/s (+68%), token generation 46 -> 55 t/s (+20%), at ~98% VRAM utilisation (54 -> 63 GB of 64).

Finding your own sweet spot:

Start with all experts on CPU (-ngl 999 -ot ".ffn_.*_exps.=CPU").
Pin the first N expert layers to the GPU with the most spare VRAM. Measure PP, TG, GPU memory.
Repeat with N+2 or N+4 until 1-2 GB headroom on that GPU.
Start pinning additional layers on the other GPU, using indices that don't collide with the first rule. Same procedure.
The ceiling is when any GPU gets within ~1 GB of total VRAM at idle (leaves room for compute-buffer jitter during long-prompt PP).

Pin early layers first. Empirically blk.0..N beats a scattered set - early layers have more predictable expert activation, reducing cross-device fetches.

-ngl with -ot: use -ngl 999 (all layers to GPU) together with the -ot regex. -ngl 999 puts every layer on GPU by default, then -ot ".ffn_.*_exps.=CPU" overrides just the expert FFN tensors to CPU. This keeps all attention, shared, and non-expert MLP tensors on GPU for every layer. The legacy alternative (a low -ngl like -ngl 25) is strictly worse: CPU-resident layers pay the PCIe cost for their attention too, not just their experts.

16.4 Context and parallelism: `--ctx-size` and `-np`#

Minimum target: native context at -np 1. Every model has a training context length (262,144 for Qwen3.5; 131k for GLM-4.7; 196,608 for MiniMax-M2.7). Always run at native context with -np 1 at minimum - dropping below native wastes capability and saves no meaningful VRAM once q8/q8 KV is on.

When to prefer -np 2 at 2x native. If the measured single-user speed penalty is <=10%, prefer -np 2 with --ctx-size = 2 x native, so each of two concurrent chats gets the full native window. A topology with a chat UI plus automation plus an agent regularly has two concurrent requests, and the queueing cost of -np 1 shows up in practice.

Worked example on qwen3.5-122b-a10b:

Config	Slots x ctx	PP	TG	GPU0	GPU1
np 1 / ctx 262k / 16+10 expert pin	1 x 262k	240	55.8	31.7	29.6
np 2 / ctx 524k / 16+9 expert pin	2 x 262k	232	54.8	31.5	31.8

-3% PP and -2% TG in exchange for a second concurrent slot at full native context. Per the concurrency-preference rule (<=10% cost -> pick concurrency), np 2 wins.

What -np actually does. With --ctx-size N -np K, llama.cpp partitions the KV arena into K equal slots of N/K tokens; a request always targets one slot. So -np 2 at --ctx-size 262144 gives each slot 131k (a reduction from native). To give each slot native 262k you must set --ctx-size 524288 -np 2. Getting this wrong silently halves the per-request context - the most common misconfiguration.

16.5 Verify: stress test at real workload size#

A small-prompt bench (~40 input, ~150 generated) is not enough. Compute buffers grow during long-prompt processing in a way that is invisible on short prompts, and that delta is exactly what pushes a marginally-tight config over the OOM line in production.

For every tuned model, run at least three stress sizes and capture peak VRAM (sampled via nvidia-smi every ~250 ms): ~20k prompt tokens (a large code file), ~50k (a long RAG response), ~100k+ (near the ceiling). Force a deterministic response length ("write a 300-word analysis") so the TG measurement uses enough generated tokens to be reliable.

Measured data, qwen3.5-122b-a10b np2/524k/16+9:

Prompt (tokens)	PP (t/s)	TG (t/s)	Peak GPU0	Peak GPU1	Notes
40	233	55	31 544	31 826	small-prompt nominal
~20k	451	80*	31 844	29 598	*TG noisy, 2 gen tokens
~50k	436	76*	31 922	29 682	*ditto
~131k	428	39.5	32 052	29 816	300 gen tokens, reliable
~260k (200k target)	-	-	-	-	HTTP 400 (prompt exceeded per-slot ctx)

Peak GPU0 under load is 32 052 / 32 607 MiB - only 555 MiB of headroom - and it does not OOM, because the compute-buffer delta between idle and peak is only 80 MiB on this workload. Thin but stable. TG dropping 55 -> 39.5 as context grows is the expected cost of attention scaling, not a config issue.

What failure looks like: the model loads, short prompts work, VRAM sits fine at idle, then the first 50k+ prompt crashes the process with a CUDA OOM mid-forward; the supervisor restarts it and the next request succeeds (different compute-buffer allocation). If this happens, back off 1-2 expert layers from the tightest GPU and re-run the stress test.

16.6 Full results table - `qwen3.5-122b-a10b`#

All measurements with the persistent models co-resident (qwen3.5-4b, embedding-qwen3-4b, whisper-large-v3 holding ~12 GB on GPU0), a fresh llama.cpp build.

Config	ctx per slot	slots	exp layers on GPU	PP (t/s)	TG (t/s)	GPU0 MiB	GPU1 MiB
OLD `-ngl 25` no FA/quant	131k	1	0	47.5	20.4	30 086	25 602
NEW minimal: FA + q8/q8 + expert=CPU	131k	1	0	105	38	22 273	12 327
+ ctx 262k native	262k	1	0	105	39	22 817	13 415
+ ctx 524k, np 2	262k	2	0	106	39	23 959	15 687
np 2, 4 experts on GPU1	262k	2	4	117	41	24 173	21 255
np 2, 8 experts on GPU1	262k	2	8	128	43	24 173	26 823
np 2, 10 experts on GPU1	262k	2	10	138	46	24 173	29 607
np 1, 16 experts on GPU1	262k	1	16	165	47.5	18 030	29 492
np 1, 16 GPU1 + 6 GPU0	262k	1	22	208	52.7	26 168	29 492
np 1, 16 GPU1 + 10 GPU0	262k	1	26	240	55.8	31 736	29 492
np 2, 16 GPU1 + 9 GPU0 (chosen)	262k	2	25	232	54.8	31 486	31 764

Net result vs the pre-update config: +388% PP, +168% TG, and a full 2x native context with 2 concurrent slots instead of 1 slot at half-native. Same hardware, just flags.

16.7 Testing harness#

Two small scripts on the server drive every probe.

probe122.sh runs a candidate config on a side port (1235), alongside the live llama-swap (1234), so the main stack and its persistent models stay loaded and the VRAM numbers reflect co-resident reality:

#!/bin/bash
# probe122.sh <label> <ctx> [extra-args...]
set -u
LABEL="$1"; CTX="$2"; shift 2; EXTRA="$*"
PORT=1235
MODEL=<path to main gguf shard 1>
MMPROJ=<path to mmproj>
BIN=/opt/llama.cpp/build/bin/llama-server
LOG=/tmp/probe_${LABEL}.log

pkill -f "llama-server --port ${PORT}" 2>/dev/null
sleep 1

nohup "$BIN" --port "$PORT" --no-warmup --jinja --metrics -np 1 -fa on \
  -ctk q8_0 -ctv q8_0 \
  --device CUDA0,CUDA1 --split-mode layer --tensor-split 1,2 \
  -ngl 999 \
  $EXTRA \
  -ot '.ffn_.*_exps.=CPU' \
  -m "$MODEL" --mmproj "$MMPROJ" \
  --ctx-size "$CTX" \
  >"$LOG" 2>&1 &
PID=$!

# wait for /health, bench, kill
for i in $(seq 1 180); do
  code=$(curl -s -o /dev/null -w '%{http_code}' http://127.0.0.1:${PORT}/health)
  [ "$code" = "200" ] && break
  kill -0 "$PID" || { tail -20 "$LOG"; exit 1; }
  sleep 1
done

nvidia-smi --query-gpu=memory.used --format=csv,noheader
PORT_OVERRIDE=$PORT python3 ~/bench_big.py

kill "$PID"; wait "$PID" 2>/dev/null

The -ot regex passed via $EXTRA is placed before the default .ffn_.*_exps.=CPU rule, so first-match semantics promote specific layers to GPU while everything else falls through to CPU.

bench_big.py fires a request with a configurable prompt-token target (via repeated passages), forces a long deterministic response, and samples nvidia-smi in a parallel thread at 250 ms to capture peak VRAM. It reads prompt_per_second / predicted_per_second from the /v1/chat/completions response timings. Env vars: TARGET_TOKENS, MAX_NEW_TOKENS, PORT_OVERRIDE (1234 for llama-swap, 1235 for probe), MODEL.

Workflow for tuning a new model:

Evict the model from llama-swap (pkill -f 'llama-server.*<model-filename>') so the GPU is clean of its previous config.
Run probe<model>.sh with candidate flags on port 1235, co-resident with the persistent models on 1234.
Iterate expert-pin counts until the target GPU gets tight.
Promote the winner to the /etc/llama-swap.yaml entry for that model.
Restart llama-swap once (not per-iteration - watch-config triggers a full persistent-model reload each time). Accept the ~25 s persistent-preload penalty.
Issue one request to the tuned model so it loads under real conditions, then run the stress bench at ~20k / ~50k / ~100k tokens and confirm peak VRAM stays below each GPU's limit.

16.8 What didn't work / isn't worth the complexity#

ngram speculative decoding on free-form analytical prose: zero VRAM cost, zero measurable speed gain - the ngram draft has no useful patterns to predict on chat workloads. MTP (multi-token-prediction) speculative decoding is the version that pays off, and the stack now runs it: the chat models are unsloth *-MTP-GGUF builds with a trained draft head, loaded via --spec-type draft-mtp --spec-draft-n-max 3. A trained head drafts far better than ngram lookup and helps on free-form generation too. The benchmark tables above predate MTP, so treat them as the expert-offload baseline; MTP is an additional gain layered on top.
Going beyond native context via YaRN scaling: KV scaling costs are linear and attention time grows visibly past native. Quality degrades past training length and TG latency at extended context becomes unusable.
-mlock / --no-mmap: pinning weights in RAM fights page-cache eviction on a multi-model swap box. mmap benefits from OS page caching across swaps; forcing it off slows cold loads noticeably.
--split-mode row (tensor-parallel): experimental on two GPUs with no NVLink; PCIe becomes the bottleneck. Layer split is the right choice on this topology.
Reducing context below native to "save VRAM": once q8/q8 KV is on, the marginal gain isn't worth the capability loss. Either keep native or go 2x native with -np 2.

16.9 Checklist for optimising a new model#

Before touching config, know: is it dense or MoE (MoE -> expert offload is the main lever; dense -> KV tuning only), how many transformer layers (bounds the -ot regex), the native training context (sets the --ctx-size minimum), the available quant family (prefer Unsloth Dynamic), and the gguf path pattern on the server.

Then, re-measuring after each step:

Macro flags: -fa on -ctk q8_0 -ctv q8_0.
Native context at -np 1.
For MoE: all experts to CPU, then pin in layers until the GPU with most headroom is at 1 GB free.
For MoE: pin additional layers on the other GPU until it's also at 1 GB free.
If single-user cost is <10%, double context and switch to -np 2.
Stress test at 20k / 50k / ~80% native prompts. Confirm peak VRAM under load stays below each GPU's ceiling.
Promote to the llama-swap config, restart once, verify swap-in and swap-out both work without errors.

17. Memory and self-learning#

Serving tuning makes the stack fast. This layer makes it improve by being used. Three kinds of durable memory, all built from my own input, all plain files on disk under ~/.config/opencode/.

Kind	What it holds	How it loads
`memory/`	facts: who I am, a project's constraints, decisions in no repo	an index file loaded into every session
`traits/`	one-line corrections: `WHEN a search 403s, DO retry via the extract tool`	the trait index is always in the system prompt
`learned-skills/`	reusable multi-step procedures composed from existing tools	auto-injected when the next message matches the trigger

Who writes them#

The primary model will not stop mid-task to record a lesson - it has a job to do. So a separate cheap model does it. After a session that did real work goes idle, the resident 4B reviews the tool sequence and records ONE thing: a learned-skill if the run was clean, a trait if a step failed and got recovered. The big model does the work, the small model writes down what worked. This runs at no swap cost, because the 4B is one of the always-resident models alongside the warm slot.

The promotion ladder#

fact         --memory_remember-->  memory/<slug>.md         (loaded every session)
correction   --reflector-------->  traits/<slug>.md         (index always in prompt)
procedure    --reflector-------->  learned-skills/<slug>/    (auto-injected on match)
proven skill --promote (mv)----->  skills/<slug>/           (a permanent skill)

A fact becomes a memory, a correction becomes a trait, a working procedure becomes a learned-skill, and a learned-skill that keeps proving itself gets promoted by hand to a permanent skill. The store is capped - a few dozen traits, a handful of learned-skills - so it stays curated, not a junk drawer.

Recall: semantic layer + knowledge graph#

On top of the flat files sits an optional semantic layer and a knowledge graph. The semantic layer indexes the memory, traits, and learned-skills into one searchable store; recall is automatic per turn, gated so an ordinary coding turn pulls in nothing and only a real match surfaces. If that layer is absent, everything degrades to a flat-file substring scan - the plain files are the source of truth, the index is a convenience.

The knowledge graph is built only from my own words: my notes and the things I have actually typed, never the assistant's output. A pipeline extracts (subject, predicate, object) triples, dedupes them, and normalises name variants so they collapse to one entity. The agent reads relationships from the graph during a conversation and writes new ones back as it learns them, so the graph keeps growing from use.

The point of the whole layer: the stack gets better at my work by running my work. A correction I make once sticks, because the stack wrote it down on its own.