The full writeup behind Self-hosted LLMs and the context discipline that makes them work. Internal specifics are scrubbed; everything here is generic enough to run on your own gear.
A GPU host behind a gateway, running LLM inference, web search, scraping, embeddings, and speech-to-text with zero external API dependency. On top of it: a single-slot model swap, model tiering, a handful of narrow agents, subagents with isolated context, skills that wrap an API as one tool call, and a memory layer that learns from its own runs. The discipline matters more than the model size, and the serving methodology near the end is what makes that discipline affordable on hardware you own.
Table of Contents#
- Overview
- Infrastructure
- llama-swap: model server
- Open WebUI: API gateway
- Firecrawl: scrape + AI extraction
- SearXNG: metasearch
- The core idea: clean contexts beat big models
- Five concepts that make it work
- Model tiering
- OpenCode: the agent runtime
- Custom tools
- The research pipeline
- Skills: one tool call over an API
- Provider configuration
- Lessons learned
- Serving optimization methodology
- Memory and self-learning
- Sources
1. Overview#
A single server runs every AI service I touch: LLM inference, embeddings, speech-to-text, web search, web scraping, and a unified API gateway in front of all of it. Everything self-hosted, no external API dependency.
Clients (OpenCode, browser, automation, scripts)
|
v <llmbox>
+--------------------------------------------------+
| |
| Open WebUI :8080 ----+ |
| (API gateway + UI) | |
| v |
| Firecrawl :443 ----> llama-swap :1234 |
| (scrape + extract) (model server) |
| | |
| SearXNG :8888 +-> llama-server instances |
| (metasearch) | (one per loaded model) |
| | |
| +-> GPU 0 |
| +-> GPU 1 |
+--------------------------------------------------+The interesting part of working with language models in 2026 is not the model. It is the discipline you wrap around it, and that discipline is far easier to build and inspect on a stack you own end to end. The hardware is the boring part. The rest of this doc is the layers on top: the single-slot constraint that shapes the design, model tiering, the narrow agents built around it, subagents with isolated context, skills, the serving tuning that makes it fit in VRAM, and a memory layer that learns from its own runs.
2. Infrastructure#
The reference machine in this doc is a single multi-GPU box, called <llmbox> throughout. The exact CPU, RAM, and SSD models are not load-bearing - any box with two 24-32 GB consumer GPUs, a large amount of system RAM, and a fast NVMe for model weights will run the same stack. The numbers below are the shape of one working build.
Compute#
| Component | Detail |
|---|---|
| CPU | high-core-count workstation/server class |
| RAM | 256 GB (typical usage ~20 GB for OS + Docker, the rest available for model layers and page cache) |
| GPU 0 | NVIDIA, 32 GB VRAM, PCIe Gen5 x16 |
| GPU 1 | NVIDIA, 32 GB VRAM, PCIe Gen5 x16 |
| Total VRAM | 64 GB |
Both GPUs hang off the CPU PCIe host bridge (PHB topology), no NVLink. Models are split across the two cards using llama.cpp's layer splitting (--split-mode layer --tensor-split 1,7). With the worker tier loaded, GPU 0 holds the persistent small models plus a slice of the big model; GPU 1 holds the bulk of the big model layers.
Storage#
Dedicate a fast NVMe to model weights. Layout that works:
/(root, ~2 TB): OS, Docker images and volumes, configs./var/lib/llama(separate fast NVMe, PCIe Gen5, ~12 GB/s read): model weights on an LVM volume. The read speed matters because every hot-swap reloads weights from disk.
Disk usage on the reference box: model weights ~840 GB under /var/lib/llama/models/, embedding weights ~7 GB under /var/lib/llama/embed/, Docker images ~45 GB on root.
Network#
A single primary interface on the LAN is all that is needed. Throughout this doc the box's address is written <llmbox>; on the reference build it is a static LAN IP in the documentation range 192.0.2.0/24. Docker uses its default 172.17.0.0/16 bridge.
OS#
Ubuntu LTS, a recent kernel, and a matching CUDA driver. Nothing distro-specific in the stack; it runs the same on any glibc Linux with working NVIDIA + Docker.
3. llama-swap: model server#
How it works#
llama-swap sits in front of llama.cpp's llama-server. A YAML file maps model names to launch commands. When a request arrives for a model:
- If that model is already loaded -> route directly to its process.
- If a different model from the same exclusive group is loaded -> unload it, start
llama-serverfor the requested model, wait for it to load into VRAM, then route. - Persistent models (embeddings, whisper, the small utility model) stay loaded permanently alongside whatever big model is current.
Request: model="qwen3.5-122b-a10b"
|
v
llama-swap (:1234)
+-- Currently loaded: qwen3.6-27b (big group, exclusive)
+-- Requested: qwen3.5-122b-a10b (same group)
+-- 1. Unload 27B process
+-- 2. Start llama-server for 122B with the right GPU config
+-- 3. Wait for model to load into VRAM
+-- 4. Route request
v
Response (swap took ~10-15s, subsequent requests instant)Model groups#
groups:
big: # Hot-swapped - only one loaded at a time
swap: true
exclusive: true
members:
- qwen3.6-27b # + -fast, -uncensored - the default worker
- qwen3.6-35b-a3b # + -fast
- qwen3.5-122b-a10b # + -fast, -uncensored - reason escalation
- qwen3.5-397b-a17b # + -fast - deep escalation
- gemma4-31b
- glm-5-1
- minimax-m2.7-229b
- nemotron3-super-120b # + -fast
- step3.5-flash # + -fast
- reranker-qwen3-8b
- reranker-bge-v2-m3
small: # Persistent - always loaded
persistent: true
members:
- qwen3.5-4b
embeddings: # Persistent - always loaded
persistent: true
members:
- embedding-bge-m3 # default: small, multilingual, best quality-to-size
- embedding-qwen3-4b
- embedding-qwen3-8b
audio: # Persistent - always loaded
persistent: true
members:
- whisper-large-v3Startup preload: the default worker plus the three persistent models (small utility, an embedder, whisper).
Available models#
| Model | Type | Active params | Tier / use |
|---|---|---|---|
qwen3.5-4b | Dense | 4B | utility: page extraction, classification, the reflector |
qwen3.6-27b | Dense | 27B | the warm worker tier (default), + -fast / -uncensored |
qwen3.6-35b-a3b | MoE | 3B active | faster worker variant, + -fast |
qwen3.5-122b-a10b | MoE | 10B active | reason escalation, + -fast / -uncensored |
qwen3.5-397b-a17b | MoE | 17B active | deep escalation, + -fast |
gemma4-31b, glm-5-1, minimax-m2.7-229b, nemotron3-super-120b, step3.5-flash | various | alternates on the swap menu | |
embedding-bge-m3 | Embed | ~0.6B | embedder of choice: small, multilingual, best quality-to-size |
embedding-qwen3-4b / -8b | Embed | 4B / 8B | higher-dimension embedding options |
reranker-bge-v2-m3 / reranker-qwen3-8b | Rerank | - | result reranking |
whisper-large-v3 | Audio | - | speech-to-text |
The chat models run from unsloth *-MTP-GGUF builds (Unsloth Dynamic UD-Q4_K_XL) with multi-token-prediction speculative decoding (--spec-type draft-mtp) and a vision projector (--mmproj), so they are multimodal and draft-accelerated. The -fast variants are the same weights with thinking disabled (enable_thinking: false) and adjusted sampling (temp 0.7, top_p 0.8); subagents use them. -uncensored variants exist where a primary needs them.
The MoE letter math is total params - active params. A 122b-a10b activates 10 billion parameters per token while holding 122 billion. Memory holds the whole model, compute touches only the experts the router picks. That is the trick that makes the big tiers affordable on this hardware.
GPU memory layout (worker tier loaded)#
GPU 0 (32 GB):
qwen3.5-4b (persistent) .......... ~3 GB
embedding (persistent) ........... ~2 GB
whisper-large-v3 (persistent) .... ~2 GB
qwen3.6-35b-a3b (partial layers) . ~4 GB (tensor-split 1,7 = 1/8 of model)
KV cache + overhead .............. ~4 GB
Free ............................. ~17 GB
GPU 1 (32 GB):
qwen3.6-35b-a3b (main layers) .... ~24 GB (7/8 of model)
KV cache + overhead .............. ~5 GB
Free ............................. ~3 GBKey configuration#
# /etc/llama-swap.yaml (excerpt)
defaults:
server: /opt/llama.cpp/build/bin/llama-server --port ${PORT}
--no-warmup --jinja --metrics -np 1
whisper: /opt/whisper.cpp/build/bin/whisper-server --port ${PORT}
proxy: http://127.0.0.1:${PORT}
qwen3.6-27b:
cmd: |
${server}
--device CUDA0,CUDA1 --split-mode layer --tensor-split 1,1
-m ${MODEL_BASE}/unsloth/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-UD-Q4_K_XL.gguf
--mmproj ${MODEL_BASE}/unsloth/Qwen3.6-27B-MTP-GGUF/mmproj-F16.gguf
--spec-type draft-mtp --spec-draft-n-max 3 --spec-draft-n-min 0
--temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0
--ctx-size 262144
proxy: ${proxy}*-MTP-GGUF is the key part: an Unsloth Dynamic build with a trained multi-token-prediction draft head baked in. --spec-type draft-mtp runs that head as the speculative drafter, which (unlike ngram drafting) measurably speeds up free-form generation. --mmproj loads the vision projector, so the model is multimodal.
llama-swap runs as a native systemd service (not Docker) for direct GPU access:
/usr/bin/llama-swap -config /etc/llama-swap.yaml -watch-config -listen 0.0.0.0:12344. Open WebUI: API gateway#
External clients (OpenCode, browser, scripts)
|
v
Open WebUI (<llmbox>:8080)
+-- /api/chat/completions -----> llama-swap :1234
+-- /api/embeddings -----------> llama-swap :1234 (embedding models)
+-- Web UI for direct chat
+-- User management, conversation history
+-- Model aliases and routingOpen WebUI is the single entry point for all LLM interaction. OpenCode, automation, and custom scripts connect to it as an OpenAI-compatible API; it routes requests to llama-swap internally and adds user management, conversation history, and model routing. It is reachable publicly at chat.archworks.co, which points its OPENAI_API_BASE at the local llama-swap.
Docker image: ghcr.io/open-webui/open-webui:latest, port 8080.
5. Firecrawl: scrape + AI extraction#
Docker stack#
| Container | Role | Port |
|---|---|---|
| firecrawl-api | Main API service | :443 -> 3002 |
| firecrawl-playwright | Headless Chromium for JS rendering | internal |
| firecrawl-redis | Job queue | internal |
| firecrawl-postgres | Job metadata | internal |
| firecrawl-rabbitmq | Task orchestration | internal |
Endpoints#
| Endpoint | What it does |
|---|---|
POST /v1/scrape | Scrape URL -> clean markdown |
POST /v1/extract | Scrape + LLM extraction -> structured JSON |
POST /v1/search | DuckDuckGo search via headless browser |
POST /v1/crawl | Crawl an entire site (async) |
POST /v1/map | Discover URLs on a domain |
LLM integration#
The /v1/extract endpoint uses the local 4B model for intelligent extraction:
Client -> <llmbox>:443/v1/extract
-> Firecrawl scrapes the page with Playwright
-> Passes content + extraction prompt to llama-swap (host.docker.internal:1234)
-> llama-swap routes to qwen3.5-4b (persistent, instant)
-> Returns structured JSON with only the requested factsConfig in docker-compose.yml:
OPENAI_API_KEY: "sk-local"
OPENAI_BASE_URL: "http://host.docker.internal:1234/v1"
MODEL_NAME: "qwen3.5-4b"6. SearXNG: metasearch#
SearXNG aggregates results from multiple providers into one response, on <llmbox>:8888.
| Engine | Status | Notes |
|---|---|---|
| Brave | Primary | Best quality, rate-limits under heavy automated use |
| Qwant | Primary | Strong for European content |
| DuckDuckGo | Enabled | Connection issues from some datacenter IPs |
| Enabled | Silent failures from datacenter IPs (consent page) | |
| Startpage | Enabled | CAPTCHA from datacenter IPs |
| Bing | Disabled | Consistently poor results |
# /etc/searxng/settings.yml
search:
default_lang: all # Language set per-query by clients
ban_time_on_fail: 30 # Fast recovery (default: 180s)
max_ban_time_on_fail: 120 # Suspension cap (default: 600+s)
outgoing:
request_timeout: 30.0
enable_http2: true
pool_connections: 100Valkey (a Redis fork) stores engine-suspension state. Flush it when engines get stuck banned:
valkey-cli -s /run/valkey/valkey.sock FLUSHALLDocker compose#
All Docker services live in one compose file (/opt/ai-stack/docker-compose.yml):
cd /opt/ai-stack
docker compose up -d # start all
docker compose up -d firecrawl-api # restart a single service
docker compose logs -f open-webui # follow logs
docker compose ps # status7. The core idea: clean contexts beat big models#
A naive way to use an LLM for research: hand it a question, let it search the web, scrape ten pages, dump the content into context, ask for a report. By page seven it has forgotten what it read for page two. By the report it is inventing quotes that appeared in none of the sources.
This is not a small-model problem. Frontier models hallucinate under the same load. A model with a polluted context performs worse than a smaller model with a clean one. The shape of the problem is attention dilution: the model spreads its limited attention across everything in context, and the more noise you stuff in, the less attention each fact gets.
The fix is not bigger models. The fix is smaller contexts.
The mental model is a filing cabinet.
BAD: one person, 50 documents, 10 questions
= overloaded, mixes details, forgets, starts guessing
GOOD: 10 people, 5 documents each, 1 question each
+ 1 analyst combining the answers
= focused, accurate, verifiableThe senior analyst is the orchestrator model. The ten people are worker models, each with a clean desk. That is what context discipline is: many workers with clean desks, one orchestrator combining their findings.
8. Five concepts that make it work#
8.1 Orchestrated delegation#
One agent plans the work and combines the results. The workers it spawns execute focused subtasks. Because the backend keeps one big model warm at a time (see section 3), the coordinator and its workers run on the SAME model - a subagent inherits the model of the agent that called it. The win here is a clean context per worker, not a bigger brain on top.
You: "Research complex topic X"
|
v
+--Coordinator (the warm worker model)----+
| "I'll break this into 12 questions" |
+--+------+------+------+------+------+----+
v v v v v v
[Worker][Worker][Worker][Worker][Worker][Worker]
(same warm model, inherited - not a separate tier)
Each worker:
- Gets ONE specific question
- Starts with a clean context
- Searches, reads 3-5 pages
- Writes findings with source URLs
- Returns ONLY the distilled findings
Coordinator:
- Receives 12 concise answers (not 12 raw page dumps)
- Checks for gaps, retries failed topics
- Verifies critical claims
- Writes the final reportThe coordinator does the synthesis, the workers do focused retrieval, and neither is overwhelmed. Escalating to a bigger model is a separate, deliberate step (see section 9), not something that happens inside one fan-out: a swap-bound backend cannot run a 122B coordinator and a fleet of smaller workers at the same moment. The gain holds at any scale - context isolation improves attention quality even on frontier models.
8.2 Context isolation#
The single biggest quality improvement. Each subtask gets its own context. Nothing leaks between workers.
WITHOUT isolation (one agent does everything):
+--------------------------------------------------+
| User question 500 tokens |
| Page 1 scrape 8,000 tokens |
| Page 2 scrape 12,000 tokens |
| Page 3 scrape 6,000 tokens |
| Search results 2,000 tokens |
| Page 4 scrape 15,000 tokens |
| Page 5 scrape 9,000 tokens |
+--------------------------------------------------+
= 52,500 tokens of accumulated noise
= Attention spread thin across everything
= Quality of final answer: LOW
WITH isolation (workers have separate contexts):
Worker 1 context: Worker 2 context:
+------------------+ +------------------+
| Question 200 tk | | Question 200 tk |
| Extract 800 tk | | Extract 1,200 tk |
+------------------+ +------------------+
= fully focused = fully focused
Orchestrator context:
+---------------------------+
| Original question 500 tk |
| Worker 1 findings 400 tk | <- distilled, not raw pages
| Worker 2 findings 600 tk | <- distilled, not raw pages
+---------------------------+
= 1,500 tokens of clean signal
= Quality of final answer: HIGHThe total compute is the same. The quality is not. It works because of three failure modes the isolation avoids:
- Attention dilution: models spread attention across everything; less noise means better focus.
- Retrieval failure: key facts from early in a long context get buried; a short context loses nothing.
- Instruction drift: models gradually forget the original task as context grows; a fresh context keeps the task clear.
8.3 Extract, don't dump#
When a worker needs a fact from a web page, don't put the whole page in its memory. Use a tiny model to extract just the relevant facts first.
RAW SCRAPE (wasteful): UTILITY EXTRACTION (efficient):
+---------------------------+ +---------------------------+
| Navigation menu | | Fine amount: 726 EUR |
| Cookie banner | | Repeat offense: 2,180 EUR |
| *** Actual content *** | | Source: example.com |
| Footer / ads / comments | +---------------------------+
+---------------------------+ = 200 tokens
= 8,000 tokens (95% noise)A small 4B model reads the page and returns only what was asked for. The worker (35B) never sees the noise. This compounds: five extracted pages cost a thousand tokens of worker context; five raw pages would cost forty thousand. That compounding is the difference between fitting in context and not. The total compute is the same; the quality is not.
8.4 Verification pipeline#
Models confidently state wrong facts. They cite wrong sources, mix up jurisdictions, and invent information when they can't find the real thing. A single-pass pipeline has no way to catch this.
The fix: a verification step that checks claims against their sources.
Research output: "The fine is 726 EUR (Source: example.com/law)"
|
v
Verify Agent
+-----------+-----+-----------+
| | |
Visit URL Compare Rate it
example. "claim says CONFIRMED /
com/law 726 EUR, DEBUNKED /
page says NOT FOUND
726 EUR"
|
v
CONFIRMED - claim matches sourceWhat this catches in practice: wrong legal citations (a paragraph from the wrong jurisdiction), URLs that don't contain the claimed information, numbers the model fabricated when search was down, and paraphrased text presented as exact quotes.
8.5 Graceful degradation#
Things fail. Search engines rate-limit. Pages go down. Workers use up their budget. The system must handle this without collapsing or lying.
Search engine down?
-> Automatic fallback to alternative search backend (different code path)
Worker returns empty?
-> Orchestrator retries with different keywords
-> After retry fails: marked "NOT VERIFIED" (honest)
Worker out of step budget?
-> Writes partial findings + lists remaining questions
-> Orchestrator delegates the rest to a fresh worker
All search backends down?
-> Worker reports "search unavailable"
-> Does NOT generate content from training data
-> Honest gap > confident hallucinationThe key rule: an honest "I couldn't find this" is ALWAYS better than a confident fabrication. The whole point of building a research pipeline is to be more trustworthy than the model talking to itself. Systems that degrade gracefully are trustworthy; systems that fill gaps silently are dangerous.
Measured impact#
A complex 20-topic research task, single-agent vs multi-agent on the same models and the same search tools:
| Metric | Single agent | Multi-agent |
|---|---|---|
| Topics covered | ~15% | 75% |
| Verifiable sources | 0-2 | 23 |
| Hallucinations | 5+ per report | Caught by verify |
| Useful output | Muddled summary | Structured report with citations |
The only difference was how the work was organized: many focused workers with clean context, vs one overloaded agent with polluted context.
9. Model tiering#
The single-slot constraint shapes the tiers. One big model is warm at a time, so the default driver and the subagents it spawns share that warm model - a subagent inherits the caller's model rather than picking its own. Reaching for a bigger model means a deliberate swap, so the big tiers are escalation-only.
| Tier | Model | Thinking | Role |
|---|---|---|---|
| Utility | 4B (resident) | off | page extraction, classification, the reflector that writes memory |
| Worker | 27B (the warm slot) | off | the default driver AND the workers it spawns (inherited) |
| Reason | 122B MoE | on | escalation: only after the worker has failed twice on its own |
| Deep | 397B MoE | on | last resort: only after reason came back and was verified wrong |
The worker tier is the default. It runs the primary agents and the read-only subagents they fan out to. Thinking off, because for retrieval and most coding the reasoning monologue burns context without improving the result.
The utility tier is the always-resident 4B. It runs alongside the warm slot at no swap cost and does the cheap mechanical jobs: turn a raw HTML page into the three facts a worker asked for, classify, and (see section 17) review a finished session and write down what worked.
The reason and deep tiers are MoE escalation. You do not reach for them first - each step up costs a swap. A 122b-a10b activates only 10B parameters per token while holding 122 billion: memory holds the whole model, compute touches only the experts the router picks, which is what makes the big tiers affordable on this hardware.
When delegation pays off#
The whole pipeline is overhead. For a simple question it loses to a single model with a clean prompt.
| Situation | Pattern | Why |
|---|---|---|
| Complex research, many subtopics | Full pipeline: delegate + isolate + verify | Each subtopic gets clean context, claims get checked |
| Code review / audit | Orchestrator + focused reviewers | Each reviewer checks one aspect |
| Data analysis | Orchestrator + extraction workers | Each worker processes one source |
| Simple Q&A | Single model, no delegation | The overhead is not worth it |
| One-page lookup | Single worker with extract tool | No orchestration needed |
Rule of thumb: more than 3-5 documents to read, or more than 3 questions to answer, delegation pays. Below that, one model with clean prompting wins.
10. OpenCode: the agent runtime#
OpenCode is a TUI-based agent runtime. You select an agent in a terminal; the agent uses tools to accomplish a task and can spawn subagents, which is what makes the multi-agent delegation pattern real.
~/.config/opencode/
opencode.json # provider config, model menu, per-agent tool permissions
agents/ # agent definitions (markdown files)
# primaries (you pick one in the TUI; each refuses off-topic work):
chat.md # knowledge Q&A, comparisons, recommendations
build.md # code, refactors, infrastructure-as-code
research.md # sourced multi-step research
ops.md # live infra, remote troubleshooting, runbooks
student.md # coursework, math, write-ups
# escalation tiers (called only after the cheap model fails):
reason.md # 122B - subtle bugs, race conditions, trade-offs
deep.md # 397B - last resort
# read-only subagents (inherit the caller's model):
web-researcher.md verify.md plan.md review.md researcher.md ...
tools/ # custom tools (TypeScript) - see section 11
plugins/ # lifecycle hooks: memory, traits, reflect, auto-format
traits/ # one-line learned corrections (auto-loaded)
learned-skills/ # multi-step procedures the stack wrote itself
memory/ # flat-file facts + index, loaded every sessionAgent definitions#
Every agent is a markdown file with YAML frontmatter:
---
description: What this agent does (shown in the agent picker)
mode: primary # primary = user-facing, subagent = only called by others, all = both
model: openwebui/qwen3.6-27b
steps: 50 # maximum actions before forced stop
permission:
edit: deny # block file editing
bash: deny # block command execution
websearch: deny # block built-in web search (use the custom search tool)
---
The rest is the agent's system prompt: instructions, rules,
workflow steps, output format.Agent modes#
| Mode | Who can use it | Example |
|---|---|---|
primary | User selects it in the TUI | research, build, review |
subagent | Only other agents can spawn it | web-researcher, verify |
all | Both user and other agents | plan |
Model assignment#
Primaries pin the warm worker model. Read-only subagents leave the model unset and inherit the caller's, so a fan-out never swaps mid-run. Only the escalation tiers pin a bigger model on purpose:
chat.md / build.md / research.md / ops.md -> openwebui/qwen3.6-27b (the warm slot)
web-researcher.md / verify.md / plan.md -> (unset, inherits the caller's model)
reason.md -> openwebui/qwen3.5-122b-a10b (deliberate swap)
deep.md -> openwebui/qwen3.5-397b-a17b (deliberate swap)Subagents run thinking-disabled, because OpenCode returns subagent results via assistant-message prefill, which is incompatible with llama.cpp's thinking mode.
Step budgets#
steps: 50 is the hard maximum. The system prompt sets a softer budget:
## HARD LIMITS
You have a budget of about 15-20 tool calls. After around 15, start wrapping up.
By 20 at most, you MUST stop and write your findings as text.
CRITICAL: Your findings are ONLY returned to the orchestrator if you write
them as a text message. Tool call results are NOT forwarded. If you never write
a text response, ALL your work is lost.The gap between soft budget (20) and hard limit (50) gives the model 30 steps to write its findings even if it overshoots the tool-call budget. Without this buffer, workers consume all steps on tool calls and return nothing.
11. Custom tools#
Tools extend what agents can do. Three power the research pipeline and get the detail here; the stack ships about two dozen more, grouped at the end of this section. Each tool is denied globally and re-enabled per agent in opencode.json, so an agent only holds the tools its job needs.
search.ts - web search#
SearXNG (primary) -> Firecrawl /v1/search (fallback)- Queries SearXNG first (Brave, Qwant, DDG, Google).
- If 3+ SearXNG engines are down, falls back to Firecrawl's built-in DuckDuckGo search.
- 2-second rate limiter between calls to prevent engine exhaustion.
- Language parameter per query (
de-AT,pl-PL, etc.). - When all backends fail, returns "Do NOT retry" to stop the agent from spiraling.
extract.ts - AI-powered extraction#
Firecrawl /v1/extract (LLM extraction) -> scrape + truncation (fallback)- Sends URL + extraction prompt to Firecrawl.
- Firecrawl scrapes the page, passes it through the 4B model, returns structured JSON with only the requested facts.
- If LLM extraction fails, falls back to a raw scrape truncated to 16K characters.
- This is the key context-management tool: returns 500 tokens instead of 8,000.
extract("https://example.com/...", "Extract fine amounts for registration violations")
Returns:
{
"penalties": [
{"offense": "First offense", "fine": "up to 726 EUR"},
{"offense": "Repeat offense", "fine": "up to 2,180 EUR"}
]
}vs a raw scrape returning 20,000 characters of full text plus navigation and footers.
scrape.ts - raw page scrape#
Direct Firecrawl scrape, returns the full page as markdown (up to 15K chars). Used only when extract fails, or when the agent needs to browse a page without knowing what to look for.
The rest, grouped#
Beyond the research three, the stack adds tools that make agents capable and deterministic. Each is denied globally and re-enabled per agent.
| Group | Tools | Purpose |
|---|---|---|
| Web | crawl, map, mirror, scrape_html, har_parse | recursive crawl, URL discovery, single-file or whole-site mirror, post-JS raw HTML, HAR API reverse-engineering |
| Vision | vision_describe, screenshot, screenshot_local | describe an image via a vision model, screenshot a URL or a local static site headlessly |
| Determinism | resolve_path, json_query, backup_before, wait_for | fuzzy-resolve a partial path, jq over JSON, timestamped backup before a destructive change, poll-until-true instead of sleep |
| Math (student) | math_eval, math_symbolic, math_solve, math_linalg, math_numerical, math_stats, math_plot | numeric eval, SymPy, equation / linear-algebra / numerical solving, stats, ASCII plots |
The determinism group exists because a model left to write one-liners reaches for fragile shell: sleep 5 blocks the runtime, python -c breaks on odd input, rm runs before any backup. Each tool replaces a class of flaky improvisation with one call that behaves the same every time.
12. The research pipeline#
The most complex workflow. It demonstrates orchestration, context isolation, verification, and graceful degradation working together.
research.md (coordinator, warm model)
+---> plan.md (optional, builds a detailed plan)
+---> web-researcher.md x N (workers, inherited model)
| Each answers ONE question
| Tools: search, extract, scrape
+---> verify.md (fact-checker, inherited model)
| Checks claims against source URLs
| Tools: search, extract, scrape
v
Final report with citations and verification verdictsStep 1 - parse the request. The orchestrator reads the message line by line, extracting every issue, entity, and constraint. The prompt explicitly requires cross-checking so nothing gets missed.
Step 2 - build 12-20 narrow research questions. Not broad topics ("research social welfare") but specific questions: "What are the income thresholds for benefit X in 2026?", "What penalties exist for misrepresentation on application Y?", "What alternatives exist if someone is excluded from benefit Z?"
Step 3 - spawn all workers in parallel. Each web-researcher gets ONE question, starts with a clean context, and has the search/extract/scrape tools.
Worker lifecycle:
1. Search (1-2 calls)
2. Extract from 3-5 result URLs
3. Reason about source quality and consistency
4. Write findings with citations
5. Return text to orchestrator (tool outputs discarded)Step 4 - gap analysis. Did every subtopic get answered? For gaps, spawn new workers with different keywords. After one failed retry, mark as "NOT VERIFIED".
Step 5 - verification (mandatory). The verify agent receives the 5-10 most critical claims with their source URLs, visits each URL, compares the claim against actual page content, and rates each CONFIRMED / PARTIALLY CONFIRMED / DEBUNKED / UNVERIFIABLE.
Step 6 - synthesize. The orchestrator combines verified findings into the final report. Debunked claims are marked; unverified items are listed honestly.
What goes wrong without this#
| Problem | Cause | Solution |
|---|---|---|
| Worker returns empty | Used all steps on tool calls | Step budget (soft 20, hard 50) |
| Wrong law cited | Model hallucinated from training data | Verify agent checks source URLs |
| Worker forgets first 5 of 10 pages | Context overflow from raw page dumps | Extract tool returns structured JSON |
| All engines rate-limited | Too many rapid queries | 2s rate limit + Firecrawl fallback |
| Worker keeps retrying failed search | Ignores "stop" instructions | Search tool returns "Do NOT retry" |
| Orchestrator skips verification | Prompt said "optional" | Changed to "MANDATORY" |
13. Skills: one tool call over an API#
Subagents handle the process-discipline side. Skills handle the API-surface side. Together they let an agent touch real systems without drowning in plumbing.
A skill is a small documented capability with a clear contract. Three layers per skill:
+-------------------+ +-------------------+ +-------------------+
| SKILL.md | | AGENT.md | | Python CLI |
| When to use | --> | API reference | --> | Makes the calls |
| Workflow steps | | Field IDs | | Handles auth |
| Output format | | Rate limits | | Parses responses |
+-------------------+ +-------------------+ +-------------------+- SKILL.md is the public face. The orchestrator reads it to decide when and why to use the skill.
- AGENT.md is the reference layer. The executing agent reads it to know how to make correct API calls: API IDs, custom field names, endpoints. Loaded only when the skill is invoked.
- Python CLI handles auth and transport. The model never sees credentials.
Directory structure#
~/.claude/skills/
jira/
SKILL.md # "Use this skill for ticket management..."
AGENT.md # "POST /rest/api/2/issue, fields: {project: {key: ...}}"
jira.py # subprocess.run(["curl", ...]) with auth headers
config.json # {"url": "...", "token": "..."} - never shared
calendar/
SKILL.md
AGENT.md
calendar.py
config.jsonEach skill is a folder. Each opens with a short description so the agent knows when to reach for it, and ends with one or two example invocations so the agent has a working template. A practical set covers the things you touch daily: a ticket tracker, email, a calendar, a wiki, a time-tracking system, a tmux driver, a notes API.
Example: creating a ticket#
User: "Create a ticket for the DNS migration"
1. skill-runner reads jira/SKILL.md
-> Determines this is a "create" operation
-> Reads jira/AGENT.md for field mappings
2. skill-runner constructs the command:
python3 jira.py create \
--project OPS \
--type Task \
--summary "DNS migration to new provider" \
--labels infra,dns
3. jira.py:
- Reads config.json for URL + token
- POST /rest/api/2/issue with the correct payload
- Returns: "OPS-1234 created"
4. skill-runner reports back: "Created OPS-1234"The model constructs the command; the Python script handles auth. Credentials never enter the model's context. The agent's whole view of "send a ticket" or "send an email" or "create a calendar event" is one tool call: a function that takes a small dict and returns a small dict.
The gain is concrete. Prompts dropped from "search for the ticket about deploying X, paste in the ticket ID, here is my username, here is the API token format" to "create a ticket about deploying X". Less plumbing in context, more attention left for the actual task.
14. Provider configuration#
OpenCode connects to backends via providers in opencode.json:
{
"provider": {
"local": {
"npm": "@ai-sdk/openai-compatible",
"name": "Local LLM",
"options": {
"baseURL": "http://127.0.0.1:1234/v1",
"apiKey": "no-key"
},
"models": {
"qwen3.6-27b": { "name": "qwen3.6-27b" },
"qwen3.5-122b-a10b": { "name": "qwen3.5-122b-a10b" }
}
},
"openwebui": {
"npm": "@ai-sdk/openai-compatible",
"name": "Open WebUI",
"options": {
"baseURL": "https://chat.archworks.co/api",
"apiKey": "<redacted>"
},
"models": {
"qwen3.5-122b-a10b": { "name": "qwen3.5-122b-a10b" }
}
}
}
}Agents reference models as provider/model-name (e.g. openwebui/qwen3.5-122b-a10b).
Two providers, same kind of backend. openwebui is the gateway every agent uses, so a subagent inherits the exact provider/model string of its caller and the warm slot is never fought over mid-run. The direct local provider points at a llama-swap on the workstation, kept for manual model-switching and offline use rather than wired into any agent.
15. Lessons learned#
Each of these was discovered through a specific failure.
| Lesson | Discovery |
|---|---|
Use -fast (thinking disabled) for all subagents | Thinking-enabled models reject assistant prefill, causing empty returns |
| Steps must be >> tool-call budget | Workers consumed all steps on tool calls, leaving none for writing findings |
| "MANDATORY" not "optional" for verify | The model skipped verification every time it was described as optional |
| Extract (structured JSON) not scrape (raw markdown) | 5 raw scrapes = 50K tokens of noise, 5 extracts = 2K tokens of signal |
| Search tool must say "Do NOT retry" | Workers retried failed search 13+ times, exhausting all engines |
| Orchestrator must cross-check plan against user input | Broad planning missed specific details in the request |
| Subagent must include "write partial + list remaining questions" | Workers that couldn't finish returned nothing instead of partial results |
| Rate limit in the search tool, not just the prompt | Models ignore "wait 2 seconds" instructions; the tool enforces it in code |
| Two independent search backends | SearXNG and Firecrawl search use different code paths; they don't fail simultaneously |
16. Serving optimization methodology#
How to squeeze the most context and tokens-per-second out of a llama.cpp + llama-swap stack on two 24-32 GB consumer GPUs (64 GB VRAM total on the reference box). The methodology was developed while tuning the Qwen3.5 family and generalizes to any large dense or MoE model too big for full GPU residency.
16.1 What you're actually optimizing#
For a single forward pass, the critical-path cost has three parts:
- Attention over the KV cache - memory-bandwidth bound. Dominates token generation (TG) for long contexts.
- Feed-forward (FFN / MoE expert) compute - compute-bound on GPU, PCIe-bound if experts live on CPU and their activations round-trip.
- Per-request scheduling overhead - small, grows with
-np(parallel slots).
On a 64 GB box a 100B+ model does not fully fit in VRAM, so you make three trade-offs at once: how much of the model lives on GPU vs CPU RAM, how large the context window is (KV scales linearly with context), and how many parallel slots the server exposes. Every knob is a move along one of those axes.
16.2 Free wins - apply to every model#
These cost nothing in quality and little to no speed, so they go into the server macro or every per-model entry.
Flash attention:
-fa onFused softmax + tiled matmul. Default-on for supported models on recent CUDA llama.cpp; specifying it makes intent clear. Zero quality impact, reduces memory bandwidth, unlocks the quantized KV path.
Quantized KV cache:
-ctk q8_0 -ctv q8_0q8_0 is indistinguishable from fp16 in practice (<0.1% perplexity delta). It halves KV-cache VRAM: room to pin more expert layers, longer contexts at the same budget, faster load, same TG speed on bandwidth-bound attention. Verified clean up to 131k tokens at q8/q8.
Unsloth Dynamic quants. Prefer UD-Q4_K_XL or UD-Q4_K_S over plain Q4_K_M. The Dynamic family keeps sensitive tensors (attention projections, embedding, first/last layers) at higher precision and aggressively quantizes the rest, measurably beating same-size quants from other publishers. Between Q4_K_S (K-quant, faster CUDA MMQ kernels, slightly larger) and IQ4_XS (i-quant, ~5% smaller, slightly slower PP, slightly higher quality per byte): on a recent GPU with the MMQ path optimised, Q4_K_S is usually faster.
Build flags that matter when compiling llama.cpp against CUDA:
cmake -B build \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES="120" \ # match your GPU's compute capability
-DCMAKE_BUILD_TYPE=Release \
-DGGML_CUDA_FA_ALL_QUANTS=ON \ # quantized KV cache in flash attention
-DGGML_CUDA_F16=ON \ # F16 intermediate precision
-DLLAMA_CURL=ONGGML_CUDA_FA_ALL_QUANTS=ON is what enables -ctk q8_0 -ctv q8_0 under flash attention. Without it the server silently falls back to slow paths.
16.3 MoE trick: selective expert offload#
This is where the biggest wins live for >100B MoE models like qwen3.5-122b-a10b, qwen3.5-397b-a17b, glm-4.7:358b, nemotron-3-super:120b, minimax-m2.7:229b.
A Mixture-of-Experts layer has the form:
y = router_gate(x) · sum_{e in top_k_experts(x)} expert_e(x)where expert_e is a separate FFN block. In qwen3.5-122b-a10b the 122B count is mostly experts; only ~10B are active per token (top-k routing, k=8 across 128 experts in the named config).
llama.cpp lets you override where individual tensors go:
-ot "<regex-on-tensor-name>=<device>"applied in declaration order (first match wins). A useful pattern:
-ot "blk\.(1[0-5]|[0-9])\.ffn_.*_exps\.=CUDA1" # experts of layers 0-15 -> GPU1
-ot "blk\.(1[6-9]|2[0-4])\.ffn_.*_exps\.=CUDA0" # experts of layers 16-24 -> GPU0
-ot ".ffn_.*_exps.=CPU" # every remaining expert -> CPUNon-expert tensors (attention projections, embedding, norm, shared MLP, output head) still obey --split-mode layer --tensor-split 1,2, so attention lives entirely on GPU and only routed expert blocks are CPU-bound.
Why the trade is worth it. At TG time the critical path is attention (GPU) -> router (GPU) -> 8 expert FFN blocks (wherever they live). Moving some experts back onto GPU eliminates PCIe round-trips for those layers on every token, keeps attention fully GPU-resident, and leaves CPU-pinned experts only for layers where there's no GPU room.
On qwen3.5-122b-a10b, moving from 10 to 26 expert layers on GPU pushed prompt processing 138 -> 232 t/s (+68%), token generation 46 -> 55 t/s (+20%), at ~98% VRAM utilisation (54 -> 63 GB of 64).
Finding your own sweet spot:
- Start with all experts on CPU (
-ngl 999 -ot ".ffn_.*_exps.=CPU"). - Pin the first N expert layers to the GPU with the most spare VRAM. Measure PP, TG, GPU memory.
- Repeat with N+2 or N+4 until 1-2 GB headroom on that GPU.
- Start pinning additional layers on the other GPU, using indices that don't collide with the first rule. Same procedure.
- The ceiling is when any GPU gets within ~1 GB of total VRAM at idle (leaves room for compute-buffer jitter during long-prompt PP).
Pin early layers first. Empirically blk.0..N beats a scattered set - early layers have more predictable expert activation, reducing cross-device fetches.
-ngl with -ot: use -ngl 999 (all layers to GPU) together with the -ot regex. -ngl 999 puts every layer on GPU by default, then -ot ".ffn_.*_exps.=CPU" overrides just the expert FFN tensors to CPU. This keeps all attention, shared, and non-expert MLP tensors on GPU for every layer. The legacy alternative (a low -ngl like -ngl 25) is strictly worse: CPU-resident layers pay the PCIe cost for their attention too, not just their experts.
16.4 Context and parallelism: --ctx-size and -np#
Minimum target: native context at -np 1. Every model has a training context length (262,144 for Qwen3.5; 131k for GLM-4.7; 196,608 for MiniMax-M2.7). Always run at native context with -np 1 at minimum - dropping below native wastes capability and saves no meaningful VRAM once q8/q8 KV is on.
When to prefer -np 2 at 2x native. If the measured single-user speed penalty is <=10%, prefer -np 2 with --ctx-size = 2 x native, so each of two concurrent chats gets the full native window. A topology with a chat UI plus automation plus an agent regularly has two concurrent requests, and the queueing cost of -np 1 shows up in practice.
Worked example on qwen3.5-122b-a10b:
| Config | Slots x ctx | PP | TG | GPU0 | GPU1 |
|---|---|---|---|---|---|
| np 1 / ctx 262k / 16+10 expert pin | 1 x 262k | 240 | 55.8 | 31.7 | 29.6 |
| np 2 / ctx 524k / 16+9 expert pin | 2 x 262k | 232 | 54.8 | 31.5 | 31.8 |
-3% PP and -2% TG in exchange for a second concurrent slot at full native context. Per the concurrency-preference rule (<=10% cost -> pick concurrency), np 2 wins.
What -np actually does. With --ctx-size N -np K, llama.cpp partitions the KV arena into K equal slots of N/K tokens; a request always targets one slot. So -np 2 at --ctx-size 262144 gives each slot 131k (a reduction from native). To give each slot native 262k you must set --ctx-size 524288 -np 2. Getting this wrong silently halves the per-request context - the most common misconfiguration.
16.5 Verify: stress test at real workload size#
A small-prompt bench (~40 input, ~150 generated) is not enough. Compute buffers grow during long-prompt processing in a way that is invisible on short prompts, and that delta is exactly what pushes a marginally-tight config over the OOM line in production.
For every tuned model, run at least three stress sizes and capture peak VRAM (sampled via nvidia-smi every ~250 ms): ~20k prompt tokens (a large code file), ~50k (a long RAG response), ~100k+ (near the ceiling). Force a deterministic response length ("write a 300-word analysis") so the TG measurement uses enough generated tokens to be reliable.
Measured data, qwen3.5-122b-a10b np2/524k/16+9:
| Prompt (tokens) | PP (t/s) | TG (t/s) | Peak GPU0 | Peak GPU1 | Notes |
|---|---|---|---|---|---|
| 40 | 233 | 55 | 31 544 | 31 826 | small-prompt nominal |
| ~20k | 451 | 80* | 31 844 | 29 598 | *TG noisy, 2 gen tokens |
| ~50k | 436 | 76* | 31 922 | 29 682 | *ditto |
| ~131k | 428 | 39.5 | 32 052 | 29 816 | 300 gen tokens, reliable |
| ~260k (200k target) | - | - | - | - | HTTP 400 (prompt exceeded per-slot ctx) |
Peak GPU0 under load is 32 052 / 32 607 MiB - only 555 MiB of headroom - and it does not OOM, because the compute-buffer delta between idle and peak is only 80 MiB on this workload. Thin but stable. TG dropping 55 -> 39.5 as context grows is the expected cost of attention scaling, not a config issue.
What failure looks like: the model loads, short prompts work, VRAM sits fine at idle, then the first 50k+ prompt crashes the process with a CUDA OOM mid-forward; the supervisor restarts it and the next request succeeds (different compute-buffer allocation). If this happens, back off 1-2 expert layers from the tightest GPU and re-run the stress test.
16.6 Full results table - qwen3.5-122b-a10b#
All measurements with the persistent models co-resident (qwen3.5-4b, embedding-qwen3-4b, whisper-large-v3 holding ~12 GB on GPU0), a fresh llama.cpp build.
| Config | ctx per slot | slots | exp layers on GPU | PP (t/s) | TG (t/s) | GPU0 MiB | GPU1 MiB |
|---|---|---|---|---|---|---|---|
OLD -ngl 25 no FA/quant | 131k | 1 | 0 | 47.5 | 20.4 | 30 086 | 25 602 |
| NEW minimal: FA + q8/q8 + expert=CPU | 131k | 1 | 0 | 105 | 38 | 22 273 | 12 327 |
| + ctx 262k native | 262k | 1 | 0 | 105 | 39 | 22 817 | 13 415 |
| + ctx 524k, np 2 | 262k | 2 | 0 | 106 | 39 | 23 959 | 15 687 |
| np 2, 4 experts on GPU1 | 262k | 2 | 4 | 117 | 41 | 24 173 | 21 255 |
| np 2, 8 experts on GPU1 | 262k | 2 | 8 | 128 | 43 | 24 173 | 26 823 |
| np 2, 10 experts on GPU1 | 262k | 2 | 10 | 138 | 46 | 24 173 | 29 607 |
| np 1, 16 experts on GPU1 | 262k | 1 | 16 | 165 | 47.5 | 18 030 | 29 492 |
| np 1, 16 GPU1 + 6 GPU0 | 262k | 1 | 22 | 208 | 52.7 | 26 168 | 29 492 |
| np 1, 16 GPU1 + 10 GPU0 | 262k | 1 | 26 | 240 | 55.8 | 31 736 | 29 492 |
| np 2, 16 GPU1 + 9 GPU0 (chosen) | 262k | 2 | 25 | 232 | 54.8 | 31 486 | 31 764 |
Net result vs the pre-update config: +388% PP, +168% TG, and a full 2x native context with 2 concurrent slots instead of 1 slot at half-native. Same hardware, just flags.
16.7 Testing harness#
Two small scripts on the server drive every probe.
probe122.sh runs a candidate config on a side port (1235), alongside the live llama-swap (1234), so the main stack and its persistent models stay loaded and the VRAM numbers reflect co-resident reality:
#!/bin/bash
# probe122.sh <label> <ctx> [extra-args...]
set -u
LABEL="$1"; CTX="$2"; shift 2; EXTRA="$*"
PORT=1235
MODEL=<path to main gguf shard 1>
MMPROJ=<path to mmproj>
BIN=/opt/llama.cpp/build/bin/llama-server
LOG=/tmp/probe_${LABEL}.log
pkill -f "llama-server --port ${PORT}" 2>/dev/null
sleep 1
nohup "$BIN" --port "$PORT" --no-warmup --jinja --metrics -np 1 -fa on \
-ctk q8_0 -ctv q8_0 \
--device CUDA0,CUDA1 --split-mode layer --tensor-split 1,2 \
-ngl 999 \
$EXTRA \
-ot '.ffn_.*_exps.=CPU' \
-m "$MODEL" --mmproj "$MMPROJ" \
--ctx-size "$CTX" \
>"$LOG" 2>&1 &
PID=$!
# wait for /health, bench, kill
for i in $(seq 1 180); do
code=$(curl -s -o /dev/null -w '%{http_code}' http://127.0.0.1:${PORT}/health)
[ "$code" = "200" ] && break
kill -0 "$PID" || { tail -20 "$LOG"; exit 1; }
sleep 1
done
nvidia-smi --query-gpu=memory.used --format=csv,noheader
PORT_OVERRIDE=$PORT python3 ~/bench_big.py
kill "$PID"; wait "$PID" 2>/dev/nullThe -ot regex passed via $EXTRA is placed before the default .ffn_.*_exps.=CPU rule, so first-match semantics promote specific layers to GPU while everything else falls through to CPU.
bench_big.py fires a request with a configurable prompt-token target (via repeated passages), forces a long deterministic response, and samples nvidia-smi in a parallel thread at 250 ms to capture peak VRAM. It reads prompt_per_second / predicted_per_second from the /v1/chat/completions response timings. Env vars: TARGET_TOKENS, MAX_NEW_TOKENS, PORT_OVERRIDE (1234 for llama-swap, 1235 for probe), MODEL.
Workflow for tuning a new model:
- Evict the model from llama-swap (
pkill -f 'llama-server.*<model-filename>') so the GPU is clean of its previous config. - Run
probe<model>.shwith candidate flags on port 1235, co-resident with the persistent models on 1234. - Iterate expert-pin counts until the target GPU gets tight.
- Promote the winner to the
/etc/llama-swap.yamlentry for that model. - Restart llama-swap once (not per-iteration -
watch-configtriggers a full persistent-model reload each time). Accept the ~25 s persistent-preload penalty. - Issue one request to the tuned model so it loads under real conditions, then run the stress bench at ~20k / ~50k / ~100k tokens and confirm peak VRAM stays below each GPU's limit.
16.8 What didn't work / isn't worth the complexity#
- ngram speculative decoding on free-form analytical prose: zero VRAM cost, zero measurable speed gain - the ngram draft has no useful patterns to predict on chat workloads. MTP (multi-token-prediction) speculative decoding is the version that pays off, and the stack now runs it: the chat models are unsloth
*-MTP-GGUFbuilds with a trained draft head, loaded via--spec-type draft-mtp --spec-draft-n-max 3. A trained head drafts far better than ngram lookup and helps on free-form generation too. The benchmark tables above predate MTP, so treat them as the expert-offload baseline; MTP is an additional gain layered on top. - Going beyond native context via YaRN scaling: KV scaling costs are linear and attention time grows visibly past native. Quality degrades past training length and TG latency at extended context becomes unusable.
-mlock/--no-mmap: pinning weights in RAM fights page-cache eviction on a multi-model swap box. mmap benefits from OS page caching across swaps; forcing it off slows cold loads noticeably.--split-mode row(tensor-parallel): experimental on two GPUs with no NVLink; PCIe becomes the bottleneck. Layer split is the right choice on this topology.- Reducing context below native to "save VRAM": once q8/q8 KV is on, the marginal gain isn't worth the capability loss. Either keep native or go 2x native with
-np 2.
16.9 Checklist for optimising a new model#
Before touching config, know: is it dense or MoE (MoE -> expert offload is the main lever; dense -> KV tuning only), how many transformer layers (bounds the -ot regex), the native training context (sets the --ctx-size minimum), the available quant family (prefer Unsloth Dynamic), and the gguf path pattern on the server.
Then, re-measuring after each step:
- Macro flags:
-fa on -ctk q8_0 -ctv q8_0. - Native context at
-np 1. - For MoE: all experts to CPU, then pin in layers until the GPU with most headroom is at 1 GB free.
- For MoE: pin additional layers on the other GPU until it's also at 1 GB free.
- If single-user cost is <10%, double context and switch to
-np 2. - Stress test at 20k / 50k / ~80% native prompts. Confirm peak VRAM under load stays below each GPU's ceiling.
- Promote to the llama-swap config, restart once, verify swap-in and swap-out both work without errors.
17. Memory and self-learning#
Serving tuning makes the stack fast. This layer makes it improve by being used. Three kinds of durable memory, all built from my own input, all plain files on disk under ~/.config/opencode/.
| Kind | What it holds | How it loads |
|---|---|---|
memory/ | facts: who I am, a project's constraints, decisions in no repo | an index file loaded into every session |
traits/ | one-line corrections: WHEN a search 403s, DO retry via the extract tool | the trait index is always in the system prompt |
learned-skills/ | reusable multi-step procedures composed from existing tools | auto-injected when the next message matches the trigger |
Who writes them#
The primary model will not stop mid-task to record a lesson - it has a job to do. So a separate cheap model does it. After a session that did real work goes idle, the resident 4B reviews the tool sequence and records ONE thing: a learned-skill if the run was clean, a trait if a step failed and got recovered. The big model does the work, the small model writes down what worked. This runs at no swap cost, because the 4B is one of the always-resident models alongside the warm slot.
The promotion ladder#
fact --memory_remember--> memory/<slug>.md (loaded every session)
correction --reflector--------> traits/<slug>.md (index always in prompt)
procedure --reflector--------> learned-skills/<slug>/ (auto-injected on match)
proven skill --promote (mv)-----> skills/<slug>/ (a permanent skill)A fact becomes a memory, a correction becomes a trait, a working procedure becomes a learned-skill, and a learned-skill that keeps proving itself gets promoted by hand to a permanent skill. The store is capped - a few dozen traits, a handful of learned-skills - so it stays curated, not a junk drawer.
Recall: semantic layer + knowledge graph#
On top of the flat files sits an optional semantic layer and a knowledge graph. The semantic layer indexes the memory, traits, and learned-skills into one searchable store; recall is automatic per turn, gated so an ordinary coding turn pulls in nothing and only a real match surfaces. If that layer is absent, everything degrades to a flat-file substring scan - the plain files are the source of truth, the index is a convenience.
The knowledge graph is built only from my own words: my notes and the things I have actually typed, never the assistant's output. A pipeline extracts (subject, predicate, object) triples, dedupes them, and normalises name variants so they collapse to one entity. The agent reads relationships from the graph during a conversation and writes new ones back as it learns them, so the graph keeps growing from use.
The point of the whole layer: the stack gets better at my work by running my work. A correction I make once sticks, because the stack wrote it down on its own.