Self-hosted LLMs and the context discipline that makes them work

I run my own LLMs. The interesting part of working with language models in 2026 is not the model. It is the discipline you wrap around it, and that discipline is far easier to build and inspect on a stack you own end to end.

This is that discipline as it sits on my desk today: a single-slot model swap, model tiering, narrow agents with isolated context, skills that hide API plumbing behind one tool call, and a memory layer that learns from its own runs. The hardware is the boring part.

The stack#

The models run on a GPU host fronted by a gateway. Nothing in the design above the gateway cares where that host is - it talks to an OpenAI-compatible endpoint over HTTP, and whether the GPUs are under my desk or in a rack somewhere does not change a line.

The pieces:

llama.cpp is the inference engine. C++, runs anywhere, supports every quantization that matters.
llama-swap is the hot-swap router. A YAML file maps model names to launch commands. A request comes in for a model that is not loaded, llama-swap unloads the current one, loads the requested one, proxies the request through. A swap costs 20-40 seconds.
Open WebUI is the front door: an OpenAI-compatible API gateway and chat UI that also fronts speech-to-text, text-to-speech, embeddings, and a reranker. Everything points its OPENAI_API_BASE here.
OpenCode is the agentic CLI. Reads prompts, plans, calls tools, delegates. Its provider config points at the same gateway, so it sees the same model menu.

Two more services earn their place: SearXNG for web search, Firecrawl for clean content extraction. Both self-hosted, both behind the same gateway.

The constraint that shapes everything: one slot#

One fact drives the whole design. The swap keeps exactly ONE big model warm at a time. Ask for a different big model and you pay the swap. A few small models stay resident alongside the warm slot - a 4B, the embedder, speech-to-text - because they are cheap enough to keep loaded.

Short answer to why that matters: it kills naive parallelism. You cannot fan eight subtasks out across eight different big models at once. Each swap would evict the last and the slot would thrash.

So the architecture has one rule that makes a swap-bound backend usable for multi-agent work: a subagent inherits the model of the agent that called it. An agent and the workers it spawns share one warm model. Parallelism happens within that model, not across slots. The only time you deliberately swap is to escalate to a bigger model, and you do that on purpose, once.

Design around the constraint instead of fighting it. Inherit the caller's model, and the single slot stops being a problem.

The filing cabinet#

A naive way to use an LLM for research: hand it a question, let it search the web, scrape ten pages, dump the content into context, ask for a report. By page seven it has forgotten what it read for page two. By the report it is inventing quotes that appeared in none of the sources.

This is not a small-model problem. Frontier models hallucinate under the same load. The shape of the problem is attention dilution: the model spreads its limited attention across everything in context, and the more noise you stuff in, the less attention each fact gets.

The fix is not bigger models. The fix is smaller contexts.

The mental model I keep coming back to is a filing cabinet.

BAD:  one person, 50 documents, 10 questions
      = overloaded, mixes details, forgets, starts guessing
GOOD: 10 people, 5 documents each, 1 question each
      + 1 analyst combining the answers
      = focused, accurate, verifiable

That is what context discipline is. Many workers with clean desks, one coordinator combining their findings.

Layer one: model tiering#

Different roles need different model sizes. The menu is small and deliberate.

Tier	Model	Thinking	Purpose
Utility	4B	off	page extraction, classification, the always-resident reflector
Worker	27B	off	the default driver and the focused subtask runner
Reason	122B MoE	on	escalation: subtle bugs, race conditions, architecture trade-offs
Deep	397B MoE	on	last resort, when reason was tried and verified wrong

The worker tier is the focused reader, and it is also the default. It gets one subtask with a small starting context, does its retrieval, extracts the relevant facts with source URLs, returns a short distilled answer. Thinking off, because reasoning is not the worker's job and thinking burns context on internal monologue that does not improve search quality.

The utility tier does one thing: turn a raw HTML page into the three sentences the worker wanted. A worker never reads a raw page. It calls a utility model with the page and a prompt like "extract any fine amounts and their legal basis", and gets back two hundred tokens of clean facts.

RAW SCRAPE (wasteful):               UTILITY EXTRACTION (efficient):
+---------------------------+         +---------------------------+
| Navigation menu           |         | Fine amount: 726 EUR      |
| Cookie banner             |         | Repeat offense: 2,180 EUR |
| *** Actual content ***    |         | Source: ris.bka.gv.at     |
| Footer / ads / comments   |         +---------------------------+
+---------------------------+         = 200 tokens
= 8,000 tokens (95% noise)

Five extracted pages cost a thousand tokens of worker context. Five raw pages would cost forty thousand. That compounding is the difference between fitting in context and not.

The reason and deep tiers are MoE models. The letter math is total params - active params: a 122b-a10b activates 10 billion parameters per token while holding 122 billion in memory. That is what makes the big tiers affordable on this kind of hardware - memory holds the whole model, compute touches only the experts the router picks.

Layer two: the agents#

The stack is not one general assistant. It is a handful of narrow agents, each Tab-switchable, each with a focused system prompt and a hard instruction to refuse work outside its lane and name the right agent instead.

Agent	Job
`chat`	knowledge Q&A, comparisons, recommendations, one-shot non-code
`build`	code, refactors, infrastructure-as-code
`research`	sourced multi-step research; delegates every web call and fact-check
`ops`	live infrastructure, remote troubleshooting over tmux, runbooks
`student`	coursework: PDF exercises, math, write-ups

Narrow agents beat one do-everything agent for the same reason small contexts beat big ones: a focused system prompt is a clean desk. The chat agent's whole job is epistemic discipline - never fabricate a name, a version, or a number - and an agent that refuses to silently start editing files is more trustworthy than one general agent pattern-matching its way into the wrong mode.

The escalation tiers are agents too, but you do not reach for them first:

reason (122B) runs only after the primary has failed twice on its own.
deep (397B) runs only after reason came back and was verified wrong.

Escalation is deliberate and rare, because each step up costs a model swap. You reach for the big model when the cheap one has demonstrably failed, not before.

Layer three: subagents#

Subagents are the mechanism that makes context isolation real. Not a UI thing. A runtime split where each subtask runs in its own model invocation with its own context window.

You: "Research X"
         |
         v
+--research agent (27B, warm slot)------+
|                                       |
|  Breaks X into 8 questions            |
|  For each: spawn a worker subagent    |
|                                       |
+--+-------+-------+-------+-------+----+
   v       v       v       v       v
 worker  worker  worker  worker  worker
  (27B)   (27B)   (27B)   (27B)   (27B)   <- same warm model, inherited
   |       |       |       |       |
   |       |  each: search, extract, verify
   v       v       v       v       v
 answer  answer  answer  answer  answer

Coordinator: receives 8 short answers (not 8 page dumps)
             cross-checks, writes the report

Each worker starts with maybe 500 tokens: the question, a short system prompt. It works, hits maybe 5,000 tokens of accumulated context by the end, returns 200-400 tokens of distilled findings. The coordinator never sees the workers' contexts. It sees the answers.

The first time I instrumented this in OpenCode I expected a modest gain. The output-quality jump was the same as switching from a 27B model to a frontier API on the same prompt. Same compute, same hardware, very different output.

Five rules make subagents work, and getting any one wrong undoes the gain.

Orchestrated delegation: one agent plans and distributes, the workers execute. The coordinator does the synthesis; workers do the focused retrieval. Because of the single-slot rule, all of them run on the same warm model - the gain here is clean context per worker, not a bigger brain on top.

Context isolation: a worker's context never bleeds into another worker or the coordinator. The coordinator receives distilled answers, not raw scrapes. Anything that pollutes it with worker noise defeats the pattern.

Extract, do not dump: workers never put raw pages in their context. They call a utility model to extract first.

Verification pipeline: a separate verifier agent checks claims against their sources after the workers finish. Visits the cited URL, compares the claim, returns CONFIRMED / DEBUNKED / NOT FOUND. This catches the case where a worker said "the fine is 726 EUR according to example.com" and example.com does not say 726 EUR.

Graceful degradation: when something fails - search rate-limited, page down, worker out of step budget - the system says so. An empty result marked NOT VERIFIED beats a confident fabrication.

The last rule takes the longest to internalise. A model that fills gaps from training data sounds confident. The point of building a research pipeline is to be more trustworthy than the model talking to itself, so an honest "I could not find this" is ALWAYS better than a silent guess.

Layer four: skills#

Subagents handle the process-discipline side. Skills handle the API-surface side. Together they let an agent touch real systems without drowning in plumbing.

A skill is a small documented capability with a clear contract. Three layers per skill:

SKILL.md is the public face. Tells the agent when to invoke the skill, what arguments it takes, what it returns. The agent reads this and decides.
An execution layer the agent runs. Usually a thin script: a Python wrapper around an API, a shell command, a templated HTTP request. Returns structured output.
A reference layer for stable identifiers the script needs but the agent should not learn: API IDs, custom field names, common endpoints. Loaded only when the skill is invoked.

The win is that the agent's view of "send a Jira ticket" or "create a calendar event" is one tool call. It does not see the API client, the authentication, or the retry logic. It sees a function that takes a small dict and returns a small dict.

The gain was concrete. Prompts dropped from "search for the Jira ticket about deploying X, paste in the ticket ID, here is my username, here is the API token format" to "create a Jira ticket about deploying X". Less plumbing in context, more attention left for the actual task.

Layer five: memory that learns#

This is the newest layer and the one that makes the stack improve by being used. Three kinds of durable memory, all built from my own input, all plain files on disk:

Memory: facts. Who I am, a project's constraints, decisions that live in no repo. A flat-file store with an index loaded into every session. Copy the directory and it works.
Traits: one-line corrections. WHEN a search returns a 403, DO retry through the extract tool. The trait index is always in the system prompt, so the fix is always in view.
Learned-skills: reusable procedures. A multi-step recipe composed from existing tools, written once and auto-injected when the next message matches its trigger.

The interesting part is how they get written. The primary model will not stop mid-task to record a lesson - it has a job to do. So a separate cheap model does it. After a session that did real work goes idle, the always-resident 4B reviews the tool sequence and records ONE thing: a learned-skill if the run was clean, a trait if a step failed and got recovered. The big model does the work, the small model writes down what worked.

On top of the flat files sits a semantic recall layer and a knowledge graph. The graph is built only from my own words - my notes and the things I have actually typed, never the assistant's output - extracted into (subject, predicate, object) triples, deduped, and normalised so name variants collapse to one entity. Recall is automatic: each turn, the relevant facts and relationships are injected into the prompt, gated so an ordinary coding turn pulls in nothing and only a real match surfaces.

The promotion ladder is the point. A fact becomes a memory, a correction becomes a trait, a working procedure becomes a learned-skill, and a learned-skill that keeps proving itself gets promoted to a permanent skill. The stack gets better at my work by running my work.

When delegation pays off#

The whole pipeline is overhead. For a simple question it loses to a single model with a clean prompt.

Situation	Pattern	Why
Complex research, many subtopics	Full pipeline: delegate + isolate + verify	Each subtopic gets clean context, claims get checked
Code review / audit	Coordinator + focused reviewers	Each reviewer checks one aspect
Simple Q&A	Single agent, no delegation	The overhead is not worth it

Rule of thumb: more than 3-5 documents to read, or more than 3 questions to answer, delegation pays. Below that, one model with clean prompting wins.

What this changes in practice#

A research task that took a frontier API thirty seconds of single-model thinking and returned a report with two confident fabrications takes this stack a few minutes and returns a report with verified citations. The extra minutes are mostly worker latency. The two fabrications are gone.

The hardware matters less than the access it buys. The discipline is mine, the failure modes are mine, the iteration is mine. Something hallucinates, I read the worker's context. Something stalls, I read the coordinator's plan. A skill is wrong, I edit the markdown. A correction sticks because the stack wrote it down on its own.

Most of this does not need to be self-hosted. The five subagent rules work against any API. Three-layer skills work against any agent runtime. Model tiering works on any provider with multiple model sizes. Self-hosting is what made the discipline visible to me, and building it bottom-up is what taught me which parts matter.

Full writeup, copy-paste grade: A self-hosted multi-agent LLM stack.