LLM & Inference Subsystem
Every LLM call in Codebolt goes through one service. That service decides: which provider, which model, local or remote, stream or not, with which tool schemas, against which budget.
Source code:
controllers/llmController,services/llmService,services/inference/,services/embeddingService,services/localEmbeddingService,services/localModelInferenceService,services/localModelManager,services/localModelService,services/tokenizerService, sibling packagepackages/multillm.
Responsibilities
- Provider routing — pick the right provider for the current request (explicit config, model alias, fallback).
- Remote + local parity — same interface whether the model is a remote API or running on the user's machine.
- Embeddings — a separate but parallel path (
embeddingService,localEmbeddingService). - Tokenization —
tokenizerServiceowns token counting so cost/budget checks are accurate. - Model lifecycle —
localModelManagerdownloads, warms, and evicts local models.
Components
llmService
The single entry point. Takes a typed LLMRequest (messages, tools, model, options) and returns a typed LLMResponse or a stream. Internally dispatches to inference/ for the actual provider call.
inference/
Per-provider adapters (OpenAI, Anthropic, Google, local, etc.). Each adapter:
- Translates Codebolt's canonical message format into the provider's wire format.
- Translates the provider's tool-call format back into Codebolt's.
- Handles streaming, retries, rate limits, and error normalisation.
packages/multillm
Sibling package that holds the actual provider client code. Kept separate from the server so CLI and SDK can reuse it.
localModelInferenceService + localModelManager + localModelService
The local inference path. localModelManager handles download + caching + eviction. localModelInferenceService runs the inference. localModelService is the high-level controller surface.
embeddingService + localEmbeddingService
Parallel to the chat path but for embeddings. Used by the memory ingestion pipeline to produce vectors.
tokenizerService
One place for token counting. Critical because budget enforcement, context-window truncation, and cost reporting all depend on accurate counts.
The request path
agent loop
│
▼
llmService.chat({ messages, tools, model })
│
├── tokenizerService.count → budget check
│
├── provider resolution (explicit model? alias? fallback?)
│
▼
inference/<provider>.call()
│
├── local? → localModelInferenceService
└── remote? → multillm provider client
│
▼
normalized LLMResponse (or stream)
What this subsystem does NOT own
- Prompt assembly. That's
contextAssembly.llmServiceonly receives a fully assembled message list. - Tool execution. The LLM returns intent to call a tool; the agent loop passes it to
toolService. - Memory. The response goes back through the agent loop, which then writes to memory.
llmServiceis stateless per-call. - Guardrails. A separate sidecar runs before and after the call.
This strict separation is why you can swap providers, add local inference, or change prompt assembly without touching each other's code.
See also
- Memory — consumer of embeddings
- Context Assembly — the thing that feeds
llmService - LLM Providers integration — user-facing setup
- Custom Provider — build your own