Use Ollama Locally
Running LLMs on your own machine via Ollama. Free, private, offline-capable.
For the full local-model guide with hardware requirements, tuning, and alternatives, see Connect a local model. This page is the focused Ollama-specific version.
Step 1 — install Ollama
Download from ollama.com. Runs as a background service.
Verify:
ollama --version
Step 2 — pull a model
For coding, pick something from the coder-tuned list:
ollama pull qwen2.5-coder:14b # strong, moderate RAM
# or
ollama pull deepseek-coder:33b # larger, better, needs more RAM
# or
ollama pull codellama:13b # reliable baseline
First pull is 5-20 GB; subsequent pulls of related models share layers.
Step 3 — configure in Codebolt
Settings → Providers → Add provider → Ollama.
Default URL is http://localhost:11434. Codebolt auto-detects installed models. Click Test.
Or via Settings → Providers → Add provider → Ollama.
Step 4 — use it
In a chat tab, pick your Ollama model from the model picker. First turn will be slow (model loads into memory); subsequent turns are fast.
Keep models warm
Ollama unloads models after inactivity (default 5 minutes). To keep them warm longer:
# Linux/macOS
export OLLAMA_KEEP_ALIVE=1h
Or in a systemd drop-in. 24h / unlimited if your RAM can spare it.
Embedding models
Codebolt uses embeddings for memory ingestion and vector search. Run them locally too:
ollama pull nomic-embed-text
Settings → Providers → Embeddings → Ollama.
After switching embedding models, re-index:
After switching embedding models, re-index via Settings → Indexing → Re-index full project.
Fallback chain: local + remote
A common setup: local by default, remote for hard tasks.
Settings → Providers → Fallback chains:
Primary: ollama (qwen2.5-coder:14b)
Fallback on error: anthropic (claude-sonnet)
Fallback on timeout: anthropic (claude-sonnet)
Most runs stay local; only the hard ones escalate.
Tuning
GPU acceleration
If you have an Nvidia GPU or Apple Silicon, Ollama uses it automatically. Check with nvidia-smi (or activity monitor on Mac) while Ollama runs — the GPU should be busy.
If GPU isn't being used:
- Ensure you installed the GPU-enabled Ollama build.
- Check CUDA / Metal install.
- On Linux, might need
nvidia-container-toolkitif running under Docker.
Quantization
Smaller, faster, slightly worse: :q4_k_m (the default for most Ollama models).
Larger, slower, slightly better: :q8_0 or :fp16.
Tiny, noticeably worse: :q2_k.
For coding, q4_k_m is usually right. Experiment if you have RAM to spare.
Context window
Local models often have smaller context windows than remote flagships. Check the model's metadata:
ollama show qwen2.5-coder:14b
If the context is too small for your project, compression kicks in earlier — either accept it or move to a remote provider for that task.
Troubleshooting
"Connection refused"
Ollama isn't running. Start it: ollama serve (or launch the app).
"Model not found"
You didn't pull it, or the name is wrong. ollama list shows installed models.
"Out of memory"
Model too big. Use a smaller variant or more aggressive quantization.
Extremely slow generation
- CPU-only? Expected. Use smaller models or get a GPU.
- GPU idle? Driver / install issue. See GPU acceleration above.
- First call? Cold start. Subsequent calls will be faster.
Lower quality than expected
Local open-weight models trail frontier closed models. For hard tasks, use a fallback to a remote provider.