Use Ollama Locally

Running LLMs on your own machine via Ollama. Free, private, offline-capable.

For the full local-model guide with hardware requirements, tuning, and alternatives, see Connect a local model. This page is the focused Ollama-specific version.

Step 1 — install Ollama

Download from ollama.com. Runs as a background service.

Verify:

ollama --version

Step 2 — pull a model

For coding, pick something from the coder-tuned list:

ollama pull qwen2.5-coder:14b          # strong, moderate RAM
# or
ollama pull deepseek-coder:33b         # larger, better, needs more RAM
# or
ollama pull codellama:13b              # reliable baseline

First pull is 5-20 GB; subsequent pulls of related models share layers.

Step 3 — configure in Codebolt

Settings → Providers → Add provider → Ollama.

Default URL is http://localhost:11434. Codebolt auto-detects installed models. Click Test.

Or via Settings → Providers → Add provider → Ollama.

Step 4 — use it

In a chat tab, pick your Ollama model from the model picker. First turn will be slow (model loads into memory); subsequent turns are fast.

Keep models warm

Ollama unloads models after inactivity (default 5 minutes). To keep them warm longer:

# Linux/macOS
export OLLAMA_KEEP_ALIVE=1h

Or in a systemd drop-in. 24h / unlimited if your RAM can spare it.

Embedding models

Codebolt uses embeddings for memory ingestion and vector search. Run them locally too:

ollama pull nomic-embed-text

Settings → Providers → Embeddings → Ollama.

After switching embedding models, re-index:

After switching embedding models, re-index via Settings → Indexing → Re-index full project.

Fallback chain: local + remote

A common setup: local by default, remote for hard tasks.

Settings → Providers → Fallback chains:

Primary: ollama (qwen2.5-coder:14b)
Fallback on error: anthropic (claude-sonnet)
Fallback on timeout: anthropic (claude-sonnet)

Most runs stay local; only the hard ones escalate.

Tuning

GPU acceleration

If you have an Nvidia GPU or Apple Silicon, Ollama uses it automatically. Check with nvidia-smi (or activity monitor on Mac) while Ollama runs — the GPU should be busy.

If GPU isn't being used:

Ensure you installed the GPU-enabled Ollama build.
Check CUDA / Metal install.
On Linux, might need nvidia-container-toolkit if running under Docker.

Quantization

Smaller, faster, slightly worse: :q4_k_m (the default for most Ollama models). Larger, slower, slightly better: :q8_0 or :fp16. Tiny, noticeably worse: :q2_k.

For coding, q4_k_m is usually right. Experiment if you have RAM to spare.

Context window

Local models often have smaller context windows than remote flagships. Check the model's metadata:

ollama show qwen2.5-coder:14b

If the context is too small for your project, compression kicks in earlier — either accept it or move to a remote provider for that task.

Troubleshooting

"Connection refused"

Ollama isn't running. Start it: ollama serve (or launch the app).

"Model not found"

You didn't pull it, or the name is wrong. ollama list shows installed models.

"Out of memory"

Model too big. Use a smaller variant or more aggressive quantization.

Extremely slow generation

CPU-only? Expected. Use smaller models or get a GPU.
GPU idle? Driver / install issue. See GPU acceleration above.
First call? Cold start. Subsequent calls will be faster.

Lower quality than expected

Local open-weight models trail frontier closed models. For hard tasks, use a fallback to a remote provider.

Step 1 — install Ollama​

Step 2 — pull a model​

Step 3 — configure in Codebolt​

Step 4 — use it​

Keep models warm​

Embedding models​

Fallback chain: local + remote​

Tuning​

GPU acceleration​

Quantization​

Context window​

Troubleshooting​

"Connection refused"​

"Model not found"​

"Out of memory"​

Extremely slow generation​

Lower quality than expected​

See also​