Skip to main content

Vector DB

The Vector DB component manages collections of vector embeddings for semantic search across project files and arbitrary text. It is backed by SQLite with the sqlite-vec extension for K-nearest-neighbor (KNN) similarity search, and FTS5 for hybrid search that combines BM25 keyword scoring with vector similarity via Reciprocal Rank Fusion.

Vector DB


Storage Layout

Database

All vector data lives in the shared project database:

{projectPath}/.codebolt/db/project.db

The database runs in WAL (Write-Ahead Logging) mode for concurrent read/write performance.

Tables

TablePurpose
vec_documentsStores documents — columns: collection, uri, doc_text, hash, metadata
vec_chunksStores chunks of documents — columns: document_id, collection, chunk_text, start_pos, end_pos
vec_collectionsCollection metadata — columns: embedding_dim, embedding_model
vec_{collectionId}Per-collection virtual table (sqlite-vec vec0) used for KNN search
vec_chunks_ftsFTS5 virtual table over vec_chunks for full-text keyword search

Collection Metadata on Disk

{projectPath}/.codebolt/vectordb/{collectionId}/

Each collection directory contains JSON files for collection info and settings.

Database Migrations

Five migrations run in order:

  1. vec_documents table
  2. vec_chunks table
  3. vec_collections table
  4. FTS5 virtual table + triggers
  5. vec_migrations tracking table

Embedding

Embedding converts text into fixed-dimension vectors. The Vector DB routes embedding requests to one of several providers:

  • Local ONNX modelall-MiniLM-L6-v2, dimension 384
  • Cloud providers — OpenAI (dimension 1536), Voyage, and others

The provider and model are selected from project settings. Different models produce different embedding dimensions:

ModelDimension
all-MiniLM-L6-v2 (local)384
OpenAI1536

The default dimension is 1536 when no model is explicitly configured.


Document Processing Flow

When a document is indexed, the following steps execute:

  1. upsertDocument(uri, text) is called.
  2. The document type is detected from the URI file extension.
  3. If the extension is supported (js, jsx, ts, tsx, py, rs, go, cpp, hpp, c, h, cs, rb, java, php, swift — 16 languages): AST-aware chunking via tree-sitter.
  4. Otherwise: TextSplitter with language-specific separators.
  5. Chunks are processed in batches of 10 for embedding.
  6. Each chunk is stored in vec_chunks and its embedding is inserted into the vec0 virtual table.
  7. Progress callbacks are emitted during processing.

Indexing Events

EventWhen
indexing-startedProcessing begins for a document
indexing-progressAfter each batch of chunks is embedded and stored
indexing-completedAll chunks processed successfully
indexing-failedAn error occurred during processing

Chunking Strategies

TextSplitter

Used for unsupported file types and as a fallback.

  • Configurable chunk_size and chunk_overlap
  • Language-specific separator hierarchies (default, markdown, typescript, javascript, python, JSON)
  • Splits by separators in priority order, respects token counts

ASTChunker (tree-sitter)

Used for supported source code files:

Supported extensions: js, jsx, ts, tsx, py, rs, go, cpp, hpp, c, h, cs, rb, java, php, swift

Behavior:

  • Parses the file into an AST using tree-sitter
  • Extracts function, class, and method definitions as individual chunks
  • Splits large classes into a header chunk plus individual method chunks
  • Preserves signature context for sub-chunks (the parent function/class signature is prepended)
  • Falls back to TextSplitter if the parse fails

Query Mechanics

When a query is executed:

  1. The query text is embedded into a vector using the collection's configured model.
  2. KNN search runs via sqlite-vec: finds the top-K most similar chunks by cosine distance.
  3. Optional hybrid search: if enabled, an FTS5 BM25 keyword search runs in parallel.
  4. Reciprocal Rank Fusion (RRF): merges KNN and BM25 result rankings into a combined score.
  5. Results are grouped by document and sorted by the best chunk score within each document.
  6. The response includes: document URI, relevance score, and matching text sections with line numbers.

Defaults

ParameterDefault
Max documents returned10
Max chunks returned50

Collection Settings

Each collection has a VectorDbSettings object that controls its behavior:

SettingDescription
embeddingModelWhich embedding model to use
embeddingProviderLocal ONNX or a cloud provider
chunkSizeNumber of tokens per chunk (default from project config)
chunkOverlapOverlap in tokens between consecutive chunks
searchModevector (KNN only) or hybrid (KNN + BM25 via RRF)
topKDefault number of results to return
minScoreMinimum similarity threshold — results below this score are excluded

WebSocket Events

The Vector DB emits real-time events over WebSocket:

Collection lifecycle:

  • collection-created
  • collection-updated
  • collection-deleted

Document lifecycle:

  • document-added
  • document-deleted

Chunk lifecycle:

  • chunks-added
  • chunk-deleted

Indexing progress:

  • indexing-started
  • indexing-progress
  • indexing-completed
  • indexing-failed

REST API

Collections

MethodEndpointDescription
POST/vectorDB/collectionsCreate a new collection
GET/vectorDB/collectionsList all collections
PUT/vectorDB/collectionsUpdate a collection
DELETE/vectorDB/collectionsDelete a collection

Documents

MethodEndpointDescription
POST/vectorDB/collections/:id/documentsUpsert a document (insert or update by URI)
GET/vectorDB/collections/:id/documentsList documents in a collection
DELETE/vectorDB/collections/:id/documentsDelete a document by URI

Chunks

MethodEndpointDescription
GET/vectorDB/collections/:id/chunksList chunks in a collection
POST/vectorDB/collections/:id/chunksAdd chunks manually
DELETE/vectorDB/collections/:id/chunks/:chunkIdDelete a specific chunk

Query

MethodEndpointDescription
POST/vectorDB/collections/:id/queryRun a semantic (or hybrid) search query

Settings

MethodEndpointDescription
GET/vectorDB/collections/:id/settingsGet collection settings
PUT/vectorDB/collections/:id/settingsUpdate collection settings

SDK Reference

The codebolt.vectordb module exposes the following methods:

Create a Collection

const collection = await codebolt.vectordb.createCollection({
name: "my-docs",
embeddingModel: "all-MiniLM-L6-v2",
embeddingProvider: "local"
});

Upsert a Document

await codebolt.vectordb.upsertDocument(collectionId, {
uri: "src/index.ts",
text: fileContents
});

Query Documents

const results = await codebolt.vectordb.queryDocuments(collectionId, {
query: "authentication middleware",
topK: 5,
searchMode: "hybrid"
});

Delete a Document

await codebolt.vectordb.deleteDocument(collectionId, "src/index.ts");

Add a Chunk

await codebolt.vectordb.addChunk(collectionId, {
documentId: docId,
chunkText: "function authenticate(req, res, next) { ... }",
startPos: 0,
endPos: 120
});

Delete a Chunk

await codebolt.vectordb.deleteChunk(collectionId, chunkId);