Skip to main content

Knowledge

The Knowledge component provides document-level knowledge management with configurable automatic chunking. It handles file upload, URL import, and raw text input, storing everything as JSON files on disk with no database dependency.

Knowledge Base

Storage Layout

All knowledge data lives under the project directory in a flat file structure:

{projectPath}/.codebolt/knowledge/{collectionId}/
collection.json # Collection metadata (name, description, timestamps)
chunkingsettings.json # Per-collection chunking configuration
documents/
{documentId}/
document.json # Document metadata (name, type, status, chunkCount)
original.{ext} # Original uploaded/fetched content
chunks.json # Array of generated chunks

There is no SQLite or other database involved. Every read and write goes directly to JSON files on the filesystem.

Document Processing Flow

When a document is added to a collection, it goes through an asynchronous processing pipeline:

  1. addDocument(collectionId, request) creates the document entry with status='pending'.
  2. An async processDocument() call is triggered.
  3. Status transitions to 'processing'; emits the chunking-started WebSocket event.
  4. The service reads the collection's chunkingsettings.json and selects a chunking strategy based on the file extension.
  5. Calls knowledgeChunkingService.chunkContentWithSettings() with the selected strategy and options.
  6. Progress callbacks emit chunking-progress events throughout processing.
  7. Generated chunks are saved via saveChunks(), and the document's chunkCount is updated.
  8. Status transitions to 'completed'; emits chunking-completed.
  9. On any error: status transitions to 'failed' with the error message attached.

URL Import

The addDocumentFromUrl(url) method fetches remote content via HTTP with the following behavior:

  • HTML stripping: removes <script>, <style>, <nav>, <footer>, <header>, <aside>, and <iframe> tags entirely.
  • Entity handling: decodes common HTML entities (&nbsp;, &amp;, etc.).
  • Timeout: 30 seconds.
  • User-Agent: Mozilla/5.0 (compatible; CodeBolt/1.0; Knowledge Fetcher).

Chunking Strategies

Six built-in strategies are available. Each collection can set a default strategy and per-file-type overrides.

1. fixed_size

Splits text into chunks of a fixed character or token count.

ParameterDefaultDescription
chunkSize500Characters per chunk
overlap50Overlapping characters between consecutive chunks

Simple sequential splitting with no awareness of content boundaries.

2. recursive

Hierarchical splitting that tries the largest separator first and falls back to progressively smaller ones.

ParameterDefaultDescription
chunkSize500Target characters per chunk
overlap50Overlapping characters between chunks
separators["\n\n", "\n", ". ", " "]Ordered list of separators to try

This is the default strategy for most file types.

3. semantic

Groups content by paragraph and sentence boundaries as a proxy for semantic similarity.

ParameterDefaultDescription
maxChunkSize1000Maximum characters per chunk
minChunkSize100Minimum characters per chunk
similarityThresholdThreshold for grouping (simplified; true semantic chunking requires embeddings)

This is a simplified implementation that uses structural boundaries rather than embedding-based similarity.

4. sentence

Splits by sentence-ending punctuation ([.!?]+) and groups a configurable number of sentences per chunk.

ParameterDefaultDescription
maxSentencesPerChunk5Number of sentences per chunk
overlap1Number of overlapping sentences between chunks

5. paragraph

Splits by double newlines (\n\s*\n) and groups paragraphs.

ParameterDefaultDescription
maxParagraphsPerChunk3Number of paragraphs per chunk
overlap0Number of overlapping paragraphs between chunks

6. markdown

Parses markdown structure including headings, code blocks, lists, and paragraphs.

ParameterDefaultDescription
maxChunkSize1000Maximum characters per chunk
preserveCodeBlockstrueKeep code blocks as atomic units
preserveListstrueKeep lists as atomic units
includeHeadingsInChunkstruePrefix chunks with their heading hierarchy

This strategy is automatically selected for .md files in the default configuration.

Default Chunking Settings

Every new collection is initialized with the following chunkingsettings.json:

{
"defaultStrategy": "recursive",
"defaultOptions": {
"chunkSize": 500,
"overlap": 50
},
"fileTypeOverrides": {
".md": {
"strategy": "markdown",
"options": {
"maxChunkSize": 1000
}
},
".json": {
"strategy": "fixed_size",
"options": {
"chunkSize": 500
}
}
}
}

The fileTypeOverrides map matches the original file's extension to a specific strategy and options. Any extension not listed falls back to defaultStrategy with defaultOptions.

Token Estimation

The system uses a rough approximation of ~4 characters per token for chunk metadata. This is used for informational purposes in chunk metadata fields, not for actual embedding generation.

Collection Operations

OperationBehavior
CreateGenerates a UUID, creates the directory structure, saves collection.json and default chunkingsettings.json
ListReads all collection directories, sorts by updatedAt descending
UpdateModifies name and/or description fields in collection.json
DeleteRemoves the entire collection directory recursively

WebSocket Events

The following events are emitted during knowledge operations:

EventTrigger
collection-createdNew collection created
collection-updatedCollection metadata modified
collection-deletedCollection removed
document-addedNew document added to a collection
document-deletedDocument removed
chunking-startedDocument processing begins
chunking-progressProgress update during chunking
chunking-completedDocument successfully chunked
chunking-failedChunking encountered an error
chunk-updatedIndividual chunk edited

REST API

Collections

MethodEndpointDescription
POST/knowledge/collectionsCreate a new collection
GET/knowledge/collectionsList all collections
PUT/knowledge/collections/:idUpdate collection metadata
DELETE/knowledge/collections/:idDelete a collection and all its documents

Documents

MethodEndpointDescription
POST/knowledge/collections/:id/documentsUpload a document (file or raw text)
POST/knowledge/collections/:id/documents/urlImport a document from a URL
GET/knowledge/collections/:id/documentsList documents in a collection
GET/knowledge/documents/:documentId?collectionId=Get a specific document
DELETE/knowledge/documents/:documentId?collectionId=Delete a document
POST/knowledge/documents/:documentId/rechunk?collectionId=Re-chunk a document with current settings

Chunks

MethodEndpointDescription
PUT/knowledge/chunks/:chunkId?documentId=&collectionId=Update a specific chunk's content

Settings and Strategies

MethodEndpointDescription
GET/knowledge/collections/:id/settingsGet chunking settings for a collection
PUT/knowledge/collections/:id/settingsUpdate chunking settings
GET/knowledge/strategiesList all available chunking strategies
GET/knowledge/strategies/:strategy/optionsGet configurable options for a strategy