Skip to main content
Semantic search lets you find recording sessions using natural language queries. It works by generating vector embeddings from session metadata (robot name, operator, tags, labels, track names) and using pgvector for similarity matching.
Semantic search requires PostgreSQL with pgvector. Make sure you’ve completed the PostgreSQL setup first.

Install Dependencies

Install the search extras, which include sentence-transformers and the PostgreSQL driver:
uv pip install repoch[search]
Use the provided configs/search.toml:
[database]
backend = "postgres"

[database.postgres]
# password should be set via environment variable for security

[search]
enabled = true
Start the server:
REPOCH__DATABASE__POSTGRES__PASSWORD=your_secure_password repoch server --config configs/search.toml
When the server starts with search enabled, it automatically creates the pgvector extension and begins generating embeddings for new and updated sessions.

Build the Index

To generate embeddings for existing sessions, run the reindex command against a running server:
repoch db reindex
Use --force to re-embed all sessions, even those that haven’t changed:
repoch db reindex --force

Configuration Options

All search settings are optional and have sensible defaults:
SettingDefaultDescription
enabledfalseEnable semantic search
embedding_modelall-MiniLM-L6-v2Sentence-transformer model to use
devicecpuComputation device: cpu, cuda, or auto
batch_size100Sessions per embedding batch
max_reindex_sessions10000Maximum sessions per reindex request

GPU Acceleration

For faster embedding generation on machines with a CUDA-compatible GPU:
[search]
enabled = true
device = "cuda"
Set device = "auto" to use the GPU when available and fall back to CPU otherwise.
The default all-MiniLM-L6-v2 model is lightweight and runs well on CPU. GPU acceleration is most beneficial when reindexing large numbers of sessions.

How It Works

Embeddings are generated from session metadata including:
  • Robot name
  • Operator username
  • Tags
  • Timeslice labels
  • Track, camera, audio source and time series group names
When any of these change, the embedding is automatically regenerated. A source text hash is stored alongside each embedding to detect when sessions need re-embedding. Search queries are embedded using the same model and matched against stored embeddings using cosine distance. Results can be further filtered by robot name, tags, time range and visibility.

Next Steps

Configuration System

Learn more about the full configuration system