Vector Store User Guide

Vector Store is a component designed for storing, managing, and retrieving vector embeddings. It supports features such as workspace management, similarity search, and metadata filtering.

Core Concepts

Workspace: Each workspace is an independent vector storage unit used to organize and manage related vector nodes.

VectorNode: A data unit containing text content, vector embedding, and metadata. It serves as the fundamental unit for storage and retrieval.

Embedding Model: Used to convert text into vector embeddings. It supports automatic generation of both node vectors and query vectors.

Available Implementations

FlowLLM provides multiple Vector Store implementations tailored to different use cases:

  • LocalVectorStore (source code): A file-based local implementation that persists data in JSONL format. Suitable for single-machine deployments and small-scale datasets.

  • MemoryVectorStore (source code): An in-memory implementation offering fast access speeds. Ideal for temporary data or testing scenarios.

  • QdrantVectorStore (source code): Built on the Qdrant vector database, supporting high-performance vector search. Recommended for large-scale production environments.

  • ChromaVectorStore (source code): Based on ChromaDB, providing persistent storage and metadata filtering capabilities.

  • EsVectorStore (source code): Built on Elasticsearch, enabling powerful combined full-text and vector search functionalities.

All Vector Store implementations inherit from BaseVectorStore (source code), ensuring a consistent interface specification.

Core Features

Workspace Management

  • Create Workspace: Create a new workspace for storing vector nodes.

  • Delete Workspace: Remove a workspace along with all its data.

  • Check Workspace Existence: Verify whether a specified workspace exists.

  • List Workspaces: Retrieve a list of all existing workspaces.

  • Copy Workspace: Duplicate data from one workspace to another.

Node Operations

  • Insert Nodes: Insert vector nodes into a workspace, supporting single or batch insertion with automatic vector embedding generation.

  • Delete Nodes: Remove specific nodes by their IDs.

  • Iterate Nodes: Traverse all nodes within a workspace.

Data Import/Export

  • Export Workspace: Export workspace data to a file or specified path.

  • Import Workspace: Import data into a workspace from a file or a list of nodes.

  • Callback Functions: Support callback functions during import/export for data transformation.

Synchronous and Asynchronous Interfaces

All Vector Store implementations provide both synchronous and asynchronous interfaces:

  • Synchronous Interface: Direct method calls suitable for synchronous code environments.

  • Asynchronous Interface: Prefixed with async_, designed for asynchronous environments and offering better concurrency performance.

The asynchronous interface is particularly useful in the following scenarios:

  • Using asynchronous embedding models for vector generation.

  • Performing batch operations in high-concurrency environments.

  • Integrating with other asynchronous components.

Configuration Options

General Configuration

  • embedding_model: Instance of the embedding model used to generate vector embeddings.

  • batch_size: Batch size for bulk operations (default: 1024).

LocalVectorStore Configuration

  • store_dir: Storage directory path (default: ./local_vector_store).

MemoryVectorStore Configuration

  • store_dir: Persistence directory (default: ./memory_vector_store).

QdrantVectorStore Configuration

  • url: Qdrant service URL (optional; used for Qdrant Cloud or custom deployments).

  • host: Qdrant server host (default: localhost).

  • port: Qdrant server port (default: 6333).

  • api_key: API key for Qdrant Cloud authentication.

  • distance: Distance metric—supports COSINE, EUCLIDEAN, DOT (default: COSINE).

ChromaVectorStore Configuration

  • store_dir: ChromaDB data storage directory (default: ./chroma_vector_store).

EsVectorStore Configuration

  • hosts: Elasticsearch host address(es), either a string or a list (default: http://localhost:9200).

  • basic_auth: Basic authentication credentials (username and password).

Configuration File Examples

Configure Vector Store in flowllm/config/default.yaml under the vector_store section. The basic structure is as follows:

vector_store:
  default:
    backend: <backend_name>        # Required: vector store backend type
    embedding_model: default       # Required: name of embedding model config
    params:                        # Optional: backend-specific parameters
      # Backend-specific parameters

Configuration Field Descriptions

  • backend (required): Vector store backend type. Options: local, memory, chroma, qdrant, elasticsearch.

  • embedding_model (required): Name of the embedding model configuration, referencing the embedding_model section.

  • params (optional): Dictionary of backend-specific parameters passed to the vector store constructor.

Configuration Examples by Type

1. LocalVectorStore Configuration

Simplest local file-based storage, ideal for development and testing.

Implementation: flowllm/core/vector_store/local_vector_store.py

vector_store:
  default:
    backend: local
    embedding_model: default
    params:
      store_dir: "./local_vector_store"  # Storage directory (optional; default: "./local_vector_store")

2. MemoryVectorStore Configuration

In-memory storage with fast access, suitable for temporary data or testing.

Implementation: flowllm/core/vector_store/memory_vector_store.py

vector_store:
  default:
    backend: memory
    embedding_model: default
    params:
      store_dir: "./memory_vector_store"  # Persistence directory (optional; default: "./memory_vector_store")

3. ChromaVectorStore Configuration

Persistent storage based on ChromaDB with metadata filtering support.

Implementation: flowllm/core/vector_store/chroma_vector_store.py

vector_store:
  default:
    backend: chroma
    embedding_model: default
    params:
      store_dir: "./chroma_vector_store"  # ChromaDB data directory (optional; default: "./chroma_vector_store")

4. QdrantVectorStore Configuration

Implementation: flowllm/core/vector_store/qdrant_vector_store.py

Local Qdrant Instance:

vector_store:
  default:
    backend: qdrant
    embedding_model: default
    params:
      host: "localhost"      # Qdrant server host (optional; default: localhost)
      port: 6333             # Qdrant server port (optional; default: 6333)
      distance: "COSINE"     # Distance metric (optional; default: COSINE; options: COSINE, EUCLIDEAN, DOT)

Qdrant Cloud Configuration:

vector_store:
  default:
    backend: qdrant
    embedding_model: default
    params:
      url: "https://your-cluster.qdrant.io:6333"  # Qdrant Cloud URL
      api_key: "your-api-key-here"                 # API key
      distance: "COSINE"

5. EsVectorStore Configuration

Implementation: flowllm/core/vector_store/es_vector_store.py

Basic Configuration (Local Elasticsearch):

vector_store:
  default:
    backend: elasticsearch
    embedding_model: default
    params:
      hosts: "http://localhost:9200"  # Elasticsearch host(s) (optional; default: http://localhost:9200)

Configuration with Authentication:

vector_store:
  default:
    backend: elasticsearch
    embedding_model: default
    params:
      hosts: "http://elasticsearch.example.com:9200"
      basic_auth: ["username", "password"]  # Basic auth credentials

Multi-Host Configuration:

vector_store:
  default:
    backend: elasticsearch
    embedding_model: default
    params:
      hosts:
        - "http://es-node1:9200"
        - "http://es-node2:9200"
        - "http://es-node3:9200"

Complete Configuration Example

Below is a complete default.yaml example including both embedding model and vector store configurations:

# Embedding model configuration
embedding_model:
  default:
    backend: openai_compatible
    model_name: text-embedding-v4
    params:
      dimensions: 1024

# Vector store configuration
vector_store:
  default:
    backend: elasticsearch
    embedding_model: default
    params:
      hosts: "http://localhost:9200"

Environment Variable Support

Certain Vector Stores support environment variables as a supplement to YAML configuration:

  • Elasticsearch: FLOW_ES_HOSTS – Elasticsearch host address.

  • Qdrant:

    • FLOW_QDRANT_HOST – Qdrant host (default: localhost)

    • FLOW_QDRANT_PORT – Qdrant port (default: 6333)

    • FLOW_QDRANT_API_KEY – Qdrant API key

When parameters are not explicitly specified in the YAML configuration, the system falls back to environment variables.

Metadata Filtering

Two types of metadata filtering are supported:

  • Exact Match: Specify field values for exact matching.

  • Range Queries: Use operators gte, lte, gt, lt for numeric range queries.

  • Nested Fields: Access nested metadata fields using dot notation.

Usage Recommendations

  • Development & Testing: Use MemoryVectorStore or LocalVectorStore—no additional services required.

  • Small-Scale Applications: Use LocalVectorStore or ChromaVectorStore for simplicity and ease of use.

  • Production Environments: Use QdrantVectorStore or EsVectorStore for high performance and scalability.

  • Hybrid Search: Use EsVectorStore to combine vector search with full-text search capabilities.

Important Notes

  • Ensure the embedding model’s output dimension matches the Vector Store configuration.

  • For large-scale data, use professional vector databases (e.g., Qdrant, Elasticsearch).

  • Asynchronous interfaces deliver better performance in asynchronous environments.

  • Regularly back up critical data, especially when using in-memory storage.

  • Choose an appropriate batch size based on your data scale to optimize performance.