JVector Advanced Usage

This section covers advanced usage patterns for production deployments.

On-Disk Index with Compression

For large datasets that exceed available memory, combine on-disk storage with Product Quantization (PQ) compression.

VectorIndexConfiguration config = VectorIndexConfiguration.builder()
    .dimension(768)
    .similarityFunction(VectorSimilarityFunction.COSINE)
    .maxDegree(32)
    .beamWidth(200)
    // On-disk storage
    .onDisk(true)
    .indexDirectory(Path.of("/data/vectors"))
    // PQ compression (reduces memory significantly)
    .enablePqCompression(true)
    .pqSubspaces(48)  // Must divide dimension evenly
    .build();
PQ compression automatically enforces maxDegree=32 due to the FusedPQ algorithm constraint.

How the On-Disk Index Works

The vector index uses the HNSW (Hierarchical Navigable Small World) algorithm, which constructs a multi-layer proximity graph for efficient approximate nearest neighbor search. When on-disk mode is enabled, this graph is persisted to disk and accessed via memory-mapped files rather than being held entirely on the Java heap.

Build and Persist Lifecycle

The graph is always built in-memory first. Entities are added to the graph one at a time, each inserted by finding its nearest neighbors through a beam search and then connecting it to those neighbors via bidirectional edges. Once the in-memory graph is ready, it is serialized to disk.

Persistence creates two files in the configured indexDirectory:

  • {indexName}.graph — The full HNSW graph structure, including node connectivity (edges), and optionally embedded inline vectors and FusedPQ compressed data.

  • {indexName}.meta — A small metadata file containing the file format version, the vector dimension, and the current vector count. This is used on startup to verify that the on-disk graph matches the current data.

Loading from Disk

On startup, the index first checks whether valid graph and metadata files exist. The metadata is verified: the dimension must match the configured value, and the vector count must match the number of vectors currently stored. If both checks pass, the graph is loaded directly from disk using memory-mapped I/O. This avoids rebuilding the graph from scratch, which can be expensive for large datasets.

If the metadata does not match (e.g., because new entities were added while the application was offline, or the configuration changed), the on-disk files are ignored and the graph is rebuilt in-memory from the stored vectors.

Memory-Mapped Access

The on-disk graph is accessed through memory mapping the graph file. The operating system’s virtual memory subsystem pages graph data in and out as needed, which means:

  • The JVM heap is not burdened with the full graph structure.

  • The OS page cache automatically keeps frequently accessed portions of the graph in physical memory.

  • Datasets much larger than available RAM can still be searched, with a trade-off of slightly higher query latency when pages need to be fetched from disk.

Product Quantization (PQ) Compression

Product Quantization is a lossy vector compression technique that drastically reduces the memory footprint of stored vectors while maintaining reasonable search accuracy.

The Core Idea

A raw float vector of dimension D requires D * 4 bytes of storage. For a 768-dimensional embedding, that is 3,072 bytes per vector. At scale (e.g., 10 million vectors), this alone consumes roughly 30 GB of memory — just for the vectors.

PQ reduces this by splitting each vector into M subspaces (contiguous slices) and independently quantizing each subspace using a codebook of 256 centroids (cluster centers). Since 256 values fit in a single byte (2^8 = 256), each subspace is encoded as one byte. A full vector is thus compressed from D * 4 bytes down to M bytes.

Example: A 768-dimensional vector with pqSubspaces=48:

  • Each subspace covers 768 / 48 = 16 dimensions.

  • Each subspace is encoded as 1 byte (index into its 256-entry codebook).

  • Compressed size: 48 bytes per vector (vs. 3,072 bytes raw) — a 64x reduction.

Training the Codebook

Before vectors can be compressed, PQ must learn the codebooks by clustering the training data. This happens automatically when the index is persisted to disk for the first time (provided at least 256 vectors exist). The training process:

  1. Collects all stored vectors.

  2. Splits each vector into M subspaces.

  3. Runs k-means clustering (k=256) independently on each subspace, producing 256 centroids per subspace.

  4. Encodes every vector by replacing each subspace with the index of its nearest centroid.

The resulting codebook and compressed codes are stored alongside the graph.

Approximate Distance Computation

Once vectors are compressed, distances between a query vector and a stored (compressed) vector can be computed without decompressing. For each subspace, the distance between the query’s subspace and the centroid (looked up by the stored byte index) is precomputed into a lookup table. The total distance is the sum of these per-subspace distances.

This is much faster than computing the full exact distance, because it replaces D floating-point multiplications with M table lookups and additions.

FusedPQ: Compression Fused into the Graph

When PQ compression is enabled for an on-disk index, the implementation uses FusedPQ — a technique where the compressed PQ codes for each node’s neighbors are stored directly alongside the graph edges in the on-disk file.

During graph traversal (search), the HNSW algorithm must evaluate the distance from the query to each candidate neighbor. With FusedPQ, these approximate distances can be computed directly from the fused PQ codes embedded in the graph structure, without a separate random-access lookup into a vector store. This reduces I/O and cache misses during search.

The on-disk graph file thus contains three types of data per node:

  • Graph edges — The connections to neighboring nodes (the HNSW structure).

  • Inline vectors — The full-precision float vectors, used for exact reranking.

  • FusedPQ codes — The PQ-compressed representations of the node’s neighbors, used for fast approximate scoring during traversal.

The FusedPQ implementation requires exactly maxDegree=32. This is a fixed constraint of the algorithm’s internal layout, which is why enabling PQ compression automatically enforces this value.

Two-Phase Search with Reranking

When PQ compression is active, searches use a two-phase approach to balance speed and accuracy:

Phase 1 — Approximate candidate retrieval: The HNSW graph is traversed using the FusedPQ-compressed vectors for fast approximate distance computation. This phase fetches 2 * k candidates (twice the requested result count) to ensure the true top-k results are captured despite the approximation error introduced by quantization.

Phase 2 — Exact reranking: The 2 * k approximate candidates are then re-scored using the full-precision inline vectors stored in the graph file. The exact scores are sorted and the best k results are returned.

This two-phase approach achieves nearly the same recall as an uncompressed search while benefiting from the speed and memory advantages of PQ during graph traversal.

Choosing PQ Subspaces

The pqSubspaces parameter (often called M in PQ literature) controls the compression ratio and accuracy trade-off:

Subspaces Bytes per Vector Characteristics

dimension / 2 (e.g., 384 for dim=768)

384 bytes

Highest accuracy, least compression

dimension / 4 (e.g., 192 for dim=768)

192 bytes

Good balance of accuracy and compression (recommended starting point)

dimension / 8 (e.g., 96 for dim=768)

96 bytes

More aggressive compression, some accuracy loss

dimension / 16 (e.g., 48 for dim=768)

48 bytes

Maximum compression, noticeable accuracy loss

The dimension must be evenly divisible by the number of subspaces. If pqSubspaces is set to 0, it defaults to dimension / 4.

Parallel On-Disk Writes

For large indices, writing the graph to disk can be slow. Enabling parallelOnDiskWrite(true) uses the OnDiskParallelGraphIndexWriter, which allocates parallel direct buffers and uses multiple worker threads (one per available processor) to write the graph concurrently. This can significantly speed up persistence for indices with millions of nodes.

VectorIndexConfiguration config = VectorIndexConfiguration.builder()
    .dimension(768)
    .similarityFunction(VectorSimilarityFunction.COSINE)
    .onDisk(true)
    .indexDirectory(Path.of("/data/vectors"))
    .enablePqCompression(true)
    .pqSubspaces(48)
    .parallelOnDiskWrite(true) // Multi-threaded disk writes
    .build();

Sequential (single-threaded) writing is the default, which is preferable in resource-constrained environments or for smaller indices.

Production Configuration

For production systems with continuous updates, enable both background persistence and optimization.

VectorIndexConfiguration config = VectorIndexConfiguration.builder()
    .dimension(768)
    .similarityFunction(VectorSimilarityFunction.COSINE)
    // On-disk storage
    .onDisk(true)
    .indexDirectory(Path.of("/data/vectors"))
    // Background persistence (async, non-blocking, enabled by setting interval > 0)
    .persistenceIntervalMs(30_000)        // Enable, check every 30 seconds
    .minChangesBetweenPersists(100)       // Only persist if >= 100 changes
    .persistOnShutdown(true)              // Persist on close()
    // Background optimization (periodic cleanup, enabled by setting interval > 0)
    .optimizationIntervalMs(60_000)       // Enable, check every 60 seconds
    .minChangesBetweenOptimizations(1000) // Only optimize if >= 1000 changes
    .optimizeOnShutdown(false)            // Skip for faster shutdown
    .build();

Manual Optimization and Persistence

For fine-grained control, you can manually trigger optimization and persistence.

// Optimize graph (removes excess neighbors, improves query latency)
index.optimize();

// Persist to disk (for on-disk indices)
index.persistToDisk();

// Close index (runs shutdown hooks based on config)
index.close();

Multiple Vector Indices

You can create multiple vector indices for different embedding types on the same entity.

GigaMap<Document> gigaMap = GigaMap.New();
VectorIndices<Document> vectorIndices = gigaMap.index().register(VectorIndices.Category());

// Title embeddings (smaller dimension)
VectorIndexConfiguration titleConfig = VectorIndexConfiguration.builder()
    .dimension(384)
    .similarityFunction(VectorSimilarityFunction.COSINE)
    .build();
VectorIndex<Document> titleIndex = vectorIndices.add("title", titleConfig, new TitleVectorizer());

// Content embeddings (larger dimension)
VectorIndexConfiguration contentConfig = VectorIndexConfiguration.builder()
    .dimension(768)
    .similarityFunction(VectorSimilarityFunction.COSINE)
    .build();
VectorIndex<Document> contentIndex = vectorIndices.add("content", contentConfig, new ContentVectorizer());

Combine vector similarity search with traditional bitmap index filtering.

// First, filter by category using bitmap index
List<Long> categoryIds = bitmapIndex.query(
    categoryIndexer.is("technology")
).toList();

// Then search within filtered results
VectorSearchResult<Document> result = vectorIndex.search(queryVector, 10);

// Combine results
List<Document> hybridResults = result.stream()
    .filter(e -> categoryIds.contains(e.entityId()))
    .map(VectorSearchResult.Entry::entity)
    .toList();

Vectorizer Implementations

Embedded Vectors

When the vector is stored directly in the entity, set isEmbedded() to true to avoid duplicate storage.

public class DocumentVectorizer extends Vectorizer<Document>
{
    @Override
    public float[] vectorize(Document entity)
    {
        return entity.embedding();
    }

    @Override
    public boolean isEmbedded()
    {
        return true;
    }
}

Computed Vectors

When vectors are computed on-the-fly or fetched from an external service, set isEmbedded() to false.

public class TextVectorizer extends Vectorizer<Document>
{
    private final EmbeddingService embeddingService;

    public TextVectorizer(EmbeddingService embeddingService)
    {
        this.embeddingService = embeddingService;
    }

    @Override
    public float[] vectorize(Document entity)
    {
        return embeddingService.embed(entity.text());
    }

    @Override
    public boolean isEmbedded()
    {
        return false; // Vector will be stored separately
    }
}
When using computed vectors with an external service, be aware of potential latency during indexing operations. Consider pre-computing embeddings and storing them in the entity for better performance.
The vectorize() method must never return null. If it does, an IllegalStateException is thrown when the entity is added, updated, or re-indexed. If some entities cannot produce a vector, they should be excluded before adding them to the GigaMap, or the vectorizer must provide a fallback vector.

Batch Vectorization

When adding multiple entities via GigaMap.addAll(), the vectorizer’s vectorizeAll() method is called instead of vectorize() for each entity individually. The default implementation delegates to vectorize() one by one, but you can override it to use a more efficient batch API — for example, sending all texts to an embedding service in a single request.

This is especially useful with external embedding APIs like LangChain4j, where a single batch call is significantly faster than many individual calls due to reduced network overhead and server-side batching.

public class ProductVectorizer extends Vectorizer<Product>
{
    private final EmbeddingModel embeddingModel;

    public ProductVectorizer(EmbeddingModel embeddingModel)
    {
        this.embeddingModel = embeddingModel;
    }

    @Override
    public float[] vectorize(Product product)
    {
        return embeddingModel.embed(product.name()).content().vector();
    }

    @Override
    public List<float[]> vectorizeAll(List<? extends Product> products)
    {
        List<TextSegment> segments = products.stream()
            .map(p -> TextSegment.from(p.name()))
            .toList();

        return embeddingModel.embedAll(segments).content().stream()
            .map(Embedding::vector)
            .toList();
    }
}

Using addAll() with a batch vectorizer:

EmbeddingModel embeddingModel = OllamaEmbeddingModel.builder()
    .baseUrl("http://localhost:11434")
    .modelName("all-minilm")
    .build();

GigaMap<Product> gigaMap = GigaMap.New();
VectorIndices<Product> vectorIndices = gigaMap.index().register(VectorIndices.Category());

VectorIndexConfiguration config = VectorIndexConfiguration.builder()
    .dimension(384)  // all-minilm dimension
    .similarityFunction(VectorSimilarityFunction.COSINE)
    .build();

vectorIndices.add("embeddings", config, new ProductVectorizer(embeddingModel));

// Batch add - calls vectorizeAll() once instead of vectorize() 100 times
gigaMap.addAll(products);
Batch vectorization works with any EmbeddingModel implementation from LangChain4j, including OpenAI, Cohere, Ollama, and in-process models. Just swap the model builder.

Benchmarking

The library includes benchmark tests following ANN-Benchmarks methodology.

# Run benchmark tests (disabled by default)
mvn test -Dtest=VectorIndexBenchmarkTest \
    -Djunit.jupiter.conditions.deactivate=org.junit.*DisabledCondition

Benchmark Results (10K vectors, 128 dimensions)

Metric Result

Recall@10 (clustered data)

94.3%

Recall@50 (clustered data)

100%

QPS (queries/second)

~10,000+

Average latency

< 0.1ms

p99 latency

< 0.2ms