JVector Index

The JVector index adds vector similarity search capabilities to entities stored in a GigaMap. Like the Bitmap and Lucene indices, it is registered with the GigaMap and automatically kept in sync as entities are added, updated, or removed.

Under the hood, the index uses JVector, a high-performance HNSW (Hierarchical Navigable Small World) graph implementation. This enables fast approximate k-nearest-neighbor (k-NN) search on vector embeddings, making it ideal for AI/ML applications like semantic search, recommendation systems, and RAG (Retrieval-Augmented Generation).

Traditional database queries find exact matches: "find all customers where city = 'Berlin'" or "find products where price < 100". Vector search is different - it finds similar items based on meaning or characteristics, even when there’s no exact match.

How it Works

  1. Embeddings: Each entity (product, customer, document) is converted into a vector - an array of numbers that represents its characteristics. These vectors are typically generated by AI/ML models that understand the semantic meaning of text, images, or behavior patterns.

  2. Similarity Search: Instead of matching exact values, vector search finds the entities whose vectors are closest to a query vector. "Closest" is determined by a similarity function (cosine, dot product, or euclidean distance).

  3. Approximate Nearest Neighbors: For large datasets, exact similarity search is too slow. The JVector index uses an HNSW graph structure to find approximate nearest neighbors in milliseconds, even with millions of vectors.

Query Type Index Type Example

Exact match

Bitmap

category = "Electronics"

Full-text search

Lucene

description contains "wireless bluetooth"

Similarity search

JVector

"Find products similar to this one"

Use the JVector index when you need to:

  • Find similar products, customers, or content

  • Match questions to FAQ entries or support tickets

  • Search by meaning rather than keywords

  • Build recommendation systems

  • Implement semantic search with AI embeddings

Integration with GigaMap

The JVector index integrates seamlessly with GigaMap’s indexing system. When you add, update, or remove entities from the GigaMap, the vector index is automatically updated. Search results return lazy references to your entities, so you can access the full object without additional lookups.

You can combine vector search with bitmap indices for hybrid queries - for example, finding similar products within a specific category or price range.

Features

  • HNSW Vector Index: Fast approximate k-nearest-neighbor search using JVector’s HNSW graph implementation

  • Persistent Storage: Vectors are stored in GigaMap for durability and lazy loading

  • On-Disk Index: Memory-mapped graph storage for datasets larger than RAM

  • PQ Compression: Product Quantization for reduced memory footprint

  • Background Persistence: Automatic asynchronous persistence at configurable intervals

  • Background Optimization: Periodic graph cleanup for improved query performance

  • Lazy Entity Access: Search results provide direct access to entities without additional lookups

  • Stream API: Java Stream support for search results

Requirements

  • Java 17+ (minimum)

  • Java 20+ (recommended for SIMD acceleration)

SIMD Acceleration (Panama Vector API)

JVector leverages the Panama Vector API (jdk.incubator.vector) for hardware-accelerated vector operations. This provides significant performance improvements for ANN indexing and search through SIMD (Single Instruction, Multiple Data) instructions.

Benefits

  • Faster distance calculations: SIMD parallelizes vector arithmetic across CPU vector registers

  • Accelerated indexing: Graph construction benefits from parallel similarity computations

  • Faster queries: Nearest neighbor searches execute more quickly

  • Optimized PQ encoding: Product Quantization compression uses SIMD for distance computations

Java Version Requirements

Java Version SIMD Support

Java 17-19

Functional, but not optimized (scalar fallback)

Java 20+

Full SIMD acceleration via Panama Vector API

JVector uses a multi-release JAR structure:

  • Base code targets Java 11 compatibility

  • Optimized vector code in jvector-twenty activates automatically on Java 20+ JVMs

  • Earlier Java versions receive functional but non-SIMD implementations

JVM Parameters

To enable the Panama Vector API, add the incubator module to your JVM arguments:

java --add-modules jdk.incubator.vector -jar your-app.jar

See JVM Configuration for detailed setup instructions including Maven and Gradle configuration.

For optimal performance, run on Java 21 LTS or later to benefit from full SIMD acceleration. The performance difference can be substantial for large-scale vector operations.

Installation

Maven [pom.xml]
<dependency>
    <groupId>org.eclipse.store</groupId>
    <artifactId>gigamap-jvector</artifactId>
    <version>4.0.0-beta1</version>
</dependency>

Example

First, we need to implement a Vectorizer, which extracts the vector embedding from entities.

public class DocumentVectorizer extends Vectorizer<Document>
{
    @Override
    public float[] vectorize(Document entity)
    {
        return entity.embedding();
    }

    @Override
    public boolean isEmbedded()
    {
        return true; // Vector is stored in entity (no duplicate storage)
    }
}

Then we create a VectorIndex and register it at the GigaMap.

// Create GigaMap and register vector indices
GigaMap<Document> gigaMap = GigaMap.New();
VectorIndices<Document> vectorIndices = gigaMap.index().register(VectorIndices.Category());

// Configure the vector index
VectorIndexConfiguration config = VectorIndexConfiguration.builder()
    .dimension(768)
    .similarityFunction(VectorSimilarityFunction.COSINE)
    .build();

// Add the index with a name, configuration, and vectorizer
VectorIndex<Document> index = vectorIndices.add("embeddings", config, new DocumentVectorizer());

After adding entities to the GigaMap, we can search for similar vectors.

// Add entities (automatically indexed)
gigaMap.add(new Document("Hello world", embedding));

// Search for similar vectors (returns top 10 results)
VectorSearchResult<Document> result = index.search(queryVector, 10);

for (VectorSearchResult.Entry<Document> entry : result)
{
    Document doc = entry.entity();    // Lazy entity access
    float score = entry.score();      // Similarity score
    long id = entry.entityId();       // Entity ID
}

The search results support the Java Stream API for convenient filtering and transformation.

List<Document> topDocs = result.stream()
    .filter(e -> e.score() > 0.8f)
    .map(VectorSearchResult.Entry::entity)
    .toList();

Similarity Functions

The following similarity functions are available:

Function Description

COSINE

Cosine similarity, normalized for direction. Best for text embeddings.

DOT_PRODUCT

Dot product similarity. Use when vectors are already normalized.

EUCLIDEAN

Euclidean distance. Best for geometric or spatial data.

Persistence with EclipseStore

Binary type handlers are registered automatically when using EclipseStore.

try (EmbeddedStorageManager storage = EmbeddedStorage.start(storageDir))
{
    GigaMap<Document> gigaMap = GigaMap.New();
    storage.setRoot(gigaMap);

    VectorIndices<Document> vectorIndices = gigaMap.index().register(VectorIndices.Category());
    VectorIndex<Document> index = vectorIndices.add("embeddings", config, new DocumentVectorizer());

    gigaMap.add(new Document("text", embedding));

    storage.storeRoot();
}

Limitations

  • Null vectors are not accepted: The Vectorizer.vectorize() method must never return null. If it does, an IllegalStateException is thrown. Ensure that every entity added to the GigaMap can produce a valid vector.

  • ~2.1 billion vectors per index: JVector uses int for graph node ordinals. For larger datasets, implement sharding across multiple indices.

  • PQ compression requires maxDegree=32: FusedPQ algorithm constraint (auto-enforced).