Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector store abstractions hybrid search ADR #10196

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
236 changes: 236 additions & 0 deletions docs/decisions/00NN-hybrid-search.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,236 @@
---
# These are optional elements. Feel free to remove any of them.
status: {proposed | rejected | accepted | deprecated | � | superseded by [ADR-0001](0001-madr-architecture-decisions.md)}
contact: westey-m
date: 2024-11-27
deciders: {list everyone involved in the decision}
consulted: {list everyone whose opinions are sought (typically subject-matter experts); and with whom there is a two-way communication}
informed: {list everyone who is kept up-to-date on progress; and with whom there is a one-way communication}
---

# Support Hybrid Search in VectorStore abstractions

## Context and Problem Statement

In addition to simple vector search, many databases also support Hybrid search.
Hybrid search typically results in higher quality search results, and therefore the ability to do Hybrid search via VectorStore abstractions
is an important feature to add.

The way in which Hybrid search is supported varies by database. The two most common ways of supporting hybrid search is:

1. Using dense vector search and keyword/fulltext search in parallel, and then combining the results.
1. Using dense vector search and sparse vector search in parallel, and then combining the results.

Sparse vectors are different from dense vectors in that they typically have many more dimensions, but with many of the dimensions being zero.
Sparse vectors, when used with text search, have a dimension for each word/token in a vocabulary, with the value indicating the importance of the word
in the source text.
The more common the word in a specific chunk of text, and the less common the word is in the corpus, the higher the value in the sparse vector.

There are various mechanisms for generating sparse vectors, such as

- [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
- [SPLADE](https://www.pinecone.io/learn/splade/)
- [BGE-m3 sparse embedding model](https://huggingface.co/BAAI/bge-m3).
- [pinecone-sparse-english-v0](https://docs.pinecone.io/models/pinecone-sparse-english-v0)

While these are supported well in Python, they are not well supported in .net today.
Adding support for generating sparse vectors is out of scope of this ADR.

More background information:

- [Background article from Qdrant about using sparse vectors for Hybrid Search](https://qdrant.tech/articles/sparse-vectors)
- [TF-IDF explainer for beginners](https://medium.com/@coldstart_coder/understanding-and-implementing-tf-idf-in-python-a325d1301484)

ML.Net contains an implementation of TF-IDF that could be used to generate sparse vectors in .net. See [here](https://github.com/dotnet/machinelearning/blob/886e2ff125c0060f5a251056c7eb2a7d28738984/docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Text/ProduceWordBags.cs#L55-L105) for an example.

### Hybrid search support in different databases

|Feature|Azure AI Search|Weaviate|Redis|Chroma|Pinecone|PostgreSql|Qdrant|Milvus|Elasticsearch|CosmosDB NoSql|MongoDB|
|-|-|-|-|-|-|-|-|-|-|-|-|
|Hybrid search supported|Y|Y|N (No parallel execution with fusion)|N|Y||Y|Y|Y|Y|Y|
|Hybrid search definition|Vector + FullText|[Vector + Keyword (BM25F)](https://weaviate.io/developers/weaviate/search/hybrid)|||[Vector + Sparse Vector for keywords](https://docs.pinecone.io/guides/get-started/key-features#hybrid-search)||[Vector + SparseVector / Keyword](https://qdrant.tech/documentation/concepts/hybrid-queries/)|[Vector + SparseVector](https://milvus.io/docs/multi-vector-search.md)|Vector + FullText|[Vector + Fulltext (BM25)](https://learn.microsoft.com/en-us/azure/cosmos-db/gen-ai/hybrid-search)|[Vector + FullText](https://www.mongodb.com/docs/atlas/atlas-search/tutorial/hybrid-search)|
|Fusion method configurable|N|Y|||?||Y|Y|Y, but only one option|Y, but only one option|N|
|Fusion methods|[RRF](https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking)|Ranked/RelativeScore|||?||RRF / DBSF|[RRF / Weighted](https://milvus.io/docs/multi-vector-search.md)|[RRF](https://www.elastic.co/search-labs/tutorials/search-tutorial/vector-search/hybrid-search)|[RRF](https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/query/rrf)|[RRF](https://www.mongodb.com/docs/atlas/atlas-search/tutorial/hybrid-search)|
|Hybrid Search Input Params|Vector + string|[Vector + string](https://weaviate.io/developers/weaviate/api/graphql/search-operators#hybrid)|||Vector + SparseVector||[Vector + SparseVector](https://qdrant.tech/documentation/concepts/hybrid-queries/)|[Vector + SparseVector](https://milvus.io/docs/multi-vector-search.md)|Vector + string|Vector + string array|Vector + string|
|Sparse Distance Function|n/a|n/a|||[dotproduct only for both dense and sparse, 1 setting for both](https://docs.pinecone.io/guides/data/understanding-hybrid-search#sparse-dense-workflow)||dotproduct|Inner Product|n/a|n/a|n/a|
|Sparse Indexing options|n/a|n/a|||no separate config to dense||ondisk / inmemory + IDF|[SPARSE_INVERTED_INDEX / SPARSE_WAND](https://milvus.io/docs/index.md?tab=sparse)|n/a|n/a|n/a|
|Sparse data model|n/a|n/a|||[indices & values arrays](https://docs.pinecone.io/guides/data/upsert-sparse-dense-vectors)||indices & values arrays|[sparse matrix / List of dict / list of tuples](https://milvus.io/docs/sparse_vector.md#Use-sparse-vectors-in-Milvus)|n/a|n/a|n/a|

Glossary:

- RRF = Reciprical Rank Fusion
- DBSF = Distribution-Based Score Fusion
- IDF = Inverse Document Frequency

### Naming

|Name|Parameters|Keyword Property Selector|Dense Vector Property Selector|
|-|-|-|-|
|KeywordVectorizedHybridSearch|string + Dense Vector|TextPropertyName|DenseVectorPropertyName|
|SparseVectorizedHybridSearch|Sparse Vector + Dense Vector|SparseVectorPropertyName|DenseVectorPropertyName|
|KeywordVectorizableTextHybridSearch|string + string / string|TextPropertyName|DenseVectorPropertyName|
|SparseVectorizableTextHybridSearch|string + string / string|SparseVectorPropertyName|DenseVectorPropertyName|

### Keyword based hybrid search

```csharp
interface IKeywordVectorizedHybridSearch<TRecord>
{
Task<VectorSearchResults<TRecord>> KeywordVectorizedHybridSearch(
TVector vector,
string keywords,
KeywordVectorizedHybridSearchOptions options,
CancellationToken cancellationToken);
}

class KeywordVectorizedHybridSearchOptions
{
// The name of the property to target the vector search against.
public string? DenseVectorPropertyName { get; init; }
// The name of the property to target the text search against.
public string? TextPropertyName { get; init; }
// Allow fusion method to be configurable for dbs that support configuration. If null, a default is used.
public string FusionMethod { get; init; } = null;

public VectorSearchFilter? Filter { get; init; }
public int Top { get; init; } = 3;
public int Skip { get; init; } = 0;
public bool IncludeVectors { get; init; } = false;
public bool IncludeTotalCount { get; init; } = false;
}
```

### Sparse Vector based hybrid search

```csharp
interface ISparseVectorizedHybridSearch<TRecord>
{
Task<VectorSearchResults<TRecord>> SparseVectorizedHybridSearch(
TVector denseVector,
TVector sparsevector,
SparseVectorizedHybridSearchOptions options,
CancellationToken cancellationToken);
}

class SparseVectorizedHybridSearchOptions
{
// The name of the property to target the dense vector search against.
public string? DenseVectorPropertyName { get; init; }
// The name of the property to target the sparse vector search against.
public string? SparseVectorPropertyName { get; init; }
// Allow fusion method to be configurable for dbs that support configuration. If null, a default is used.
public string FusionMethod { get; init; } = null;

public VectorSearchFilter? Filter { get; init; }
public int Top { get; init; } = 3;
public int Skip { get; init; } = 0;
public bool IncludeVectors { get; init; } = false;
public bool IncludeTotalCount { get; init; } = false;
}
```

### Keyword Vectorizable text based hybrid search

```csharp
interface IKeywordVectorizableHybridSearch<TRecord>
{
Task<VectorSearchResults<TRecord>> KeywordVectorizableHybridSearch(
string description,
string? keywords = default,
KeywordVectorizableHybridSearchOptions options = default,
CancellationToken cancellationToken = default);
}

class KeywordVectorizableHybridSearchOptions
{
// The name of the property to target the dense vector search against.
public string? DenseVectorPropertyName { get; init; }
// The name of the property to target the text search against.
public string? TextPropertyName { get; init; }
// Allow fusion method to be configurable for dbs that support configuration. If null, a default is used.
public string FusionMethod { get; init; } = null;

public VectorSearchFilter? Filter { get; init; }
public int Top { get; init; } = 3;
public int Skip { get; init; } = 0;
public bool IncludeVectors { get; init; } = false;
public bool IncludeTotalCount { get; init; } = false;
}
```

### Sparse Vector based Vectorizable text hybrid search

```csharp
interface ISparseVectorizableTextHybridSearch<TRecord>
{
Task<VectorSearchResults<TRecord>> SparseVectorizableTextHybridSearch(
string description,
string? keywords = default,
SparseVectorizableTextHybridSearchOptions options = default,
CancellationToken cancellationToken = default);
}

class SparseVectorizableTextHybridSearchOptions
{
// The name of the property to target the dense vector search against.
public string? DenseVectorPropertyName { get; init; }
// The name of the property to target the sparse vector search against.
public string? SparseVectorPropertyName { get; init; }
// Allow fusion method to be configurable for dbs that support configuration. If null, a default is used.
public string FusionMethod { get; init; } = null;

public VectorSearchFilter? Filter { get; init; }
public int Top { get; init; } = 3;
public int Skip { get; init; } = 0;
public bool IncludeVectors { get; init; } = false;
public bool IncludeTotalCount { get; init; } = false;
}
```

## Decision Drivers

- Support for generating sparse vectors is required to make sparse vector based hybrid search viable.
- Multiple vectors per record scenarios need to be supported.
- No database in our evaluation set have been identified as supporting generating sparse vectors in the database.

## Scoping Considered Options

- 1 Keyword Hybrid Search Only

Only implement KeywordVectorizedHybridSearch & KeywordVectorizableTextHybridSearch for now, until
we can add support for generating sparse vectors.

- 2 Keyword and SparseVectorized Hybrid Search

Implement KeywordVectorizedHybridSearch & KeywordVectorizableTextHybridSearch but only
KeywordVectorizableTextHybridSearch, since no database in our evaluation set supports generating sparse vectors in the database.
This will require us to produce code that can generate sparse vectors from text.

- 3 All Hybrid Search

Create all four interfaces and implement an implementation of SparseVectorizableTextHybridSearch that
generates the sparse vector in the client code.
This will require us to produce code that can generate sparse vectors from text.

## PropertyName Naming Considered Options

- 1 Explicit Dense naming

DenseVectorPropertyName
SparseVectorPropertyName

DenseVectorPropertyName
TextPropertyName

- 2 Implicit Dense naming

VectorPropertyName
SparseVectorPropertyName

VectorPropertyName
TextPropertyName

## Decision Outcome

Chosen option: "{title of option 1}", because
{justification. e.g., only option, which meets k.o. criterion decision driver | which resolves force {force} | � | comes out best (see below)}.
Loading