Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.Net M.E.VectorData: LINQ-based metadata filtering #10156

Open
roji opened this issue Jan 10, 2025 · 2 comments
Open

.Net M.E.VectorData: LINQ-based metadata filtering #10156

roji opened this issue Jan 10, 2025 · 2 comments
Assignees
Labels
msft.ext.vectordata Related to Microsoft.Extensions.VectorData .NET Issue or Pull requests regarding .NET code sk team issue A tag to denote issues that where created by the Semantic Kernel team (i.e., not the community)

Comments

@roji
Copy link
Member

roji commented Jan 10, 2025

M.E.VectorData currently has a rudimentary metadata filtering mechanism: the VectorSearchOptions passed to the vector search method can contains a VectorSearchFilter, which can contain a number of Equals or AnyTagEqualsTo clauses in an AND relationship only. Vector database filtering syntax typically goes beyond this, both for logical operators (OR, NOT...) and other operators (e.g. greater than, less than...).

Rather than continuing to develop our own expression tree and adding node types to address the richness of all vector databases, we could leverage the existing LINQ expression tree nodes in .NET. Aside from removing the problem of expression trees from the scope of MEVD, this would greatly improve the API usability, as users would be able to use C# to express their filter:

// Current:
var searchResult = await collection.VectorizedSearchAsync(
    searchVector,
    new()
    {
        Filter = new VectorSearchFilter().Equalto(nameof(Glossary.Category), "AI")
    }
)

// Proposed:
var searchResult = await collection.VectorizedSearchAsync(
    searchVector,
    new()
    {
        Filter = g => g.Category == "AI"
    }
)

Notes:

  • The main downside here from the user perspective is the limited actual filtering support in vector databases.
    • The above proposal would allow expressing any C# code within the filter, but actually supported expressions will only be a small subset of all expressible things. Thus, beginners will likely try to write some complex condition, only to get a runtime exception saying that the filter isn't supported.
    • In contrast, with a custom expression tree, we control which nodes are available, and the user simply cannot express anything beyond what we support. However, as we'd need to cover all vector databases, nodes needed for some databases wouldn't be supported by others, again leading to a runtime failure. So the general problem can't be avoided here.
    • Overall, I don't believe the above will be a big problem - users will likely quickly get used to what's actually supported by their database (with proper documentation), and at that point this becomes a non-issue.
    • Compiler-generated LINQ expression trees contain some kinks, and normailzation can be beneficial (e.g. users can both use the equality operator and the .NET Equals method - the latter can be normalized to the former). We may want to have some support component in the abstraction to preprocess the expression tree before handing it off to the provider. This could be a bit problematic as the abstraction currently consists of interfaces rather than base classes.
  • Since the filter lambda needs to be generically typed based on the metadata record type of the collection, VectorSearchOptions would have to become generic over TRecord.
  • Another advantage of using LINQ, is that queries are expressed over the user's data model (e.g. POCOs) rather than over the storage model; this is how user interacts with MEAI in all other APIs (e.g. when inserting, accessing metadata returning from search). But this also creates mapping difficulties (see next point).
  • This needs to be kept in mind in relation to the layering of the ORM mapping feature (i.e. the ability to use arbitrary user POCOs) - we may end up in a place where it's not possible to pass the strongly-typed expression tree directly to the provider.
@markwallace-microsoft markwallace-microsoft added .NET Issue or Pull requests regarding .NET code triage labels Jan 10, 2025
@github-actions github-actions bot changed the title .NET M.E.VectorData: LINQ-based metadata filtering .Net M.E.VectorData: LINQ-based metadata filtering Jan 10, 2025
@westey-m
Copy link
Contributor

One other important aspect to consider is the generic data type scenario and any similar that the user may model, where the data model isn't a POCO with named fields.
See https://learn.microsoft.com/en-us/semantic-kernel/concepts/vector-store-connectors/generic-data-model

This scenario is used by developers who are building experiences that require e.g. defining schemas via configuration or by a user via UI, i.e. pro-user or low-code/no-code type experiences.

The record definition provided matches the underlying database schema, but the data model can look quite different. Expressing the query in the context of the data model may not be intuitive, and if it was, translating it to the underlying schema may not be possible without the developer provide custom mapping, similar to mapping between the data model and database schema.

@roji
Copy link
Member Author

roji commented Jan 10, 2025

Yeah, absolutely - that relates to my last comment above. It's worth mentioning that expressing the filter over the user POCO (or "generic data model") is another advantage of this proposal, as opposed to the string-based storage property names that need to be specified in the current implementation (I added a point before the last to mention this).

If we go in this direction, we'll most likely need some sort of logic outside of the provider (so in the "abstraction") which does various processing on the expression tree before it gets handed off to the provider. Aside from various tree normalizations, this would also probably do the translation from the user POCO to storage model representation that the provider requires.

I think this would lead us in the direction of a bit more logic inside the abstraction (or rather, above the lower-level abstraction implemented by the providers), i.e. where the ORM bits are layered on top and provide this stuff...

@markwallace-microsoft markwallace-microsoft added sk team issue A tag to denote issues that where created by the Semantic Kernel team (i.e., not the community) msft.ext.vectordata Related to Microsoft.Extensions.VectorData and removed triage labels Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
msft.ext.vectordata Related to Microsoft.Extensions.VectorData .NET Issue or Pull requests regarding .NET code sk team issue A tag to denote issues that where created by the Semantic Kernel team (i.e., not the community)
Projects
None yet
Development

No branches or pull requests

3 participants