Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] DeleteDocumentHandler does not delete all records from AI Search #933

Open
gabrielgirbacea opened this issue Dec 9, 2024 · 6 comments
Labels
bug Something isn't working waiting for author Waiting for author to reply or address comments

Comments

@gabrielgirbacea
Copy link

Context / Scenario

Steps to reproduce:

  1. Set-up Kernel Memory locally with Swagger
  2. Configure it to use AI Search
  3. Import a larger file (size >200 kb - only text document), something to generate more than 50 partitions in AI Search. Ex: lotr.txt
  4. Call the "delete document" endpoint

Expected result: Deletes all the records/partitions from AI Search for a document id.
Actual result: Some records/partitions are not deleted.

A few examples of how many records were deleted per different sizes of the file:

  • for 83 partitions, 33 remain undeleted. After the second delete, 0.
  • for 63 partitions, 13 remain undeleted. After the second delete, 0.
  • for 332 partitions, 143 remain undeleted. After the second delete is triggered, 33 remain undeleted. After the third delete, 0.

Please let me know if I can provide more details.
Thank you.

What happened?

Calling the "delete document" endpoint should delete all records from AI Search for the provided document id, regardless of how many records there are stored.

Note: When importing a file for an existing document id, SaveRecordsHandler is deleting all the existing records, as expected, before adding the new records.

Importance

edge case

Platform, Language, Versions

C# - .NET 8.0
Windows 11
Visual Studio Professional 2022
Kernel Memory (latest - 0.94.241201.1)

Relevant log output

@gabrielgirbacea gabrielgirbacea added bug Something isn't working triage labels Dec 9, 2024
@dluc
Copy link
Collaborator

dluc commented Dec 18, 2024

@gabrielgirbacea which document store are you using? what you observe could be a result of using a volatile document store, which is the default one.

@gabrielgirbacea
Copy link
Author

@dluc I am using Azure Blob Storage and for queues, Azure Queues.

@dluc
Copy link
Collaborator

dluc commented Dec 18, 2024

Do you see any errors in the logs?

Some scenarios I would check:

  • one of the services is throttling and the number of retries times out
  • the queue service is losing queues or messages or timing out messages too soon
  • there are other apps using the same services (queues, blobs, search), writing and or locking records
  • does the same error occur using a different vector store, e.g. Qdrant? (you can test locally running Qdrant with Docker, see under /tools folder)

@dluc dluc added waiting for author Waiting for author to reply or address comments and removed triage labels Dec 18, 2024
@gabrielgirbacea
Copy link
Author

gabrielgirbacea commented Dec 23, 2024

Hi @dluc,

Following up on this issue:

  • We reviewed the logs and found no error entries. However, we were able to consistently reproduce the problem by debugging the Kernel Memory code locally while connected to a new instance of Azure AI Search. It's worth noting that no other services were utilizing these resources during our testing.
  • You mentioned that the queue service may lose messages. But for the delete operation, only a single message should be queued (in the private-delete-document queue). Can you confirm this?
  • While I haven't tested with Quadrant, I did test with "SimpleVectorDb" and did not encounter this deletion problem.

The issue appears to originate within the DeleteDocumentHandler class, specifically the InvokeAsync method.
To gain further insight, I added some logging to the InvokeAsync method while debugging Kernel Memory:

IAsyncEnumerable<MemoryRecord> records = db.GetListAsync(
    index: pipeline.Index,
    limit: -1,
    filters: [MemoryFilters.ByDocument(pipeline.DocumentId)],
    cancellationToken: cancellationToken);

// Added
var recordsCount = await records.CountAsync(cancellationToken).ConfigureAwait(false);
// Added
_log.LogInformation("Found {RecordsCount} records to delete for '{Index}/{DocumentId}'", recordsCount, pipeline.Index, pipeline.DocumentId);

// Added
int deletedRecordsCount = 0;

await foreach (var record in records.WithCancellation(cancellationToken).ConfigureAwait(false))
{
    await db.DeleteAsync(index: pipeline.Index, record, cancellationToken: cancellationToken).ConfigureAwait(false);

    // Added
    deletedRecordsCount++;
}

// Added
_log.LogInformation("Deleted {DeletedRecordsCount} records for '{Index}/{DocumentId}'", deletedRecordsCount, pipeline.Index, pipeline.DocumentId);

These logs confirm the discrepancy in record counts initially reported. Although GetListAsync seems to retrieve all expected entities, some are not deleted during the process.

@gabrielgirbacea
Copy link
Author

Hi @dluc ,

Do you have any updates regarding my previous message?

@alexmg
Copy link
Contributor

alexmg commented Jan 15, 2025

Hi @gabrielgirbacea.

I made a contribution (#877) that allows a Session ID to be used when performing requests to Azure AI Search. This is surfaced as the UseStickySessions property on the AzureAISearchConfig and will be used within the GetListAsync method if set to true.

With this setting enabled the AISearchOptions.SessionId is assigned a unique value for the duration of the GetListAsync invocation. That property will suggest to Azure AI Search that it should hit the same replica to get more consistent results.

A value to be used to create a sticky session, which can help getting more consistent results. As long as the same sessionId is used, a best-effort attempt will be made to target the same replica set.

I used this in some custom code to migrate records from one Azure AI Search instance to another and the final record counts matched unlike prior attempts without the SessionId being set. I assume this setting would also be beneficial while retrieving records to delete as it decreases the chances of a different replica being hit. Unfortunately, Azure AI Search uses an eventual consistency model, so data being durable does not imply that it is immediately searchable, especially with multiple replicas.

I would be interested to know if this setting is helpful with the above problem. I'm wondering if the delay introduced by the call to db.DeleteAsync during async enumeration is resulting in different replicas being hit and inconsistent results being returned. It's a simple bool configuration property so worth a try. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working waiting for author Waiting for author to reply or address comments
Projects
None yet
Development

No branches or pull requests

3 participants