Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add I/O tailer for index meta check #240

Closed
wants to merge 1 commit into from

Conversation

cqy123456
Copy link
Collaborator

No description provided.

@sre-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cqy123456

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

mergify bot commented Nov 30, 2023

@cqy123456 🔍 Important: PR Classification Needed!

For efficient project management and a seamless review process, it's essential to classify your PR correctly. Here's how:

  1. If you're fixing a bug, label it as kind/bug.
  2. For small tweaks (less than 20 lines without altering any functionality), please use kind/improvement.
  3. Significant changes that don't modify existing functionalities should be tagged as kind/enhancement.
  4. Adjusting APIs or changing functionality? Go with kind/feature.

For any PR outside the kind/improvement category, ensure you link to the associated issue using the format: “issue: #”.

Thanks for your efforts and contribution to the community!.

@alexanderguzhva
Copy link
Collaborator

this changes the index file format, right?

@mergify mergify bot added the ci-passed label Nov 30, 2023
src/io/tailer.h Outdated
uint64_t bin_size;
IndexVersion version;
uint32_t checksum;
char index_name[MAX_INDEX_NAME_SIZE + 1];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reserve some buffer for future extension, or else adding new meta could be difficult

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tailer has 512bytes,it can add more meta info within 512bytes. Maybe we use larger size ?

@@ -100,6 +102,9 @@ struct MemoryIOWriter : public faiss::IOWriter {
tellg() const {
return rp_;
}

void
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add same interface for FileIOWriter, to ensure the binary info is independent to IOWriter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove addtailer and valid check in IOReader/Writer, and add AddTailerForMultiFiles/AddTailerForMemoryIO functions.

@@ -846,6 +847,7 @@ IvfIndexNode<faiss::IndexIVFFlat>::Serialize(BinarySet& binset) const {
faiss::write_index(index_.get(), &writer);
LOG_KNOWHERE_INFO_ << "write IVF_FLAT, file size " << writer.tellg();
}
writer.addtailer(Type(), this->version_);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can move the addtailer to index.cc? Just a suggestion, maybe different to index ser/des with file.

@mergify mergify bot removed the ci-passed label Dec 4, 2023
@cqy123456 cqy123456 force-pushed the tailer branch 3 times, most recently from 45e6f84 to 2b87809 Compare December 6, 2023 12:38
@mergify mergify bot added the ci-passed label Dec 6, 2023
@@ -71,6 +70,7 @@ getSwappedBytes(char C) {

#endif

// MemoryIOwriter and MemoryIOreader it not thread safe
Copy link
Collaborator

@zhengbuqian zhengbuqian Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo "are not"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

src/io/trailer.h Outdated
void
SetIndexName(std::string index_name) {
if (index_name.size() > MAX_INDEX_NAME_SIZE) {
LOG_KNOWHERE_ERROR_ << "the size of index name larger than " << MAX_INDEX_NAME_SIZE;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not just silently ignore the error. we should throw exception or return false so the caller of this method is aware of the error and can handle it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

src/io/trailer.h Outdated
IndexVersion version;
uint32_t checksum;
char index_name[MAX_INDEX_NAME_SIZE + 1];
} meta;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider making members private.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

uint32_t
GetFilesCheckSum(std::vector<std::string> files) {
uint32_t checksum = 0;
auto buffer = std::shared_ptr<uint8_t[]>(new uint8_t[kBlockSize]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use unique ptr as it is never shared.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

src/io/trailer.h Outdated
GetIndexName() {
return std::string(meta.index_name);
}
int32_t
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use type alias IndexVersion?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

reader.read(trailer_ptr.get(), KNOWHERE_TRAILER_SIZE);
reader.seekg(pre_rp);

if (!trailer_ptr->TrailerValidCheck()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if this file has no trailer but these 4 bytes happens to be the same as TRAILER_VALID_FLAG?

is it possible to use version number to decide whether a index file should include trailer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added flags at the head and tail of the trailer for checking.

}
auto trailer_ptr = std::make_unique<Trailer>();
reader.read((char*)trailer_ptr.get()->bytes, KNOWHERE_TRAILER_SIZE);
if (!trailer_ptr->TrailerValidCheck()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid duplicating those checks in CheckTrailerForFiles and CheckTrailerForMemoryIO. trailer class can have method validate(name, checksum) to do those checks.

Copy link
Collaborator Author

@cqy123456 cqy123456 Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

AddTrailerForFiles(const std::vector<std::string>& files, const std::string& trailer_file, const std::string& name,
const Version& version);
bool
CheckTrailerForMemoryIO(MemoryIOReader& reader, const std::string& name);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no usage of CheckTrailer in deserialize methods of memory indexes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

next pr. This pr verifies old knowhere code also can deserialize with the new binaryset.

AddTrailerForFiles(const std::vector<std::string>& files, const std::string& trailer_file, const std::string& name,
const Version& version) {
auto trailer_ptr = std::make_unique<Trailer>();
trailer_ptr->SetCheckSum(GetFilesCheckSum(files));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trailer for files is missing binary size. I prefer to make them consistent:

  • either remove binary size from single binary index and rely solely on checksum for correctness, which should be sufficient?
  • or add binary size sum of all files for multiple files index

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

binary size removed.

src/io/trailer.h Outdated
#define TRAILER_OFFSET(size) size - KNOWHERE_TRAILER_SIZE

namespace knowhere {
struct Meta {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name Meta is too big/wide to put in namespace knowhere. I suggest to hide struct Meta inside the Trailer union, as the outside world doesn't need to know its existence. Also add placeholder for the end flag as well. maybe:

union Trailer {
 public:
    ...
private:
  uint8_t bytes[KNOWHERE_TRAILER_SIZE];
  struct Meta {
    uint32_t beg_flag_placeholder;
    struct Data {
      uint32_t placeholder;
      IndexVersion version;
      uint32_t checksum;
      char index_name[MAX_INDEX_NAME_SIZE + 1];
    } data;
    char unused[512 - 2 * sizeof(uint32_t) - sizeof(Data)];
    uint32_t end_flag_placeholder;
  } meta;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wrong, my proposed solution could result in struct member alignment error if we add 8 bytes members

src/io/trailer.h Outdated
bool
TrailerValidCheck() {
FLAG_TYPE* buf = (FLAG_TYPE*)bytes;
return buf[BEG_FLAG_OFFEST] == FLAG_MAGIC_NUMBER && buf[END_FLAG_OFFEST] == FLAG_MAGIC_NUMBER;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can check on meta members directly

meta.placeholder == FLAG_MAGIC_NUMBER


reader.seekg(0, std::ios::beg);
if (fsize != KNOWHERE_TRAILER_SIZE) {
LOG_KNOWHERE_ERROR_ << "Trailer size (" << fsize << ")not correct.";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should "trailer file exists but has incorrect size" indicate error instead of success?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chatted offline, we should consider add a constant max_version_without_trailer to the knowhere versioning system, all indexes with a greater version should include trailer, or else error.

}

knowhere::Status
TrailerCheck(const knowhere::TrailerPtr& trailer_ptr, const std::string& name, const uint32_t checksum) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider making this a method of the Trailer class

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants