-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add I/O tailer for index meta check #240
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cqy123456 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@cqy123456 🔍 Important: PR Classification Needed! For efficient project management and a seamless review process, it's essential to classify your PR correctly. Here's how:
For any PR outside the kind/improvement category, ensure you link to the associated issue using the format: “issue: #”. Thanks for your efforts and contribution to the community!. |
this changes the index file format, right? |
src/io/tailer.h
Outdated
uint64_t bin_size; | ||
IndexVersion version; | ||
uint32_t checksum; | ||
char index_name[MAX_INDEX_NAME_SIZE + 1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reserve some buffer for future extension, or else adding new meta could be difficult
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tailer has 512bytes,it can add more meta info within 512bytes. Maybe we use larger size ?
src/io/memory_io.h
Outdated
@@ -100,6 +102,9 @@ struct MemoryIOWriter : public faiss::IOWriter { | |||
tellg() const { | |||
return rp_; | |||
} | |||
|
|||
void |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add same interface for FileIOWriter, to ensure the binary info is independent to IOWriter
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove addtailer and valid check in IOReader/Writer, and add AddTailerForMultiFiles/AddTailerForMemoryIO functions.
src/index/ivf/ivf.cc
Outdated
@@ -846,6 +847,7 @@ IvfIndexNode<faiss::IndexIVFFlat>::Serialize(BinarySet& binset) const { | |||
faiss::write_index(index_.get(), &writer); | |||
LOG_KNOWHERE_INFO_ << "write IVF_FLAT, file size " << writer.tellg(); | |||
} | |||
writer.addtailer(Type(), this->version_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can move the addtailer
to index.cc? Just a suggestion, maybe different to index ser/des with file.
45e6f84
to
2b87809
Compare
src/io/memory_io.h
Outdated
@@ -71,6 +70,7 @@ getSwappedBytes(char C) { | |||
|
|||
#endif | |||
|
|||
// MemoryIOwriter and MemoryIOreader it not thread safe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo "are not"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
src/io/trailer.h
Outdated
void | ||
SetIndexName(std::string index_name) { | ||
if (index_name.size() > MAX_INDEX_NAME_SIZE) { | ||
LOG_KNOWHERE_ERROR_ << "the size of index name larger than " << MAX_INDEX_NAME_SIZE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not just silently ignore the error. we should throw exception or return false so the caller of this method is aware of the error and can handle it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
src/io/trailer.h
Outdated
IndexVersion version; | ||
uint32_t checksum; | ||
char index_name[MAX_INDEX_NAME_SIZE + 1]; | ||
} meta; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider making members private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
src/io/trailer.cc
Outdated
uint32_t | ||
GetFilesCheckSum(std::vector<std::string> files) { | ||
uint32_t checksum = 0; | ||
auto buffer = std::shared_ptr<uint8_t[]>(new uint8_t[kBlockSize]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use unique ptr as it is never shared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
src/io/trailer.h
Outdated
GetIndexName() { | ||
return std::string(meta.index_name); | ||
} | ||
int32_t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use type alias IndexVersion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
src/io/trailer.cc
Outdated
reader.read(trailer_ptr.get(), KNOWHERE_TRAILER_SIZE); | ||
reader.seekg(pre_rp); | ||
|
||
if (!trailer_ptr->TrailerValidCheck()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if this file has no trailer but these 4 bytes happens to be the same as TRAILER_VALID_FLAG?
is it possible to use version number to decide whether a index file should include trailer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added flags at the head and tail of the trailer for checking.
src/io/trailer.cc
Outdated
} | ||
auto trailer_ptr = std::make_unique<Trailer>(); | ||
reader.read((char*)trailer_ptr.get()->bytes, KNOWHERE_TRAILER_SIZE); | ||
if (!trailer_ptr->TrailerValidCheck()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid duplicating those checks in CheckTrailerForFiles and CheckTrailerForMemoryIO. trailer class can have method validate(name, checksum)
to do those checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
AddTrailerForFiles(const std::vector<std::string>& files, const std::string& trailer_file, const std::string& name, | ||
const Version& version); | ||
bool | ||
CheckTrailerForMemoryIO(MemoryIOReader& reader, const std::string& name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see no usage of CheckTrailer in deserialize methods of memory indexes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
next pr. This pr verifies old knowhere code also can deserialize with the new binaryset.
src/io/trailer.cc
Outdated
AddTrailerForFiles(const std::vector<std::string>& files, const std::string& trailer_file, const std::string& name, | ||
const Version& version) { | ||
auto trailer_ptr = std::make_unique<Trailer>(); | ||
trailer_ptr->SetCheckSum(GetFilesCheckSum(files)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trailer for files is missing binary size. I prefer to make them consistent:
- either remove binary size from single binary index and rely solely on checksum for correctness, which should be sufficient?
- or add binary size sum of all files for multiple files index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
binary size removed.
src/io/trailer.h
Outdated
#define TRAILER_OFFSET(size) size - KNOWHERE_TRAILER_SIZE | ||
|
||
namespace knowhere { | ||
struct Meta { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name Meta
is too big/wide to put in namespace knowhere
. I suggest to hide struct Meta inside the Trailer union, as the outside world doesn't need to know its existence. Also add placeholder for the end flag as well. maybe:
union Trailer {
public:
...
private:
uint8_t bytes[KNOWHERE_TRAILER_SIZE];
struct Meta {
uint32_t beg_flag_placeholder;
struct Data {
uint32_t placeholder;
IndexVersion version;
uint32_t checksum;
char index_name[MAX_INDEX_NAME_SIZE + 1];
} data;
char unused[512 - 2 * sizeof(uint32_t) - sizeof(Data)];
uint32_t end_flag_placeholder;
} meta;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wrong, my proposed solution could result in struct member alignment error if we add 8 bytes members
src/io/trailer.h
Outdated
bool | ||
TrailerValidCheck() { | ||
FLAG_TYPE* buf = (FLAG_TYPE*)bytes; | ||
return buf[BEG_FLAG_OFFEST] == FLAG_MAGIC_NUMBER && buf[END_FLAG_OFFEST] == FLAG_MAGIC_NUMBER; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can check on meta members directly
meta.placeholder == FLAG_MAGIC_NUMBER
|
||
reader.seekg(0, std::ios::beg); | ||
if (fsize != KNOWHERE_TRAILER_SIZE) { | ||
LOG_KNOWHERE_ERROR_ << "Trailer size (" << fsize << ")not correct."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should "trailer file exists but has incorrect size" indicate error instead of success?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chatted offline, we should consider add a constant max_version_without_trailer
to the knowhere versioning system, all indexes with a greater version should include trailer, or else error.
} | ||
|
||
knowhere::Status | ||
TrailerCheck(const knowhere::TrailerPtr& trailer_ptr, const std::string& name, const uint32_t checksum) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider making this a method of the Trailer class
Signed-off-by: cqy123456 <[email protected]>
No description provided.