-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Ensure arrow sizing is correct / 64 byte aligned #2426
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
Using the new arrow version, thanks to apache/arrow-rs#5554, I am able to get arrow to written buffers with padding despite being contiguous. I have done a lot of manual verification the data is laid out as we expect on disk. By manually reading the arrow file using their lower level apis, I can actually parse out the padded size of each buffer and validate that they are indeed 64 byte aligned by using the metadata about the buffers. I have a protoype version of that validation here - https://github.com/chroma-core/chroma/pull/2426/files#diff-03bcd4f01acfa68c46fcc974d3722fa621056b4e0f908d708a2d15028b0e99b1R336 and everything looks correct. This lets me change the code in get_size of a block to not use arrows internal sizing methods get_array_memory_size or get_buffer_memory_size. These will always report the capacity of the buffers and also if the buffers are shared, the code is just wrong and will overreport as mentioned above. Instead, I look at the len() of each buffer and then manually pad it to 64 bytes. This is a loose coupling, it requires that when we write to buffers, they are padded to 64 bytes, and that when we read from arrow, the buffers are padded to 64 bytes. Otherwise the size computation assuming you can just pad the len to the nearest 64 byte boundary is wrong. The concern is then what if we change our code to stop allocating on 64 byte boundaries somehow, or arrow changes the alignment without us knowing. My proposal for this is to write tests for our block construction that will fail if we aren't on 64 byte boundaries - thats easy to ensure. This leads me to my main question - for arrow writing to files we have two options. Either I can put that validation code into the load block code path, and error if I read a file thats not aligned, OR I can just write tests that ensure that when we read a block, its correct. Which do people prefer? |
for column in self.data.columns() { | ||
let column_size = column.get_buffer_memory_size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this breaks if the underlying buffers are shared, which is true for if we load from Arrow IPC, so we see a size that is way larger than what it actually is.
@@ -55,6 +55,7 @@ opentelemetry = { version = "0.19.0", default-features = false, features = [ | |||
opentelemetry-otlp = "0.12.0" | |||
shuttle = "0.7.1" | |||
regex = "1.10.5" | |||
flatbuffers = "24.3.25" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have to add this for my manual reading of the footer
/// Returns the size of the block in bytes | ||
pub(crate) fn get_size(&self) -> usize { | ||
let mut total_size = 0; | ||
for column in self.data.columns() { | ||
total_size += column.get_buffer_memory_size(); | ||
let array_data = column.to_data(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the critical change
let mut writer = std::io::BufWriter::new(file); | ||
let writer = arrow::ipc::writer::FileWriter::try_new(&mut writer, &self.data.schema()); | ||
let options = | ||
match arrow::ipc::writer::IpcWriteOptions::try_new(64, false, MetadataVersion::V5) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where we enable 64 byte alignment manually (it is the default too)
} | ||
|
||
pub fn from_bytes(bytes: &[u8], id: Uuid) -> Result<Self, Box<dyn ChromaError>> { | ||
pub fn from_bytes_with_validation(bytes: &[u8], id: Uuid) -> Result<Self, BlockLoadError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the with_validation methods for both from_bytes and load are meant to be in testing only
Description of changes
Summarize the changes made by this PR.
get_array_memory_size
returns incorrect size if underlying buffers are shared apache/arrow-rs#5969).Test plan
How are these changes tested?
Updated all tests in delta.rs to save, load and check the sizes match
pytest
for python,yarn test
for js,cargo test
for rustDocumentation Changes
None