-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating how we create BloomFilter from rdb loads and upgrading bloomfilter dependency to version 3.0 #23
Conversation
188835b
to
a33e0e3
Compare
…w has capacity of filter we are loading from Signed-off-by: zackcam <[email protected]>
… as well Signed-off-by: zackcam <[email protected]>
Breaking changes included:
|
Cargo.toml
Outdated
@@ -15,7 +15,7 @@ homepage = "https://github.com/valkey-io/valkey-bloom" | |||
valkey-module = "0.1.2" | |||
valkey-module-macros = "0" | |||
linkme = "0" | |||
bloomfilter = { version = "1.0.13", features = ["serde"] } | |||
bloomfilter = { version = "3", features = ["serde"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change the version format 3
-> 3.0.0
@@ -104,7 +105,7 @@ impl BloomFilterType { | |||
|
|||
/// Create a new BloomFilterType object from an existing one. | |||
pub fn create_copy_from(from_bf: &BloomFilterType) -> BloomFilterType { | |||
let mut filters = Vec::new(); | |||
let mut filters: Vec<BloomFilter> = Vec::with_capacity(from_bf.filters.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we worry about over-allocate? ( when filters.len() = 0
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filters.len() will never / should never be 0, so this seems OK
@@ -75,6 +74,8 @@ impl ValkeyDataType for BloomFilterType { | |||
let Ok(fp_rate) = raw::load_double(rdb) else { | |||
return None; | |||
}; | |||
let mut filters: Vec<BloomFilter> = Vec::with_capacity(num_filters as usize); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments on RDB Save and Load.
- Number of hash functions are included in the bitmap dump as a result of
to_bytes
. So we can remove this from the save/load logic. Right? - Do we need number of bits for any purpose? Is it needed for bloom filter restoration (external bloom)? If not, can we remove it? If it helps with any purpose (or if it makes size validation impossible), feel free to retain it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need both fields. I removed them both and instead moved the logic of checking the size until we have converted the bloom object into a type where we can get the number of bits
3da4c19
to
0484bbb
Compare
dig.add_long_long(key1 as i64); | ||
dig.add_long_long(key2 as i64); | ||
} | ||
dig.add_string_buffer(filter.bloom.as_slice()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just Confirming. Does the slice contain the seed as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is contained in the header portion of the bitmap. BitMap::set_seed(header, &seed);
it is set like this in sync
src/bloom/data_type.rs
Outdated
// Reject RDB Load if any bloom filter within a bloom object of a size greater than what is allowed. | ||
if !BloomFilter::validate_size_with_bits(number_of_bits) { | ||
let new_fp_rate = | ||
match Self::calculate_fp_rate(fp_rate, num_filters.try_into().unwrap()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not unwrap
src/bloom/data_type.rs
Outdated
return None; | ||
} | ||
}; | ||
if !BloomFilter::validate_size(capacity.try_into().unwrap(), new_fp_rate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not unwrap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can cast the u64 to a u32 using as
src/bloom/utils.rs
Outdated
@@ -330,7 +337,8 @@ impl BloomFilter { | |||
capacity as usize, | |||
fp_rate, | |||
&configs::FIXED_SEED, | |||
); | |||
) | |||
.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not unwrap
src/bloom/utils.rs
Outdated
sip_keys, | ||
); | ||
pub fn from_existing(bitmap: &[u8], num_items: u32, capacity: u32) -> BloomFilter { | ||
let bloom = bloomfilter::Bloom::from_slice(bitmap).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not unwrap
…ll as removing unnecesary fields saved in rdb Signed-off-by: zackcam <[email protected]>
Signed-off-by: KarthikSubbarao <[email protected]>
Signed-off-by: KarthikSubbarao <[email protected]>
Signed-off-by: KarthikSubbarao <[email protected]>
@@ -75,24 +71,28 @@ impl ValkeyDataType for BloomFilterType { | |||
let Ok(fp_rate) = raw::load_double(rdb) else { | |||
return None; | |||
}; | |||
let mut filters: Vec<BloomFilter> = Vec::with_capacity(num_filters as usize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: This will be fixed in a follow up PR. We want to make the vector expansion consistent for (1) bloom object creation and scaling (2) COPY (3) RDB Load (4) BF.LOAD.
We can also add the vector capacity into the DEBUG DIGEST. This will be very useful for correctness testing
You can keep this in mind when working on the fix @zackcam
…filter dependency to version 3.0 (valkey-io#23) * Updating how we create BloomFilter from rdb loads. BloomFilter vec now has capacity of filter we are loading from Signed-off-by: zackcam <[email protected]> * Updating bloomfilter dependency to version 3, fixing breaking changes as well Signed-off-by: zackcam <[email protected]> * Updating the digest changes to follow updated version of bloom. As well as removing unnecesary fields saved in rdb Signed-off-by: zackcam <[email protected]> * Update log in src/bloom/data_type.rs Signed-off-by: KarthikSubbarao <[email protected]> * Update comment in src/bloom/utils.rs Signed-off-by: KarthikSubbarao <[email protected]> * Clippy error in src/bloom/data_type.rs Signed-off-by: KarthikSubbarao <[email protected]> --------- Signed-off-by: zackcam <[email protected]> Signed-off-by: KarthikSubbarao <[email protected]> Co-authored-by: KarthikSubbarao <[email protected]>
Before we were initializing a vector of bloomfilters with Vec::New() which could cause the size to be higher when we pushed an item into the vector during RDB Load and Copy. This is due to the raw alloc vector growth. From testing we saw that it could be four times as large. Which makes sense from this. In order to make this size what we expect we can just set the capacity to the number of filters we know we will be adding
This PR also upgrades the dependency of bloomfilter to 3.0.
Breaking changes included:
Other changes with this upgrade include removing unneeded fields from rdb load and save as they are now included in bitmap. Due to this we also changed the size validation in rdb load to now calcualte this off of fprate and capacity. This included creating a new helper method in BloomFilterType to claculate fp_rate based on number of filters