Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating how we create BloomFilter from rdb loads and upgrading bloomfilter dependency to version 3.0 #23

Merged
merged 7 commits into from
Dec 4, 2024

Conversation

zackcam
Copy link
Contributor

@zackcam zackcam commented Nov 19, 2024

Before we were initializing a vector of bloomfilters with Vec::New() which could cause the size to be higher when we pushed an item into the vector during RDB Load and Copy. This is due to the raw alloc vector growth. From testing we saw that it could be four times as large. Which makes sense from this. In order to make this size what we expect we can just set the capacity to the number of filters we know we will be adding

This PR also upgrades the dependency of bloomfilter to 3.0.
Breaking changes included:

  1. from_existing no longer exists, it is now from_slice which has different arguments
  2. We no longer have an exposed sip_keys but have the seed. Updated tests to check for this.
  3. Bitmap is no longer exposed we now have to use the as_slice method to get it
  4. len() is now being used instead of number_of_bits()
  5. Updated serialize and deserialize to now use the module from bloom

Other changes with this upgrade include removing unneeded fields from rdb load and save as they are now included in bitmap. Due to this we also changed the size validation in rdb load to now calcualte this off of fprate and capacity. This included creating a new helper method in BloomFilterType to claculate fp_rate based on number of filters

…w has capacity of filter we are loading from

Signed-off-by: zackcam <[email protected]>
@zackcam
Copy link
Contributor Author

zackcam commented Dec 2, 2024

Breaking changes included:

  1. from_existing no longer exists, it is now from_slice which has different arguments
  2. We no longer have an exposed sip_keys but have the seed. Updated tests to check for this.
  3. Bitmap is no longer exposed we now have to use the as_slice method to get it
  4. len() is now being used instead of number_of_bits()
  5. Updated serialize and deserialize to now use the module from bloom

Cargo.toml Outdated
@@ -15,7 +15,7 @@ homepage = "https://github.com/valkey-io/valkey-bloom"
valkey-module = "0.1.2"
valkey-module-macros = "0"
linkme = "0"
bloomfilter = { version = "1.0.13", features = ["serde"] }
bloomfilter = { version = "3", features = ["serde"] }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change the version format 3 -> 3.0.0

@@ -104,7 +105,7 @@ impl BloomFilterType {

/// Create a new BloomFilterType object from an existing one.
pub fn create_copy_from(from_bf: &BloomFilterType) -> BloomFilterType {
let mut filters = Vec::new();
let mut filters: Vec<BloomFilter> = Vec::with_capacity(from_bf.filters.len());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we worry about over-allocate? ( when filters.len() = 0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filters.len() will never / should never be 0, so this seems OK

@@ -75,6 +74,8 @@ impl ValkeyDataType for BloomFilterType {
let Ok(fp_rate) = raw::load_double(rdb) else {
return None;
};
let mut filters: Vec<BloomFilter> = Vec::with_capacity(num_filters as usize);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments on RDB Save and Load.

  1. Number of hash functions are included in the bitmap dump as a result of to_bytes. So we can remove this from the save/load logic. Right?
  2. Do we need number of bits for any purpose? Is it needed for bloom filter restoration (external bloom)? If not, can we remove it? If it helps with any purpose (or if it makes size validation impossible), feel free to retain it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need both fields. I removed them both and instead moved the logic of checking the size until we have converted the bloom object into a type where we can get the number of bits

@zackcam zackcam force-pushed the unstable branch 3 times, most recently from 3da4c19 to 0484bbb Compare December 3, 2024 23:18
@zackcam zackcam changed the title Updating how we create BloomFilter from rdb loads. BloomFilter vec now has capacity of filter we are loading from Updating how we create BloomFilter from rdb loads and upgrading bloomfilter dependency to version 3.0 Dec 3, 2024
dig.add_long_long(key1 as i64);
dig.add_long_long(key2 as i64);
}
dig.add_string_buffer(filter.bloom.as_slice());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just Confirming. Does the slice contain the seed as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is contained in the header portion of the bitmap. BitMap::set_seed(header, &seed); it is set like this in sync

// Reject RDB Load if any bloom filter within a bloom object of a size greater than what is allowed.
if !BloomFilter::validate_size_with_bits(number_of_bits) {
let new_fp_rate =
match Self::calculate_fp_rate(fp_rate, num_filters.try_into().unwrap()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not unwrap

return None;
}
};
if !BloomFilter::validate_size(capacity.try_into().unwrap(), new_fp_rate) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not unwrap

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can cast the u64 to a u32 using as

@@ -330,7 +337,8 @@ impl BloomFilter {
capacity as usize,
fp_rate,
&configs::FIXED_SEED,
);
)
.unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not unwrap

sip_keys,
);
pub fn from_existing(bitmap: &[u8], num_items: u32, capacity: u32) -> BloomFilter {
let bloom = bloomfilter::Bloom::from_slice(bitmap).unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not unwrap

…ll as removing unnecesary fields saved in rdb

Signed-off-by: zackcam <[email protected]>
@@ -75,24 +71,28 @@ impl ValkeyDataType for BloomFilterType {
let Ok(fp_rate) = raw::load_double(rdb) else {
return None;
};
let mut filters: Vec<BloomFilter> = Vec::with_capacity(num_filters as usize);
Copy link
Member

@KarthikSubbarao KarthikSubbarao Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This will be fixed in a follow up PR. We want to make the vector expansion consistent for (1) bloom object creation and scaling (2) COPY (3) RDB Load (4) BF.LOAD.

We can also add the vector capacity into the DEBUG DIGEST. This will be very useful for correctness testing

You can keep this in mind when working on the fix @zackcam

@KarthikSubbarao KarthikSubbarao merged commit 34d6fb0 into valkey-io:unstable Dec 4, 2024
6 checks passed
nnmehta pushed a commit to nnmehta/ValkeyBloom that referenced this pull request Dec 4, 2024
…filter dependency to version 3.0 (valkey-io#23)

* Updating how we create BloomFilter from rdb loads. BloomFilter vec now has capacity of filter we are loading from

Signed-off-by: zackcam <[email protected]>

* Updating bloomfilter dependency to version 3, fixing breaking changes as well

Signed-off-by: zackcam <[email protected]>

* Updating the digest changes to follow updated version of bloom. As well as removing unnecesary fields saved in rdb

Signed-off-by: zackcam <[email protected]>

* Update log in src/bloom/data_type.rs

Signed-off-by: KarthikSubbarao <[email protected]>

* Update comment in src/bloom/utils.rs

Signed-off-by: KarthikSubbarao <[email protected]>

* Clippy error in src/bloom/data_type.rs

Signed-off-by: KarthikSubbarao <[email protected]>

---------

Signed-off-by: zackcam <[email protected]>
Signed-off-by: KarthikSubbarao <[email protected]>
Co-authored-by: KarthikSubbarao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants