-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 file size issue #126
Comments
Thanks for an issue and we are already looking into this and aware of this. The idea behind splitting into multiple files was related to compression and encryption. We right now are assessing how this should be implemented but in current implementation because of compression and encryption we can not really request the byte range. |
Also the file size in configurable and it can be tweaked that the files are not necessary small or if the segment file is smaller than configured file size, the segment will not be split. So since some people have really big segment sizes and together with compression and encryption we do not really want to fetch the whole file to decrypt/decompress. But as I mentioned before we are looking into how to improve this. |
MultiPart upload is definitely beneficial for larger users than the standard file upload but the elephant in the room here is compression/encryption. As @AnatolyPopov mentioned, while its possible to make fetching of byte ranges work with encryption (which can ensure the same block sizes as the plaintext data), this doesn't work trivially with compression. We are trying to solve these issues, however I don't see harm in using the S3 multipart upload in the specific case of no compression and no encryption. |
With no compression and encryption it's for sure possible but those are not optional as of right now and this needs to be changed. |
@AnatolyPopov Shall we make an issue for this? |
@mdedetrich Yes, we should. |
I think the Kafka log segment file is already compressed (snappy, gzip or zstd), not sure whether we need to compress it further. |
So we are planning to add a feature where if Kafka is configured to compress segments then we won't recompress them. For our specific usecase since we are dealing with external users that can configure Kafka it would be nice if we could compress on the plugin level but as stated earlier it is problematic. |
This was changed radically in the new implementation. Now, despite chunking, we still upload a single blob and support range queries. |
Are we still uploading many small files onto S3? |
No, compression and encryption are performed by chunk, but the result nevertheless is concatenated before upload. Regardless of the log file size, each segment will produce these files on the remote: https://github.com/aiven/tiered-storage-for-apache-kafka/blob/main/core/src/test/java/io/aiven/kafka/tieredstorage/RemoteStorageManagerTest.java#L210-L224 |
This is related to #125
From the code here https://github.com/aiven/tiered-storage-for-apache-kafka/blob/main/s3/src/main/java/io/aiven/kafka/tiered/storage/s3/S3ClientWrapper.java#L179, it was trying to break the Kafka log segment file into multiple part files and upload them one by one.
This will create multiple files on S3 corresponds to one original log segment file. And this will leads into many small file issues on S3 which will hinder the S3 performance (especially on object listing which you are using).
If the goal is to improve the upload performance, you can use S3's multipart upload which uses multiple threads. But the target file on S3 is just one file (instead of many small parts files). And on the reading/downloading path, you can also use S3's range API to read a chunk of bytes instead of the whole S3 object: https://docs.aws.amazon.com/AmazonS3/latest/userguide/download-objects.html
The text was updated successfully, but these errors were encountered: