Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Enable size statistics by default #45227

Open
pitrou opened this issue Jan 11, 2025 · 6 comments
Open

[C++][Parquet] Enable size statistics by default #45227

pitrou opened this issue Jan 11, 2025 · 6 comments

Comments

@pitrou
Copy link
Member

pitrou commented Jan 11, 2025

Describe the enhancement requested

Now that #45202 , the overhead of computing and writing out the size statistics seems sufficiently negligible that we should probably enable it by default. This may allow performance improvements in readers when they are updated to take advantage of the information.

Component(s)

C++, Parquet

@pitrou
Copy link
Member Author

pitrou commented Jan 11, 2025

@wgtmac @mapleFU

@wgtmac
Copy link
Member

wgtmac commented Jan 12, 2025

What about page index (which not yet enabled by default)? Or do we enable SizeStatisticsLevel::ColumnChunk by default for now?

@pitrou
Copy link
Member Author

pitrou commented Jan 12, 2025

What about page index (which not yet enabled by default)?

The benchmark enables it when PageAndColumnChunk is selected, right?

@wgtmac
Copy link
Member

wgtmac commented Jan 12, 2025

The benchmark enables page index in all cases. Should I change it to add a baseline with page index disabled?

@pitrou
Copy link
Member Author

pitrou commented Jan 12, 2025

The benchmark enables page index in all cases. Should I change it to add a baseline with page index disabled?

Ah, that could be informative indeed.

@wgtmac
Copy link
Member

wgtmac commented Jan 12, 2025

Update: I just extend the size_stats_benchmark by adding a third variable to control page index

------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                            Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------------------------------------------
BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::Int64Type, false>                  9397353 ns      9397174 ns           74 bytes_per_second=864.622Mi/s items_per_second=111.584M/s output_size=546.08k page_index_size=0
BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::Int64Type, true>                   9406171 ns      9406047 ns           75 bytes_per_second=863.806Mi/s items_per_second=111.479M/s output_size=546.091k page_index_size=33
BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type, true>            9411777 ns      9411641 ns           74 bytes_per_second=863.293Mi/s items_per_second=111.413M/s output_size=546.107k page_index_size=33
BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type, true>     9412548 ns      9412415 ns           74 bytes_per_second=863.222Mi/s items_per_second=111.403M/s output_size=546.121k page_index_size=47
BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::StringType, false>                12655070 ns     12654774 ns           55 bytes_per_second=365.859Mi/s items_per_second=82.8601M/s output_size=864.052k page_index_size=0
BM_WritePrimitiveColumn<SizeStatisticsLevel::None, ::arrow::StringType, true>                 12647100 ns     12646934 ns           55 bytes_per_second=366.086Mi/s items_per_second=82.9115M/s output_size=864.083k page_index_size=30
BM_WritePrimitiveColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType, true>          12680150 ns     12679974 ns           55 bytes_per_second=365.132Mi/s items_per_second=82.6954M/s output_size=864.103k page_index_size=30
BM_WritePrimitiveColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType, true>   12701362 ns     12701186 ns           55 bytes_per_second=364.522Mi/s items_per_second=82.5573M/s output_size=864.122k page_index_size=44
BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::Int64Type, false>                      16357127 ns     16356706 ns           42 bytes_per_second=521.957Mi/s items_per_second=64.1068M/s output_size=625.904k page_index_size=0
BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::Int64Type, true>                       16340772 ns     16340560 ns           43 bytes_per_second=522.473Mi/s items_per_second=64.1701M/s output_size=625.915k page_index_size=34
BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::Int64Type, true>                16859853 ns     16859622 ns           42 bytes_per_second=506.388Mi/s items_per_second=62.1945M/s output_size=625.937k page_index_size=34
BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::Int64Type, true>         16847092 ns     16846868 ns           41 bytes_per_second=506.771Mi/s items_per_second=62.2416M/s output_size=625.957k page_index_size=54
BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::StringType, false>                     19851481 ns     19851224 ns           35 bytes_per_second=254.008Mi/s items_per_second=52.8217M/s output_size=944.092k page_index_size=0
BM_WriteListColumn<SizeStatisticsLevel::None, ::arrow::StringType, true>                      20174939 ns     20174669 ns           34 bytes_per_second=249.935Mi/s items_per_second=51.9749M/s output_size=944.123k page_index_size=31
BM_WriteListColumn<SizeStatisticsLevel::ColumnChunk, ::arrow::StringType, true>               20878354 ns     20878126 ns           34 bytes_per_second=241.514Mi/s items_per_second=50.2237M/s output_size=944.149k page_index_size=31
BM_WriteListColumn<SizeStatisticsLevel::PageAndColumnChunk, ::arrow::StringType, true>        20549374 ns     20549100 ns           34 bytes_per_second=245.381Mi/s items_per_second=51.0278M/s output_size=944.174k page_index_size=51

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants