-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python][[Parquet][C++] pyarrow 18 high memory consumption #45236
Comments
Thanks for opening the issue. Could you explain what do you mean by the following?
Did you found out anything with memray on where the memory was being used or did you just checked the peak memory consumption? It would be really helpful if you could help investigate a little further on where the issue is coming from. |
I meant the size of the parquet file is around 600 Kb, and checking the loaded pyarrow table's size (data.nbytes) gives 22Mb. Meaning the table is actually quite small. Here's the memray flamegraph for the above test.parquet file, Most of the allocations appear to come from native code. I don't have the symbols so its hard to read. |
Your "dummy parquet file" doesn't reproduce the issue for me. I see a peak consumption of around 226 MB, and that's with Arrow compiled in debug mode. |
Maybe you had a different pyarrow build? I have the same result in a Dockerfile FROM python:3.12
RUN pip install pyarrow==18.1.0 memray
COPY test.parquet test.parquet
RUN echo "import pyarrow.parquet as pq\ndata = pq.read_table('test.parquet')\nprint(data.nbytes / 1024**2)" > /test.py
RUN memray run --native -o report.bin test.py
CMD memray stats report.bin
Running the above (current pyarrow distro 18.1.0) yields:
Changing to pyarrow==17.0.0
|
Describe the bug, including details regarding any error messages, version, and platform.
We noticed high memory consumption while reading parquet files with pyarrow 18.1.
Loading a 600Kb parquet file into a 22Mb pyarrow table consumes over 1 Gb of memory. On 3 different machines (wsl, linux, macos), profiling with memray showed a peak memory of 1 Gb, 1.1 Gb and 1.8 Gb.
Running the same code with pyarrow 17 consumes less than 200 Mb.
Its quite simple to reproduce. I've attached a dummy parquet which consume slightly less but still over 1 GB.
test.zip
Component(s)
Python
The text was updated successfully, but these errors were encountered: