Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][[Parquet][C++] pyarrow 18 high memory consumption #45236

Open
kubat-square-sense opened this issue Jan 13, 2025 · 4 comments
Open

[Python][[Parquet][C++] pyarrow 18 high memory consumption #45236

kubat-square-sense opened this issue Jan 13, 2025 · 4 comments

Comments

@kubat-square-sense
Copy link

kubat-square-sense commented Jan 13, 2025

Describe the bug, including details regarding any error messages, version, and platform.

We noticed high memory consumption while reading parquet files with pyarrow 18.1.
Loading a 600Kb parquet file into a 22Mb pyarrow table consumes over 1 Gb of memory. On 3 different machines (wsl, linux, macos), profiling with memray showed a peak memory of 1 Gb, 1.1 Gb and 1.8 Gb.

Running the same code with pyarrow 17 consumes less than 200 Mb.

Its quite simple to reproduce. I've attached a dummy parquet which consume slightly less but still over 1 GB.

import pyarrow.parquet as pq

data = pq.read_table('test.parquet')

print(data.nbytes / 1024**2)

test.zip

Component(s)

Python

@raulcd
Copy link
Member

raulcd commented Jan 13, 2025

Thanks for opening the issue.

Could you explain what do you mean by the following?

Loading a 600Kb parquet file into a 22Mb

Did you found out anything with memray on where the memory was being used or did you just checked the peak memory consumption? It would be really helpful if you could help investigate a little further on where the issue is coming from.

@raulcd raulcd changed the title pyarrow 18 high memory consumption [Python][[Parquet][C++] pyarrow 18 high memory consumption Jan 13, 2025
@kubat-square-sense
Copy link
Author

kubat-square-sense commented Jan 13, 2025

I meant the size of the parquet file is around 600 Kb, and checking the loaded pyarrow table's size (data.nbytes) gives 22Mb. Meaning the table is actually quite small.

Here's the memray flamegraph for the above test.parquet file, Most of the allocations appear to come from native code. I don't have the symbols so its hard to read.

memray-flamegraph-perf_pyarrow.py.11595.zip

@pitrou
Copy link
Member

pitrou commented Jan 14, 2025

Your "dummy parquet file" doesn't reproduce the issue for me. I see a peak consumption of around 226 MB, and that's with Arrow compiled in debug mode.

@kubat-square-sense
Copy link
Author

Maybe you had a different pyarrow build?

I have the same result in a Dockerfile

FROM python:3.12

RUN pip install pyarrow==18.1.0 memray

COPY test.parquet test.parquet

RUN echo "import pyarrow.parquet as pq\ndata = pq.read_table('test.parquet')\nprint(data.nbytes / 1024**2)" > /test.py

RUN memray run --native -o report.bin test.py 

CMD memray stats report.bin

docker build --tag test_image .
docker run test_image

Running the above (current pyarrow distro 18.1.0) yields:

  📏 Total allocations:
          75319
  
  📦 Total memory allocated:
          1.476GB
  
  📊 Histogram of allocation size:
          min: 1.000B
          --------------------------------------------
          < 6.000B   :  4688 ▇▇▇▇▇
          < 36.000B  : 22577 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
          < 222.000B : 28911 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
          < 1.319KB  : 16851 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
          < 7.999KB  :  1397 ▇▇
          < 48.503KB :   831 ▇
          < 294.066KB:    34 ▇
          < 1.741MB  :     7 ▇
          < 10.556MB :     0 
          <=64.000MB :    23 ▇
          --------------------------------------------
          max: 64.000MB
  
  📂 Allocator type distribution:
           MALLOC: 71282
           CALLOC: 3149
           REALLOC: 838
           MMAP: 50
  
  🥇 Top 5 largest allocating locations (by size):
          - <stack trace unavailable> -> 1.378GB
          - _call_with_frames_removed:<frozen importlib._bootstrap>:488 -> 84.912MB
          - dedent:/usr/local/lib/python3.12/textwrap.py:436 -> 4.572MB
          - sub:/usr/local/lib/python3.12/re/__init__.py:186 -> 3.808MB
          - dedent:/usr/local/lib/python3.12/textwrap.py:435 -> 1.117MB
  
  🥇 Top 5 largest allocating locations (by number of allocations):
          - <stack trace unavailable> -> 29539
          - _call_with_frames_removed:<frozen importlib._bootstrap>:488 -> 17261
          - <module>:test.py:3 -> 5130
          - sub:/usr/local/lib/python3.12/re/__init__.py:186 -> 4586
          - dedent:/usr/local/lib/python3.12/textwrap.py:436 -> 4382

Changing to pyarrow==17.0.0

  📦 Total memory allocated:
          219.388MB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants