You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We found that for small dots, triton is now using a concerning number of registers. In the example outlined below for a [16,256]x[256,16] dot, compiling it down through triton and then running it through ptxas used 34 registers. Now as of the culprit commit, it uses 255 registers and spills: 120 bytes stack frame, 116 bytes spill stores, 116 bytes spill loads.
Any thoughts as to what's causing this and how we can fix it?
Describe the issue
We found that for small dots, triton is now using a concerning number of registers. In the example outlined below for a [16,256]x[256,16] dot, compiling it down through triton and then running it through
ptxas
used 34 registers. Now as of the culprit commit, it uses 255 registers and spills: 120 bytes stack frame, 116 bytes spill stores, 116 bytes spill loads.Any thoughts as to what's causing this and how we can fix it?
Culprit commit: d9facf3
Here is a sample of the TTIR:
Steps to reproduce: compile through triton, get the ptx from .triton/cache/ and run that file through
ptxas -arch=sm_80 -v --warn-on-spills
.Environment details
GPU: A100 (also appears on H100)
Triton version: affects triton built from source after d9facf3
The text was updated successfully, but these errors were encountered: