-
-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gradient Accumulation Causes Loss And Grad Norm To Multiply By GA Steps Used (BS1GA8 Is ~8x Larger Than BS8GA1) #2262
Labels
bug
Something isn't working
Comments
Same here... One fix might be downgrade |
pip list
|
This issue might be fixed after this merged: |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Please check that this issue hasn't been reported before.
Expected Behavior
I assume the loss and grad_norm should be relatively similar. I know they won't be identical, but I don't think this behaviour is intended.
Current behaviour
With the provided very tiny test config using
micro_batch_size=1
withgradient_accumulation_steps=8
results in 8x higher loss thanmicro_batch_size=8
withgradient_accumulation_steps=1
.That means with BS8GA1 it's starting at
~10.5 loss
and ending~6.8 loss
.But with BS1GA8 it's starting at
~83.7 loss
and ending at~50 loss
.The loss follows roughly the same curve, but the grad_norm seems a bit different, and the eval loss is lower on BS1GA8.
Possibly related; I've been having some issues ever since the very first transformers/TRL GA fix and I don't think it was every fully solved (at least for me?). Such as when I was trying to do KTO with TRL itself (not within axolotl) and the change caused the loss to go from 0.5 all the way up to 32.
Steps to reproduce
Config yaml
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.12.3
axolotl branch-commit
main/8606093
Acknowledgements
The text was updated successfully, but these errors were encountered: