Add KTO Loss #475

hebiao064 · 2024-12-13T00:02:57Z

Summary

Close KTO Item of the Roadmap: #371

Implements the Kahneman-Tversky Optimization (KTO) loss function.

Paper: "KTO: Model Alignment as Prospect Theory-Guided Optimization"

KTO Loss Function

For a policy π compared to a reference policy π₀:

When y is chosen:

$L_{KTO} = 1 - \sigma(\beta \cdot (\log[\frac{\pi(x)}{\pi_0(x)}] - KL(\pi||\pi_0)_y))$

When y is rejected:

$L_{KTO} = 1 - \sigma(\beta \cdot (KL(\pi||\pi_0)_y - \log[\frac{\pi(x)}{\pi_0(x)}]))$

where:

σ is the sigmoid function
β is a temperature parameter
KL(π||π₀)_y is the KL divergence threshold for action y

Intuition

KTO loss is inspired by prospect theory from behavioral economics, which models how humans make decisions under uncertainty.

The loss function is asymmetric, treating gains and losses differently, similar to
human decision-making patterns.

Credit by: https://www.youtube.com/watch?v=nSrj1J6ODoM&t=422s

Benchmark Result

Special thanks to @shivam15s on the optimization PR: #491, otherwise my implementation won't achieve speed as list below

Memory:

Speed:

Notable learning on optimizing the speed:

[Culprit] Repeated calculation of KL when we split to N chunks
[Good to have] Remove the unnecessary variables calculation like aux_outputs

Key Changes

Implemented LigerFusedLinearKTOLoss class
Added LigerFusedLinearKTOFunction for the core KTO computation
Created comprehensive test suite in test_kto_loss.py
Added reference implementation (HFKTOLoss) based on Hugging Face's implementation

Reference

Based on Hugging Face's implementation: https://github.com/huggingface/trl/blob/main/trl/trainer/kto_trainer.py

Testing Done

Test is passing now:
pytest test/chunked_loss/test_kto_loss.py

Parameterized tests covering various configurations:
- Different batch sizes, sequence lengths, hidden dims, and vocab sizes
- Multiple data types (bfloat16, float32)
- Bias and reference bias variations
- Different ignore indices and beta values
Correctness tests comparing against reference implementation
Gradient checking and backward pass verification

Hardware Type:
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

ByronHsu

Take a brief look, I am not very familiar with KTO math but why do we not have KL_log_probs but original HF has https://github.com/huggingface/trl/blob/cd7156fb34ddf9a8c04fcd640a4067933461d44e/trl/trainer/kto_trainer.py#L1121. We also need to be careful about scaling. Seems in original HF, kto_loss returns an unreduced version, but we probably need to reduce as mean. cc @shivam15s

hebiao064 · 2024-12-13T21:45:39Z

Take a brief look, I am not very familiar with KTO math but why do we not have KL_log_probs but original HF has https://github.com/huggingface/trl/blob/cd7156fb34ddf9a8c04fcd640a4067933461d44e/trl/trainer/kto_trainer.py#L1121. We also need to be careful about scaling. Seems in original HF, kto_loss returns an unreduced version, but we probably need to reduce as mean. cc @shivam15s

About KL, I'll take a further look in trl about how to support that.

About reduce, HF did averaged it here: loss = losses.nanmean()

## Summary  ### KTO LOSS #### Memory ![image](https://github.com/user-attachments/assets/bd8fe4f6-0c18-4cf3-a79a-fc8634dcb492) #### Speed ![image](https://github.com/user-attachments/assets/256cf0c3-3943-4f46-b256-38a577323a03)  ## Testing Done   - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence

… waste

hebiao064 · 2024-12-24T04:48:00Z

AMD Test failed due to no gpu available, not related to the PR: FAILED test/transformers/test_swiglu.py::test_correctness_functional[dtype1-10000.0-0.01-9-7-41] - RuntimeError: No HIP GPUs are available

kvignesh1420

Thanks for the contribution. Did a first pass over the functionality and left some comments.

benchmark/scripts/benchmark_kto_loss.py

kvignesh1420 · 2025-01-04T01:58:13Z

src/liger_kernel/chunked_loss/fused_linear_unpaired_preference.py

+            preference_labels_chunk=None,
+            ref_input_chunk=None,
+        ):
+            (chunk_grad_input, chunk_grad_weight, *chunk_grad_bias), (chunk_loss) = fused_fwd_bwd(


Why is it *chunk_grad_bias and not chunk_grad_bias like the other gradients?

The *chunk_grad_bias syntax is used here because of how Python handles unpacking in this situation. Let me explain:

The fused_fwd_bwd function returns a tuple of two elements:

First element: A tuple of gradients (either 2 or 3 elements depending on if bias is used)

Second element: The loss value

When bias is None, fused_fwd_bwd returns only two gradients: (chunk_grad_input, chunk_grad_weight)

When bias is not None, it returns three: (chunk_grad_input, chunk_grad_weight, chunk_grad_bias)

The * operator is used to handle this variability - it collects any remaining elements into a list. So:

If bias is None: *chunk_grad_bias becomes an empty list []

If bias exists: *chunk_grad_bias becomes a list with one element [chunk_grad_bias]

src/liger_kernel/chunked_loss/fused_linear_unpaired_preference.py

kvignesh1420 · 2025-01-04T03:07:12Z

src/liger_kernel/chunked_loss/fused_linear_unpaired_preference.py

+        """
+        Compute the total loss for a chunk of input and target, while using an alignment/preference loss function.
+        Args:
+            preference_loss_fn (callable): Loss function to compute the loss on a chunk of input/target.


Seems like this class already has a staticmethod for preference_loss_fn. Why do we need an extra arg here?

The abstract preference_loss_fn method defined at the class level is just a placeholder that defines the interface - notice it's marked with @abstractmethod and raises NotImplementedError. This is meant to be implemented by subclasses.

The preference_loss_fn parameter in the methods like forward() and _compute_loss() is the actual function that will be used to compute the loss. This is passed in at runtime.

This is a common pattern in Python where the base class defines an interface (abstract method) but allows for runtime flexibility by accepting the actual implementation as a parameter. This makes the class more flexible as you can:

Create subclasses that implement a default preference loss function by overriding the abstract method

Override that default at runtime by passing in a different loss function as a parameter

hebiao064 added 3 commits December 12, 2024 23:59

Add KTO Loss

4471ba6

Fix Tests

d3f565b

formatting

ab08ab6

hebiao064 marked this pull request as ready for review December 13, 2024 01:41

hebiao064 added 4 commits December 13, 2024 06:23

Add more docstrings

98aa519

Fix tests

6e67869

Add Benchmark Result

3a76c76

Merge branch 'main' into kto_loss

7114a19

hebiao064 requested review from shivam15s and ByronHsu December 13, 2024 19:08

ByronHsu reviewed Dec 13, 2024

View reviewed changes

hebiao064 and others added 11 commits December 16, 2024 21:34

Merge branch 'main' into kto_loss

f46a595

Add KL, Unpair flag, Preference Labels

1992153

Reorder Preference Labels and Bias

3aebf8d

Fix Tests

1dd10a7

Fix all tests

130e909

make it random number for chosen

c2fab1d

Merge branch 'main' into kto_loss

fe7eafc

Update benchmark

eeaa570

Merge branch 'main' into kto_loss

6d50d44

Change sign of loss to align with merged changes

846dc2e

hebiao064 enabled auto-merge (squash) December 21, 2024 06:50

hebiao064 added 6 commits December 23, 2024 22:39

Add KL into KTO Test

33fa548

Add KL and Benchmark

f7b29d5

Fix the speed slow down by removing .item() which would incur gpu-cpu…

29e818e

… waste

Merge branch 'main' into kto_loss

c478e75

Fix checkstyle

06b2350

Remove unnecessary change from conflict merge

3cf3771

Merge branch 'main' into kto_loss

6d33947

kvignesh1420 reviewed Jan 4, 2025

View reviewed changes

hebiao064 added 2 commits January 14, 2025 16:20

Merge branch 'main' into kto_loss

26f48d0

Fix comments

71b1773

hebiao064 mentioned this pull request Jan 15, 2025

Fix Unit Test error brought by Transformer Breaking Changes #523

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add KTO Loss #475

Add KTO Loss #475

hebiao064 commented Dec 13, 2024 •

edited

Loading

ByronHsu left a comment

hebiao064 commented Dec 13, 2024

hebiao064 commented Dec 24, 2024

kvignesh1420 left a comment

kvignesh1420 Jan 4, 2025

hebiao064 Jan 15, 2025

kvignesh1420 Jan 4, 2025

hebiao064 Jan 15, 2025 •

edited

Loading

Add KTO Loss #475

Are you sure you want to change the base?

Add KTO Loss #475

Conversation

hebiao064 commented Dec 13, 2024 • edited Loading

Summary

KTO Loss Function

When y is chosen:

When y is rejected:

Intuition

Benchmark Result

Key Changes

Reference

Testing Done

ByronHsu left a comment

Choose a reason for hiding this comment

hebiao064 commented Dec 13, 2024

hebiao064 commented Dec 24, 2024

kvignesh1420 left a comment

Choose a reason for hiding this comment

kvignesh1420 Jan 4, 2025

Choose a reason for hiding this comment

hebiao064 Jan 15, 2025

Choose a reason for hiding this comment

kvignesh1420 Jan 4, 2025

Choose a reason for hiding this comment

hebiao064 Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

hebiao064 commented Dec 13, 2024 •

edited

Loading

hebiao064 Jan 15, 2025 •

edited

Loading