Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VLM: Model Tracing Guide #1030

Open
wants to merge 359 commits into
base: main
Choose a base branch
from
Open

VLM: Model Tracing Guide #1030

wants to merge 359 commits into from

Conversation

kylesayrs
Copy link
Collaborator

@kylesayrs kylesayrs commented Jan 2, 2025

Purpose

This guide explains the concepts of tracing as they relate to LLM Compressor and how to modify your model to support recipes which require using the Sequential Pipeline.

Through reading this guide, you will learn

  1. Why tracing is required when compressing with recipes involving the Sequential Pipeline and modifiers such as GPTQModifier
  2. How to determine if your model is traceable for your dataset
  3. How to modify your model definition to be traceable

Prerequisites

Changes

  • Add a model tracing guide src/llmcompressor/transformers/tracing/README.md with pictures
  • Add a readme for the sequential pipeline which points to the Tracing Guide src/llmcompressor/pipelines/sequential/README.md
  • Add a debug script to help users debug their models for traceability src/llmcompressor/transformers/tracing/debug.py
    • Add the llm-compressor.attempt_trace entrypoint for ease of use
  • Swap the order of arguments in llava_example.py and and pixtral_example.py to match the order of arguments on the modifier

Testing

Use the llmcompressor.attempt_trace debug script

llmcompressor.attempt_trace \
    --model_id llava-hf/llava-1.5-7b-hf
    --model_class TraceableLlavaForConditionalGeneration
    --sequential-targets LlamaDecoderLayer
    --ignore "re:.*lm_head" "re:vision_tower.*" "re:multi_modal_projector.*"
    --multimodal_data

Stretch

It might be nice if this tracing debugger tool also printed the model graph to an svg

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
…tokenized datasets should not be given labels

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs self-assigned this Jan 10, 2025
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs marked this pull request as ready for review January 10, 2025 06:08
dsikka pushed a commit that referenced this pull request Jan 10, 2025
## Purpose ##
* Allow VLM processors to be used to tokenize datasets with prompt keys

## Postrequisites ##
* #1030

## Changes ##
* Use `text` argument name for tokenizing the prompt column

## Testing ##
* w.r.t. tokenizers, using the `text` kwarg follows the precedent set by
[PretrainedTokenizerBase](https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L2790)
* w.r.t. processors, most processors use the text kwarg

Below are all the models I know to be compatible with this change, I'm
assuming that most other processors follow the same standard
1.
[llama](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/tokenization_llama.py#L233)
2.
[pixtral](https://github.com/huggingface/transformers/blob/main/src/transformers/models/pixtral/processing_pixtral.py#L160)
3.
[phi3_vision](https://huggingface.co/microsoft/Phi-3.5-vision-instruct/blob/main/processing_phi3_v.py#L321)
4.
[mllama](https://github.com/huggingface/transformers/blob/main/src/transformers/models/mllama/processing_mllama.py#L232)
5.
[qwen2_vl](https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2_vl/processing_qwen2_vl.py#L71)

Example of using VLM processor to tokenize a dataset with prompt key
```python3
from transformers import AutoProcessor
from llmcompressor.transformers import DataTrainingArguments, TextGenerationDataset

models_to_test = [
  "meta-llama/Meta-Llama-3-8B-Instruct",
  "mistralai/Mixtral-8x7B-Instruct-v0.1",
  "Qwen/Qwen2-VL-2B-Instruct",  # fails without changes
  "mgoin/pixtral-12b",  # fails without changes
]

for model_id in models_to_test:
  processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
  
  data_args = DataTrainingArguments(
      dataset="ultrachat-200k",
      splits={"calibration": "test_sft[:1]"}
  )
  
  dataset = TextGenerationDataset.load_from_registry(
      data_args.dataset,
      data_args=data_args,
      split=data_args.splits["calibration"],
      processor=processor,
  )(add_labels=False)
```

Signed-off-by: Kyle Sayers <[email protected]>
setup.py Outdated Show resolved Hide resolved
return model_cls


def parse_args():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have click in the setup.py, might be worth using for cli

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

legacy_processing = (
(input_ids == self.config.image_token_index).sum(1).max() < self.config.image_seq_length
) or (input_ids.shape[-1] == 1 and pixel_values is not None).item()
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the whole thing.

I like how much time and thought you put into making this doc.

Right now, the audience needs to read until 3rd paragraph to know what the problem is and when to use the tracing -- encoder-decoder models using GPTQ, SparseGPT Modifiers. If we move those to the intro, it will be clearer for the audience to know if the doc is applicable to them or not.

Then a small paragraph introducing what 1, 2, and 3 will be helpful for --
1 shows the description of why it cannot use the previous methods and why the seq pipeline solves the problem, 2. is how to run using cli, and 3. is debugging/contribution.

This way I think the audience can have an easier time navigating to the appropriate section by reading less.

Copy link
Collaborator Author

@kylesayrs kylesayrs Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, the audience needs to read until 3rd paragraph to know what the problem is and when to use the tracing

As for when to use tracing, that's described in the second sentence

Through reading this guide, you will learn
1. Why tracing is required when compressing with recipes involving the Sequential Pipeline and modifiers such as GPTQModifier

As for what the problem is, that's described in the first section

## 1. Why is Tracing Required? ##

Copy link
Collaborator Author

@kylesayrs kylesayrs Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, the audience needs to read until 3rd paragraph to know what the problem is and when to use the tracing -- encoder-decoder models using GPTQ, SparseGPT Modifiers

That's incorrect, tracing is used for all model architectures, not just encoder-decoder models. As described in the second paragraph, tracing is used when compressing with recipes involving the Sequential Pipeline and modifiers such as GPTQModifier.

Copy link
Collaborator Author

@kylesayrs kylesayrs Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then a small paragraph introducing what 1, 2, and 3 will be helpful for
This way I think the audience can have an easier time navigating to the appropriate section by reading less.

I think the section titles + the list of things you will learn from reading each of the sections is enough context for a reader to go on. For example, if the reader doesn't care about the why, they can skip 1. If the reader doesn't care about what tracability is, they can skip 2. If the reader doesn't care about how to make a model traceable, they can skip 3.

@kylesayrs kylesayrs requested a review from horheynm January 13, 2025 19:25
kylesayrs added a commit that referenced this pull request Jan 15, 2025
## Purpose ##
* Allow VLM processors to be used to tokenize datasets with prompt keys

## Postrequisites ##
* #1030

## Changes ##
* Use `text` argument name for tokenizing the prompt column

## Testing ##
* w.r.t. tokenizers, using the `text` kwarg follows the precedent set by
[PretrainedTokenizerBase](https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L2790)
* w.r.t. processors, most processors use the text kwarg

Below are all the models I know to be compatible with this change, I'm
assuming that most other processors follow the same standard
1.
[llama](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/tokenization_llama.py#L233)
2.
[pixtral](https://github.com/huggingface/transformers/blob/main/src/transformers/models/pixtral/processing_pixtral.py#L160)
3.
[phi3_vision](https://huggingface.co/microsoft/Phi-3.5-vision-instruct/blob/main/processing_phi3_v.py#L321)
4.
[mllama](https://github.com/huggingface/transformers/blob/main/src/transformers/models/mllama/processing_mllama.py#L232)
5.
[qwen2_vl](https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2_vl/processing_qwen2_vl.py#L71)

Example of using VLM processor to tokenize a dataset with prompt key
```python3
from transformers import AutoProcessor
from llmcompressor.transformers import DataTrainingArguments, TextGenerationDataset

models_to_test = [
  "meta-llama/Meta-Llama-3-8B-Instruct",
  "mistralai/Mixtral-8x7B-Instruct-v0.1",
  "Qwen/Qwen2-VL-2B-Instruct",  # fails without changes
  "mgoin/pixtral-12b",  # fails without changes
]

for model_id in models_to_test:
  processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

  data_args = DataTrainingArguments(
      dataset="ultrachat-200k",
      splits={"calibration": "test_sft[:1]"}
  )

  dataset = TextGenerationDataset.load_from_registry(
      data_args.dataset,
      data_args=data_args,
      split=data_args.splits["calibration"],
      processor=processor,
  )(add_labels=False)
```

Signed-off-by: Kyle Sayers <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants