Any plans to add models from the llava series? #514

jp1924 · 2025-01-06T01:59:33Z

🚀 The feature, motivation and pitch

Do you have any plans to add models from the llava series provided by transformers?
If possible, I would like to add llava, llava-next, and llava-onevision. Is that feasible?

Alternatives

No response

Additional context

No response

jp1924 · 2025-01-13T07:48:13Z

@Tcc0403 Is it okay if I work on this?

Tcc0403 · 2025-01-13T10:06:10Z

Sure!

jp1924 · 2025-01-13T14:07:13Z

Thanks! @Tcc0403

Before opening a PR, I have two points I'd like to discuss.

I've been working on a separate fork here, and two main issues have come up:

There's a discrepancy between huggingface llava's forward implementation and liger-kernel's forward implementation.
LLaVA loads its language model and vision encoder through AutoModel.

Let's look at issue 1 first. Here's how loss is implemented in huggingface llava:

        outputs = self.language_model(
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
            cache_position=cache_position,
            num_logits_to_keep=num_logits_to_keep,
        )

        logits = outputs[0]

        loss = None
        if labels is not None:
            # Shift so that tokens < n predict n
            if attention_mask is not None:
                # we use the input attention mask to shift the logits and labels, because it is 2D.
                # we also crop attn mask in case it is longer, which happens in PrefixTuning with peft
                shift_attention_mask = attention_mask[:, -(logits.shape[1] - 1) :].to(logits.device)
                shift_logits = logits[..., :-1, :][shift_attention_mask.to(logits.device) != 0].contiguous()
                shift_labels = labels[..., 1:][shift_attention_mask.to(labels.device) != 0].contiguous()

Huggingface llava receives pre-calculated logits from the language model and uses those for loss calculation.
However, liger-kernel handles loss calculation by passing both lm_head and hidden states from the language model.

This creates an implementation discrepancy when applying liger-kernel to the existing llava.
You can see how the code differs here.

For issue 2, unlike Qwen2-VL and Mllama, llava doesn't have fixed language model and vision encoder components:

class LlavaForConditionalGeneration(LlavaPreTrainedModel, GenerationMixin):
    def __init__(self, config: LlavaConfig):
        super().__init__(config)
        self.vision_tower = AutoModel.from_config(config.vision_config)

        self.multi_modal_projector = LlavaMultiModalProjector(config)
        self.vocab_size = config.text_config.vocab_size
        self.language_model = AutoModelForCausalLM.from_config(config.text_config)
        self.pad_token_id = self.config.pad_token_id if self.config.pad_token_id is not None else -1

        self.post_init()

The init is designed to load various models through AutoModel.

This makes it unclear which language model and vision encoder should be loaded for test cases.

Additionally, llava's code allows training the language model without visual data.
Combined with issue 1, this means models not officially supported by liger-kernel could potentially use liger-kernel through llava as a workaround.

This could lead to unexpected side effects - what are your thoughts on this?

While these issues can be addressed if needed, I'd like to hear your thoughts on how to proceed.

Therefore, if these features of Llava conflict with the development philosophy of Liger-Kernel,
it might be better not to add Llava.

jp1924 · 2025-01-14T04:37:59Z

Actually, the perfect solution to this problem would be to use the loss calculated by the lm instead of calculating the loss in huggingface llava.
This would work well with liger-kernel, but since this involves transformers, it might get complicated. What do you think?

jp1924 · 2025-01-14T12:18:09Z

cc @ByronHsu

y-rok · 2025-01-15T02:06:20Z

Hello, and thank you for considering adding Llava models. I’m new to using the Liger kernel.

As you mentioned, Llava models can be utilized with various language and vision models. However, I thought that when using Llama 3 and the CLIP vision encoder, the following function call seems to work before loading the model (using the Liger kernel without cross-entropy):

apply_liger_kernel_to_llama( rope=True, swiglu=True, cross_entropy=Flase, fused_linear_cross_entropy=False, rms_norm=True )

However, I observed a degradation in model performance when I applied this configuration.

Do you have any idea why this might be happening? If you could provide some insights, I’d be happy to assist in adding Llava models.

Tcc0403 assigned jp1924 Jan 13, 2025

jp1924 mentioned this issue Jan 16, 2025

Add huggingface llava #524

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any plans to add models from the llava series? #514

Any plans to add models from the llava series? #514

jp1924 commented Jan 6, 2025

jp1924 commented Jan 13, 2025

Tcc0403 commented Jan 13, 2025

jp1924 commented Jan 13, 2025 •

edited

Loading

jp1924 commented Jan 14, 2025

jp1924 commented Jan 14, 2025

y-rok commented Jan 15, 2025

Any plans to add models from the llava series? #514

Any plans to add models from the llava series? #514

Comments

jp1924 commented Jan 6, 2025

🚀 The feature, motivation and pitch

Alternatives

Additional context

jp1924 commented Jan 13, 2025

Tcc0403 commented Jan 13, 2025

jp1924 commented Jan 13, 2025 • edited Loading

jp1924 commented Jan 14, 2025

jp1924 commented Jan 14, 2025

y-rok commented Jan 15, 2025

jp1924 commented Jan 13, 2025 •

edited

Loading