-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Any plans to add models from the llava series? #514
Comments
@Tcc0403 Is it okay if I work on this? |
Sure! |
Thanks! @Tcc0403 Before opening a PR, I have two points I'd like to discuss. I've been working on a separate fork here, and two main issues have come up:
Let's look at issue 1 first. Here's how loss is implemented in huggingface llava: outputs = self.language_model(
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
cache_position=cache_position,
num_logits_to_keep=num_logits_to_keep,
)
logits = outputs[0]
loss = None
if labels is not None:
# Shift so that tokens < n predict n
if attention_mask is not None:
# we use the input attention mask to shift the logits and labels, because it is 2D.
# we also crop attn mask in case it is longer, which happens in PrefixTuning with peft
shift_attention_mask = attention_mask[:, -(logits.shape[1] - 1) :].to(logits.device)
shift_logits = logits[..., :-1, :][shift_attention_mask.to(logits.device) != 0].contiguous()
shift_labels = labels[..., 1:][shift_attention_mask.to(labels.device) != 0].contiguous() Huggingface llava receives pre-calculated logits from the language model and uses those for loss calculation. This creates an implementation discrepancy when applying liger-kernel to the existing llava. For issue 2, unlike Qwen2-VL and Mllama, llava doesn't have fixed language model and vision encoder components: class LlavaForConditionalGeneration(LlavaPreTrainedModel, GenerationMixin):
def __init__(self, config: LlavaConfig):
super().__init__(config)
self.vision_tower = AutoModel.from_config(config.vision_config)
self.multi_modal_projector = LlavaMultiModalProjector(config)
self.vocab_size = config.text_config.vocab_size
self.language_model = AutoModelForCausalLM.from_config(config.text_config)
self.pad_token_id = self.config.pad_token_id if self.config.pad_token_id is not None else -1
self.post_init() The init is designed to load various models through AutoModel. This makes it unclear which language model and vision encoder should be loaded for test cases. Additionally, llava's code allows training the language model without visual data. This could lead to unexpected side effects - what are your thoughts on this? While these issues can be addressed if needed, I'd like to hear your thoughts on how to proceed. Therefore, if these features of Llava conflict with the development philosophy of Liger-Kernel, |
Actually, the perfect solution to this problem would be to use the loss calculated by the lm instead of calculating the loss in huggingface llava. |
cc @ByronHsu |
Hello, and thank you for considering adding Llava models. I’m new to using the Liger kernel. As you mentioned, Llava models can be utilized with various language and vision models. However, I thought that when using Llama 3 and the CLIP vision encoder, the following function call seems to work before loading the model (using the Liger kernel without cross-entropy):
However, I observed a degradation in model performance when I applied this configuration. Do you have any idea why this might be happening? If you could provide some insights, I’d be happy to assist in adding Llava models. |
🚀 The feature, motivation and pitch
Do you have any plans to add models from the llava series provided by transformers?
If possible, I would like to add llava, llava-next, and llava-onevision. Is that feasible?
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: