Is there any explanation for adjusting seqs after embedding? #18

baiyuting · 2022-02-03T12:08:51Z

I found seq *= self.item_emb.embedding_dim ** 0.5 in function log2feats(self, log_seqs), Is there any reason for adjusting seqs after embedding?

`seqs = self.item_emb(torch.LongTensor(log_seqs).to(self.dev))

seqs *= self.item_emb.embedding_dim ** 0.5`

The text was updated successfully, but these errors were encountered:

pmixer · 2022-02-07T01:32:56Z

@baiyuting it's kind of normalization operation inherited from the original BERT paper https://arxiv.org/abs/1810.04805, check Prof. Lee https://speech.ee.ntu.edu.tw/~hylee/index.php Transformer+BERT lectures on Youtube/Bilibili if interested. BTW, u are encouraged to remove this if want to try, SASRec is not that deep comparing to BERT.

baiyuting · 2022-02-07T08:13:58Z

I clone bert from https://github.com/google-research/bert.git, but find no related code in modeling.py:embedding_lookup() , do I missed something? Could you give me a more specific elaboration, since it is a trick I did not notice before?

pmixer · 2022-02-08T02:08:27Z

I clone bert from https://github.com/google-research/bert.git, but find no related code in modeling.py:embedding_lookup() , do I missed something? Could you give me a more specific elaboration, since it is a trick I did not notice before?

ops, my fault, BERT is just the encoder of Transformer, should refer to https://arxiv.org/pdf/1706.03762.pdf section 3.2.1 on Scaled Dot-Product Attention:

Attention(Q, K, V ) = softmax(QK/√d) V

for me, its just a normalization operation, if you are very interested in it, pls try to play with it based on math like in https://medium.com/@shoray.goel/kaiming-he-initialization-a8d9ed0b5899 on your own.

h1657 · 2024-07-12T09:52:43Z

I think the range of position embedding is relatively large, while the range of item embedding is relatively small. Position embedding may suppress the signal of item embedding. By scaling the embedding vector of items, the embedding of items and the embedding of positions become more consistent in numerical range. Specific reference can be made https://datascience.stackexchange.com/questions/87906/transformer-model-why-are-word-embeddings-scaled-before-adding-positional-encod/88159#88159

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there any explanation for adjusting seqs after embedding? #18

Is there any explanation for adjusting seqs after embedding? #18

baiyuting commented Feb 3, 2022 •

edited

Loading

pmixer commented Feb 7, 2022

baiyuting commented Feb 7, 2022

pmixer commented Feb 8, 2022 •

edited

Loading

h1657 commented Jul 12, 2024

Is there any explanation for adjusting seqs after embedding? #18

Is there any explanation for adjusting seqs after embedding? #18

Comments

baiyuting commented Feb 3, 2022 • edited Loading

pmixer commented Feb 7, 2022

baiyuting commented Feb 7, 2022

pmixer commented Feb 8, 2022 • edited Loading

h1657 commented Jul 12, 2024

baiyuting commented Feb 3, 2022 •

edited

Loading

pmixer commented Feb 8, 2022 •

edited

Loading