Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any explanation for adjusting seqs after embedding? #18

Open
baiyuting opened this issue Feb 3, 2022 · 4 comments
Open

Is there any explanation for adjusting seqs after embedding? #18

baiyuting opened this issue Feb 3, 2022 · 4 comments

Comments

@baiyuting
Copy link

baiyuting commented Feb 3, 2022

I found seq *= self.item_emb.embedding_dim ** 0.5 in function log2feats(self, log_seqs), Is there any reason for adjusting seqs after embedding?

`seqs = self.item_emb(torch.LongTensor(log_seqs).to(self.dev))

seqs *= self.item_emb.embedding_dim ** 0.5`

@pmixer
Copy link
Owner

pmixer commented Feb 7, 2022

@baiyuting it's kind of normalization operation inherited from the original BERT paper https://arxiv.org/abs/1810.04805, check Prof. Lee https://speech.ee.ntu.edu.tw/~hylee/index.php Transformer+BERT lectures on Youtube/Bilibili if interested. BTW, u are encouraged to remove this if want to try, SASRec is not that deep comparing to BERT.

@baiyuting
Copy link
Author

I clone bert from https://github.com/google-research/bert.git, but find no related code in modeling.py:embedding_lookup() , do I missed something? Could you give me a more specific elaboration, since it is a trick I did not notice before?

@pmixer
Copy link
Owner

pmixer commented Feb 8, 2022

I clone bert from https://github.com/google-research/bert.git, but find no related code in modeling.py:embedding_lookup() , do I missed something? Could you give me a more specific elaboration, since it is a trick I did not notice before?

ops, my fault, BERT is just the encoder of Transformer, should refer to https://arxiv.org/pdf/1706.03762.pdf section 3.2.1 on Scaled Dot-Product Attention:

Attention(Q, K, V ) = softmax(QK/√d) V

for me, its just a normalization operation, if you are very interested in it, pls try to play with it based on math like in https://medium.com/@shoray.goel/kaiming-he-initialization-a8d9ed0b5899 on your own.

@h1657
Copy link

h1657 commented Jul 12, 2024

I think the range of position embedding is relatively large, while the range of item embedding is relatively small. Position embedding may suppress the signal of item embedding. By scaling the embedding vector of items, the embedding of items and the embedding of positions become more consistent in numerical range. Specific reference can be made https://datascience.stackexchange.com/questions/87906/transformer-model-why-are-word-embeddings-scaled-before-adding-positional-encod/88159#88159

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants