-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[c++] Add Bagging by Query for Lambdarank #6623
Conversation
…hub.com/Microsoft/LightGBM into bagging/bagging-by-query-for-lambdarank
void GetGradients(const double* scores, const data_size_t /*num_sampled_queries*/, const data_size_t* /*sampled_query_indices*/, score_t* gradients, score_t* hessians) const override { | ||
LaunchGetGradientsKernel(scores, gradients, hessians); | ||
SynchronizeCUDADevice(__FILE__, __LINE__); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@guolinke Could you please help to review this when you have time? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some very minor suggestions from me below:
assert ndcg_score_bagging_by_query >= ndcg_score - 0.1 | ||
assert ndcg_score_no_bagging_by_query >= ndcg_score - 0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR's description states that bagging_by_query=True
should improve metrics, but I don't see any comparison of bagging_by_query=True
and bagging_by_query=False
here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I found the result can be random when the dataset is small. For example, on CPU, bagging_by_query=True
gets higher NDCG with the toy test dataset (even higher than the case without bagging), while with GPU bagging_by_query=True
could get worse results compared with bagging_by_query=False
. But when the dataset is large, for example, with MS LTR
dataset, the results are less random, and bagging_by_query=True
should improve performance, as in the description of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, we also see a significant improvement in training speed with bagging_by_query=True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see. So this test is for something like "bagging_by_query=True
doesn't break training".
Co-authored-by: Nikita Titov <[email protected]>
Co-authored-by: Nikita Titov <[email protected]>
Co-authored-by: Nikita Titov <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Add bagging by query instead of by items in lambdarank, suggested by @metpavel. This should be more reasonable for bagging in ranking tasks. For a comparison of performance, on MS LTR dataset:
With
bagging_freq=1
andbagging_fraction=0.1
, ifbagging_by_query=true
and if
bagging_by_query=false
Without bagging