-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pagination params to base handler #394
Conversation
Verified that @nickzuber has signed the CLA. Thanks for the pull request! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall! Want to add some tests?
app/master/cluster_master.py
Outdated
@@ -62,12 +63,15 @@ def api_representation(self): | |||
'slaves': slaves_representation, | |||
} | |||
|
|||
def builds(self): | |||
def builds(self, start=None, amount=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned this in your proposal, but just to push a bit harder 😛: I think I prefer "limit/offset" to "amount/start" just because there is precedent in other contexts, so might help understanding. I'm not sure they're objectively better words, but since "limit/offset" is used frequently in database contexts I feel like someone reading this code is more likely to be able to interpret those words quickly.
Also, want to add type hints throughout? 😸
@@ -63,7 +63,7 @@ def __init__(self, cluster_master): | |||
api_v2 = [ | |||
RouteNode(r'metrics', _MetricsHandler), | |||
RouteNode(r'version', _VersionHandler), | |||
RouteNode(r'builds', _BuildsHandler).add_children([ | |||
RouteNode(r'builds', _BuildsPaginatedHandler).add_children([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's call this something like _v2BuildsHandler
. Just because pagination is the only difference now doesn't mean it will always be the only difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point -- is V2[OldHandlerName]
the convention we want to go with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's fine. It would help make it clear just from looking at the routenodes which endpoints had changed for v2 and which were still falling back to old logic.
@@ -62,6 +66,11 @@ def _handle_request_exception(self, ex): | |||
self.set_status(status_code) | |||
self.finish() | |||
|
|||
def _get_pagination_params(self): | |||
start = int(self.get_query_argument('start', self.DEFAULT_PAGINATION_START, True)) | |||
amount = int(self.get_query_argument('amount', self.DEFAULT_PAGINATION_AMOUNT, True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably reject negative amounts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments. I should have probably mentioned this in the proposal, but I feel that start
and amount
should be changed to page
and size
. I am not sure how others feel about that.
Also, this change needs tests.
app/master/cluster_master.py
Outdated
@@ -62,12 +63,15 @@ def api_representation(self): | |||
'slaves': slaves_representation, | |||
} | |||
|
|||
def builds(self): | |||
def builds(self, start=None, amount=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think page
and size
might be more appropriate names here.
@@ -20,6 +20,10 @@ class ClusterBaseHandler(tornado.web.RequestHandler): | |||
ClusterBaseHandler is the base handler for all request handlers of ClusterRunner services. | |||
""" | |||
|
|||
DEFAULT_PAGINATION_START = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the above comment, I think we can name these constants as:
DEFAULT_PAGE = 0
DEFAULT_PAGE_SIZE = 20
MAX_PAGE_SIZE = 200
app/master/cluster_master.py
Outdated
@@ -62,12 +63,15 @@ def api_representation(self): | |||
'slaves': slaves_representation, | |||
} | |||
|
|||
def builds(self): | |||
def builds(self, start=None, amount=None): | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs need to be updated.
app/master/cluster_master.py
Outdated
num_builds = len(self._all_builds_by_id) | ||
start = start or 0 | ||
amount = amount or num_builds | ||
return [self._all_builds_by_id[key] for key in islice(self._all_builds_by_id, start, min((start + amount), num_builds))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is slightly difficult to read, I think this should separated out into a few lines, maybe something like this (untested):
total_builds = len(self._all_builds_by_id)
page = page or 0
size = size or total_builds
stop_index = min((page + size), total_builds))
builds_on_requested_page = islice(self._all_builds_by_id, start=page, stop=stop_index)
return builds_on_requested_page.values()
I actually like @josephharrington's suggestion around |
app/master/cluster_master.py
Outdated
Returns a list of all builds | ||
:rtype: list[Build] | ||
Returns a list of all builds. Both offset and limit are assumed to be nonnegative at this point | ||
since the corrections or negative offsets and limits are done when the query parameters are first initialized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do another check here for negative entries? Technically this method should only ever get limit and offset arguments from the pagination method, which will not return negative numbers, so another check here might be overkill
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I like the idea of adding an extra check, especially since it's so easy to do.
offset = max(offset or 0, 0)
In my perfect world every method would be callable with every combination of arguments that its type signature permits. Since this method's type signature allows for negative arguments, it should behave sanely when negative arguments are passed in. (Sometimes obeying this philosophy is more trouble than it's worth, but in this case it's a one liner) -
iirc @josephharrington suggested allowing you to pass in negative offsets in order to get the most recent builds. So if I passed in
offset=0&limit=-1
it would give me the most recent build (similar to how I can doarr[0:-1]
orarr[-2:-1]
in python . I liked that suggestion. Could we rename the variables to something likestart
andrange
and then do
start = start % num_builds
end = min(start + range, num_builds)
end = end % num_builds
start, end = min(start, end), max(start, end)
return [self._all_builds_by_id[key] for key in islice(self._all_build_by_id, start, end)]
I'm pretty sure I have an off by one error somewhere in that logic, but hopefully you get the idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the negative entries something we want to do now? This might be better left for a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with leaving negative offsets for later. @ethomas2 I think in your example it should be offset=-1&limit=1
. I think it only makes sense to support negative offsets. Negative limits (especially combined with negative offsets) feels unintuitive to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with leaving negative offsets for a later change, but I just spent a long time thinking about negative offsets though and the exact spec I want, so i'm gonna write it down here. I don't really care if we implement it now, later or never. Fwiw @josephharrington, I do think we can make negative limits a well defined and intuitive concept. Maybe it's not useful, but I think it makes sense.
The mental model I have is this. Lets say you have an array of N
builds. You can think of any offset
/limit
combo as defining a "window" that you place over the array of builds. The return value is just whatever part of the array you can see through that window. If offset
is negative you mod it by N
, if it's positive you just leave it alone.
offset = offset if offset > 0 else offset % N # Don't mod by N if offset is positive so we can handle cases where offset is off the right edge of the array
if limit >= 0:
start, end = offset, min(offset + limit, N)
else:
start, end = max(offset + limit, 0), min(offset, N)
return builds[start:end]
Some cases to test. (N
is 100 for all cases). I do agree that the last 2 cases are a little weird, but they're well defined and they fit my model.
offset | limit | return value | case |
---|---|---|---|
5/-95 | 20 | [5,24] | positive limit |
-5/95 | -20 | [75,95] | negative limit |
-5/95 | 20 | [95, 99] | limit overflow right |
5/-95 | -20 | [0,4] | limit overflow left |
100 | 1 | [] | offset off right edge, positive limit |
101 | -1 | [] | offset off right edge. Window does not overlap array |
100 | -2 | [99] | offset off right edge. Window does overlap array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙀
@@ -62,6 +66,14 @@ def _handle_request_exception(self, ex): | |||
self.set_status(status_code) | |||
self.finish() | |||
|
|||
def _get_pagination_params(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't really a good way to write tests for this method unless we have this data stored in some response. The problem is that this method gets the offset and limit from the query arguments in a request, and then does things internally with them, so there isn't really a way for us to intercept those values without explicitly having a handler that does that.
One idea is we could split this up into two methods like:
def get_pagination_params(self) -> Tuple[int, int]:
offset = self.get_query_argument('offset', self.DEFAULT_OFFSET, True)
limit = self.get_query_argument('limit', self.DEFAULT_LIMIT, True)
return self._validate_arguments(int(offset), int(limit))
def _validate_arguments(self, offset: int, limit: int) -> Tuple[int, int]:
# No negative offsets or limits are accepted
non_negative_offset = max(offset, 0)
non_negative_limit = max(offset, 0)
return non_negative_limit, min(non_negative_limit, self.MAX_LIMIT)
Then we can easily unit test _validate_arguments
. I think this would be a good idea for testing and also wouldn't hurt to separate the methods anyways -- any opinions before I do this on whether or not we should?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imo this method shouldn't be tested in isolation in the first place. I don't really care if this method is behaving properly, I care if the pagination functionality is behaving properly. This method is just a means to an end, so don't test it. (It's also private, which is an indication that it's not a good thing to test).
A better test would be a functional test that passes various offset
/limit
combinations to cluster runner and verifies that we get back the set of builds we expect. Don't test functions, test functionality.
I know that flies in the face of the concept of unit testing, and some people on the team disagree with me. So do what you want.
Is it possible for a single build to ever be such a large json blob that we'd want to somehow limit how much of that single build is returned? Ie I ask for
but I just wanted to make sure that's not something we're overlooking. |
@@ -62,12 +65,24 @@ def api_representation(self): | |||
'slaves': slaves_representation, | |||
} | |||
|
|||
def builds(self): | |||
def builds(self, offset: int=None, limit: int=None) -> List['Build']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Optional[int]
for the type of limit
and offset
.
https://docs.python.org/3/library/typing.html#typing.Optional
whoops. I should probably read the documentation that I post. According to the docs
An optional argument with a default needn’t use the Optional qualifier on its type annotation (although it is inferred if the default is None).
TIL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor. I don't believe you need the quotes around Build for List generics.
app/master/cluster_master.py
Outdated
Returns a list of all builds | ||
:rtype: list[Build] | ||
Returns a list of all builds. Both offset and limit are assumed to be nonnegative at this point | ||
since the corrections or negative offsets and limits are done when the query parameters are first initialized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I like the idea of adding an extra check, especially since it's so easy to do.
offset = max(offset or 0, 0)
In my perfect world every method would be callable with every combination of arguments that its type signature permits. Since this method's type signature allows for negative arguments, it should behave sanely when negative arguments are passed in. (Sometimes obeying this philosophy is more trouble than it's worth, but in this case it's a one liner) -
iirc @josephharrington suggested allowing you to pass in negative offsets in order to get the most recent builds. So if I passed in
offset=0&limit=-1
it would give me the most recent build (similar to how I can doarr[0:-1]
orarr[-2:-1]
in python . I liked that suggestion. Could we rename the variables to something likestart
andrange
and then do
start = start % num_builds
end = min(start + range, num_builds)
end = end % num_builds
start, end = min(start, end), max(start, end)
return [self._all_builds_by_id[key] for key in islice(self._all_build_by_id, start, end)]
I'm pretty sure I have an off by one error somewhere in that logic, but hopefully you get the idea
offset = self.get_query_argument('offset', self.DEFAULT_OFFSET, True) | ||
limit = self.get_query_argument('limit', self.DEFAULT_LIMIT, True) | ||
# No negative offsets or limits are accepted | ||
non_negative_offset = max(int(offset), 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
This is sort've a stupid edge case, but
offset
andlimit
are user provided strings, which meansint
might raise a value error and return a 500 to the user. Catch the error and return a 400 instead. -
Might as well just rename
non_negative_offset
andnon_negative_limit
tooffset
andlimit
-
If you go with my earlier suggestion (shamelessly stolen from Joey) about allowing
offset
andlimit
to be negative, you'll want to get rid ofmax
logic from this method (allow negative numbers to be passed through to thebuilds
method.
offset = offset or pagination_constants['DEFAULT_OFFSET'] | ||
limit = limit or pagination_constants['DEFAULT_LIMIT'] | ||
# Negative offsets or limits are not allowed | ||
non_negative_offset = max(offset, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this comment on an earlier commit, making it again to make sure you see it since github folded the old comment away with the older commit.
-
offset and limit are user provided strings, which means
int(some_string)
might raise a value error and return a 500 to the user. Catch the error and return a 400 instead. -
Might as well just rename non_negative_offset and non_negative_limit to offset and limit.
-
If you go with my earlier suggestion (shamelessly stolen from Joey) about allowing offset and limit to be negative, you'll want to get rid of max logic from this method (allow negative numbers to be passed through to the
builds
method.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For (1) what do you think the best way for handling that would be? Use a try/catch and if it throws a value exception use the default instead (not sure if there's a more elegant way to do this in python besides a try/catch)?
And for (2) I think the reason I explicitly named them non_negative_x was because of (3) if we wanted to add that in the future then it's really clear where we disallow negatives. We could totally just rename the variables I don't think it'd matter much if that's what we wanna do.
For (3) Joey mentioned above (now buried) that "we should probably reject negative amounts" and we could do this for now and always make that negative underflow request trick something to add in a later PR, unless you think that'd be in the scope of this one. I like these ideas but don't necessarily want this particular PR to try and do too much at once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I'd suggest doing
try:
a = int(string_variable)
b = int(other_string_variable)
except ValueError as ex:
raise BadRequestError from ex
For context, ClusterBaseHandler
has a map of known exceptions to status codes. When an exception bubbles up to ClusterBaseHandler
it calls _handle_request_exception
(a method overwritten from tornado) and translates the exception to the relevant status_code. BadRequestError
is one of the exceptions in this map. It gets translated to 400.
2 and 3.
I think the negative offset stuff does belong in this change. I like separating a single PR into multiple PR's when one of the following is true
(a) a single PR is too big
(b) a single PR is taking too long to implement
(c) a single PR feels like it's doing "too many things", ie 2 logically different pieces of functionality in the same commit.
I think none of those criteria apply here, so I don't see why we'd separate it. I'm not necessarily opposed to separating it, it doesn't hurt anything, I just don't see a reason to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me.
For clarification, on the negative offset, I'm a little confused on how we want it to behave. Python's negative indexing works because it's start/end but here we have start/amount, so things are a bit different. If we have a offset=-5
and limit=20
request for a 100 build result, do we return the builds from 95 to 100 then 0 to 15? Or something like builds 95 to end of builds or limit (in this case 95 to 100)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to repeat my above comment, let's only support negative values for offset
(not limit
). And I think it's fine to save that for a later change. It seems like most of the conversation/comments/questions on this PR so far have been around the negative offset support, so that's a good sign that it would be good to split out -- get everything merged that everyone agrees on and save the highly-discussed piece for a separate change. This somewhat falls under Evan's point (b) above about splitting things out that are taking too long to implement, but I'd extend that to "too long to complete due to implementation or review" including a best effort to try to split out things that you expect to be more contentious or debatable. (And btw, I'll take any disagreement with this comment as proof of my point. 😝)
Also, with regards to your last question about wrap-around -- I don't think it makes sense to support wrap-around. In your example of offset=-5&limit=20
I think it makes the most sense to only return the last five items. I think that's intuitive.
A clearer example of why wrap-around is not good is if we only had 2 builds total and asked for offset=0&limit=8
would we return a list of 8 builds which are just the only two builds repeated four times (1, 2, 1, 2, 1, 2, 1, 2)? Bonkers!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@josephharrington Hah yeah good point, that would get really confusing. I was thinking more like offset=-1&limit=20
for 10 builds would give builds 9 to 10, then 1 to 9 etc so the wrap wouldn't repeat builds, but yeah like you said it'd get really confusing and I don't think this would be a good idea after I think about it
@@ -1,4 +1,6 @@ | |||
from typing import List |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plz alphbtz
app/master/cluster_master.py
Outdated
Returns a list of all builds | ||
:rtype: list[Build] | ||
Returns a list of all builds. Both offset and limit are assumed to be nonnegative at this point | ||
since the corrections or negative offsets and limits are done when the query parameters are first initialized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with leaving negative offsets for later. @ethomas2 I think in your example it should be offset=-1&limit=1
. I think it only makes sense to support negative offsets. Negative limits (especially combined with negative offsets) feels unintuitive to me.
offset = offset or pagination_constants['DEFAULT_OFFSET'] | ||
limit = limit or pagination_constants['DEFAULT_LIMIT'] | ||
# Negative offsets or limits are not allowed | ||
non_negative_offset = max(offset, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to repeat my above comment, let's only support negative values for offset
(not limit
). And I think it's fine to save that for a later change. It seems like most of the conversation/comments/questions on this PR so far have been around the negative offset support, so that's a good sign that it would be good to split out -- get everything merged that everyone agrees on and save the highly-discussed piece for a separate change. This somewhat falls under Evan's point (b) above about splitting things out that are taking too long to implement, but I'd extend that to "too long to complete due to implementation or review" including a best effort to try to split out things that you expect to be more contentious or debatable. (And btw, I'll take any disagreement with this comment as proof of my point. 😝)
Also, with regards to your last question about wrap-around -- I don't think it makes sense to support wrap-around. In your example of offset=-5&limit=20
I think it makes the most sense to only return the last five items. I think that's intuitive.
A clearer example of why wrap-around is not good is if we only had 2 builds total and asked for offset=0&limit=8
would we return a list of 8 builds which are just the only two builds repeated four times (1, 2, 1, 2, 1, 2, 1, 2)? Bonkers!
@@ -62,6 +69,24 @@ def _handle_request_exception(self, ex): | |||
self.set_status(status_code) | |||
self.finish() | |||
|
|||
# For methods that annotate with `Tuple`, use `disable=invalid-sequence-index` | |||
# This bug is fixed in https://github.com/PyCQA/astroid/commit/563031aaf13a44adc8db4f8d0ab8020d550aae00 | |||
# More information on the issue in https://github.com/PyCQA/pylint/issues/1212 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this fixed in the most recent version of pylint? If so, can you create a new ClusterRunner issue to update pylint (and reference this line so we remember to remove it)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 #395
test/unit/master/test_pagination.py
Outdated
|
||
|
||
@genty | ||
class TestClusterMaster(BaseUnitTestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test class name should match the name of the file. And on that note, why a new file instead of adding these to the existing test_cluster_master.py? (Edit: see my comment below; I think I understand why you might have split these out.)
return self._validate_arguments(int(offset), int(limit)) | ||
|
||
@staticmethod | ||
def _validate_arguments(offset: int, limit: int) -> Tuple[int, int]: # pylint: disable=invalid-sequence-index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the name _validate_arguments
is too vague since this is specific to pagination.
test/unit/master/test_pagination.py
Outdated
build_mock.build_id = build_id | ||
master._all_builds_by_id[build_id] = build_mock | ||
|
||
offset, limit = ClusterBaseHandler._validate_arguments(offset, limit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems out of place. I think a test for ClusterBaseHandler should test logic in ClusterBaseHandler, and a test for ClusterMaster should test logic in ClusterMaster. (Is this why you split out these tests?)
With this call here, this becomes some sort of integration test between ClusterBaseHandler and ClusterMaster, but it ends up not being a realistic integration test since it picks and chooses related methods from each class to call (which may or may not reflect reality).
I think if you wanted to test this logic in ClusterBaseHandler we should add a test class specifically for ClusterBaseHandler that would call _V2BuildsHandler.get directly. Alternatively, a functional test could test the API end-to-end which wouldn't be a bad idea either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, so each test file is dedicated to testing a certain file in clusterrunner rather than a test file will be dedicated to testing a certain concept in clusterrunner? That's why I was creating new files for these pagination tests, I didn't know that's how we organized things.
If that's the case, then yeah it'd make more sense to move these builds
tests to unit/master/test_cluster_master
and the pagination tests to unit/web_framework/test_cluster_base_handler
The reason I used _validate_arguments
here is because the arguments that get passed into the builds
endpoint always go through this function first. I think it's fine and probably makes more sense to remove this function from this test and just give "validated" entries to the tests instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, unit+integration tests are a bit different than functional tests in that way. Functional tests are usually oriented around concepts of functionality since by their nature they cannot test specific code classes, but unit+integration should be more oriented around the structure of the code. That makes unit+integration tests a bit easier to organize since it's clearer where to put certain tests and have them share common setup etc., but it also helps us be very conscious of the entry points of our tests (i.e., "every test in file X tests some method in class Y").
'DEFAULT_OFFSET': 0, | ||
'DEFAULT_LIMIT': 20, | ||
'MAX_LIMIT': 200, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably make these separate constants or, if you really want to explicitly group them, an enum
. A dict feels a bit too free-form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we rather change this to an enum or move this to the config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd personally probably vote config, mostly because there are already established testing patterns/framework around making sure config gets reset for every test. (For example see Configuration['timings_directory']
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So are you saying when we do something like
def setUp(self):
super().setUp()
Configuration['timings_directory'] = self._TIMINGS_DIR_SYS_PATH
Configuration['timings_directory']
is changed in this particular class test but it gets reset for all the others?
test/unit/master/test_pagination.py
Outdated
expected_last_build_id: int, | ||
): | ||
master = ClusterMaster() | ||
for build_id in range(1, 501): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 501 chosen in relation to the default pagination parameters? If so, maybe express the value in terms of those parameters? Also, are there any edge cases you'd want to test where the actual number of builds is less than MAX_LIMIT?
An alternative to this pattern is to put the pagination parameters in Configuration (base_config_loader.py) and then set them in test setup to values more convenient for a testing situation. It's not a huge deal but I would guess that we don't need to create 501 build mocks to exercise the logic you are trying to test. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 501 here is arbitrary -- I was generating 500 mock builds to test with ids of 1 to 500 inclusively (maybe a comment explaining it would be better?) The reason I chose 500 was because it was more than our max limit (200) so I could test making sure that limit works correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I don't understand how moving the pagination default values to the config would affect the structure of these particular tests? I do think moving the default pagination values to the config might make more sense, I was wondering where a good place for them to live would be. With them being in the config they'd be more customizable which would be nice.
Are you suggesting these particular tests shouldn't pass any parameters to builds
but rather only test the default values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah, I meant that if those default values are in config then it's more straightforward to set them to whatever you want during these tests. For example you could set max_limit to 10 (or some low value) that still allowed you to still test the logic without needing to create hundreds of mock objects.
A nice thing about setting config values in tests is that the base test case makes sure they get reset for every test. It kind of works like patch
that way. Without that automatic reset it can be easy to forget to set a value back to its original value, which could potentially influence other tests that run after yours.
app/master/cluster_master.py
Outdated
limit = limit if limit is not None else min(num_builds, pagination_constants['MAX_LIMIT']) | ||
|
||
# Reset the offset to 0 if it is out of range | ||
starting_index = offset if offset < num_builds else 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect if the offset
/limit
combo is off the right edge of the array for the return value to be []
.
E.g if there are 5 builds, and the input is offset=5,limit=1
I would expect to get back []
. Similar to how in python this code evaluates to []
.
arr = [0, 1, 2, 3, 4]
print(arr[5:6]) # returns []
your logic would reset starting_index to 0 though, and return the first element of the array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah after messing around with it a bit I noticed it seemed really weird to have offset=10
for something like 5 builds to return all 5. Makes much more sense to return nothing 👍
@josephharrington My comments on "negative limits" was more regarding that as an illegal input, not really something we'd want to support. Like if someone requested a limit of -20 we'd just default to 0 since negative limit I think we can assume returning no results would make the most sense. |
app/master/cluster_master.py
Outdated
starting_index = offset if offset < num_builds else 0 | ||
ending_index = min((starting_index + limit), num_builds) | ||
|
||
requested_builds = islice(self._all_builds_by_id, starting_index, ending_index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing this here for visibility. I don't think anyone should actually fix this right now, I just want to make sure this "bug" is known.
I noticed that
_all_builds_by_id
is an OrderedDict, not an arrayislice
returns builds things from an OrderedDict in the order they were added to the dict, not in the numerical order of the keys
This means if somehow builds got added to the OrderedDict out of numerical order, islice
would return the builds out of order.
For example, say we got 3 requests all at the same time. We created build ids 0,1,2 for them but we added them to the OrderedDict in the order 2,1,0. Then if I made a request to clusterrunner with offset=0,limit=1
I would get build 2 instead of build 0.
That being said, this behavior is actually impossible right now. The "bug" I described relies on builds being added to the dict in a different order than they are created, but the current code is setup in a way that that can never happen. Builds are created and then immediately added to the dict in handle_request_for_new_build
. This happens on tornado's single thread, so there's no possibility of threads interleaving operations in a way that causes this bug.
So ... nothing to do right now. Just wanted to write this down. At some point someone should maybe change _all_builds_by_id
to be an array though, which would fix this potential bug and make it easier for me to sleep at night.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good catch -- alternatively we could create a range of build ids from the offset/limit and explicitly retrieve those from the OrderedDict.
Like offset=5&limit=20
would imply we want builds 6 to 25, so we could just fetch all the builds with an id from 6 to 25 from the OrderedDict (we could then just use a regular dict)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow. Yeah that's super easy to do, we should just do that now.
return [self._all_builds_by_id[idx] for idx in range(starting_index, ending_index)]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ethomas2 Agreed 👍 makes much more sense intuitively and is cleaner than islice IMO
app/master/cluster_master.py
Outdated
if offset > num_builds: | ||
return [] | ||
|
||
first_build_id = offset + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we adding 1 here?
Does that mean if I ask for offset=10,limit=1
i'll actually get the 11th build?
Also, does that mean it's impossible to ever get the first build in the array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We start our builds with an id of 1 and so on, so with an offset of 0 we'd expect to get builds starting with 1. Maybe I should add a comment here about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so it's because build id's are 1 indexed and offset is 0 indexed. I feel like that might be confusing. Personally if I queried for offset=10,limit=1
i'd expect to get build_id 10 back. I feel like it might make sense to either start 0 indexing the builds or 1 index the offset. If we 1 indexed the offset, maybe it would make more sense to name it start
? (I think that might make more sense anyway).
I can't decide if the current behavior is actually confusing, or if i'm just looking for something to complain about. Opinion @josephharrington ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we know that the builds start at 1 then something like offset=6
giving builds starting at 7 makes sense. This is why I originally was leaning towards start/amount for names so its a little more obvious where you're starting and ending.
I'm personally fine with either one and don't have any strong opinions in either way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python array slicing is inclusive with start and exclusive with end i.e.
A = [0, 1, 2]
A[0: 1]
=> [0]
From that understanding, I would have thought that offset and limit would mean that the returned build ID's would be in the range of [offset, offset + limit).
Though, I'm guessing because you used the terminology "offset" and "limit", you are coming more from a SQL foundation where the "offset" is exclusive. I don't have any comment on it. I think it's fine either way. As long as we pick a path and stick to it, I think that's okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think it's fine as-is. If we wanted to give builds random strings of characters for an id (or if we wanted to count by 2s or 4s like some dbs do) then we should be able to do that. The paging offset happens to be closely related to the build id in the current indexing scheme but that should be considered a coincidence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@josephharrington Yeah after I realized this I think I want to switch back to using islice instead of getting builds by their id. The pagination isn't really for getting specific builds from their id but rather getting builds from the collection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using islice
is extra dangerous now that you've changed the dict from an OrderedDict to a regular dict. islice
returns things in iteration order, but iteration order is not defined for a dict. I expect that islice
might return different results from machine to machine, or even from hour to hour on the same machine.
I say if you go back to islice
, at least change the dict back to an OrderedDict. That way at least the iteration order is defined, and we can make guarantees about it. Why did you change it to a normal dict in the first place?
app/master/cluster_master.py
Outdated
last_build_id = min((offset + limit), num_builds) | ||
|
||
# Add +1 to last_build_id because range is exclusive | ||
return [self.get_build(uid) for uid in range(first_build_id, last_build_id + 1)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also confused why we're adding 1 to last_build_id here. It seems like this would make it so that if I pass in limit=1
, I actually get 2 builds back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
range
's params are inclusive and exclusive respectively, so if we want to get builds 1 through 10 we'd do range(1, 11)
which would produce [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
. Maybe there's a clearer way to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, my mistake. There's not a clearer way of doing this, I was just bad at reading code.
I thought last_build_id = first_build_id + limit
but it's actually last_build_id = offset + limit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use get_build()
here instead of getting directly from _all_builds_by_id
? It seems like code in this method is trying to make sure that every build in the computed range is a valid build id, so if there's a mistake in that logic it is more of a server error than a client error (so we don't need the potential ItemNotFoundError raised from get_build()
.
Also, isn't this implementation a breaking change for v1 API's /builds endpoint? It seems like it would make that API stop returning the entire list of builds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly minor comments except for the one about potentially breaking the /builds endpoint in v1 API. 😝
app/master/cluster_master.py
Outdated
@@ -28,7 +28,7 @@ def __init__(self): | |||
self._logger = get_logger(__name__) | |||
self._master_results_path = Configuration['results_directory'] | |||
self._all_slaves_by_url = {} | |||
self._all_builds_by_id = OrderedDict() | |||
self._all_builds_by_id = dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just write a {}
instead of dict()
for consistency with the line above. :p
app/master/cluster_master.py
Outdated
limit = max(limit, 0) | ||
|
||
# If limit is set higher than the number of builds/max limit, reduce limit | ||
limit = highest_amount if limit > highest_amount else limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this line the same as limit = min(limit, highest_amount)
?
Also, what do you think about renaming highest_amount
to max_limit
?
app/master/cluster_master.py
Outdated
last_build_id = min((offset + limit), num_builds) | ||
|
||
# Add +1 to last_build_id because range is exclusive | ||
return [self.get_build(uid) for uid in range(first_build_id, last_build_id + 1)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use get_build()
here instead of getting directly from _all_builds_by_id
? It seems like code in this method is trying to make sure that every build in the computed range is a valid build id, so if there's a mistake in that logic it is more of a server error than a client error (so we don't need the potential ItemNotFoundError raised from get_build()
.
Also, isn't this implementation a breaking change for v1 API's /builds endpoint? It seems like it would make that API stop returning the entire list of builds.
@josephharrington Ahh nice catch, yeah we don't want to add any default offset/limit logic in the builds endpoint. If they're not set, we should set them to 0 and My most recently commit should fix that and I also fixed some tests for consistency. I had some tests that set something like offset and not limit, when it's always either none of them are set or both of them are set (because if you use v2, they will always be set to the defaults if not specified) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments. Feel free to ignore.
app/master/cluster_master.py
Outdated
if offset > num_builds: | ||
return [] | ||
|
||
first_build_id = offset + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python array slicing is inclusive with start and exclusive with end i.e.
A = [0, 1, 2]
A[0: 1]
=> [0]
From that understanding, I would have thought that offset and limit would mean that the returned build ID's would be in the range of [offset, offset + limit).
Though, I'm guessing because you used the terminology "offset" and "limit", you are coming more from a SQL foundation where the "offset" is exclusive. I don't have any comment on it. I think it's fine either way. As long as we pick a path and stick to it, I think that's okay.
@@ -105,6 +105,11 @@ def configure_defaults(self, conf): | |||
conf.set('https_cert_file', None) | |||
conf.set('https_key_file', None) | |||
|
|||
# Default values for pagination requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor. I kind of which we used constants instead of strings as keys or @property
getters. It would be easier for refactoring later. Though that is a general criticism of how we approach confs and not isolated to this change.
Maybe you should expose the default values as constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, although it has not really been a problem in my experience. @wjdhollow you may want to file an issue if you feel strongly about it -- we'd want to do that all in one change instead of piecemeal.
# Default values for pagination requests | ||
conf.set('pagination_offset', 0) | ||
conf.set('pagination_limit', 20) | ||
conf.set('pagination_max_limit', 200) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a pagination_max_limit
? How did you arrive at this value? Is it to protect the ClusterRunner service from having to serve large amounts of data? Seems artificial to me.
Currently we return back everything, which can be 1000 upon 1000s of builds, so it just seems strange to limit it to 1% of the previous max. I'm fine with this, I just want to hear your justification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha. Nice. That's what I get for not keeping up on Github updates.
# A negative limit will give no results | ||
limit = max(limit, 0) | ||
offset = max(offset, 0) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you consolidate this logic? It's duplicated by what's in the cluster_master
https://github.com/box/ClusterRunner/pull/394/files#diff-c3a071a6d6663990b42516055b3634b8R74
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a comment buried here somewhere that mentions we might as well repeat the negative logic here since we probably want to do something different with negative offsets in the future.
@@ -263,6 +263,15 @@ def get(self): | |||
self.write(response) | |||
|
|||
|
|||
class _V2BuildsHandler(_BuildsHandler): | |||
def get(self): | |||
offset, limit = self.get_pagination_params() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. It looks like you are validating your offset pagination values twice and both locations assign defaults. Very susceptible to drift. Please don't do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean once in the handler and again in the builds? Those are two separate kinds of validations. Can you elaborate?
The pagination sanitization of offset and limit makes sure the requested offset and limit values are legal and returns defaults if they're not.
The builds validation of offset and limit makes sure that they're in range of the number of builds we have stored on the master and adjusts the offset and limit values to indices of where to slice our dict.
return self._validate_pagination_params(offset, limit) | ||
|
||
@staticmethod | ||
def _validate_pagination_params(offset: int, limit: int) -> Tuple[int, int]: # pylint: disable=invalid-sequence-index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look like it validates anything. In fact, it looks like it does exactly the opposite of validation because it silently fails over to default values. I suggest renaming it to something like _get_pagination_params_or_safe_defaults. Something to indicate that it won't raise an error and that it will return sensible defaults.
Honestly, if you just combined the two methods, that would would work too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, this could do with a better name.
I intentionally separated the two methods into Tornado retrieving the values and then sanitizing the values. This way we can test the sanitizing the values function since there isn't a good way to test the former.
Add pagination parameters to the ClusterBaseHandler. This allows for any RequestHandler to get query params called
start
andamount
.In the v2 API, we also paginate the
builds
endpoint for the master service.The current defaults are
start=0
andamount=50
, do we want to adjust these? Also do we want to adjust the naming ofstart
andamount
?Closes #393