-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unified file source stage #1184
base: branch-23.11
Are you sure you want to change the base?
Unified file source stage #1184
Conversation
…ltifilesource-and-directorywatcher-functionality
Co-authored-by: Christopher Harris <[email protected]>
Co-authored-by: Christopher Harris <[email protected]>
Co-authored-by: Christopher Harris <[email protected]>
…ltifilesource-and-directorywatcher-functionality
|
||
if len(self._protocols) > 1: | ||
raise ValueError("Accepts same protocol input files, but it received multiple protocols.") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remark: It appears that == 0
and < -1
are invalid values for max_files
.
Important: Check that max_files
is in a valid range (if you decide to keep -1
as the default, adjust accordingly).
if max_files and max_files <= 0: | |
raise ValueError(...) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raising an error if self._files
is None
or []
. We will get at least one value in the self._protocols
, so i didn't put an extra check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, but what about max_files
? If max_files == 0
or max_files < -1
, then this stage won't produce any files. In that case we should either warn or raise an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The max_files
flag takes effect only when set to a value greater than zero; otherwise, it is treated as continuous polling without any imposed limit. Default value is -1
, so I thought raising an error or warn would not needed. Let me know if you still want to add the warning message.
filtered_files = sorted(filtered_files, key=lambda f: f.full_name) | ||
|
||
if self._max_files > 0: | ||
filtered_files = filtered_files[:self._max_files - processed_files_count] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remark: If processed_files_count > self._max_files
we get filtered_files[:n]
where n < 0
, meaning we'll take the last n
files, which doesn't sound like what we want.
Important: make sure we don't accidentally read from the end of the list of filtered_files
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see we won't get a negative number because processed_files_count
is calculated based on _max_files
. My bad. No change needed.
morpheus/stages/input/file_source.py
Outdated
if self._max_files > 0 and self._max_files <= processed_files_count: | ||
logger.debug("Maximum file limit reached. Exiting polling service...") | ||
self._stop_requested = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remark: Oh, I see. This is how we are stopping the source when we reach the max file limit. This is fine, but in general cancellation tokens are reserved for flagging from outside of the function that checks them. I think we can move this logic up in to the previous if self._max_files > 0
condition and use break
or return
there rather than flagging the cancellation token. Up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated as suggested. I tried to avoid break
and yield
(multiple times), which is the reason i choosed this approach.
Co-authored-by: Christopher Harris <[email protected]>
…irectorywatcher-functionality' of github.com:bsuryadevara/Morpheus into 976-fea-unify-the-filesourcestage-multifilesource-and-directorywatcher-functionality
if self._max_files <= processed_files_count: | ||
logger.debug("Maximum file limit reached. Exiting polling service...") | ||
yield fsspec.core.OpenFiles(filtered_files, fs=files.fs) | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I'm seeing the break
in this location, I think having it down below line 206 (as you had it before I made my suggestion to change it) makes more sense because we won't need the two yields
. That's my bad. I don't think it's necessary to change, I just wanted to point out your first solution was less repetitive than my suggestion. Up to you!
morpheus/stages/input/file_source.py
Outdated
if self._watch: | ||
generator_function = self._polling_generate_frames_fsspec | ||
else: | ||
if self._watch: | ||
generator_function = self._polling_generate_frames_fsspec | ||
else: | ||
generator_function = self._generate_frames_fsspec | ||
generator_function = self._generate_frames_fsspec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remark: Nice! This looks cleaner now that DirectoryWatcher
is gone!
|
||
if len(self._protocols) > 1: | ||
raise ValueError("Accepts same protocol input files, but it received multiple protocols.") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, but what about max_files
? If max_files == 0
or max_files < -1
, then this stage won't produce any files. In that case we should either warn or raise an exception.
curr_time = time.monotonic() | ||
next_update_epoch = curr_time | ||
|
||
while (True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't knowingly witnessed a clean shutdown of a Morpheus pipeline through the use of ctrl+c. I should check to see if that's even possible (it should be, but might have limits).
watch_interval: float = 1.0, | ||
sort: bool = False, | ||
file_type: FileTypes = FileTypes.Auto, | ||
parser_kwargs: dict = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remark: Adding this argument makes me uneasy, since it will be difficult to deprecate in the future if necessary.
Question: Is this being added as a new feature, or is this something that existed on any of the other file source implementations?
…ltifilesource-and-directorywatcher-functionality
Notes:
use_cpp=False
, as directory watcher implementation doesn't exists in C++Description
Closes #976