-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(sql): Adds url_download and url_upload to daft-sql #3690
feat(sql): Adds url_download and url_upload to daft-sql #3690
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3690 +/- ##
==========================================
- Coverage 77.79% 77.68% -0.11%
==========================================
Files 729 732 +3
Lines 90477 90757 +280
==========================================
+ Hits 70384 70502 +118
- Misses 20093 20255 +162
|
6f90f29
to
63ae5b7
Compare
Ok(Self { | ||
max_connections, | ||
raise_error_on_failure, | ||
multi_thread: true, // TODO always true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't able to find other examples of multi_thread in daft-sql, but the python logic defined in ExpressionUrlNamespace has
multi_thread = not using_ray_runner
tests/sql/test_uri_exprs.py
Outdated
def test_url_download(): | ||
df = daft.from_pydict( | ||
{ | ||
"urls": [ | ||
"https://raw.githubusercontent.com/Eventual-Inc/Daft/refs/heads/main/README.rst", | ||
"https://raw.githubusercontent.com/Eventual-Inc/Daft/refs/heads/main/LICENSE", | ||
] | ||
} | ||
) | ||
|
||
actual = ( | ||
daft.sql( | ||
""" | ||
SELECT | ||
url_download(urls) as downloaded, | ||
url_download(urls, max_connections=>1) as downloaded_single_conn, | ||
url_download(urls, on_error=>'null') as downloaded_ignore_errors | ||
FROM df | ||
""" | ||
) | ||
.collect() | ||
.to_pydict() | ||
) | ||
|
||
expected = ( | ||
df.select( | ||
col("urls").url.download().alias("downloaded"), | ||
col("urls").url.download(max_connections=1).alias("downloaded_single_conn"), | ||
col("urls").url.download(on_error="null").alias("downloaded_ignore_errors"), | ||
) | ||
.collect() | ||
.to_pydict() | ||
) | ||
|
||
assert actual == expected | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just as a sanity check, could we also make sure it works with scalar values
select url_download("https://...") from df;
CodSpeed Performance ReportMerging #3690 will improve performances by 80.78%Comparing Summary
Benchmarks breakdown
|
#3575
This change is a bit larger because I had to change how url_download and url_upload handled the parsing of keyword arguments to be more like image functions. I also moved some re-usable SQL argument parsing logic to a functions::args module.
I've left some notes/TODOs regarding input validation and handling of named parameters, but addressing these is outside the scope of this PR.