Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(sql): Adds url_download and url_upload to daft-sql #3690

Merged
merged 3 commits into from
Jan 16, 2025

Conversation

RCHowell
Copy link
Contributor

@RCHowell RCHowell commented Jan 15, 2025

#3575

This change is a bit larger because I had to change how url_download and url_upload handled the parsing of keyword arguments to be more like image functions. I also moved some re-usable SQL argument parsing logic to a functions::args module.

I've left some notes/TODOs regarding input validation and handling of named parameters, but addressing these is outside the scope of this PR.

@github-actions github-actions bot added the feat label Jan 15, 2025
Copy link

codecov bot commented Jan 15, 2025

Codecov Report

Attention: Patch coverage is 96.85535% with 5 lines in your changes missing coverage. Please review.

Project coverage is 77.68%. Comparing base (34d2036) to head (b8cbf40).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-sql/src/functions.rs 86.66% 2 Missing ⚠️
src/daft-sql/src/modules/image/decode.rs 0.00% 1 Missing ⚠️
src/daft-sql/src/modules/uri/url_download.rs 96.96% 1 Missing ⚠️
src/daft-sql/src/modules/uri/url_upload.rs 97.67% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3690      +/-   ##
==========================================
- Coverage   77.79%   77.68%   -0.11%     
==========================================
  Files         729      732       +3     
  Lines       90477    90757     +280     
==========================================
+ Hits        70384    70502     +118     
- Misses      20093    20255     +162     
Files with missing lines Coverage Δ
daft/expressions/expressions.py 93.54% <ø> (ø)
src/daft-functions/src/python/uri.rs 73.58% <100.00%> (-1.86%) ⬇️
src/daft-functions/src/uri/download.rs 85.71% <100.00%> (+2.38%) ⬆️
src/daft-functions/src/uri/mod.rs 100.00% <100.00%> (ø)
src/daft-functions/src/uri/upload.rs 73.86% <100.00%> (+3.03%) ⬆️
...al-plan/src/optimization/rules/push_down_filter.rs 97.35% <100.00%> (+0.02%) ⬆️
src/daft-sql/src/modules/uri/mod.rs 100.00% <100.00%> (ø)
src/daft-sql/src/modules/image/decode.rs 17.14% <0.00%> (+4.09%) ⬆️
src/daft-sql/src/modules/uri/url_download.rs 96.96% <96.96%> (ø)
src/daft-sql/src/modules/uri/url_upload.rs 97.67% <97.67%> (ø)
... and 1 more

... and 9 files with indirect coverage changes

@RCHowell RCHowell force-pushed the rchowell/df-109-add-missing-url_-functions-to-sql branch from 6f90f29 to 63ae5b7 Compare January 15, 2025 23:12
Ok(Self {
max_connections,
raise_error_on_failure,
multi_thread: true, // TODO always true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to find other examples of multi_thread in daft-sql, but the python logic defined in ExpressionUrlNamespace has

multi_thread = not using_ray_runner

Comment on lines 8 to 43
def test_url_download():
df = daft.from_pydict(
{
"urls": [
"https://raw.githubusercontent.com/Eventual-Inc/Daft/refs/heads/main/README.rst",
"https://raw.githubusercontent.com/Eventual-Inc/Daft/refs/heads/main/LICENSE",
]
}
)

actual = (
daft.sql(
"""
SELECT
url_download(urls) as downloaded,
url_download(urls, max_connections=>1) as downloaded_single_conn,
url_download(urls, on_error=>'null') as downloaded_ignore_errors
FROM df
"""
)
.collect()
.to_pydict()
)

expected = (
df.select(
col("urls").url.download().alias("downloaded"),
col("urls").url.download(max_connections=1).alias("downloaded_single_conn"),
col("urls").url.download(on_error="null").alias("downloaded_ignore_errors"),
)
.collect()
.to_pydict()
)

assert actual == expected

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just as a sanity check, could we also make sure it works with scalar values

select url_download("https://...") from df;

@RCHowell RCHowell enabled auto-merge (squash) January 16, 2025 00:57
Copy link

codspeed-hq bot commented Jan 16, 2025

CodSpeed Performance Report

Merging #3690 will improve performances by 80.78%

Comparing rchowell/df-109-add-missing-url_-functions-to-sql (b8cbf40) with main (34d2036)

Summary

⚡ 1 improvements
✅ 26 untouched benchmarks

Benchmarks breakdown

Benchmark main rchowell/df-109-add-missing-url_-functions-to-sql Change
test_iter_rows_first_row[100 Small Files] 197.1 ms 109 ms +80.78%

@RCHowell RCHowell merged commit c650794 into main Jan 16, 2025
43 checks passed
@RCHowell RCHowell deleted the rchowell/df-109-add-missing-url_-functions-to-sql branch January 16, 2025 01:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants