feat(sql): Adds url_download and url_upload to daft-sql #3690

RCHowell · 2025-01-15T22:31:25Z

This change is a bit larger because I had to change how url_download and url_upload handled the parsing of keyword arguments to be more like image functions. I also moved some re-usable SQL argument parsing logic to a functions::args module.

I've left some notes/TODOs regarding input validation and handling of named parameters, but addressing these is outside the scope of this PR.

codecov · 2025-01-15T23:07:07Z

Codecov Report

Attention: Patch coverage is 96.85535% with 5 lines in your changes missing coverage. Please review.

Project coverage is 77.68%. Comparing base (34d2036) to head (b8cbf40).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/daft-sql/src/functions.rs	86.66%	2 Missing ⚠️
src/daft-sql/src/modules/image/decode.rs	0.00%	1 Missing ⚠️
src/daft-sql/src/modules/uri/url_download.rs	96.96%	1 Missing ⚠️
src/daft-sql/src/modules/uri/url_upload.rs	97.67%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3690      +/-   ##
==========================================
- Coverage   77.79%   77.68%   -0.11%     
==========================================
  Files         729      732       +3     
  Lines       90477    90757     +280     
==========================================
+ Hits        70384    70502     +118     
- Misses      20093    20255     +162

Files with missing lines	Coverage Δ
daft/expressions/expressions.py	`93.54% <ø> (ø)`
src/daft-functions/src/python/uri.rs	`73.58% <100.00%> (-1.86%)`	⬇️
src/daft-functions/src/uri/download.rs	`85.71% <100.00%> (+2.38%)`	⬆️
src/daft-functions/src/uri/mod.rs	`100.00% <100.00%> (ø)`
src/daft-functions/src/uri/upload.rs	`73.86% <100.00%> (+3.03%)`	⬆️
...al-plan/src/optimization/rules/push_down_filter.rs	`97.35% <100.00%> (+0.02%)`	⬆️
src/daft-sql/src/modules/uri/mod.rs	`100.00% <100.00%> (ø)`
src/daft-sql/src/modules/image/decode.rs	`17.14% <0.00%> (+4.09%)`	⬆️
src/daft-sql/src/modules/uri/url_download.rs	`96.96% <96.96%> (ø)`
src/daft-sql/src/modules/uri/url_upload.rs	`97.67% <97.67%> (ø)`
... and 1 more

... and 9 files with indirect coverage changes

RCHowell · 2025-01-15T23:17:35Z

src/daft-sql/src/modules/uri/url_download.rs

+        Ok(Self {
+            max_connections,
+            raise_error_on_failure,
+            multi_thread: true, // TODO always true


I wasn't able to find other examples of multi_thread in daft-sql, but the python logic defined in ExpressionUrlNamespace has

multi_thread = not using_ray_runner

universalmind303 · 2025-01-15T23:29:26Z

tests/sql/test_uri_exprs.py

+def test_url_download():
+    df = daft.from_pydict(
+        {
+            "urls": [
+                "https://raw.githubusercontent.com/Eventual-Inc/Daft/refs/heads/main/README.rst",
+                "https://raw.githubusercontent.com/Eventual-Inc/Daft/refs/heads/main/LICENSE",
+            ]
+        }
+    )
+
+    actual = (
+        daft.sql(
+            """
+        SELECT
+            url_download(urls) as downloaded,
+            url_download(urls, max_connections=>1) as downloaded_single_conn,
+            url_download(urls, on_error=>'null') as downloaded_ignore_errors
+        FROM df
+        """
+        )
+        .collect()
+        .to_pydict()
+    )
+
+    expected = (
+        df.select(
+            col("urls").url.download().alias("downloaded"),
+            col("urls").url.download(max_connections=1).alias("downloaded_single_conn"),
+            col("urls").url.download(on_error="null").alias("downloaded_ignore_errors"),
+        )
+        .collect()
+        .to_pydict()
+    )
+
+    assert actual == expected
+


just as a sanity check, could we also make sure it works with scalar values

select url_download("https://...") from df;

codspeed-hq · 2025-01-16T01:07:50Z

CodSpeed Performance Report

Merging #3690 will improve performances by 80.78%

_{Comparing rchowell/df-109-add-missing-url_-functions-to-sql (b8cbf40) with main (34d2036)}

Summary

⚡ 1 improvements
✅ 26 untouched benchmarks

Benchmarks breakdown

	Benchmark	`main`	`rchowell/df-109-add-missing-url_-functions-to-sql`	Change
⚡	`test_iter_rows_first_row[100 Small Files]`	197.1 ms	109 ms	+80.78%

github-actions bot added the feat label Jan 15, 2025

RCHowell added 2 commits January 15, 2025 15:08

Adds url_download and url_upload to daft-sql

5f8e7a9

Removes unused common-io-config from daft-functions

63ae5b7

RCHowell force-pushed the rchowell/df-109-add-missing-url_-functions-to-sql branch from 6f90f29 to 63ae5b7 Compare January 15, 2025 23:12

RCHowell commented Jan 15, 2025

View reviewed changes

RCHowell mentioned this pull request Jan 15, 2025

add missing url_* functions to SQL #2945

Closed

universalmind303 approved these changes Jan 15, 2025

View reviewed changes

Adds additional test for url_download string literal

b8cbf40

RCHowell enabled auto-merge (squash) January 16, 2025 00:57

RCHowell merged commit c650794 into main Jan 16, 2025
43 checks passed

RCHowell deleted the rchowell/df-109-add-missing-url_-functions-to-sql branch January 16, 2025 01:25

RCHowell mentioned this pull request Jan 16, 2025

feat: add url_download/url_upload as sql function #3575

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sql): Adds url_download and url_upload to daft-sql #3690

feat(sql): Adds url_download and url_upload to daft-sql #3690

RCHowell commented Jan 15, 2025 •

edited

Loading

codecov bot commented Jan 15, 2025 •

edited

Loading

RCHowell Jan 15, 2025

universalmind303 Jan 15, 2025

codspeed-hq bot commented Jan 16, 2025

feat(sql): Adds url_download and url_upload to daft-sql #3690

feat(sql): Adds url_download and url_upload to daft-sql #3690

Conversation

RCHowell commented Jan 15, 2025 • edited Loading

codecov bot commented Jan 15, 2025 • edited Loading

Codecov Report

RCHowell Jan 15, 2025

Choose a reason for hiding this comment

universalmind303 Jan 15, 2025

Choose a reason for hiding this comment

codspeed-hq bot commented Jan 16, 2025

CodSpeed Performance Report

Merging #3690 will improve performances by 80.78%

Summary

Benchmarks breakdown

RCHowell commented Jan 15, 2025 •

edited

Loading

codecov bot commented Jan 15, 2025 •

edited

Loading