-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for terra SpatVectorProxy, and format="file"
for SpatRaster
#42
Comments
Do you know how to materialize a proxy vect? it doesn't seem to have any behaviours beyond crs(), ext(), dim() and plot() - which plots the extent only. I'd expect vect() at least to work but it does not. I'm super interested in the concerns here and the topic of GDAL generally. |
I think you want library(terra)
#> terra 1.7.74
x <- vect(system.file("ex", "lux.shp", package = "terra"), proxy=TRUE)
x
#> class : SpatVectorProxy
#> geometry : polygons
#> dimensions : 12, 6 (geometries, attributes)
#> extent : 5.826232, 6.16085, 49.94611, 50.18162 (xmin, xmax, ymin, ymax)
#> source : lux.shp
#> layer : lux
#> coord. ref. : lon/lat WGS 84 (EPSG:4326)
#> names : ID_1 NAME_1 ID_2 NAME_2 AREA POP
#> type : <num> <chr> <num> <chr> <num> <int>
y <- query(x)
y
#> class : SpatVector
#> geometry : polygons
#> dimensions : 12, 6 (geometries, attributes)
#> extent : 5.74414, 6.528252, 49.44781, 50.18162 (xmin, xmax, ymin, ymax)
#> source : lux.shp
#> coord. ref. : lon/lat WGS 84 (EPSG:4326)
#> names : ID_1 NAME_1 ID_2 NAME_2 AREA POP
#> type : <num> <chr> <num> <chr> <num> <int>
#> values : 1 Diekirch 1 Clervaux 312 18081
#> 1 Diekirch 2 Diekirch 218 32543
#> 1 Diekirch 3 Redange 259 18664
y2 <- query(x, sql="SELECT * FROM lux LIMIT 5")
y2
#> class : SpatVector
#> geometry : polygons
#> dimensions : 5, 6 (geometries, attributes)
#> extent : 5.74414, 6.315773, 49.69933, 50.18162 (xmin, xmax, ymin, ymax)
#> source : lux.shp
#> coord. ref. : lon/lat WGS 84 (EPSG:4326)
#> names : ID_1 NAME_1 ID_2 NAME_2 AREA POP
#> type : <num> <chr> <num> <chr> <num> <int>
#> values : 1 Diekirch 1 Clervaux 312 18081
#> 1 Diekirch 2 Diekirch 218 32543
#> 1 Diekirch 3 Redange 259 18664 I agree it perhaps would make sense to be able to do these same operations via |
I guess rast() drops reference to actual data so perhaps vect is the wrong expectation |
The {terra} SpatVectorProxy allows you to create a "lazy" reference to a vector dataset with
terra::vect(..., proxy=TRUE)
that you can query withterra::query()
rather than loading all attributes and geometry into memory. This is very helpful and can be much more efficient when working with portions of large data. The SpatVector is always in memory, SpatVectorProxy never in memory, and SpatRaster is in memory if it is sufficiently small, otherwise it automatically behaves as if it were a "SpatRasterProxy" to the source file.Currently,
tar_terra_vect()
cannot handle SpatVectorProxy because there is no SpatVectorProxywriteVector()
method:A philosophical question is whether creating a target for a SpatVectorProxy should copy the full source data to the target store, as we do for vector objects in memory, OR create a
format="file"
target for the data source returned byterra::sources()
.For a proxy object, I think I might often prefer the latter option. On one hand the former might be more reproducible in general as the source data get copied, but essentially we have this as the default SpatVector and SpatRaster approach already.
I often work with some fairly large file-based databases using a SpatVectorProxy or large SpatRaster initially and materializing only small portions relevant to specific areas later with
query()
orcrop()
or similar. Usually I would be fine to have targets just track the state of the source file, rather than a full copy of the data, as those things are not changing much, and often can be downloaded through standard methods (and the download could be a preceding target in the pipeline, prior to creating the SpatVectorProxy/SpatRaster)Perhaps the methods we have implemented should have an option to utilize an existing source and a
format="file"
approach. This would be the default behavior for SpatVectorProxy, and default could be based onterra::inMemory()
for SpatRaster.I see a few problems with the above:
I suppose that
format="file"
would only work for source formats that are a file to begin with (e.g. GeoPackage, FGDB, SQLite, DuckDB, Parquet...) but not for true database drivers like PostgreSQL. I think this can be addressed as a different issue for true database sources, where instead of aformat="file"
you store a checksum for a table or query result from a database source, and possibly? something about the database connection.{terra} will automatically write temporary files for raster operations that are too big to be done entirely in memory. This means
inMemory()
could return FALSE, but a target could be created based on a temporary file that will be deleted after the R session cleans up, invalidating downstream targets. In the case a temp file is used, rather than a file path specifically chosen by the user, we may not want to automatically decide for the user whether the "proxy" behavior of SpatRaster should be triggered.The text was updated successfully, but these errors were encountered: