Support datastore without distribution references #4380

dafeder · 2025-01-13T22:20:50Z

Having distributions be referenced child nodes creates many problems for sites that update datasets regularly. Because the referencing system creates new entities on change for everything except datasets (to keep revision history for datasets simpler, as DKAN references don't support specific revisions), moving datasets through a workflow or simply deleting/unpublishing them can create a huge number of redundant distributions. Even in cases where the data itself is not changing, simply keeping distributions and datasets in sync if using multiple workflow states is quite a headache.

After years of DKAN 2 in production we've come to the conclusion that in most cases, having distributions stored as separate, referenced entities provides very few upsides. We don't want to remove the ability to do this, both for backward compatibility and because there may be some legitimate use cases. Any property in a DKAN dataset schema can be configured as a reference where the values will be stored in a separate entity and this should remain the case.

However, the datastore is quite coupled to the distribution entity. Most of the referencing and import logic in both metastore and datastore assume that all data resource URLs will be stored in a downloadURL property on a separate distribution item. As part of a larger initiative to decouple/disentangle the datastore in various ways from the rest of DKAN (see #3746), we should consider some re-architecting to expose URLs to the datastore based simply on their place in the dataset tree, agnostic about the referencing configuration.

Note: this was previously discussed in #4054.

This issue could use some more refinement but some basic assumptions about implementation:

We try to use resource IDs instead of distribution IDs everywhere in our APIs. You can access a resource or datastore via dataset ID + index or by resource ID
Initially maintain backward compatibility. If someone requests /api/1/datastore/query/[id], we check if there's a distribution UUID in the system that matches the id, if and only if a) we've checked for a matching resource ID and not found one, and b) it's a valid UUID
Rather than metastore logic being to dereference resource IDs and invoke datastore imports based on the downloadURL property of distributions, for it rather to look at the $.distribution[].downloadURL property on a deferenced dataset no matter how underlying referencing/storing works.
If possible, let that JSON path to be override-able. But this could left out and ticketed as a follow-up enhancement
Maybe we cache some sort of map of dataset and resource IDs to make it easier to find the right dataset when all you have is a downloadURL or resource id (this also might be considered out of scope and a spearate issue)

The text was updated successfully, but these errors were encountered:

dafeder · 2025-01-15T20:11:33Z

To see the issue, just create a dataset, then edit it and change anything in the distribution, you'll see an additional distribution in the node list. This can get very unmanageable if you multiply by hundreds or edits and/or hundreds of datasets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support datastore without distribution references #4380

Support datastore without distribution references #4380

dafeder commented Jan 13, 2025

dafeder commented Jan 15, 2025

Support datastore without distribution references #4380

Support datastore without distribution references #4380

Comments

dafeder commented Jan 13, 2025

dafeder commented Jan 15, 2025