Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support datastore without distribution references #4380

Open
dafeder opened this issue Jan 13, 2025 · 1 comment
Open

Support datastore without distribution references #4380

dafeder opened this issue Jan 13, 2025 · 1 comment

Comments

@dafeder
Copy link
Member

dafeder commented Jan 13, 2025

Having distributions be referenced child nodes creates many problems for sites that update datasets regularly. Because the referencing system creates new entities on change for everything except datasets (to keep revision history for datasets simpler, as DKAN references don't support specific revisions), moving datasets through a workflow or simply deleting/unpublishing them can create a huge number of redundant distributions. Even in cases where the data itself is not changing, simply keeping distributions and datasets in sync if using multiple workflow states is quite a headache.

After years of DKAN 2 in production we've come to the conclusion that in most cases, having distributions stored as separate, referenced entities provides very few upsides. We don't want to remove the ability to do this, both for backward compatibility and because there may be some legitimate use cases. Any property in a DKAN dataset schema can be configured as a reference where the values will be stored in a separate entity and this should remain the case.

However, the datastore is quite coupled to the distribution entity. Most of the referencing and import logic in both metastore and datastore assume that all data resource URLs will be stored in a downloadURL property on a separate distribution item. As part of a larger initiative to decouple/disentangle the datastore in various ways from the rest of DKAN (see #3746), we should consider some re-architecting to expose URLs to the datastore based simply on their place in the dataset tree, agnostic about the referencing configuration.

Note: this was previously discussed in #4054.

This issue could use some more refinement but some basic assumptions about implementation:

  • We try to use resource IDs instead of distribution IDs everywhere in our APIs. You can access a resource or datastore via dataset ID + index or by resource ID
  • Initially maintain backward compatibility. If someone requests /api/1/datastore/query/[id], we check if there's a distribution UUID in the system that matches the id, if and only if a) we've checked for a matching resource ID and not found one, and b) it's a valid UUID
  • Rather than metastore logic being to dereference resource IDs and invoke datastore imports based on the downloadURL property of distributions, for it rather to look at the $.distribution[].downloadURL property on a deferenced dataset no matter how underlying referencing/storing works.
  • If possible, let that JSON path to be override-able. But this could left out and ticketed as a follow-up enhancement
  • Maybe we cache some sort of map of dataset and resource IDs to make it easier to find the right dataset when all you have is a downloadURL or resource id (this also might be considered out of scope and a spearate issue)
@dafeder
Copy link
Member Author

dafeder commented Jan 15, 2025

To see the issue, just create a dataset, then edit it and change anything in the distribution, you'll see an additional distribution in the node list. This can get very unmanageable if you multiply by hundreds or edits and/or hundreds of datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant