Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catalog polling to infer & exclude broken datastores #308

Open
Tracked by #279
charles-turner-1 opened this issue Dec 12, 2024 · 4 comments
Open
Tracked by #279

Catalog polling to infer & exclude broken datastores #308

charles-turner-1 opened this issue Dec 12, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@charles-turner-1
Copy link
Collaborator

Is your feature request related to a problem? Please describe.

This package indexes & includes a number of datastores in the catalog via way of Translators - eg. those in #211 #199, etc, etc.

These datastores are typically owned by other users/groups on Gadi, and so are liable to break if their owners make alterations to the relevant catalog files (examples of where these are likely to be located can be found in the config section of this package).

Although these breakages are not strictly speaking a breakage of the ACCESS-NRI Intake Catalog, they will appear to users as such, as attempting to open a broken datastore will trigger an error. To handle this, we should implement some sort of procedure to ensure that all datastores included in the live catalog are functioning.

Describe the feature you'd like

There should be some functionality that allows us to continually poll the catalog at regular intervals (eg. midnight every night) to ensure that all datastores are correctly functioning.

The tests introduced by #290 (tests/e2e/test_datasets_representative) attempt to build all datastores from translators, forcing the assumption that the datastores can still be opened & translated using the same translators that were valid when the datastores were added into the catalog. This, in essence, forms a regression test for the datastore validity, and the functionality therein could be adapted & extended to ensure datastore integrity.

Additional considerations:

  • Some translators are applied to multiple datastores: eg. cmip6.yaml:
builder: null

translator: Cmip6Translator

sources:

  - metadata_yaml: /g/data/xp65/admin/access-nri-intake-catalog/config/metadata_sources/cmip6-fs38/metadata.yaml
    path:
      - /g/data/fs38/catalog/v2/esm/catalog.json
    
  - metadata_yaml: /g/data/xp65/admin/access-nri-intake-catalog/config/metadata_sources/cmip6-oi10/metadata.yaml
    path:
      - /g/data/oi10/catalog/v2/esm/catalog.json

If, for example, the cmip6-fs38 datastore were broken, care would need to be taken to ensure that the cmip6-oi10 datastore was not also excluded (in this instance erroneously).

  • There may be additional considerations around missing files - it is easy & computationally cheap to ensure that datastores are not broken. It is potentially substantially harder & more computationally expensive to ensure that no files within a datastore have been altered or moved in a way that breaks the catalog - perhaps a binhash solution like suggested in Restart Catalogue #272 would be an approach that would address these concerns.
  • We would need to consider how to address breakages - would we automatically trigger a catalog rebuild, excluding missing artifacts? How would we warn catalog users of missing data due to newly broken datastores?

Additional context

  • We may also want to apply a similar procedure towards datastores created using builders, as well as translators. This would require some additional thought regarding implementation.
@marc-white
Copy link
Collaborator

In the event that we get a user attempting to open a 'broken' datastore, should we wrap the returning error somehow to point out that it's not really the fault of the catalog? E.g., Attempt to open datastore XXXX failed. This appears to be due to the data store being broken, or having been changed since it was ingested by access-nri-intake. Please contact ACCESS-NRI.

@charles-turner-1
Copy link
Collaborator Author

Yeah, that sounds very sensible. I think if it does fail, we should be able to separately read the .json and the .csv.gz files associated with the datastore & parse out whats broken, which would give us

  1. Some extra confidence that we haven't inadvertently caused the error.
  2. A place to start looking to resolve the error.

@marc-white
Copy link
Collaborator

Having taken the chance to have a more detailed read of this proposal (which I agree is very much needed), I have a few more comments/suggestions:

  • I'd be against an automated catalog rebuild. There are multiple possible failure modes, which may require a quick fix, a more prolonged fix, or simply the removal of an experiment from the catalog. I think human intervention is desirable at that point, so it would be better if we just got a notification that something was amiss.
  • From a technical stand point, I presume there is a mechanism for getting Gadi to run a small job periodically?

@charles-turner-1
Copy link
Collaborator Author

Having taken the chance to have a more detailed read of this proposal (which I agree is very much needed), I have a few more comments/suggestions:

  • I'd be against an automated catalog rebuild. There are multiple possible failure modes, which may require a quick fix, a more prolonged fix, or simply the removal of an experiment from the catalog. I think human intervention is desirable at that point, so it would be better if we just got a notification that something was amiss.

I agree. Potentially a good way of handling this would be to ping the tracking services server with a notification that something has failed, which we can then get to send us an email or open an issue on this repo notifying of the failure.

  • From a technical stand point, I presume there is a mechanism for getting Gadi to run a small job periodically?

Can we run a persistent session with a cron job inside it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Backlog
Development

No branches or pull requests

2 participants