Catalog polling to infer & exclude broken datastores #308

charles-turner-1 · 2024-12-12T02:16:50Z

Is your feature request related to a problem? Please describe.

This package indexes & includes a number of datastores in the catalog via way of Translators - eg. those in #211 #199, etc, etc.

These datastores are typically owned by other users/groups on Gadi, and so are liable to break if their owners make alterations to the relevant catalog files (examples of where these are likely to be located can be found in the config section of this package).

Although these breakages are not strictly speaking a breakage of the ACCESS-NRI Intake Catalog, they will appear to users as such, as attempting to open a broken datastore will trigger an error. To handle this, we should implement some sort of procedure to ensure that all datastores included in the live catalog are functioning.

Describe the feature you'd like

There should be some functionality that allows us to continually poll the catalog at regular intervals (eg. midnight every night) to ensure that all datastores are correctly functioning.

The tests introduced by #290 (tests/e2e/test_datasets_representative) attempt to build all datastores from translators, forcing the assumption that the datastores can still be opened & translated using the same translators that were valid when the datastores were added into the catalog. This, in essence, forms a regression test for the datastore validity, and the functionality therein could be adapted & extended to ensure datastore integrity.

Additional considerations:

Some translators are applied to multiple datastores: eg. cmip6.yaml:

builder: null

translator: Cmip6Translator

sources:

  - metadata_yaml: /g/data/xp65/admin/access-nri-intake-catalog/config/metadata_sources/cmip6-fs38/metadata.yaml
    path:
      - /g/data/fs38/catalog/v2/esm/catalog.json
    
  - metadata_yaml: /g/data/xp65/admin/access-nri-intake-catalog/config/metadata_sources/cmip6-oi10/metadata.yaml
    path:
      - /g/data/oi10/catalog/v2/esm/catalog.json

If, for example, the cmip6-fs38 datastore were broken, care would need to be taken to ensure that the cmip6-oi10 datastore was not also excluded (in this instance erroneously).

There may be additional considerations around missing files - it is easy & computationally cheap to ensure that datastores are not broken. It is potentially substantially harder & more computationally expensive to ensure that no files within a datastore have been altered or moved in a way that breaks the catalog - perhaps a binhash solution like suggested in Restart Catalogue #272 would be an approach that would address these concerns.
We would need to consider how to address breakages - would we automatically trigger a catalog rebuild, excluding missing artifacts? How would we warn catalog users of missing data due to newly broken datastores?

Additional context

We may also want to apply a similar procedure towards datastores created using builders, as well as translators. This would require some additional thought regarding implementation.

The text was updated successfully, but these errors were encountered:

marc-white · 2025-01-02T03:22:01Z

In the event that we get a user attempting to open a 'broken' datastore, should we wrap the returning error somehow to point out that it's not really the fault of the catalog? E.g., Attempt to open datastore XXXX failed. This appears to be due to the data store being broken, or having been changed since it was ingested by access-nri-intake. Please contact ACCESS-NRI.

charles-turner-1 · 2025-01-02T04:34:48Z

Yeah, that sounds very sensible. I think if it does fail, we should be able to separately read the .json and the .csv.gz files associated with the datastore & parse out whats broken, which would give us

Some extra confidence that we haven't inadvertently caused the error.
A place to start looking to resolve the error.

marc-white · 2025-01-16T05:07:19Z

Having taken the chance to have a more detailed read of this proposal (which I agree is very much needed), I have a few more comments/suggestions:

I'd be against an automated catalog rebuild. There are multiple possible failure modes, which may require a quick fix, a more prolonged fix, or simply the removal of an experiment from the catalog. I think human intervention is desirable at that point, so it would be better if we just got a notification that something was amiss.
From a technical stand point, I presume there is a mechanism for getting Gadi to run a small job periodically?

charles-turner-1 · 2025-01-20T00:54:20Z

Having taken the chance to have a more detailed read of this proposal (which I agree is very much needed), I have a few more comments/suggestions:

I'd be against an automated catalog rebuild. There are multiple possible failure modes, which may require a quick fix, a more prolonged fix, or simply the removal of an experiment from the catalog. I think human intervention is desirable at that point, so it would be better if we just got a notification that something was amiss.

I agree. Potentially a good way of handling this would be to ping the tracking services server with a notification that something has failed, which we can then get to send us an email or open an issue on this repo notifying of the failure.

From a technical stand point, I presume there is a mechanism for getting Gadi to run a small job periodically?

Can we run a persistent session with a cron job inside it?

charles-turner-1 added the enhancement New feature or request label Dec 12, 2024

github-project-automation bot added this to Model Evaluation & Diagnostics Dec 12, 2024

github-project-automation bot moved this to Backlog in Model Evaluation & Diagnostics Dec 12, 2024

charles-turner-1 mentioned this issue Dec 12, 2024

Lessons learned from v1.0.0 release #279

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catalog polling to infer & exclude broken datastores #308

Catalog polling to infer & exclude broken datastores #308

charles-turner-1 commented Dec 12, 2024

marc-white commented Jan 2, 2025

charles-turner-1 commented Jan 2, 2025

marc-white commented Jan 16, 2025

charles-turner-1 commented Jan 20, 2025

Catalog polling to infer & exclude broken datastores #308

Catalog polling to infer & exclude broken datastores #308

Comments

charles-turner-1 commented Dec 12, 2024

Is your feature request related to a problem? Please describe.

Describe the feature you'd like

Additional context

marc-white commented Jan 2, 2025

charles-turner-1 commented Jan 2, 2025

marc-white commented Jan 16, 2025

charles-turner-1 commented Jan 20, 2025