Refactor `runCheck` method in `XMLDialect` class to remove system metadata related arguments #463

doulikecookiedough · 2025-01-09T20:19:23Z

Before runCheck is executed from runSuite, it appears that we are also setting the system metadata to the xml results produced from the check. This should no longer be required unless it is used by the solr index in some form.

runCheck(Check check) {
...
// include system metadata if available
if (this.systemMetadata != null) {
try {
	ByteArrayOutputStream baos = new ByteArrayOutputStream();
	TypeMarshaller.marshalTypeToOutputStream(systemMetadata, baos);
	variables.put("systemMetadata", baos.toString("UTF-8"));
	variables.put("datasource", systemMetadata.getOriginMemberNode().getValue());
	// dateUploaded
	// This unusual date format is acceptable to Solr - it must be GMT time, with
	// no offset
	SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'");
	df.setTimeZone(TimeZone.getTimeZone("GMT"));
	variables.put("dateUploaded", df.format(systemMetadata.getDateUploaded()));
	variables.put("authoritativeMemberNode", systemMetadata.getAuthoritativeMemberNode().getValue());
	variables.put("systemMetadataPid", systemMetadata.getIdentifier().getValue());
} catch (Exception e) {
	log.error("Could not serialize SystemMetadata for check", e);
}
}

Investigate and then remove the system metadata related code if it is redundant.

The text was updated successfully, but these errors were encountered:

mbjones · 2025-01-09T21:08:13Z

@doulikecookiedough Take a look at the SOLR report that is stored for each run report -- you can tell that it includes some sysmeta fields with a command like this:

curl -s https://api.dataone.org/quality/runs/arctic.data.center.suite-1.2.0/doi:10.18739/A2P55DJ7H | jq .sysmeta

{
  "originMemberNode": "urn:node:ARCTIC",
  "rightsHolder": "http://orcid.org/0000-0003-1410-628X",
  "groups": [],
  "dateUploaded": "Jan 6, 2025, 9:04:29 PM",
  "formatId": "https://eml.ecoinformatics.org/eml-2.2.0",
  "obsoletes": "urn:uuid:28b531d7-61d2-412e-b3f4-8caf9c3d8ece",
  "obsoletedBy": null,
  "seriesId": null
}

In addition, sysmeta fields are also added directly into the SOLR record for each run, so that SOLR can be used to facet and group results. Here's an example SOLR query that shows the fields from sysmeta that are in the SOLR schema for the quality service (this only works if you first forward port 8983 to the cluster with kubectl port-forward service/metadig-solr 8983:8983):

❯ curl -s "http://localhost:8983/solr/quality/select?indent=true&q.op=OR&q=*%3A*&rows=2&start=0&wt-json" | jq .

{
  "responseHeader": {
    "status": 0,
    "QTime": 2,
    "params": {
      "q": "*:*",
      "indent": "true",
      "start": "0",
      "q.op": "OR",
      "wt-json": "",
      "rows": "2"
    }
  },
  "response": {
    "numFound": 89543,
    "start": 0,
    "numFoundExact": true,
    "docs": [
      {
        "metadataId": "doi:10.18739/A2KW57J9Q",
        "formatId": "https://nceas.ucsb.edu/mdqe/v1",
        "runId": "724bf66a-f5b3-425f-b968-25be0572733f",
        "suiteId": "FAIR-suite-0.3.1",
        "timestamp": "2022-01-11T01:14:37.701Z",
        "checksPassed": 32,
        "checksWarned": 9,
        "checksFailed": 10,
        "checksInfo": 0,
        "checksErrored": 0,
        "checkCount": 51,
        "scoreOverall": 0.7619048,
        "scoreByType_Interoperable_f": 0.78,
        "scoreByType_Reusable_f": 0.64,
        "scoreByType_Accessible_f": 0.62,
        "scoreByType_Findable_f": 0.93,
        "_version_": 1721618849384628224,
        "rightsHolder": "CN=DBO,DC=dataone,DC=org",
        "datasource": "urn:node:ARCTIC",
        "dateUploaded": "2020-07-23T17:16:13Z",
        "obsoletes": "doi:10.18739/A29S1KK5B",
        "metadataFormatId": "https://eml.ecoinformatics.org/eml-2.2.0",
        "group": [
          "CN=DBO,DC=dataone,DC=org"
        ]
      },
      {
        "metadataId": "doi:10.18739/A2KH0F05D",
        "formatId": "https://nceas.ucsb.edu/mdqe/v1",
        "runId": "fbae7d4b-9a10-4b6a-b265-229da29bd73c",
        "suiteId": "FAIR-suite-0.3.1",
        "timestamp": "2022-01-11T01:14:29.644Z",
        "checksPassed": 20,
        "checksWarned": 11,
        "checksFailed": 20,
        "checksInfo": 0,
        "checksErrored": 0,
        "checkCount": 51,
        "scoreOverall": 0.5,
        "scoreByType_Interoperable_f": 0.12,
        "scoreByType_Reusable_f": 0.2,
        "scoreByType_Accessible_f": 0.62,
        "scoreByType_Findable_f": 0.86,
        "_version_": 1721618843173912576,
        "rightsHolder": "CN=DBO,DC=dataone,DC=org",
        "datasource": "urn:node:ARCTIC",
        "dateUploaded": "2020-07-17T22:06:05Z",
        "obsoletes": "doi:10.18739/A2HH6C63Z",
        "metadataFormatId": "eml://ecoinformatics.org/eml-2.1.1",
        "group": [
          "CN=DBO,DC=dataone,DC=org"
        ]
      }
    ]
  }
}

doulikecookiedough · 2025-01-11T00:34:50Z

It appears that runCheck may not need to be refactored. It parses instance variables of its class (XMLDialect) to access a metadata document (EML) or systemMetadata to pass onto the dispatcher via Map<String, Object> variables .

try {
    result = dispatcher.dispatch(variables, code);
} catch (ScriptException e) {
    // report this
    result = new Result();
    result.setStatus(Status.ERROR);
    result.setOutput(new Output(e.getMessage()));
}

Since the checks themselves moving forward will use (or be refactored to use) hashstore to get all the data objects and system metadata it requires, this should not in theory pose an issue for the check if it was removed.

None of the checks currently access system metadata or check themselves
- We need to refactor the java code to just get streams, to pass to the checks
- Refactoring all the checks would be a huge setback
- getSystem M
runCheck itself updates the system metadata, so if it is attached beforehand it can proceed
It also makes the document available to the DOM, which is unclear at this time why this is needed

// gather the variable name/value details
Map<String, Object> variables = new HashMap<String, Object>();
if (check.getSelector() != null) {
for (Selector selector : check.getSelector()) {

	Document docToUse = document;
	if (selector.isNamespaceAware()) {
		docToUse = nsAwareDocument;
	}

	String name = selector.getName();
	Object value = this.selectPath(selector, docToUse);

	// make available in script
	variables.put(name, value);
}
}

// make the entire dom available
// TODO: string seems like only viable option for all env
variables.put("document", toXmlString(document));

To Do:

Investigate why the resourceMap and system metadata are attached to variables to determine if runCheck needs to be refactored/optimized or not
Investigate whether we should import the hashstore-java library or find other means of accessing hashstore
Investigate how we may possibly pass the location of a hashstore to the metadig-engine

doulikecookiedough · 2025-01-13T20:04:15Z

Check-in:

The metadata document described above is not a resource map. It is an EML document that gets passed to a check directly, which may at times be used as is or parsed for select pieces of data.
The checks themselves need to use this EML document, which is stored through Metacat as a data object and should be retrieved from hashstore moving forward.
After speaking with Jeannette, it feels like a lot of unnecessary work to refactor the checks - especially since it can proceed as is so long as it gets the stream to the eml document, and access to the system metadata for the solr index.
We may be able to find optimizations, but not all the issues created to refactor code to remove system metadata may be required

To Do:

Create diagram to propose where in the flow we should import hashstore so that we can provide the streams to the sysmeta and eml itself
- Hashstore requires the storePath to retrieve the configuration - also determine how we may receive the string to the store path.
Determine which issues created for refactoring can be closed after discussing the proposed path forward

doulikecookiedough changed the title ~~Refactor runCheck method in MDQEngine class to remove system metadata related arguments~~ Refactor runCheck method in XMLDialect class to remove system metadata related arguments Jan 9, 2025

doulikecookiedough mentioned this issue Jan 10, 2025

Refactor processQualityRequest method in Controller class to remove system metadata related arguments #458

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `runCheck` method in `XMLDialect` class to remove system metadata related arguments #463

Refactor `runCheck` method in `XMLDialect` class to remove system metadata related arguments #463

doulikecookiedough commented Jan 9, 2025

mbjones commented Jan 9, 2025

doulikecookiedough commented Jan 11, 2025 •

edited

Loading

doulikecookiedough commented Jan 13, 2025 •

edited

Loading

Refactor runCheck method in XMLDialect class to remove system metadata related arguments #463

Refactor runCheck method in XMLDialect class to remove system metadata related arguments #463

Comments

doulikecookiedough commented Jan 9, 2025

mbjones commented Jan 9, 2025

doulikecookiedough commented Jan 11, 2025 • edited Loading

doulikecookiedough commented Jan 13, 2025 • edited Loading

Refactor `runCheck` method in `XMLDialect` class to remove system metadata related arguments #463

Refactor `runCheck` method in `XMLDialect` class to remove system metadata related arguments #463

doulikecookiedough commented Jan 11, 2025 •

edited

Loading

doulikecookiedough commented Jan 13, 2025 •

edited

Loading