Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor runCheck method in XMLDialect class to remove system metadata related arguments #463

Open
doulikecookiedough opened this issue Jan 9, 2025 · 3 comments

Comments

@doulikecookiedough
Copy link

Before runCheck is executed from runSuite, it appears that we are also setting the system metadata to the xml results produced from the check. This should no longer be required unless it is used by the solr index in some form.

runCheck(Check check) {
...
// include system metadata if available
if (this.systemMetadata != null) {
try {
	ByteArrayOutputStream baos = new ByteArrayOutputStream();
	TypeMarshaller.marshalTypeToOutputStream(systemMetadata, baos);
	variables.put("systemMetadata", baos.toString("UTF-8"));
	variables.put("datasource", systemMetadata.getOriginMemberNode().getValue());
	// dateUploaded
	// This unusual date format is acceptable to Solr - it must be GMT time, with
	// no offset
	SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'");
	df.setTimeZone(TimeZone.getTimeZone("GMT"));
	variables.put("dateUploaded", df.format(systemMetadata.getDateUploaded()));
	variables.put("authoritativeMemberNode", systemMetadata.getAuthoritativeMemberNode().getValue());
	variables.put("systemMetadataPid", systemMetadata.getIdentifier().getValue());
} catch (Exception e) {
	log.error("Could not serialize SystemMetadata for check", e);
}
}

Investigate and then remove the system metadata related code if it is redundant.

@doulikecookiedough doulikecookiedough changed the title Refactor runCheck method in MDQEngine class to remove system metadata related arguments Refactor runCheck method in XMLDialect class to remove system metadata related arguments Jan 9, 2025
@mbjones
Copy link
Member

mbjones commented Jan 9, 2025

@doulikecookiedough Take a look at the SOLR report that is stored for each run report -- you can tell that it includes some sysmeta fields with a command like this:

curl -s https://api.dataone.org/quality/runs/arctic.data.center.suite-1.2.0/doi:10.18739/A2P55DJ7H | jq .sysmeta

{
  "originMemberNode": "urn:node:ARCTIC",
  "rightsHolder": "http://orcid.org/0000-0003-1410-628X",
  "groups": [],
  "dateUploaded": "Jan 6, 2025, 9:04:29 PM",
  "formatId": "https://eml.ecoinformatics.org/eml-2.2.0",
  "obsoletes": "urn:uuid:28b531d7-61d2-412e-b3f4-8caf9c3d8ece",
  "obsoletedBy": null,
  "seriesId": null
}

In addition, sysmeta fields are also added directly into the SOLR record for each run, so that SOLR can be used to facet and group results. Here's an example SOLR query that shows the fields from sysmeta that are in the SOLR schema for the quality service (this only works if you first forward port 8983 to the cluster with kubectl port-forward service/metadig-solr 8983:8983):

❯ curl -s "http://localhost:8983/solr/quality/select?indent=true&q.op=OR&q=*%3A*&rows=2&start=0&wt-json" | jq .

{
  "responseHeader": {
    "status": 0,
    "QTime": 2,
    "params": {
      "q": "*:*",
      "indent": "true",
      "start": "0",
      "q.op": "OR",
      "wt-json": "",
      "rows": "2"
    }
  },
  "response": {
    "numFound": 89543,
    "start": 0,
    "numFoundExact": true,
    "docs": [
      {
        "metadataId": "doi:10.18739/A2KW57J9Q",
        "formatId": "https://nceas.ucsb.edu/mdqe/v1",
        "runId": "724bf66a-f5b3-425f-b968-25be0572733f",
        "suiteId": "FAIR-suite-0.3.1",
        "timestamp": "2022-01-11T01:14:37.701Z",
        "checksPassed": 32,
        "checksWarned": 9,
        "checksFailed": 10,
        "checksInfo": 0,
        "checksErrored": 0,
        "checkCount": 51,
        "scoreOverall": 0.7619048,
        "scoreByType_Interoperable_f": 0.78,
        "scoreByType_Reusable_f": 0.64,
        "scoreByType_Accessible_f": 0.62,
        "scoreByType_Findable_f": 0.93,
        "_version_": 1721618849384628224,
        "rightsHolder": "CN=DBO,DC=dataone,DC=org",
        "datasource": "urn:node:ARCTIC",
        "dateUploaded": "2020-07-23T17:16:13Z",
        "obsoletes": "doi:10.18739/A29S1KK5B",
        "metadataFormatId": "https://eml.ecoinformatics.org/eml-2.2.0",
        "group": [
          "CN=DBO,DC=dataone,DC=org"
        ]
      },
      {
        "metadataId": "doi:10.18739/A2KH0F05D",
        "formatId": "https://nceas.ucsb.edu/mdqe/v1",
        "runId": "fbae7d4b-9a10-4b6a-b265-229da29bd73c",
        "suiteId": "FAIR-suite-0.3.1",
        "timestamp": "2022-01-11T01:14:29.644Z",
        "checksPassed": 20,
        "checksWarned": 11,
        "checksFailed": 20,
        "checksInfo": 0,
        "checksErrored": 0,
        "checkCount": 51,
        "scoreOverall": 0.5,
        "scoreByType_Interoperable_f": 0.12,
        "scoreByType_Reusable_f": 0.2,
        "scoreByType_Accessible_f": 0.62,
        "scoreByType_Findable_f": 0.86,
        "_version_": 1721618843173912576,
        "rightsHolder": "CN=DBO,DC=dataone,DC=org",
        "datasource": "urn:node:ARCTIC",
        "dateUploaded": "2020-07-17T22:06:05Z",
        "obsoletes": "doi:10.18739/A2HH6C63Z",
        "metadataFormatId": "eml://ecoinformatics.org/eml-2.1.1",
        "group": [
          "CN=DBO,DC=dataone,DC=org"
        ]
      }
    ]
  }
}

@doulikecookiedough
Copy link
Author

doulikecookiedough commented Jan 11, 2025

It appears that runCheck may not need to be refactored. It parses instance variables of its class (XMLDialect) to access a metadata document (EML) or systemMetadata to pass onto the dispatcher via Map<String, Object> variables .

try {
    result = dispatcher.dispatch(variables, code);
} catch (ScriptException e) {
    // report this
    result = new Result();
    result.setStatus(Status.ERROR);
    result.setOutput(new Output(e.getMessage()));
}

Since the checks themselves moving forward will use (or be refactored to use) hashstore to get all the data objects and system metadata it requires, this should not in theory pose an issue for the check if it was removed.

  • None of the checks currently access system metadata or check themselves
    • We need to refactor the java code to just get streams, to pass to the checks
    • Refactoring all the checks would be a huge setback
    • getSystem M
  • runCheck itself updates the system metadata, so if it is attached beforehand it can proceed
  • It also makes the document available to the DOM, which is unclear at this time why this is needed
// gather the variable name/value details
Map<String, Object> variables = new HashMap<String, Object>();
if (check.getSelector() != null) {
for (Selector selector : check.getSelector()) {

	Document docToUse = document;
	if (selector.isNamespaceAware()) {
		docToUse = nsAwareDocument;
	}

	String name = selector.getName();
	Object value = this.selectPath(selector, docToUse);

	// make available in script
	variables.put(name, value);
}
}

// make the entire dom available
// TODO: string seems like only viable option for all env
variables.put("document", toXmlString(document));

To Do:

  • Investigate why the resourceMap and system metadata are attached to variables to determine if runCheck needs to be refactored/optimized or not
  • Investigate whether we should import the hashstore-java library or find other means of accessing hashstore
  • Investigate how we may possibly pass the location of a hashstore to the metadig-engine

@doulikecookiedough
Copy link
Author

doulikecookiedough commented Jan 13, 2025

Check-in:

  • The metadata document described above is not a resource map. It is an EML document that gets passed to a check directly, which may at times be used as is or parsed for select pieces of data.
  • The checks themselves need to use this EML document, which is stored through Metacat as a data object and should be retrieved from hashstore moving forward.
  • After speaking with Jeannette, it feels like a lot of unnecessary work to refactor the checks - especially since it can proceed as is so long as it gets the stream to the eml document, and access to the system metadata for the solr index.
  • We may be able to find optimizations, but not all the issues created to refactor code to remove system metadata may be required

To Do:

  • Create diagram to propose where in the flow we should import hashstore so that we can provide the streams to the sysmeta and eml itself
    • Hashstore requires the storePath to retrieve the configuration - also determine how we may receive the string to the store path.
  • Determine which issues created for refactoring can be closed after discussing the proposed path forward

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants