Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update data.json to comply with DCAT-US Metadata Schema #89

Open
rcheetham opened this issue Aug 25, 2023 · 4 comments
Open

Update data.json to comply with DCAT-US Metadata Schema #89

rcheetham opened this issue Aug 25, 2023 · 4 comments

Comments

@rcheetham
Copy link
Collaborator

the federal data.gov catalog is able harvest catalogs from local jurisdictions. They used to harvest data from OpenDataPhilly but have suspended this harvesting because our data.json file is not compliant with the current metadata schema.

Our data.json file says it's 1.1, but when we run the validator at https://catalog.data.gov/dcat-us/validator (using the data.json at https://opendataphilly.org/data.json), there are a lot of errors, including:

  • missing required properties: bureauCode, ProgramCode, keyword, modified, publisher, identifier, accessLevel, on Dataset, for example
  • missing point of contact
  • all of our download points seems to not match the schema
  • they don't like the custom City of Philadelphia license (we probably can't fix this)
@BryanQuigley
Copy link
Member

Upstream JKAN issue, although I haven't checked recently if it complies with it's single data set. I believe some items needed to just be changed to output in a different field.

Some items like bureauCode IMU can be ignored as that's specific to the US government. They used to have a different validator for non-federal orgs but I can't find that option.

@rcheetham
Copy link
Collaborator Author

I reached out to the Data.gov folks. Many of the "mandatory" fields are only mandatory for federal agencies. They ran it for non-federal data sets and identified only one mandatory field that we don't have:

We ran the harvest of the source and all the datasets failed for lack of an identifier. We don't have many mandatory fields, but identifier is one of them. Schema explanation is at resources.data.gov

We would need to have a unique id expressed in the "identifier" field.

@BryanQuigley
Copy link
Member

It's a bit more complicated than I had hoped - but I understand why:

"This field allows third parties to maintain a consistent record for datasets even if title or URLs are updated. Agencies may integrate an existing system for maintaining unique identifiers. Each identifier must be unique across the agency’s catalog and remain fixed. It is highly recommended that a URI (preferably an HTTP URL) be used to provide a globally unique identifier. Identifier URLs should be designed and maintained to persist indefinitely regardless of whether the URL of the resource itself changes."
https://resources.data.gov/resources/dcat-us/#identifier

I see two options:

  • Just use the filename - figuring that will change much less than the Title or URL
  • Generate based on org/title/first URL - and save generated version to each dataset file

@rcheetham
Copy link
Collaborator Author

@BryanQuigley I received an annual check-in email from Data.gov asking for our harvest information, which caused me to come back to this.

After a year of editing the catalog over the past year, I definitely have seen several examples of the filename changing as well as being deleted. However, I also think it is the simplest approach as it could be generated in the data.json file without having to persist a unique ID and all of the issues that will go along with that.

So I propose we implement this using the file name as the unique ID.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants