Update data.json to comply with DCAT-US Metadata Schema #89

rcheetham · 2023-08-25T20:54:45Z

the federal data.gov catalog is able harvest catalogs from local jurisdictions. They used to harvest data from OpenDataPhilly but have suspended this harvesting because our data.json file is not compliant with the current metadata schema.

Our data.json file says it's 1.1, but when we run the validator at https://catalog.data.gov/dcat-us/validator (using the data.json at https://opendataphilly.org/data.json), there are a lot of errors, including:

missing required properties: bureauCode, ProgramCode, keyword, modified, publisher, identifier, accessLevel, on Dataset, for example
missing point of contact
all of our download points seems to not match the schema
they don't like the custom City of Philadelphia license (we probably can't fix this)

BryanQuigley · 2023-08-25T22:36:23Z

Upstream JKAN issue, although I haven't checked recently if it complies with it's single data set. I believe some items needed to just be changed to output in a different field.

Some items like bureauCode IMU can be ignored as that's specific to the US government. They used to have a different validator for non-federal orgs but I can't find that option.

rcheetham · 2023-09-05T00:51:43Z

I reached out to the Data.gov folks. Many of the "mandatory" fields are only mandatory for federal agencies. They ran it for non-federal data sets and identified only one mandatory field that we don't have:

We ran the harvest of the source and all the datasets failed for lack of an identifier. We don't have many mandatory fields, but identifier is one of them. Schema explanation is at resources.data.gov

We would need to have a unique id expressed in the "identifier" field.

BryanQuigley · 2023-09-05T03:34:11Z

It's a bit more complicated than I had hoped - but I understand why:

"This field allows third parties to maintain a consistent record for datasets even if title or URLs are updated. Agencies may integrate an existing system for maintaining unique identifiers. Each identifier must be unique across the agency’s catalog and remain fixed. It is highly recommended that a URI (preferably an HTTP URL) be used to provide a globally unique identifier. Identifier URLs should be designed and maintained to persist indefinitely regardless of whether the URL of the resource itself changes."
https://resources.data.gov/resources/dcat-us/#identifier

I see two options:

Just use the filename - figuring that will change much less than the Title or URL
Generate based on org/title/first URL - and save generated version to each dataset file

rcheetham · 2024-11-18T12:36:55Z

@BryanQuigley I received an annual check-in email from Data.gov asking for our harvest information, which caused me to come back to this.

After a year of editing the catalog over the past year, I definitely have seen several examples of the filename changing as well as being deleted. However, I also think it is the simplest approach as it could be generated in the data.json file without having to persist a unique ID and all of the issues that will go along with that.

So I propose we implement this using the file name as the unique ID.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update data.json to comply with DCAT-US Metadata Schema #89

Update data.json to comply with DCAT-US Metadata Schema #89

rcheetham commented Aug 25, 2023

BryanQuigley commented Aug 25, 2023

rcheetham commented Sep 5, 2023

BryanQuigley commented Sep 5, 2023

rcheetham commented Nov 18, 2024

Update data.json to comply with DCAT-US Metadata Schema #89

Update data.json to comply with DCAT-US Metadata Schema #89

Comments

rcheetham commented Aug 25, 2023

BryanQuigley commented Aug 25, 2023

rcheetham commented Sep 5, 2023

BryanQuigley commented Sep 5, 2023

rcheetham commented Nov 18, 2024