Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the maximum number of metadata files a JKAN site can handle? #290

Open
pzwsk opened this issue Dec 24, 2024 · 2 comments
Open

What is the maximum number of metadata files a JKAN site can handle? #290

pzwsk opened this issue Dec 24, 2024 · 2 comments

Comments

@pzwsk
Copy link

pzwsk commented Dec 24, 2024

Hi there,

I am considering to continue using JKAN for the Risk Data Library catalog https://jkan.riskdatalibrary.org/datasets/ (currently an MVP product).

However, we would potentially have a lot of files to handle. Hopefully several thousands at some point.

I am therefore trying to understand the limitation in terms of number of metadata files for a JKAN site to function.

Hosting

This of course depends on server size. In our case, the site is stored on GitHub pages for now though we might consider an alternative. According to GitHub, it is advised not to have a repo more than 1GB so if we consider 2KB per metada files, this is about 500,000 files limit.

Search and filtering

Not sure what search engine is used for JKAN but I am guessing file limit should be much less than the storage limit as index needs to be downloaded and processed on the client side?

Any help appreciated.

@BryanQuigley
Copy link
Collaborator

I think it should be fine with a few 1000s, but I am not aware of any jkan sites using more then 700 datasets - so not sure anyone has tested it.

As I think you identified:
Hosting - really shouldn't be a problem
Search - maybe an issue at that scale (still guessing not), but it's something we've discussed ways to improve and would be open to better ideas - AND there is nothing preventing you from just using an external search engine.

Happy to review any PRs you want to merge back too.

@timwis
Copy link
Owner

timwis commented Jan 13, 2025

Hey @pzwsk ! 👋🏻 A few thousand would be fine. A few hundred thousand would probably be jittery. The two bottlenecks are:

  1. The /datasets.json file is loaded into memory when viewing the Datasets page (e.g. https://jkan.riskdatalibrary.org/datasets.json)
  2. We don't currently have pagination on the Datasets page, so it will show every item at the moment

Adding pagination to the Datasets page is pretty straightforward, but if you want search to still work, you'll need to use something like algolia or swiftype (or an open source, self-hosted alternative). You'd give the search engine that datasets.json file and tell it how to render the results on the Datasets page. We haven't done that because we aimed to keep the setup simple, but it wouldn't be too difficult to implement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants