-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Move mirror creation to the cloud #123
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! @galargh any idea how this cloud setup would behave/cost when a bigger ZIM is used?
The biggest one is wikipedia_en_all_maxi_2021-12.zim – 87G compressed, ~360GB unpacked on IPFS.
I was able to build it on 1TB SSD, everything else was too slow. Wondering if we could get emergency fast build capability with this setup, at the same time I am not sure what would be the cost (and how it compares to buying physical 1TB SSD).
Any quesstimates / end of napkin calculations?
Sidenote: Be mindful that the more we invest in the unpacking process, the harder it will be to avoid the sunk cost fallacy. On a principle, I am worried about subsidizing AWS with all this unnecessary unpacking and building, would rather donate the money to https://www.kiwix.org/en/support/ or fund a devgrant to remove the need for unpacking ZIMs (https://github.com/ipfs/devgrants/blob/devgrant/kiwix-js/targeted-grants/kiwix-js.md, #42 (comment)).
Good point, I didn't think of bigger ZIMs at all. Bumping the SSD to 1TB on EC2 in the current setup would put the total yearly cost at around $1500 (see https://calculator.aws/#/estimate?id=bd698299f8d4943c9d41130fbaac0fa4d28bf24a). That is if the machine is to run all year round and we stick with the default volume properties. That also poses a problem with GitHub Actions as by default it only gets ~30GB of disk space. My main goal with this setup was to try to support contributors (such as myself) that cannot complete the mirror creation on their own machines. I think in the meantime I might have gone a bit overboard. So let me take a step back and propose some simplifications. Let's:
I'll share the updated code shortly. Thank you for your comments, I found them really helpful :) |
On zimdump side, we still have potential to improve speed with openzim/zim-tools#69. We made this multithreading implementation for zimcheck with success (but don't remember the benchmarking results). |
I'm closing this PR in favor of one that only updates the |
I've seen #120 and got inspired to try building a wikipedia mirror myself to see how it gets done. To make it more useful of an exercise I decided to try to hunt down the issue that affected Belarusian wikipedia mentioned in the original issue. Since the steps were taking quite a long time on my machine I decided to go beyond the original scope and take as many parts of the process off of my machine as possible.
In this PR:
Dockerfile
so that it is on par withREADME.md
and I make it fully prepared to executemirrorzim.sh
.Original Proposal (outdated)
mirrorzim.sh
, runsmirrorzim.sh
in container created from newly updatedDockerfile
, packages the outputs astar.gz
and finally uploads them to S3 (the S3 creation is automated interraform
directory added in this PR too)Testing
terraform apply
interraform/ecr
to create public ECR repository.docker build . --platform=linux/amd64 -f Dockerfile -t public.ecr.aws/c4h1q7d1/distributed-wikipedia-mirror -t public.ecr.aws/c4h1q7d1/distributed-wikipedia-mirror:$(date -u +%F) -t public.ecr.aws/c4h1q7d1/distributed-wikipedia-mirror:$(date -u +%s)
to create a docker image.docker push --all-tags public.ecr.aws/c4h1q7d1/distributed-wikipedia-mirror
to push the image to public ECR.terraform apply
interraform/ec2
to create EC2 instance for myself.ssh -i <private_key> ec2-user@<public_dns>
to ssh into the machine.docker run --name wikipedia-on-ipfs --ulimit nofile=65536:65536 -d -p 4001:4001/tcp -p 4001:4001/udp public.ecr.aws/c4h1q7d1/distributed-wikipedia-mirror:latest --languagecode=be --wikitype=wikipedia
to create and publish a Belarusian wikipedia mirror.docker logs
to showCID
-bafybeihs2ql4lnd5v7oscxwbblsqnp6krlvbvak4k3wmqvnb32cg73cpiq
.Original Proposal (outdated)
packer build wikipedia-on-ipfs.pkr.hcl
which successfully createdami-02ff7a8cff61c5d41
.terraform apply
which successfully created S3 and EC2 (I think we should split the configs for the two).It spit out the following notice:
> publish.out publish_website_from_s3.sh 'wikipedia_be_all_maxi_2022-03' &
publush_website_from_s3.sh
is not running anymore and copied the CID frompublish.out
TODO