-
-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support non-utf8 encoding for zip files #3389
Conversation
It has been a week and there's no response at all, anything wrong with this PR? Is stash accepting PRs from passersby devs? |
Reviewing PRs takes some time. I think atm there are some open PRs that need to merged first for the next stable version so it will probably take a little bit after that. Regarding your PR there is an extra dependency on another open PR you made against the upstream chardet library. Given that it may be accepted as is or modified i think it is better to directly use (instead of redirecting via replace in the mod file) your fork and not |
This PR currently uses my fork via the replace way. I'm using the upstream package name otherwise I'll have to update the package name in my fork (at least the go.mod) for dependency resolving to work. If that's acceptable, will make the change later today. |
With latest PR everything looks ok. |
I have a bunch of zips with unusual encoding, but they're all too large and NSFW ¯\_(ツ)_/¯ You can create one by
Here're two samples: gbk.zip, shift-jis.zip As you can see, the detection is not 100% accurate, because Shift_JIS with only kana can be decoded by EUC-JP properly. For GBK, it needs more text or a proper hint to get accurate result. Getting wrong detection doesn't affect the scan and those image can be loaded properly :) My stash have 836 zip files containing 57,472 images (after dedup the exactly same files) in total without any error, even the misdetection is rare since those zip files contain many images and there're sufficient sample (by concating all filenames) for the detector to work. |
@xWTF thanks for the samples |
Problem
We all know that filename in zip files should be only encoded with
CP437
orUTF-8
(EFS flag set), theoretically. But for historical reasons there're many zip files dangling around with their OEM encoding:Shift_JIS
from Japan,GBK
from China, etc.When Stash tries to scan a zip file containing these non-utf8 encoding, let's take Shift_JIS as an example:
ノエル
in Shift_JIS is presented as�m�G��
in UTF-8Reader.Open
instead ofzip.File.Open
:Reader.Open
callsfs.ValidPath
reffs.ValidPath
then callsutf8.ValidString
immediately refinvalid argument
errorSolution?
The only way to fix this without modifying the code of `archive/zip/reader.go` itself seems to be copy & pasta the related path resolving & file open logic to our own file, I know this is hacky but can't really find a better solution.Not sure if this violates the BSD-3 of golang, I'm not familiar with that, might need another solution if it does 😢
Forget the old copy pasta thing at xWTF@718d37a, it works but the implementaion is ugly.
New Solution!
The
zip.Reader
initializes the index list AFTER the first call toOpen
, so we can decode names before that, and it works!Changes summary:
This dependency could be replaced after Enforce order for results with same confidence gogs/chardet#10 merged
BTW, I've read the contributing guide and finished the checklist before open this PR