Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support non-utf8 encoding for zip files #3389

Merged
merged 2 commits into from
Feb 15, 2023
Merged

Conversation

xWTF
Copy link
Contributor

@xWTF xWTF commented Jan 30, 2023

Problem

We all know that filename in zip files should be only encoded with CP437 or UTF-8 (EFS flag set), theoretically. But for historical reasons there're many zip files dangling around with their OEM encoding: Shift_JIS from Japan, GBK from China, etc.

When Stash tries to scan a zip file containing these non-utf8 encoding, let's take Shift_JIS as an example:

  • The file names retrieved are encoded in OEM encoding
  • Theoritically, the names are just unreadable, but still recognizable by underlaying code
    • For example, ノエル in Shift_JIS is presented as �m�G�� in UTF-8
    • But the underlaying byte sequence is the same, we can still fetch the correct file from zip file
  • However, when stash tries to read a file, it calls Reader.Open instead of zip.File.Open:
    • Reader.Open calls fs.ValidPath ref
    • fs.ValidPath then calls utf8.ValidString immediately ref
  • Obviously, the check above will fail, and we'll see the invalid argument error
  • And the user won't be able to scan & view the images in that zip file 😞

Solution?

The only way to fix this without modifying the code of `archive/zip/reader.go` itself seems to be copy & pasta the related path resolving & file open logic to our own file, I know this is hacky but can't really find a better solution.

Not sure if this violates the BSD-3 of golang, I'm not familiar with that, might need another solution if it does 😢

Forget the old copy pasta thing at xWTF@718d37a, it works but the implementaion is ugly.

New Solution!

The zip.Reader initializes the index list AFTER the first call to Open, so we can decode names before that, and it works!

Changes summary:

  • The main changes here concats all the filename and comments, detect its encoding, and if it's not UTF8, decode them 😃
  • The original chardet library does not return a stable result due to goroutine problems, so I'm currently using my fork which ensures the result consistency, to prevent stash from creating duplicate images.
    This dependency could be replaced after Enforce order for results with same confidence gogs/chardet#10 merged

BTW, I've read the contributing guide and finished the checklist before open this PR

@xWTF xWTF changed the title [Feature] Support non-utf8 encoding for zip files Support non-utf8 encoding for zip files Jan 30, 2023
@xWTF xWTF changed the title Support non-utf8 encoding for zip files Support non-utf8 encoding for zip files (updated, no more copy pasta from stdlib) Jan 30, 2023
@xWTF
Copy link
Contributor Author

xWTF commented Feb 7, 2023

It has been a week and there's no response at all, anything wrong with this PR? Is stash accepting PRs from passersby devs?
@WithoutPants

@bnkai
Copy link
Collaborator

bnkai commented Feb 7, 2023

Reviewing PRs takes some time. I think atm there are some open PRs that need to merged first for the next stable version so it will probably take a little bit after that.

Regarding your PR there is an extra dependency on another open PR you made against the upstream chardet library. Given that it may be accepted as is or modified i think it is better to directly use (instead of redirecting via replace in the mod file) your fork and not "github.com/gogs/chardet". When there is an update on that PR we can then change back the library.

@xWTF
Copy link
Contributor Author

xWTF commented Feb 8, 2023

This PR currently uses my fork via the replace way. I'm using the upstream package name otherwise I'll have to update the package name in my fork (at least the go.mod) for dependency resolving to work. If that's acceptable, will make the change later today.

@WithoutPants WithoutPants added this to the Version 0.20.0 milestone Feb 8, 2023
@WithoutPants WithoutPants added the bug Something isn't working label Feb 8, 2023
@xWTF xWTF changed the title Support non-utf8 encoding for zip files (updated, no more copy pasta from stdlib) Support non-utf8 encoding for zip files Feb 8, 2023
@bnkai
Copy link
Collaborator

bnkai commented Feb 10, 2023

With latest PR everything looks ok.
It doesnt seem to break existing behaviour, i couldnt test the exact usecase though since mentioned zip files are difficult to create ?

@xWTF
Copy link
Contributor Author

xWTF commented Feb 10, 2023

I have a bunch of zips with unusual encoding, but they're all too large and NSFW ¯\_(ツ)_/¯

You can create one by chcp to the codepage and use -mcl with 7z on Windows:

  • 932: Shift_JIS
  • 936: GBK
  • 950: Big5

Here're two samples: gbk.zip, shift-jis.zip

The original version:
image

The fixed version:
image

As you can see, the detection is not 100% accurate, because Shift_JIS with only kana can be decoded by EUC-JP properly. For GBK, it needs more text or a proper hint to get accurate result. Getting wrong detection doesn't affect the scan and those image can be loaded properly :)

My stash have 836 zip files containing 57,472 images (after dedup the exactly same files) in total without any error, even the misdetection is rare since those zip files contain many images and there're sufficient sample (by concating all filenames) for the detector to work.

@bnkai
Copy link
Collaborator

bnkai commented Feb 10, 2023

@xWTF thanks for the samples
Tested and confirmed working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants