Support non-utf8 encoding for zip files #3389

xWTF · 2023-01-30T10:37:22Z

Problem

We all know that filename in zip files should be only encoded with CP437 or UTF-8 (EFS flag set), theoretically. But for historical reasons there're many zip files dangling around with their OEM encoding: Shift_JIS from Japan, GBK from China, etc.

When Stash tries to scan a zip file containing these non-utf8 encoding, let's take Shift_JIS as an example:

The file names retrieved are encoded in OEM encoding
Theoritically, the names are just unreadable, but still recognizable by underlaying code
- For example, ノエル in Shift_JIS is presented as �m�G�� in UTF-8
- But the underlaying byte sequence is the same, we can still fetch the correct file from zip file
However, when stash tries to read a file, it calls Reader.Open instead of zip.File.Open:
- Reader.Open calls fs.ValidPath ref
- fs.ValidPath then calls utf8.ValidString immediately ref
Obviously, the check above will fail, and we'll see the invalid argument error
And the user won't be able to scan & view the images in that zip file 😞

Solution?

The only way to fix this without modifying the code of `archive/zip/reader.go` itself seems to be copy & pasta the related path resolving & file open logic to our own file, I know this is hacky but can't really find a better solution.

~~Not sure if this violates the BSD-3 of golang, I'm not familiar with that, might need another solution if it does 😢~~

Forget the old copy pasta thing at xWTF@718d37a, it works but the implementaion is ugly.

New Solution!

The zip.Reader initializes the index list AFTER the first call to Open, so we can decode names before that, and it works!

Changes summary:

The main changes here concats all the filename and comments, detect its encoding, and if it's not UTF8, decode them 😃
The original chardet library does not return a stable result due to goroutine problems, so I'm currently using my fork which ensures the result consistency, to prevent stash from creating duplicate images.
This dependency could be replaced after Enforce order for results with same confidence gogs/chardet#10 merged

BTW, I've read the contributing guide and finished the checklist before open this PR

xWTF · 2023-02-07T10:12:48Z

It has been a week and there's no response at all, anything wrong with this PR? Is stash accepting PRs from passersby devs?
@WithoutPants

bnkai · 2023-02-07T18:08:34Z

Reviewing PRs takes some time. I think atm there are some open PRs that need to merged first for the next stable version so it will probably take a little bit after that.

Regarding your PR there is an extra dependency on another open PR you made against the upstream chardet library. Given that it may be accepted as is or modified i think it is better to directly use (instead of redirecting via replace in the mod file) your fork and not "github.com/gogs/chardet". When there is an update on that PR we can then change back the library.

xWTF · 2023-02-08T01:02:23Z

This PR currently uses my fork via the replace way. I'm using the upstream package name otherwise I'll have to update the package name in my fork (at least the go.mod) for dependency resolving to work. If that's acceptable, will make the change later today.

bnkai · 2023-02-10T15:21:21Z

With latest PR everything looks ok.
It doesnt seem to break existing behaviour, i couldnt test the exact usecase though since mentioned zip files are difficult to create ?

xWTF · 2023-02-10T15:47:21Z

I have a bunch of zips with unusual encoding, but they're all too large and NSFW ¯\_(ツ)_/¯

You can create one by chcp to the codepage and use -mcl with 7z on Windows:

932: Shift_JIS
936: GBK
950: Big5

Here're two samples: gbk.zip, shift-jis.zip

The original version:

The fixed version:

As you can see, the detection is not 100% accurate, because Shift_JIS with only kana can be decoded by EUC-JP properly. For GBK, it needs more text or a proper hint to get accurate result. Getting wrong detection doesn't affect the scan and those image can be loaded properly :)

My stash have 836 zip files containing 57,472 images (after dedup the exactly same files) in total without any error, even the misdetection is rare since those zip files contain many images and there're sufficient sample (by concating all filenames) for the detector to work.

bnkai · 2023-02-10T16:10:08Z

@xWTF thanks for the samples
Tested and confirmed working.

xWTF changed the title ~~[Feature] Support non-utf8 encoding for zip files~~ Support non-utf8 encoding for zip files Jan 30, 2023

xWTF force-pushed the develop branch from 718d37a to 49fd74f Compare January 30, 2023 13:23

xWTF changed the title ~~Support non-utf8 encoding for zip files~~ Support non-utf8 encoding for zip files (updated, no more copy pasta from stdlib) Jan 30, 2023

WithoutPants added this to the Version 0.20.0 milestone Feb 8, 2023

WithoutPants added the bug Something isn't working label Feb 8, 2023

xWTF changed the title ~~Support non-utf8 encoding for zip files (updated, no more copy pasta from stdlib)~~ Support non-utf8 encoding for zip files Feb 8, 2023

xWTF added 2 commits February 8, 2023 18:18

detect & decode zip file names in newZipFS

5e10d4b

use xWTF/chardet instead of gogs/chardet

1c9a71f

xWTF force-pushed the develop branch from aa09947 to 1c9a71f Compare February 8, 2023 10:21

bnkai approved these changes Feb 10, 2023

View reviewed changes

WithoutPants merged commit d00966c into stashapp:develop Feb 15, 2023

DampToast mentioned this pull request Mar 20, 2023

[Bug Report] Charencoding issue when scanning a zip gallery #3566

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support non-utf8 encoding for zip files #3389

Support non-utf8 encoding for zip files #3389

xWTF commented Jan 30, 2023 •

edited

Loading

xWTF commented Feb 7, 2023

bnkai commented Feb 7, 2023

xWTF commented Feb 8, 2023 •

edited

Loading

bnkai commented Feb 10, 2023

xWTF commented Feb 10, 2023 •

edited

Loading

bnkai commented Feb 10, 2023

Support non-utf8 encoding for zip files #3389

Support non-utf8 encoding for zip files #3389

Conversation

xWTF commented Jan 30, 2023 • edited Loading

Problem

Solution?

New Solution!

Changes summary:

xWTF commented Feb 7, 2023

bnkai commented Feb 7, 2023

xWTF commented Feb 8, 2023 • edited Loading

bnkai commented Feb 10, 2023

xWTF commented Feb 10, 2023 • edited Loading

bnkai commented Feb 10, 2023

xWTF commented Jan 30, 2023 •

edited

Loading

xWTF commented Feb 8, 2023 •

edited

Loading

xWTF commented Feb 10, 2023 •

edited

Loading