-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Zstandard as a patching engine
Zstandard CLI is introducing a new command line option --patch-from
, which leverages existing compressors, dictionaries and long range match finder to deliver a high speed engine for producing and applying patches to files.
--patch-from
is based on dictionary compression. It will consider a previous version of a file as a dictionary, to better compress a new version of same file. This operation preserves fast zstd
speeds at lower compression levels. To this ends, it also increases the previous maximum limit for dictionaries from 32 MB to 2 GB, and automatically uses the long range match finder when needed (though it can also be manually overruled).
--patch-from
can also be combined with multi-threading mode at a very minimal compression ratio loss.
Example usage:
# create the patch
zstd --patch-from=<oldfile> <newfile> -o <patchfile>
# apply the patch
zstd -d --patch-from=<oldfile> <patchfile> -o <newfile>
Benchmarks:
We compared zstd
to bsdiff
, a popular industry grade diff engine. Our test corpus were tarballs of different versions of source code from popular repositories. Specifically:
repos = {
# ~31mb (small file)
"zstd": {"url": "https://github.com/facebook/zstd", "dict-branch": "refs/tags/v1.4.2", "src-branch": "refs/tags/v1.4.3"},
# ~273mb (medium file)
"wordpress": {"url": "https://github.com/WordPress/WordPress", "dict-branch": "refs/tags/5.3.1", "src-branch": "refs/tags/5.3.2"},
# ~1.66gb (large file)
"llvm": {"url": "https://github.com/llvm/llvm-project", "dict-branch": "refs/tags/llvmorg-9.0.0", "src-branch": "refs/tags/llvmorg-9.0.1"},
# ~1gb (large file)
"linux kernel tarball": {"url": "https://www.kernel.org/", "dict-branch": "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/?h=v5.9-rc7", "src-branch": "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/?h=v5.8"},
"linux kernel tarball (larger patch)": {"url": "https://www.kernel.org/", "dict-branch": "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/?h=v5.9-rc7", "src-branch": "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/?h=v4.4.238"},
}
Starting with improvements introduced in zstd 1.4.7
, --patch-from
on level 19 is comparable with bsdiff
In terms of patch size, level 19 --patch-from
and bsdiff
are comparable, though --patch-from
wins out particularly on larger patches.
In terms of speed, level 19 --patch-from
greatly outperforms bsdiff
.
In terms of memory footprint, level 19 --patch-from
also significantly outperforms bsdiff
.
The remainder of the benchmarks (primarily focusing on faster compression levels) were conducted on zstd 1.4.6
--patch-from
at level 1 and 3 is significantly faster (>200X faster on level 1 and >100X faster on level 3) vs bsdiff
And of course, there is no change to the fast zstd decompression speed.
After releasing --patch-from
, we were made aware of two other popular diff engines by the community: SmartVersion and Xdelta. SmartVersion is a file versioning tool that produces patches with the help of Zstd compression. Xdelta is a binary diff engine that is optimized for patch creation speed.
In our new benchmarks comparing the three, we looked at three metrics: patch size, patch creation speed, and patch extraction speed. And we used two main data sets: binary data (eg: clang) and text data (eg: clang source code). Measurements were made using the default parameters of each program and the parameters that optimize for patch creation speed. Note zstd-default is the same as Zstd on level 3.
The full benchmark with reproduction details can be found here and the highlights are visualized below. Our general takeaway is this: all three tools are excellent diff engines with clear advantages (especially in speed) over the popular bsdiff. Patch sizes for both binary and text data produced by all three are pretty comparable with Xdelta underperforming Zstd and SmartVersion only slightly [1]. For patch creation speed, Xdelta is the clear winner for text data and Zstd is the clear winner for binary data [2]. And for Patch Extraction Speed (ie. decompression), Zstd is fastest in all scenarios [3].
[1]
[2]
[3]
Improvements of --patch-from
compression ratio from v1.4.5
to v1.4.7
: