- Support for filenames with:
- Latin characters with diacritics.
- East Asian characters.
- Emojis.
- Symbols.
- Support for symlinks, both as symlink and as stored as the file the symlink follows.
- Set the output file modified time to be the same as the modified time of the most recent file being compressed.
- Time to compress a series of large text files at maximum compression.
- Compression ratio of a series of large text files at maximum compression.
Test files
I used as base for this test:
- Four highly compressible .gpx files totaling 11MB.
- A filename with latin characters with diacritics: “éñð”.
- A filename with East Asian characters: “你好”.
- A filename with Emojis: “🌐⛰️”.
- A filename with symbols: “#@^![]”.
- A relative symlink.
- An absolute symlink.
Zip
zip is the most popular compression format on Windows and mostly used to share files on the internet.
PKWARE created PKZIP in 1989 for MS-DOS, and rapidly gained popularity.
In 1990, Info-ZIP project, a free and open-source implemented zip
and unzip
commands.
Zip Examples
# compresses all the files in the directory.
# stores the symlinks as the file it follows, not as a symlink
zip -j files.zip *
# compresses all the files in the directory and it includes the listing
# of all the subfolder names but not their content
zip files-and-sub-folders.zip *
# compresses all the files in the directory and all the files in subdirectories
zip -r all-files-and-child-files.zip *
# compresses files using the maximum compression option
zip -9 maximum.zip *
# compresses files and the symlinks as such
zip -y symlinks-contents.zip *
# compresses files and sets the modified time to the modified time
# of the most recent compressed file.
zip --latest-time most-recent.zip *
# extracts the files of the zip file
unzip my-zip-file.zip
# list the files of zip file in short format
unzip -l my-zip-file.zip
# list the files of zip file in detailed format
unzip -v my-zip-file.zip
Zip Results
The results below are related only to Linux zip
command:
- ⛔️ It can’t store files with non Latin-characters (The current version due a bug).
- ⛔️ It has one of the lowest compression ratios compared with the alternatives.
- ✅ It can store filenames with symbols.
- ✅ It can store absolute symlinks and relative symlinks.
- ✅ It can store symlinks content that it follows.
- ✅ It had one of the fastest speeds for a Command at maximum ratio.
- ✅ It supports set modified time to the most recent compressed file.
7z
7z is a popular alternative to zip, mainly on Windows, due to the fact that it is open-source, provides high compression ratios and AES-256 encryption.
It was created by Igor Pavlov in 1999. The Linux version was created in the early 2000s.
Zip Examples
# compresses all the files in the directory and all the files in subdirectories
# stores the symlinks as the file it follows, not as a symlink
7z a all-files-and-child-files.7z *
# compresses all the files in the directory and it excludes
# subfolders and its files
7z a '-x!*/' files.7z *
# compresses files using the maximum compression option
7z a -mx9 maximum.7z *
# compresses files and the symlinks as such
7z a -snl symlinks-as-such.7z *
# compresses files and sets the modified time to the modified time
# of the most recent compressed file.
7z a -stl most-recent.7z *
# extracts the files of the 7z file
7z x my-7z-file.7z
# list the files of 7z file in detailed format
7z l my-7z-file.7z
7z Results
The results below are related only to Linux 7z
command:
- ✅ It can store files Unicode characters.
- ✅ It can store filenames with symbols.
- ⛔️ It can store absolute symlinks and relative symlinks but it can’t decompress absolute symlinks, it reports: “ERROR: Dangerous link path was ignored”.
- ✅ It can store symlinks content.
- ✅ It had one of the best compressed sized for a Command at maximum ratio but it wasn’t the best.
Tar
tar isn't a compression command, instead it joins the files, creates a file directory listing, and it calls internally an external compression command, all with a single command call.
This method offers a significant advantage by allowing each tool to focus on the specific task of joining files and letting the external command do the compression job.
It was created in 1979 by AT&T Bell Labs for the Unix system, and it was designed for archiving files to tape drives. It’s commonly used with gzip
compression.
Tar Results
The results below are related to tar
command:
- ✅ It can store files Unicode characters.
- ✅ It can store filenames with symbols.
- ✅ It can store absolute symlinks and relative symlinks.
- ✅ It can store symlinks content.
- ⛔️ It doesn’t supports set modified time to the modified time of the most recent file.
Tar Examples
# stores all the files in the directory and all the files in subdirectories
# stores the symlinks as such
# "-c" parameter means create
# "-f" parameter means filename
tar -cf all-files-and-child-files.tar *
# same as before, but it displays more information while creating the tar file
tar -cvf all-files-and-child-files.tar *
# stores all the files in the directory and it excludes
# files of subdirectories but it doesn't excludes the first level of subdirectories
tar --exclude='*/*' -cf files-and-folders.tar *
# stores only the files in the directory
find . -maxdepth 1 -type f -or -type l | xargs -d '\n' tar -cf files.tar
# stores files and the symlinks the file it follows, not as such
# "-h" parameter means Follow symlinks
tar -chf symlinks-contents.tar *
# extracts the files of the tar file
tar -xf my-tar-file.tar
# list the files of tar file in short format
# "-t" parameter means list
tar -tf my-tar-file.tar
# list the files of tar file in detailed format
# "-v" parameter means verbose
tar -tvf my-tar-file.tar
# Set the modified time of a tar file to the modified time
# of the most recent file
# This script is built to support large number of files including spaces
BASE_FILE=$TMPDIR/filelist-$(date +"%Y-%m-%d-%H_%M_%S")
TMP_FILE=$BASE_FILE.lst
STAT_FILE=$BASE_FILE-stat.lst
while IFS= read -r file; do
stat -c '%Y %n' "$file" >>$STAT_FILE
done <$TMP_FILE
MOST_RECENT_FILE="$(cat $STAT_FILE | sort -n | tail -1 | cut -d' ' -f2-)"
touch -r "$MOST_RECENT_FILE" "my-tar-file.tar"
Gzip
gzip is the most popular compression format used along with tar, generating files with the extension: tar.gz
.
GZip means GNU zip, and uses the deflate algorithm for compression, which is one of the available algorithms for zip described above.
It was created in 1992 by Jean-loup Gailly and Mark Adler for the GNU Project.
Gzip Examples
# stores all the files and it compresses them using gzip
# "-z" parameter means use gzip compressor
tar -czf all-files-and-child-files.tar.gz *
# compresses files using the maximum compression option
tar -c * | gzip -9 > all-files-and-child-files.tar.gz
# list the files of tar.gz file in detailed format
tar -tvf all-files-and-child-files.tar.gz
# extracts the files of the tar.gz file
tar -xf my-tar-file.tar.gz
Bzip2
bzip2 offers better compression ratios than gzip, but it’s slower.
Bzip2 command will replace the original file when during the compression mode, to prevent this, you can use it with tar
or with cat
.
It was created 1996 by Julian Seward, and it uses the Burrows-Wheeler transform and Huffman coding.
Bzip2 Examples
# stores all the files and it compresses them using bzip2
# "-j" parameter means use bzip2 compressor
tar -cjf files-and-child-files.tar.bz2 *
# compresses files using the maximum compression option
# this has almost no effect in this format
tar -c * | bzip2 -9 > files-and-child-files.tar.bz2
# compresses a file and replaces it with the compressed version
bzip2 my-file.tar.bz2
# list the files of tar.bz2 file in detailed format
tar -tvf files-and-child-files.tar.bz2
# extracts the files of the tar.bz2 file
tar -xf files-and-child-files.tar.bz2
Xz
xz has an excellent compression ratio, often better than both gzip and bzip2, and it’s more efficient than bzip2, offering a good balance between compression ratio and performance.
xz was created in 2009 by Lasse Collin, and it uses the LZMA algorithm, which originated from 7-Zip.
xz compression format became infamous due the XZ Utils backdoor attack, which was a significant security incident discovered in March 2024, if it hadn’t been detected it would have allowed unauthorized access to systems that used the compromised version of XZ Utils. The attack failed, and it never made it to the publicly released version of Ubuntu.
Xz Examples
# stores all the files and it compresses them using xz
# "-J" parameter means use xz compressor
tar -cJf files-and-child-files.tar.xz *
# compresses files using the maximum compression option
tar -c * | xz -9 > files-and-child-files.tar.xz
# list the files of tar.xz file in detailed format
tar -tvf files-and-child-files.tar.xz
# extracts the files of the tar.xz file
tar -xf files-and-child-files.tar.xz
Zstd
zstd is increasingly popular in various applications due to its versatility and efficiency.
It’s known for very fast compression and decompression speeds while still achieving strong compression ratios.
It was created in 2015 at Facebook.
Zstd Examples
# stores all the files and it compresses them using zstd
tar --zstd -cf files-and-child-files.tar.zst *
# compresses files using the maximum compression option
tar -c * | zstd -19 > files-and-child-files.tar.zst
# list the files of tar.zst file in detailed format
tar -tvf files-and-child-files.tar.zst
# extracts the files of the tar.zst file
tar -xf files-and-child-files.tar.zst
Timeline
- 1979: tar (AT&T Bell Labs)
- 1989: PKZIP (PKWARE)
- 1990: zip (Info-ZIP)
- 1992: gzip (Jean-loup Gailly, Mark Adler for GNU)
- 1996: bzip2 (Julian Seward)
- 1999: 7z (Igor Pavlov)
- 2009: xz (Lasse Collin)
- 2015: zstd (Yann Collet at Facebook)
Comparison
To compare the time and compression size of each format, I used four highly compressible .gpx files totaling 11 MB to compare.
Timings
Command at default compression option:
Command | Seconds |
---|---|
zstd | 0.04 |
gzip | 0.12 |
zip | 0.12 |
bzip2 | 0.93 |
7z | 1.34 |
xz | 2.95 |
Command at maximum compression option:
Command | Seconds |
---|---|
gzip | 0.34 |
zip | 0.34 |
bzip2 | 0.99 |
7z | 2.31 |
xz | 2.85 |
zstd | 5.76 |
Notes:
- bzip2 has no practical difference in time regardless if it’s a default or at maximum compression option.
- zstd has the biggest difference in time if it’s at default or at maximum compression option.
- zip and gzip are in general the fastest compression commands for a highly compressed text.
- xz is in general the slowest compression command.
Sizes
Command at default compression option:
Command | Bytes |
---|---|
bzip2 | 0668848 |
xz | 0786164 |
7z | 0877195 |
gzip | 1013562 |
zip | 1013687 |
zstd | 1104299 |
Command at maximum compression option:
Command | Bytes |
---|---|
bzip2 | 668848 |
zstd | 701164 |
xz | 786308 |
7z | 795175 |
gzip | 945239 |
zip | 945722 |
Notes:
- bzip2 in these tests proven to always provide the highest compression ratio at average speed in comparison with other commands.
- zstd has the biggest difference in size if it’s at default or at maximum compression option.
- zip and gzip had in general the lowest compression ratio.
Conclusion
There is no perfect compression command, each one has its own benefits.
If you want to distribute files over the internet, zip is still the king, hopefully its bug in Unicode characters will be fixed soon.
7z is popular on Windows platforms for those looking for an open-source format with higher compression ratios than zip.
Within the Linux community, tar.gz is the most popular format.
Bzip2 generated the smaller compressed files in these tests, making a great fit for text file compression.