Pcompress is a utility to do compression/decompression and deduplication in parallel by splitting input data into chunks. It has a modular structure and includes support for multiple algorithms like LZMA, Bzip2, PPMD, LZ4, etc., with KECCAK/BLAKE2/SHA-256/512 chunk checksums. SSE optimizations for the bundled LZMA are included. It also implements chunk-level Content-Aware Deduplication and Delta Compression features based on a Rabin Fingerprinting scheme. It has low metadata overhead and overlaps I/O and compression to achieve maximum parallelism. It has AES encryption capability and uses Scrypt from Tarsnap to generate per-session unique keys from passwords. It can work in pipe mode, reading from stdin and writing to stdout. It also provides some adaptive compression modes in which a suitable algorithm is chosen per chunk based on heuristics.
| Tags | data compression Deduplication |
|---|---|
| Licenses | LGPL v3 |
| Operating Systems | UNIX / Linux |
| Implementation | C |
Recent releases


Release Notes: This release adds many bugfixes and performance improvements. Accuracy in finding duplicates in Global Dedupe has been improved. SHA256 is now the default block hash algorithm for dedupe, with the ability to change it separately from the chunk verification hash. Overall, many performance improvements have been made, with better parallelism, more SSE vectorization, and faster sorting and improving the segment hash list file handling, resulting in smaller I/O and fewer random accesses. Bugs in calculating in-memory index size has been fixed to avoid overflowing free RAM and swapping to disk.


Release Notes: This release introduces the capability to do Global Deduplication. This performs deduplication across the entire dataset using an in-memory index as opposed to deduplication only within segments. Two kinds of indexes are used based on the dataset size. A full chunk hash index is used for small datasets. A special segmented similarity based index is used when the dataset is very large. The latter index size is just 0.002% of the dataset size with >90% efficiency of a full chunk index based exact dedupe using 4KB chunks. Streaming support allows optimization of network transfer of large data.


Release Notes: This update release adds several performance and security enhancements. AES code now includes AES-NI and VPAES optimizations. The fast XSalsa20 encryption algorithm has been added. Encryption key length can now be set at runtime (128/256 bits). Nonce, salt, etc. are now HMACed, and nonce generation randomized. Merkle Tree hashing via OpenMP is now used for all hashes when compressing an entire file in a single chunk (solid archive mode). Deduplication performance is improved by 95%. There is improved XML detection in adaptive modes. The file format has been updated, but backward compatibility retained.


Release Notes: This is a performance-focused release with improvements across the board. Extensive x86 SSE2/3/4/AVX vectorization has been done with runtime CPU feature detection. Deduplication performance is increased 3X. The Delta Compression algorithm has been tweaked for better performance and effectiveness with reduced memory usage. xxHash has been vectorized. Support for BLAKE2 checksum has been included. AES CTR mode has been vectorized. Intel's optimized SHA512, 512/256 is included for leading edge SHA2 performance in addition to parallel modes. LZMA performance is slightly boosted.


Release Notes: This release adds many improvements and fixes, including ones for performance and stability. It adds the KECCAK Sha3 message digest. A new fast Delta variant detects embedded tables of binary numeric data and RLE encodes them. A matrix transform allows better compression of a Dedupe index. LZ4 and XXHash have been updated. The test suite has been expanded. Pcompress now builds without warnings with strict compiler flags. Alternate locations for external libraries are handled properly, and older OpenSSL versions up to 0.9.8e work. The debug statistics mode now prints additional throughput statistics.