ddn.compressor.bzip2

ddn.compressor.bzip2

High-performance BZIP2 compression provider for ddn.api.compressor. This is a pure D implementation of the BZIP2 algorithm designed to match the performance of the system bzip2 tool.

The BZIP2 algorithm consists of the following stages:

  • Run-Length Encoding (RLE1): Initial pass limiting runs to 255.
  • Burrows-Wheeler Transform (BWT): Reversible block sort.
  • Move-To-Front (MTF): Exploits locality in BWT output.
  • Zero Run-Length Encoding (RLE2): Encodes zeros as RUNA/RUNB.
  • Huffman Coding: Multiple tables with selectors.

Features:

  • Supports both streaming and block compression.
  • Compression levels map to bzip2's blockSize100k (1-9).
  • Fully DDoc-documented.
  • All public functions and types have unittests.

Module Initializers 1

shared static this()

Types 13

structBwtResult

BWT transform result.

Fields
ubyte[] output
uint origPtr
structMtfResult

MTF transform result.

Fields
ubyte[] output
bool[256] inUse
int nInUse

RLE2 encode result.

Fields
ushort[] symbols
int nInUse

Huffman encoding table entry.

Fields
uint code
ubyte len

Huffman decoding table.

Fields
uint[] limit
uint[] base
uint[] perm
int minLen
int maxLen
structBitWriter

Bit writer for MSB-first bit packing.

Fields
ubyte[] buffer
size_t pos
uint bsBuff
int bsLive
Methods
void init(size_t initialSize = 4096) pure @safeInitialize the writer.
void ensureCapacity(size_t extra) pure @safeEnsure buffer has space.
void writeBits(int n, uint value) pure nothrow @nogc @safeWrite n bits from value (MSB-first).
void writeBit(uint bit) pure nothrow @nogc @safeWrite a single bit.
void writeByte(ubyte b) pure nothrow @nogc @safeWrite a byte (8 bits).
void writeUInt24(uint value) pure nothrow @nogc @safeWrite a 24-bit value.
void writeUInt32(uint value) pure nothrow @nogc @safeWrite a 32-bit value.
void writeUnary(int count) pure nothrow @nogc @safeWrite unary encoded value.
void writeBitmap16(ushort bitmap) pure nothrow @nogc @safeWrite bitmap for 16 symbols.
void flush() pure nothrow @nogc @safeFlush remaining bits to output.
const(ubyte)[] getOutput() pure nothrow @nogc @safe returnGet output bytes.
const(ubyte)[] consumeOutput() nothrow @nogc @safeFlush complete bytes and return them, keeping partial bits in buffer.
void reset() pure nothrow @nogc @safeReset to initial state.
structBitReader

Bit reader for MSB-first bit unpacking.

Fields
const(ubyte)[] data
size_t bytePos
int bitPos
uint bsBuff
int bsLive
Methods
void setInput(const(ubyte)[] input) pure nothrow @nogc @safeSet input data.
uint readBits(int n) pure nothrow @nogc @safeRead n bits.
uint readBit() pure nothrow @nogc @safeRead a single bit.
ubyte readByte() pure nothrow @nogc @safeRead a byte.
uint readUInt24() pure nothrow @nogc @safeRead a 24-bit value.
uint readUInt32() pure nothrow @nogc @safeRead a 32-bit value.
bool hasBits(int n) pure nothrow @nogc @safeCheck if we have n bits available.
size_t bytesConsumed() pure nothrow @nogc @safeGet bytes consumed.

Compressed block result.

Fields
ubyte[] data
uint blockCrc
private structHuffmanTableSelection

Determine number of Huffman tables based on data size.

Parameters

nMTFNumber of MTF symbols.

Returns

Number of tables (2-6).

Result of multi-table Huffman selection.

Fields
ubyte[][] lengths
uint[][] codes
int[] minLens
int[] maxLens
ubyte[] selectors
ubyte[] selectorsMtf
int nGroups
int nSelectors
private structCanonicalCodes

Assign canonical Huffman codes from code lengths.

Matches bzip2's BZ2_hbAssignCodes: codes assigned in order of increasing length, and within the same length, in symbol index order.

Fields
uint[] codes
int minLen
int maxLen

Decompressed block result.

Fields
ubyte[] data
uint blockCrc
bool isEos

BZIP2 streaming compressor.

Fields
ubyte[] _blockBuffer
size_t _blockPos
ulong _bytesIn
ulong _bytesOut
uint _combinedCrc
bool _headerWritten
bool _finished
int _blockSize100k
uint _blockSize
BitWriter _bitWriter
Methods
CompressionOptions options() const pure nothrow
void write(const(ubyte)[] data)
void flush(FlushMode mode = FlushMode.SYNC)
void finish()
void reset()
ulong bytesInTotal() const pure nothrow
ulong bytesOutTotal() const pure nothrow
bool setDictionary(const(ubyte)[] dict) pure nothrow
bool isFinished() const pure nothrow
private void writeHeader()
private void flushBlock()
private void writeTrailer()
Constructors
this(CompressionOptions opts)Create a new BZIP2 compressor.

BZIP2 streaming decompressor.

Fields
ubyte[] _inBuffer
size_t _inPos
ulong _bytesIn
ulong _bytesOut
uint _combinedCrc
uint _expectedCombinedCrc
int _blockSize100k
bool _headerParsed
bool _finished
BitReader _bitReader
bool _readerInited
Methods
DecompressionOptions options() const pure nothrow
void write(const(ubyte)[] data)
void finish()
void reset()
ulong bytesInTotal() const pure nothrow
ulong bytesOutTotal() const pure nothrow
bool setDictionary(const(ubyte)[] dict) pure nothrow
bool isFinished() const pure nothrow
private void processInput()
private bool parseHeader()
private bool processBlock()
Constructors
this(DecompressionOptions opts)Create a new BZIP2 decompressor.

Functions 44

fnuint bz2Crc32Update(uint crc, ubyte b) pure nothrow @nogc @safeCompute BZIP2 CRC32 for a single byte.
fnuint bz2Crc32(const(ubyte)[] data) pure nothrow @nogc @safeCompute BZIP2 CRC32 for a buffer.
fnsize_t rle1EncodeCalcSize(const(ubyte)[] input) pure nothrow @nogc @safeCalculate the size of RLE1 encoded output.
fnsize_t rle1Encode(const(ubyte)[] input, ubyte[] output) pure nothrow @nogc @safeEncode data using RLE1.
fnubyte[] rle1Decode(const(ubyte)[] input) pure @safeDecode RLE1 encoded data.
private fnint compareRotationsFrom(const(ubyte)[] block, size_t i, size_t j, size_t startOffset) pure nothrow @nogcCompare two rotations starting from a given offset.
private fnint compareRotations(const(ubyte)[] block, size_t i, size_t j) pure nothrow @nogcCompare two rotations of the input block lexicographically.
private fnsize_t bwtLog2(size_t n) pure nothrow @nogcCompute floor(log2(n)) for depth limiting.
private fnvoid bwtIntrosort(uint[] indices, const(ubyte)[] block, size_t lo, size_t hi, size_t depthLimit) pure nothrowIntrosort implementation for BWT sorting.
private fnvoid bwtIntrosortFrom(uint[] indices, const(ubyte)[] block, size_t lo, size_t hi, size_t depthLimit, size_t compareOffset) pure nothrowIntrosort implementation that skips already-compared prefix bytes.
private fnsize_t bwtPartition(uint[] indices, const(ubyte)[] block, size_t lo, size_t hi) pure nothrowPartition for quicksort using median-of-three pivot.
private fnsize_t bwtPartitionFrom(uint[] indices, const(ubyte)[] block, size_t lo, size_t hi, size_t compareOffset) pure nothrowPartition for quicksort with prefix skip optimization.
private fnvoid bwtSwap(uint[] indices, size_t i, size_t j) pure nothrow @nogcSwap two elements in the indices array.
private fnvoid bwtHeapsort(uint[] indices, const(ubyte)[] block, size_t lo, size_t hi) pure nothrowHeapsort for BWT indices (fallback for pathological cases).
private fnvoid bwtHeapsortFrom(uint[] indices, const(ubyte)[] block, size_t lo, size_t hi, size_t compareOffset) pure nothrowHeapsort with prefix skip optimization.
private fnvoid bwtSiftDown(uint[] indices, const(ubyte)[] block, size_t lo, size_t start, size_t n) pure nothrowSift down operation for heapsort.
private fnvoid bwtSiftDownFrom(uint[] indices, const(ubyte)[] block, size_t lo, size_t start, size_t n, size_t compareOffset) pure nothrowSift down with prefix skip optimization.
private fnvoid bwtInsertionSort(uint[] indices, const(ubyte)[] block, size_t lo, size_t hi) pure nothrowInsertion sort for small ranges.
private fnvoid bwtInsertionSortFrom(uint[] indices, const(ubyte)[] block, size_t lo, size_t hi, size_t compareOffset) pure nothrowInsertion sort with prefix skip optimization.
private fnvoid bwtRadixSortDeep(uint[] indices, const(ubyte)[] block, uint lo, uint hi, int depth, uint[] temp) pure nothrowDeeper radix sort for large buckets with repetitive data. Uses single-byte radix at each level with efficient batch skipping.
private fnvoid bwtTernaryRadixSort(uint[] indices, const(ubyte)[] block, uint lo, uint hi, int depth, uint[] temp) pure nothrowTernary radix quicksort for BWT indices.
private fnvoid bwtSort(uint[] indices, const(ubyte)[] block) pure nothrowSort an array of rotation indices based on the rotations they represent.
fnBwtResult bwtTransform(const(ubyte)[] input) purePerform the Burrows-Wheeler Transform on the input data.
fnubyte[] bwtInverseTransform(const(ubyte)[] input, uint origPtr) pure @safeInverse BWT.
fnMtfResult mtfTransform(const(ubyte)[] input) pure @safePerform MTF transform.
fnubyte[] mtfInverseTransform(const(ubyte)[] indices, const bool[256] inUse, int nInUse) pure @safeInverse MTF transform.
fnRle2Result rle2Encode(const(ubyte)[] mtfIndices, int nInUse) pure @safeEncode MTF output using RLE2 (zero run encoding).
fnubyte[] rle2Decode(const(ushort)[] symbols, int nInUse) pure @safeDecode RLE2 to MTF indices.
fnubyte[] generateHuffmanLengths(const(uint)[] frequencies, int alphaSize, int maxLen = BZ_MAX_CODE_LEN) pureGenerate Huffman code lengths from frequencies.
private fnvoid heapPush(ref uint[] heapFreq, ref int[] heapNode, int heapSize, uint freq, int node) purePush a node onto the min-heap.
private fnvoid heapPop(ref uint[] heapFreq, ref int[] heapNode, int heapSize, ref uint freq, ref int node) purePop the minimum node from the min-heap.
private fnvoid limitCodeLengths(ref ubyte[] lengths, int alphaSize, int maxLength) pureLimit Huffman code lengths using a simple heuristic. Ensures no code exceeds maxLength by redistributing.
fnHuffEncEntry[] buildHuffmanEncodeTable(const(ubyte)[] lengths) pure @safeBuild Huffman encoding table.
fnHuffDecTable buildHuffmanDecodeTable(const(ubyte)[] lengths) pure @safeBuild Huffman decoding table.
private fnCanonicalCodes assignCanonicalCodes(const(ubyte)[] lengths) pure
private fnvoid initializeHuffmanTables(ref ubyte[][] lengths, const(uint)[] mtfFreqs, int nGroups, int alphaSize, int nMTF) pure nothrowInitialize Huffman table code lengths based on symbol frequency partitioning.
private fnubyte[] mtfEncodeSelectors(const(ubyte)[] selectors, int nGroups) pure nothrowMTF-encode the selector sequence.
private fnHuffmanTableSelection selectHuffmanTables(const(ushort)[] symbols, const(uint)[] mtfFreqs, int alphaSize) pureSelect Huffman tables for MTF-encoded symbols using iterative optimization.
fnint determineNumGroups(int nMTF) pure nothrow @nogc
fnCompressedBlock compressBlock(const(ubyte)[] input, int blockSize100k, uint originalDataCrc = 0)Compress a block of RLE1-encoded data into a BZIP2 block.
private fnCompressedBlock compressBlockToWriter(const(ubyte)[] input, int blockSize100k, uint originalDataCrc, BitWriter * bw)Compress a block into a provided BitWriter without flushing.
fnDecompressedBlock decompressBlock(ref BitReader reader)Decompress a single block.
fnCompressor makeBzip2Compressor(CompressionOptions opts)Create a BZIP2 compressor.
fnDecompressor makeBzip2Decompressor(DecompressionOptions opts)Create a BZIP2 decompressor.

Variables 15

private varubyte[3] BZ_HEADER_MAGIC

BZIP2 stream header magic bytes ("BZh")

private varubyte[6] BZ_BLOCK_MAGIC

BZIP2 block header magic (48 bits: digits of pi 3.14159265359)

private varubyte[6] BZ_EOS_MAGIC

BZIP2 end-of-stream magic (48 bits: sqrt(pi) 1.77245385090)

private enumvarBZ_MAX_ALPHA_SIZE = 258

Maximum alphabet size for Huffman coding

private enumvarBZ_MAX_GROUPS = 6

Number of Huffman tables (2-6)

private enumvarBZ_MAX_CODE_LEN = 23

Number of Huffman code lengths

private enumvarBZ_G_SIZE = 50

Symbols per group for Huffman table selection

private enumvarBZ_N_ITERS = 4

Number of iterations for Huffman table optimization

private enumvarBZ_N_OVERSHOOT = 34

Overshoot for safe BWT comparison past block end

private enumvarBZ_RUNA = 0

RUNA symbol for zero run encoding

private enumvarBZ_RUNB = 1

RUNB symbol for zero run encoding

private enumvarBZ_EOB = 0

End of block symbol

private enumvarBZ_LESSER_ICOST = 0

Initial cost for symbols within a table's assigned range

private enumvarBZ_GREATER_ICOST = 15

Initial cost for symbols outside a table's assigned range

private varuint[256] BZ_CRC32_TABLE