LZW
This calculator compresses/decompresses a string using Lempel-Ziv-Welch (LZW) algorithm.
The calculators in this article are used to compress and decompress a string using the LZW algorithm. The LZW method is simple and reliable, and it does not require storing a dictionary - the dictionary is dynamically generated during compression and decompression.
To study the algorithm's logic, check the "Show details" box - the calculators will display the work log and generated phrase dictionaries.
The binary dump or binary file generated by the previous calculator can be passed to the following calculator, which will restore the original string. It is necessary to set the method of forming the source dictionary in the same way as it is set for the compression calculator. (choose the same encoding or specify the same dictionary explicitly).
- Drag files here
Lempel-Ziv-Welch algorithm
The lossless compression algorithm LZ78 was published in 1978 by Abraham Lempel and Jacob Ziv and then modified by Terry Welch in 1984. After Welch's publication, the algorithm was named LZW after the authors' surnames (Lempel, Ziv, Welch). The algorithm has been patented, but all patents have expired by now, which gives us a great opportunity to publish our implementation here.
LZW algorithm description
- Create the initial dictionary containing all possible characters.
- Put the first character to the input phrase W.
- Read next character K.
- If EOF returns W, else:
If WK phrase is already in the dictionary, W ⟵ WK, go to 3,
Else return the code of W, add WK to the dictionary, W ⟵ K. Go to 3.
To decode the data generated in this way, you do not need to store the dictionary. It is recreated by itself in the process of decompression:
- Create the initial dictionary containing all possible characters.
- Put the first code to the input phrase W.
- Read the next code K.
4.If EOF, return the character having the code W, else:
If the phrase with the WK code is not in the dictionary, return the phrase with the W code, and add the phrase with the WK code to the dictionary.
Else, assign the WK code to the input phrase and go to 3.
Welch's original article intended to encode a phrase in a dictionary with a fixed-size 12-bit code. Still, this approach may even increase the length of the encoded message for small messages compared to the original text. Therefore, it is quite common to use dynamic code length, which changes every time the dictionary limit is reached.
Dictionary size growth handling
In the compression algorithm described above, the size of the dictionary is not limited. In practice, this can lead to resource constraints when packing large amounts of data.
There are known modifications to the algorithm trying to address this problem:
- LZC - the implementation of the algorithm in the compress utility limits the maximum dictionary size to 16 bits. When the maximum size is reached, the dictionary stops changing. The algorithm monitors the compression ratio and, if it degrades significantly, resets the dictionary and forms it anew.
- LZT - on overflow, removes from the dictionary a phrase that has not been used for the longest time
Our implementation features
Initial dictionary
Our calculators implement an endlessly growing vocabulary, which can be too expensive for huge data. The size of phrase codes starts at 8 bits for the initial dictionary specified by standard encoding, or less for a dictionary entered by hand. In the latter case, the phrase codes' size is the minimum number of bits required to express the number of characters in the dictionary. Bit length.
Multibyte encodings
Some encodings can take more than 1 byte per character, for example, UTF-8 encoding contains characters of variable length from 1 to 4 bytes. In any case, the initial dictionary will have 256 items with codes from 0 to 255 for whatever encoding you choose.
For multibyte encodings, a dynamically generated dictionary will look quite exotic - its phrases will inevitably be composed of different parts of adjacent characters. In this case, for clarity in the output tables, we use a missing byte marker (by default, this is the '~' character). For example, when compressing the phrase "9 €", represented in UTF-8 encoding, the dynamic dictionary can be filled with combinations of the following characters:
9 - nine,
' ' - space,
€~~ - first byte of euro symbol,
~€~ - second byte of euro symbol,
~~€ - third byte of euro symbol,
€~ - the first two bytes of the euro symbol,
~€ - the last two bytes of the euro symbol,
€ - all three bytes of the euro symbol.
We have introduced this character description for convenience only. Since the constituent parts of many multibyte characters are just bytes of 8 bits, they may be the same for different characters.
for example:
€ ~~ - the first byte of the euro symbol, its code in utf-8 = 226
₽ ~~ - the first byte of the ruble symbol, its code in utf-8 = 226.
Trailing with zeros
The compression algorithm can produce a bit array with a size that is not a multiple of 8. In this case, the compression calculator fills the end with zero bits. Since our implementation uses a variable size of the phrase code, this approach can cause problems if the manually set initial dictionary is tiny.
In this example the final message will be only 18 bits and 6 bits would have to be padded with zeros to store the entire message in a binary file.
Since with the specified parameters, the phrase code's size is only 4 bits, then the extra word will be read during unpacking, and if we were to create the original dictionary with the code of the first character = 0, then the extra character would be unpacked.
We solve this problem by adding a null character to the beginning of the dictionary, which is commonly interpreted as the end of a string.
When using the encoding as the initial dictionary, this problem will never arise since the phrase code's size is always at least 8 bits.
Comentários