9/27/2008

09-27-08 : On LZ and ACB


09-27-08

On LZ and ACB

LZ77 coders are hugely inefficient. Everybody knows the basic idea of LZ77. You compress data using back references to previous data. You either write a match, consisting of an offset & length, or a literal, consisting of one raw character.

The inefficiencies are intuitively obvious. If the coding was perfect, then if you were the decoder, the encoded stream should be totally random, you shouldn't be able to predict the code stream, but of course you can. One way is with context modeling. If you see a context like "what the ", then it's highly likely the next sequence is "fuck" so offsets which point to that are far more likely. Similarly, certain offsets point at certain words, and there are natural likely match lengths for those as well.

Someday soon I'm going to write about how I'm doing "Optimal Parsing". But the very fact that you can do optimal parsing is a sign of the flaw of LZ77. The encoder has many choices of different ways to write the code stream, which means there is massive redundancy in the code space. There are many different LZ77 code streams which all correspond to the same input. The optimal parser tries to pick the best one of these, but it doesn't change the fact that all the ways to encode that you did not choose are just wasted probability space in the coded bits.

Let's look at how you might reduce the redundancy in LZ77 if you for some reason had to do LZ coding and wanted to go as far with it as you can :

First of all, whether or not a match will be found is highly predictable. Most high compression LZ coders do a lot of context modeling for the match flag, because you can tell by looking at the local neighborhood whether it's an area that's likely to match. There are also finite-state patterns that arise in many types of data (particularly binary data), where it may be very likely that the data goes match-literal-match-literal , or mat-lit-lit-mat-lit-lit-mat , or whatever. Apparently LZMA does this.

Modeling the redundancy in the offset and the length are more interesting and difficult. The first thing we see is that they are not really two separate parameters, they are highly correlated, but not in a trivial way. For example, say we simply find the longest possible match and send the {offset,length} in the normal LZ77 way. There were very similar shorter matches that we passed up. When we send just the {offset} , the decoder knows what the match string we want looks like, and it knows that we passed up all the other matches, so this offset must be the longest one. That means it can mostly infer the match length. The decoder gets the offset, that specifies the match string, then the decoder finds the other string in the window that has the longest prefix length with the match string. The actual match must be at least the prefix length + 1. You can then either just output that match, or you could send the amount that the lengths exceeds that.

Let me show an example for clarity :

T A B C X X A B C D T T * A B C D ...

current code pointer is at (*)

encoder send match offset "-6"

decoder sees the match string "ABCDTT"

it looks for the longest other matching prefix and finds "ABCXX" with shared length 3

therefore we must match of at least length 4

decoder outputs "ABCD"
(then possibly also an additional match length starting at 0)

Now, we're still sending the offsets raw, but of course the offset is highly predictable too. Given the current context (or whatever type of model you want to use), you're much more likely to match certain offsets than others. You shouldn't just code the offset as a backward index into the window, you should assign a probability to each. The "ROLZ" and "PMR" LZ coders of today do this to some extent.

(aside : ROLZ = reduced offset LZ; basically means only code matches that share some amount of context, perhaps also MTF the offsets within the contexts; you code many fewer matches, but that's a good thing because ROLZ coders generally use a strong context model for literals; in fact if you use a powerful PPM for literals then the fewer matches you code the better your compression is. PMR = previous match reference, just codes references to previous matches in fewer bits because they are more likely, encoder also favors choosing PMR's over other offsets).

We can code an arbitrary offset with some modeling by reordering them. Make offset 0 be the most likely, up to offset N the least likely. One way to do this is to order by context match length. Say the longest context match is L symbols, then all the offsets that match with a context of length L are first, then all the offsets that match at length (L-1), down to length (0). Within each context, order so that more recently seen & used offsets are lower. So to code an offset, you still find the longest match, then you see where that offset is in the list ordered by likelihood, and output that index. (alternatively you could just context code the offset in the normal context way).

This is very closely related to LZP, Block Sorting, and Fenwick's Symbol Ranking stuff ( for example ).

We can do something similar in a slightly different way by sorting. Consider all previous possible match spots. Sort them by their context string, backwards, like for PPM. Now to match your current string, find the longest match with your forward string. This is the offset you want to send. To send it, find your current context in the sorted list - you'll find the longest match (actually you'll be between two strings in the list). Send the difference between your position in the list and the position of the longest match. Hopefully this delta is usually small, because the best match usually comes from a context similar to yours. The length of match is like before - the offset implicility send the length, you send the additional length after the implicit.

Obviously our goal has been to make the offset and match length both very close to zero. I mentioned block sort and symbol ranking before; the statistics of the codes we're generating here should look very much like those statistics.

Well, holy crap, this is exactly ACB !! (ACB = Associative Coding by Buyanovsky). I found a paper on ACB by Tomas Skopal or you can read this or this on my site.

ACB was never given much attention for a few reasons. 1) George's english is awful and he could never write a decent paper on it, so it never got into academia, his terminology is all non-standard and very confusing, 2) it's way the hell off the optimal curve in terms of speed vs. compression. It's slower than PPM and can't compress as well. It is however, interesting to me in a theoretical sense, because it's sort of like as good as you can do with dictionary compression, and as we see it's extremely similar to LZP, block sorting, and symbol ranking. In fact all of these coders wind up sucking in the same way, because by simply turning your correlation knowledge into an index with low mean, you're giving away a lot of good local information that a plain context modeller could do more with.

old rants