cbloom rants: 05-21-15

LZ-Sub decoder :


delta_literal = get_sub_literal();

if ( delta_literal != 0 )
{
    *ptr++ = delta_literal + ptr[-lastOffset];
}
else // delta_literal == 0
{
    if ( ! get_offset_flag() )
    {
        *ptr++ = ptr[-lastOffset];
    }
    else if ( get_lastoffset_flag() )
    {
        int lo_index = get_lo_index();
        lastOffset = last_offsets[lo_index];
        // do MTF or whatever using lo_index
        
        *ptr++ = ptr[-lastOffset];
        // extra 0 delta literal implied :
        *ptr++ = ptr[-lastOffset];
    }
    else
    {
        lastOffset = get_offset();
        // put offset in last_offsets set
        
        *ptr++ = ptr[-lastOffset];
        *ptr++ = ptr[-lastOffset];
        // some automatic zero deltas follow for larger offsets
        if ( lastOffset > 128 )
        {
            *ptr++ = ptr[-lastOffset];
            if ( lastOffset > 16384 )
            {
                *ptr++ = ptr[-lastOffset];
            }
        }   
    }

    // each single zero is followed by a zero runlen
    //  (this is just a speed optimization)
    int zrl = get_zero_runlen();
    while(zrl--)
        *ptr++ = ptr[-lastOffset];
}

This is basically LZMA. (sub literals instead of bitwise-LAM, but structurally the same) (also I've reversed the implied structure here; zero delta -> offset flag here, whereas in normal LZ you do offset flag -> zero delta)

This is what a modern LZ is. You're sending deltas from the prediction. The prediction is the source of the match. In the "match" range, the delta is zero.

The thing about modern LZ's (LZMA, etc.) is that the literals-after-match (LAMs) are very important too. These are the deltas after the zero run range. You can't really think of the match as just applying to the zero-run range. It applies until you send the next offset.

You can also of course do a simpler & more general variant :

Generalized-LZ-Sub decoder :


if ( get_offset_flag() )
{
    // also lastoffset LRU and so on not shown here
    lastOffset = get_offset();
}

delta_literal = get_sub_literal();

*ptr++ = delta_literal + ptr[-lastOffset];

Generalized-LZ-Sub just sends deltas from prediction. Matches are a bunch of zeros. I've removed the acceleration of sending zero's as a runlen for simplicity, but you could still do that.

The main difference is that you can send offsets anywhere, not just at certain spots where there are a bunch of zero deltas generated (aka "min match lengths").

This could be useful. For example when coding images/video/sound , there is often not an exact match that gives you a bunch of exact zero deltas, but there might be a very good match that gives you a bunch of small deltas. It would be worth sending that offset to get the small deltas, but normal LZ can't do it.

Generalized-LZ-Sub could also give you literal-before-match. That is, instead of sending the offset at the run of zero deltas, you could send it slightly *before* that, where the deltas are not zero but are small.

(when compressing text, "sub" should be replaced with some kind of smart lexicographical distance; for each character precompute a list of its most likely substitution character in order of probability.)

LZ is a bit like a BWT, but instead of the contexts being inferred by the prefix sort, you transmit them explicitly by sending offsets to prior strings. Weird.

cbloom rants

5/21/2015

05-21-15 - LZ-Sub

No comments:

old rants