cbloom rants: 09-30-11 - String Match Results Part 2b

9/30/2011

09-30-11 - String Match Results Part 2b

Still on optimal parsing, exact matching, large window :

Chart of clocks per byte, on each file of a test set :

On my "lztest" data set :


0 = almost_incompressable
1 = bigship1.x
2 = Dolphin1.x
3 = GrannyRocks_wot.gr2
4 = Gryphon_stripped.gr2
5 = hetero50k
6 = movie_headers.bin
7 = orange_purple.BMP
8 = particles.max
9 = pixo_run_animonly_stripped.gr2
10 = poker.bin
11 = predsave.bin
12 = quick.exe
13 = RemoteControl_stripped.gr2
14 = ScriptVolumeMgr.cpp

"lztest" is not a stress test set, it's stuff I've gathered that I think is roughly reflective of what games actually compress. It's interesting that this data set causes lots of DNF's (did not finish) for MMC and LzFind.

Suffix5 (the real suffix trie) is generally slightly faster than the suffix array. It should be, of course, if I didn't do a bonehead trie implementation, since the suffix array method basically builds a trie in the sort, then reads it out to sorted indexes, and then I convert the sorted indexes back to match lengths.

Good old CCC (Calgary Compression Corpus) :


0 = BIB
1 = BOOK1
2 = BOOK2
3 = GEO
4 = NEWS
5 = OBJ1
6 = OBJ2
7 = PAPER1
8 = PAPER2
9 = PAPER3
10 = PAPER4
11 = PAPER5
12 = PAPER6
13 = PIC
14 = PROGC
15 = PROGL
16 = PROGP
17 = TRANS

I won't be showing results on CCC for the most part because it's not very reflective of real world modern data, but I wanted to run on a set where MMC and LzFind don't DNF too much to compare their speed when they do succeed. Suffix Trie is almost always very close to the fastest except on paper4 & paper5 which are very small files.


0 = BIB
1 = BOOK1
2 = BOOK2
3 = GEO
4 = NEWS
5 = OBJ1
6 = OBJ2
7 = PAPER1
8 = PAPER2
9 = PROGC
10 = PROGL
11 = PROGP
12 = TRANS

Two new tests in the mix.

Test_Hash1 : traditional "Zip" style fixed size hash -> linked list. In this run there's no chain limit so matching is exact.

Test_Hash3 : uses cblib::hash_table (a reprobing ("open addressing" or "closed hashing", I prefer reprobing)) to hash the first 4 bytes then a linked list. I was surprised to find that this is almost the same speed as Hash1 and sometimes faster, even though it's a totally generic template hash table (that is not particularly well suited to this usage).

cbloom rants

9/30/2011

09-30-11 - String Match Results Part 2b

No comments:

old rants