5/27/2010

05-27-10 - Loop Branch Inversion

A major optimization paradigm I'm really missing from C++ is something I will call "loop branch inversion". The problem is for code sharing and cleanliness you often wind up with cases where you have a lot of logic in some outer loops that find all the things you should work on, and then in the inner loop you have to do a conditional to pick what operation to do. eg :

LoopAndDoWork(query,worktype)
{
    Make bounding area
    Do Kd-tree descent .. 
    loop ( tree nodes )
    {
        bounding intersection, etc.
        found an object
        DoPerObjectWork(object);
    }
}

The problem is that DoPerObjectWork then is some conditional, maybe something like :

DoPerObjectWork(object)
{
    switch(workType)
    {
    ...
    }
}

or even worse - it's a function pointer that you call back.

Instead you would like the switch on workType to be on the outside. WorkType is a constant all the way through the code, so I can just propagate that branch up through the loops, but there's way to express it neatly in C.

The only real option is with templates. You make DoPerObjectWork a functor and you make LoopAndDoWork a template. The other option is to make an outer loop dispatcher to constants. That is, make workType a template parameter instead of an integer :


template < int workType >
void t_LoopAndDoWork(query)
{
    ...
}

and then provide a dispatcher which does the branch outside :

LoopAndDoWork(query,worktype)
{
    switch(workType)
    {
    case 0 : t_LoopAndDoWork<0>(query); break;
    case 1 : t_LoopAndDoWork<1>(query); break;
    ...
    }
}

this is an okay solution, but it means you have to reproduce the branch on workType in the outer loop and inner loop. This is not a speed penalty becaus the inner loop is a branch on constant which goes away, it's just ugly for code maintenance purposes because they have to be kept in sync and can be far apart in the code.

This is a general pattern - use templates to turn a variable parameter into a constant and then use an outer dispatcher to turn a variable into the right template call. But it's ugly.

BTW when doing this kind of thing you are often wind up with loops on constants. The compiler often can't figure out that a loop on a constant can be unrolled. It's better to rearrange the loop on constant into branches. For example I'm often doing all this on pixels where the pixel can have between 1 and 4 channels. Instead of this :


for(int c=0;c<channels;c++)
{
    DoStuff(c);
}

where channels is a constant (template parameter), it's better to do :

DoStuff(0);
if ( channels > 1 ) DoStuff(1);
if ( channels > 2 ) DoStuff(2);
if ( channels > 3 ) DoStuff(3);

because those ifs reliably go away.

4 comments:

won3d said...

http://en.wikipedia.org/wiki/Loop_unswitching

I guess MSVC doesn't do this, or maybe the switch bodies are too big for it to want to duplicate. The whole "switch in a loop" pattern is common enough (e.g. indirect threaded interpreters) that it is usually pretty optimized. Like, when the dispatch predicate is NOT constant, some compilers (like GCC) will actually replicate the predicate code and jump directly to a case, rather than separately going through the loop and branch. Among other things, this improves branch prediction of such a loop.

But yeah, MSVC is bad at unrolling loops. Sad you have to manually unroll, or is it because the body of the loop is a function call?

cbloom said...

"But yeah, MSVC is bad at unrolling loops. Sad you have to manually unroll, or is it because the body of the loop is a function call?"

Usually when I'm doing this it's very complicated code and many function calls.

It's also often complicated enough that I don't want to expose the whole thing out to the world, or the template/functor method would work. The only real disadvantages to the functor method are that you have publish your whole routine in a header, and that functors are a pain in the ass to make (fixed in C++0x).

Brian said...

What type of performance wins do you actually see by doing the unrolling? I wonder if the processor branch prediction does well enough that this doesn't really matter.

Anonymous said...

Brian: PPC don't have branch prediction and are very bad at it. This include X360's Xenon and PS3's PPU and SPU.

old rants