Friday, June 12, 2009

Current Project:: Wait-Free Cache Allocator (pt 1)

About a week ago, I started looking into block-free, lock-free and wait-free algorithms. I've been having an absolute blast with them and thought I'd post what I've learned and done so far.

(This first post is the English version. I'll get into WAY more detail in later posts.)

Imagine, for a moment, that you need a series of boxes for packing. Every time you need a new box, you have to make it. Work with me here... imagine you've got the skills/facilities to do this...

Method A is to create a new box every time you need one. You stop packing, make a new box, then fill it. It takes you about 1 minute to make each box this way because you have to gather the materials, do the work, then clean up afterwards.

Method B is to create a bunch of boxes whenever you run out, fill them all up, then create a bunch more boxes. This method avoids a lot of the startup and cleanup and you can cut boxes in groups; all in all, this is way more efficient. This method costs about 0.04 minutes, or 2.4 seconds per box. Very significant improvement! The only problem with this approach is that you must have only one person packing and creating boxes (read on and you'll understand why.)

Method C is to create a bunch of boxes up front, then every time you use a box do a little work towards making the next set of boxes. Nothing significant, mind you... put a piece of cardboard near the machine. Make one cut in the group of boxes. Maybe pick up a scrap or two off of the floor and throw it away. We also have the advantage of avoiding the downtime of making the entire group wait while one person makes boxes. Doing it this way takes 0.08 minutes or 4.8 seconds per box. While it's still way more efficient than singly creating boxes, it's less efficient than Method B. It's also much more complicated.

The single worker approach:

With only one person doing all of the work, any of these approaches work. Method B is obviously the best, but they all at least work.

The multiple worker approach:

With multiple workers, we expand our problems. The single worker approach insists that all we have to worry about is the one person making and packing boxes. We work as fast as we can and that's all we have to worry about. With multiple workers, if anything stalls the workers, we quickly get huge dents in our productivity. Another problem is that we only have one box making machine, so only one person can use it at a time.

The problem with Method A is that this will inevitably make a backup at the machine. The more workers you get, the larger the backup at the box machine grows and eventually you reach a maximum throughput of the time it takes to create a single box. Definitely not ideal.

Method B has an issue in that you waste a lot of man hours. When the group runs out of boxes, only one person can make new ones and everyone else has to sit around waiting. This approach greatly wastes a lot of time and only gets worse with the more workers you have. With 10 workers and an average time of 0.04 minutes per set of boxes, every stall will cost you an additional 0.4 minutes, or 24 seconds. In other words, the more workers you have, the worse your efficiency will be.

Method C looks pretty good. None of the workers ever get stalled and they all contribute a bit towards making the next set of boxes. This method also has the nice little feature that every time you run out of boxes, the next set of boxes is finished and waiting. As an added bonus, this version doesn't care if there are one worker or a thousand, it'll work just as well either way.

As you can probably guess, the method I'm working on is C. More info on how, exactly, this works later on.


  1. Any initially empty boxes you leave unattended will immediately be occupied by N cats where N is approximately 1.

    I'm going to guess that a worker with an empty input queue and an empty workbench switches to making boxes, until his input queue is occupied by I >= IThresh units or I > 0 units for more than ITimeThresh time units, where he switches back to his workbench. If there is a worker already making boxes (the box machine is a constrained resource), the worker idles.

    If number boxes B > BThresh then every worker idles (since you have only BThresh cats to guard empty boxes).

  2. hehe, I had to read the first sentence three times before I got it. :)

    You're right: Methods A and B assume that the machine can only be used by one person at a time. If someone else tries to use it, they go into a wait of some sort.

    Method C is a bit trickier to explain... each (except one) of the steps of making the boxes is totally discrete and parallelizable. When workers pack a box, they look for any available task and try to accomplish it.

    The packing analogy has issues here in that all but one of the steps can be done at the same time. Obviously, this isn't quite accurate with boxes... you can't clean up a mess you haven't made yet.

    The only step that forces everyone to wait is the part where the first person moves the new cardboard to the cutter (allocation.) Unluckily, this is the slowest part... Luckily, I fill this gap time using no-operation calls if needed.

  3. Don't forget locality....

    In the analogy, if the worker has to put down his tools and pick up others (predictive caching has failed and new instructions have to be loaded) or walk over to the box machine (data not in the cache) then switching tasks picks up a lot more cost than if everything is local.

  4. Yeah, my worry with locality is that since the memory is allocated elsewhere I'll be getting cache hits a lot. Not on the data, as the data is never used, just the pointers. These are scattered around (somewhat... depending on what the new operator does) so that part is fairly unpredictable, thus "bad."

    The way I'm selecting tasks at the moment is pretty horrid too; I'm working on this now actually. Essentially, it boils down to a bunch of if/else's but I'm going to switch it to an atomic increment... but that might slow it down even more what with the underlying OS stuff to interrupt threads.

    The next post I'm putting up has the code in it, feel free to rip it apart. :)