Monday, October 5, 2009

Wait-Free Cache Allocator, Part 3!

Yeah, it's been a while since I posted on this so I wanted to give a quick status on how it's working out...

There are four different types of tests:
  • Raw (r): Simple new/delete allocation scheme where each request for memory causes an allocation.
  • Blocked (b): Allocates in blocks handing out a chunk at a time. Enforces thread safety using mutexes.
  • Unblocked (ub): Just like unblocked except it is totally NOT thread safe...
  • Atomic (a): The pseudo-wait-free allocator I've been working on. (it's not actually wait free... to make it truly wait free would probably make it less efficient and take too long.)

The raw allocator is my baseline for the tests but I try to compare all of the results against each other. I'm still working on the test framework, but here are the results from a single thread running all four tests at various iterations, all of which allocate in bulk, then deallocate randomly. Yes, yes, I know, what's the point in a single threaded test of something that's multithreaded... it's just a baseline to see how it performs in non-threaded situations (where it SHOULDN'T be used, but still...)

ANYWAYS, here's the results:

000: Iters[    382]: r[ 0.0002822] b[ 0.0003256] ub[ 0.0001308] a[ 0.0001301]
001: Iters[ 573]: r[ 0.0003111] b[ 0.0004795] ub[ 0.0001159] a[ 0.0001418]
002: Iters[ 859]: r[ 0.0004065] b[ 0.0005000] ub[ 0.0001171] a[ 0.0001703]
003: Iters[ 1288]: r[ 0.0005679] b[ 0.0007953] ub[ 0.0001349] a[ 0.0004345]
004: Iters[ 1932]: r[ 0.0007991] b[ 0.0010567] ub[ 0.0001510] a[ 0.0002469]
005: Iters[ 2898]: r[ 0.0011374] b[ 0.0014791] ub[ 0.0002087] a[ 0.0003129]
006: Iters[ 4347]: r[ 0.0016636] b[ 0.0022002] ub[ 0.0001988] a[ 0.0004183]
007: Iters[ 6520]: r[ 0.0024785] b[ 0.0040991] ub[ 0.0002672] a[ 0.0005800]
008: Iters[ 9780]: r[ 0.0035872] b[ 0.0047101] ub[ 0.0003520] a[ 0.0008599]
009: Iters[ 14670]: r[ 0.0053226] b[ 0.0074197] ub[ 0.0004619] a[ 0.0011902]
010: Iters[ 22005]: r[ 0.0080567] b[ 0.0107315] ub[ 0.0008942] a[ 0.0019779]
011: Iters[ 33007]: r[ 0.0122470] b[ 0.0168040] ub[ 0.0011022] a[ 0.0027471]
012: Iters[ 49510]: r[ 0.0179474] b[ 0.0267609] ub[ 0.0018665] a[ 0.0045312]
013: Iters[ 74265]: r[ 0.0275046] b[ 0.0363858] ub[ 0.0027583] a[ 0.0066835]
014: Iters[ 111397]: r[ 0.0411089] b[ 0.0552259] ub[ 0.0036272] a[ 0.0094744]
015: Iters[ 167095]: r[ 0.0611580] b[ 0.0819954] ub[ 0.0060764] a[ 0.0145781]
016: Iters[ 250642]: r[ 0.0958014] b[ 0.1236177] ub[ 0.0086538] a[ 0.0212276]
017: Iters[ 375963]: r[ 0.1448224] b[ 0.1847220] ub[ 0.0142410] a[ 0.0322124]
018: Iters[ 563944]: r[ 0.2254061] b[ 0.3297470] ub[ 0.0198752] a[ 0.0479318]
019: Iters[ 845916]: r[ 0.3327258] b[ 0.4203867] ub[ 0.0283150] a[ 0.0709337]
TimePer: raw[0.000388] blocked[0.000516] unblocked[0.000035] atomic[0.000085]
Percent of raw: blocked[1.331635] unblocked[0.091066] atomic[0.220457]
Average: raw[0.098333] blocked[0.130944] unblocked[0.008955] atomic[0.021678]

Friday, September 18, 2009

Thread pools suck

So I was at the Guildhall mixer last night at the Austin GDC and I met Colt McAnlis, a Guildhall professor and programmer. We had a very short but very intense discussion on threading where he advised me that, in short, thread pools suck...

Which got me to thinking, why?

For those who have no idea what a thread pool is, here's the short of it... Most computers now have multiple cores, each one of which can do their own thing independently of the rest of the system. A thread pool creates a number of threads (independent tasks) usually in line with the number of cores that are available and then farms out tasks to an available pool as they become available. Of course, I'm leaving out a lot of details like hyperthreading, pseudo-threads that spend most of their time waiting, etc.

Ok, the non-geek version... You live in a house with 4 people and you've just had a party. You've accumulated a ton of dishes and you need to wash them all. "Single threaded" is one person washes, rinses, dries, and puts away all the dishes while the other three stand around and watch. "Multi-threaded" is one person washes, one rinses, one dries, and one puts them all away all at the same time. Pretty much a major improvement as the washer can keep washing while the other dishes magically make it into the cupboard.

But there's a huge gaping hole in this problem. What if the person drying takes too long and everyone else ends up waiting for him? The rinser can't rinse any more dishes because the counter is full of wet ones, the washer can't wash any more because the rinse spot is full, and the person putting them away has gone off to watch television.

What if we reassign priorities between the people. Sometimes the person washing helps to dry dishes. This works out a bit better, right? Well, what if there's only one drying towel or, even worse, the person washing is doing so because he absolutely sucks at drying them? Our improvement may give us a slight advantage, but it's not all-that-and-a-bag-of-chips... so how do we make it better?

Oh, there is so much of a better way...

Back to the geekified version because the explanation relies a lot on hardware.

A bit of background first.

Hard drives: Hard drives essentially are s.l.o.w. DVD drives are WAY worse, but trying to read data from either of these is time consuming and essentially wastes useful time while you sit there waiting for the data to come in.

System resources and shared hardware: Some system resources do not react well to being told to suddenly change tasks. The math pipeline is one that comes to mind. Lets say you queue up a bunch of math operations on a processor and it's over there grinding away on them. In the middle of the work, you tell it to stop and do this other thing instead. It has to flush every single job is has queued, cache in the new work and fill the pipeline all over again. Blech, this sucks.

So how do we avoid these scenarios?

File IO: Ideally, we'll want to serialize all of the file I/O operations per device. If we allow two threads to try to read/write to the same device at the same time (even if they're on different cores!) it will cause disk thrash. This will happen if you have multiple threads asking for data in different areas. The disk will spin to one area, read a bit, then spin to a different area, read some more, spin back to the original area and read a bit more, etc. You know this is happening when you hear a grinding sound or feel your computer vibrating when you touch it. This is ugly bad. If we restrict all of the I/O happening on a specific device to a particular thread, it will essentially serialize it all for us, optimizing the way it reads the data (assuming you're not asking for data in different areas, which I can't help you with.)

System resources and shared hardware: Using the math coprocessor as an example again, the best possible scenario here is to share the workload around to as many independent hardware pieces as possible that can do the math. For each one of the tasks, let it start, work and complete end to end so the pipeline doesn't get flushed.

But Wait, There's More!

A lot of these tasks will interfere with each other, as with the file I/O. Some of them will coexist perfectly with other types, such as mixing file I/O and math operations since they use different hardware. But here's the kicker... for some tasks, we can start it in the background and go off and do something else.

Let's mix the use of file reads and math. Traditionally, we read a file waiting for all of the data to show up then return to the caller and proceed to the next task. That is a HUGE waste of time. If we instead use asynchronous file operations, we can tell the system to queue the read then go off and do something else while we wait.

This, of course, takes some intelligence by the thread scheduler. I'm already assuming that we tag each threaded operation as the type that it's going to do (file I/O, math, etc.) When we schedule a file I/O operation, we know that it will go on a particular thread because we've chosen to serialize all of those operations. If we use asynchronous operations, we can start the task and look through our queue to find another operation that we can do while we wait for the I/O to finish.

But Wait, There's More!

Yeah yeah, I know I'm pretty much rambling at this point, but... cest la vie!

What if we give each of the threads a tiny bit of intelligence so it can schedule tasks itself grabbing them from the main pool. For example, thread 1 knows that it handles all of the file I/O for a particular device, so it pulls those tasks first. It queues the operation and immediately puts it on the back burner, requesting another task that it knows will coexist happily with the first. When that second operation finishes, it looks at the I/O task to see if it's finished with the disk operation. If it is, it finalizes the task and moves on. If it's not, it goes out and grabs another task to do.

Essentially, it puts all the scheduling logic in the threads, not the main app. The main app then just throws all of the tasks into a series of queues which can be done in linear time.

This puts traditional thread pooling on its head. Normally we have the main app schedule tasks to threads and the threads sit and wait for a semaphore to fire. Doing it this way, the main app just categorizes the tasks and the threads actively seek work. Yeah, the threads are doing more work, but if you look at Vtune or thread timings there were most likely huge gaps there anyways. The threads become more fully utilized, the main thread becomes less utilized and the tasks happen in parallel with much reduced collisions.

Friday, August 7, 2009

Long delays, but...

I'm back! Well, mostly. My new job keeps me very busy most of the time, so free time has been... well... limited.

During compiles, though, I've been working with CPUID, atomic operations (YES! STILL!!) and re-learning X86 assembly. It's been YEARS since I've written assembly, so, needless to say, I'm a bit rusty.

While I can't (and wont, so don't ask) go into what I'm working on, I've been developing for the Xbox, PS3 and Win32 simultaneously. Hopefully I can reveal what I'm doing soon, but until then... you'll just have to wait in agony.

Well, for all one or two of you that actually read my blog, that is. :)

Wednesday, June 24, 2009

Oops

Well, a couple things have come up recently, so I haven't had much time to work on the cache algorithm or do much coding...

I've accepted a job at Edge of Reality in Austin, Tx. Not quite sure what I'll be doing yet, but the crew seems awesome and the work looks very interesting. I start in early July and just found an apartment (yay!) so I'm sort of freaking out trying to get everything together to do the move.

Hopefully I'll have some time to do some coding for fun (dont' laugh! it's fun for me!) in the next couple of days. :)

Friday, June 12, 2009

Current Project:: Wait-Free Cache Allocator (pt 2)

So this post is a little... no, wait, a lot more directed at programmers. If you're not a programmer... you might want to stop now 'cause the next bit gets fairly scary.

Traditional multithreaded programming uses mutual exclusion (Mutex) or critical sections to require that only one thread is using a piece of memory. Mutexes work by saying "I'd want to be the only one using this piece of memory" and waits until it is the exclusive owner. When done, the thread releases the mutex, allowing other processes to do something with that memory.

Another way to do this is through something called a "spin lock." A mutex uses extremely efficient "no CPU" waits where the entire thread is shut down until the mutex can be obtained. Spin locks, on the other hand, burn CPU time constantly checking a variable.

Lock-free algorithms typically use atomic operations as a replacement for mutexes. Atomic operations allow a program to modify a single piece of memory guaranteeing that no other thread is currently reading or writing to it. Under Windows, one of the functions to do this is InterlockedCompareExchange. These atomic operations can fail if some other thread is using the data or if the data used during the compare and exchange is stale, so some sort of wait (spin locks, for example) are used to pass the time until the data can be changed.

Wait-free algorithms are lock free algorithms, but they're written in such a way that if a wait is ever needed, they can do something else instead. Essentially, they never sit around and wait. Instead, they perform some other operation then come back and try again later. These types of algorithms are extremely complicated to set up. You've got to guarantee that you always have something else to do just in case an atomic operation can't happen. Essentially, it's loosely organized chaos.

Another problem with wait-free algorithms is that they're rarely performant in either development or execution. So much time and effort is put into making it "truly" wait-free that the algorithm gets very complicated. Not to mention that it takes an incredible amount of time to write one of these correctly.

The cache algorithm I created straddles a fine line between wait-free and lock-free. It is a lock-free algorithm in that it guarantees that some progress can be made if all other threads are suspended (don't ask me, this is someone else's definition.) It is sort of a wait-free algorithm in that the steps used in the cache are broken up to be parallelized as much as possible. It is not, however, a true wait-free algorithm because it uses spin locks to set data and to wait if no work is available. The reason I did it this way is for performance, both in execution and development. The algorithm is compact and very efficient and causes very few hits on the spin locks.

In the next post, I'll go back to the box analogy and explain how atomic operations can be used in a cached memory allocation scheme.

Current Project:: Wait-Free Cache Allocator (pt 1)

About a week ago, I started looking into block-free, lock-free and wait-free algorithms. I've been having an absolute blast with them and thought I'd post what I've learned and done so far.

(This first post is the English version. I'll get into WAY more detail in later posts.)

Imagine, for a moment, that you need a series of boxes for packing. Every time you need a new box, you have to make it. Work with me here... imagine you've got the skills/facilities to do this...

Method A is to create a new box every time you need one. You stop packing, make a new box, then fill it. It takes you about 1 minute to make each box this way because you have to gather the materials, do the work, then clean up afterwards.

Method B is to create a bunch of boxes whenever you run out, fill them all up, then create a bunch more boxes. This method avoids a lot of the startup and cleanup and you can cut boxes in groups; all in all, this is way more efficient. This method costs about 0.04 minutes, or 2.4 seconds per box. Very significant improvement! The only problem with this approach is that you must have only one person packing and creating boxes (read on and you'll understand why.)

Method C is to create a bunch of boxes up front, then every time you use a box do a little work towards making the next set of boxes. Nothing significant, mind you... put a piece of cardboard near the machine. Make one cut in the group of boxes. Maybe pick up a scrap or two off of the floor and throw it away. We also have the advantage of avoiding the downtime of making the entire group wait while one person makes boxes. Doing it this way takes 0.08 minutes or 4.8 seconds per box. While it's still way more efficient than singly creating boxes, it's less efficient than Method B. It's also much more complicated.

The single worker approach:

With only one person doing all of the work, any of these approaches work. Method B is obviously the best, but they all at least work.

The multiple worker approach:

With multiple workers, we expand our problems. The single worker approach insists that all we have to worry about is the one person making and packing boxes. We work as fast as we can and that's all we have to worry about. With multiple workers, if anything stalls the workers, we quickly get huge dents in our productivity. Another problem is that we only have one box making machine, so only one person can use it at a time.

The problem with Method A is that this will inevitably make a backup at the machine. The more workers you get, the larger the backup at the box machine grows and eventually you reach a maximum throughput of the time it takes to create a single box. Definitely not ideal.

Method B has an issue in that you waste a lot of man hours. When the group runs out of boxes, only one person can make new ones and everyone else has to sit around waiting. This approach greatly wastes a lot of time and only gets worse with the more workers you have. With 10 workers and an average time of 0.04 minutes per set of boxes, every stall will cost you an additional 0.4 minutes, or 24 seconds. In other words, the more workers you have, the worse your efficiency will be.

Method C looks pretty good. None of the workers ever get stalled and they all contribute a bit towards making the next set of boxes. This method also has the nice little feature that every time you run out of boxes, the next set of boxes is finished and waiting. As an added bonus, this version doesn't care if there are one worker or a thousand, it'll work just as well either way.

As you can probably guess, the method I'm working on is C. More info on how, exactly, this works later on.

Thursday, June 11, 2009

Committing

Well, the time has come for me to finally accept a job offer. I've got to admit, I've got quite a bit of trepidation in choosing... mostly because I barely know the companies I'm choosing between. The only way for me to really get to know them, however, is to go work there. If it turns out to be a Bad Thing then I'll need to continue the search or grind it out until my first title.

On the other hand, I've met a lot of really great people. Most everyone I've met is really into making a great product and doing the best job they can do. That's a major bonus for me as it reflects on the industry as a whole.

When it comes right down to it, though, I have to make a decision. My criteria is fairly simple: Is the company credible? Do I get along with them and them with me? Do they have smart people I can learn from? Are there other studios around them in case things go south?

Some of the criteria is a bit odd... I am not ranking based on pay (although that obviously has some to do with it...) and I'm not limiting myself by genre very much. I know that this is my first game development job, so I can't be too picky. Especially in this economy; I'm just happy to get offers.

I'll post by this weekend who I've decided on... until then, all of us will just wonder.

Tuesday, June 9, 2009

My Specialization

I've been on a lot... no, a lot of job interviews recently and one topic comes up a lot: what is my specialty?

It took me a while to figure out what this was. Not because I don't know, but because I like to do so many things. I sort of stressed over this for a while because I didn't want to pigeonhole myself. To find one, I went back to my old projects and looked at what I focused on time after time...

Parallel Processing

Using my gremlin analogy, imagine that your computer contains all sorts of different kinds of gremlins. Some are really small and only work in groups of four. Some are medium sized and can work well with others. Others are really big but don't work well with others.

My specialty is to get all of the different types of gremlins to work together and communicate very efficiently all at the same time.

What is a computer?

I guess before I can get started in deep coding stuffs I need to create an analogy for what a computer is and how it works. Don't worry, this is for the uninitiated; you don't need to know anything about computers to understand this.

Your computer is a box. Yeah, yeah, I know... difficult stuff here. In fact, it is a "magic" box to most people. It just plows along doing what it does. Most people don't pay any attention to how it works or what it does, they just click the buttons and do what they need to do.

But how does it work?

That's actually a huge topic that very few people can truly answer... so I'll create an analogy that I'll likely reuse a lot in the future.

Imagine that your computer is actually a home for mathematically inclined gremlins. Perhaps smart chupacabras, but we'll go with gremlins because it's easier to type. These gremlins know how to add, subtract, multiple and divide and they're very good at it. Incredibly good; in fact they can do billions of these simple math operations per second.

The gremlins have other talents too. They can work in groups or on their own and, most importantly, they can read and follow instructions.

Their drawback, however, is that they will only do what they're told. Nothing more, nothing less. They don't think for themselves and they don't understand the concept of common sense. If we tell them to lick their finger and jam it into a light socket they'll do it without question or hesitation. Even more odd, they'll keep their finger in the light socket until we tell them to take it out! They're cute, but not too smart.

Writing a computer program (programming/coding) is essentially telling the gremlins what to do. Add this, multiply it with this, divide it by this, etc. The gremlins will follow the instructions exactly as the programmer told them to.

Some instructions can cause the gremlins to get confused. If the programmer tells the gremlins to do something and put the result in a box but they fail to give the gremlins a box, bad things can happen. In fact, the gremlins will freak out because they have to put the result in a box but they don't have a box but they have to have a box but, but, but...

This is your traditional "crash." The gremlins have to do something but they can't. The net effect is that they do the only thing they can do: pull the fire alarm.

A programmer's job is to tell the gremlins what to do in such a way that the gremlins can never be confused. The job of the user (You!) is to never have to worry about how the gremlins do what they do.

Computers are STUPID!

Computers are stupid creatures. They do exactly what we tell them to do, even if it's not what was intended. They only do exactly what we tell them to do and nothing more, so a programmer's job is to take a generalized statement and break it down into individual actions.

Writing code isn't hard. In fact, it's really easy. It's like learning a foreign language that uses just a small subset of English. Admittedly this is much easier if you already speak English...

The hard part is wrapping your brain around every nook and cranny of what's going on. You've got to have both a 10,000 foot view and a 1 inch view of what's going on in order to write good code. What's the overall purpose of the piece you're writing? What's the purpose of this one single line? What's the purpose of this one single character on this line?

A pondering: how to do this...

So I want to post a bunch of stuff about coding (programming), but the majority of people out there, let's face it, don't know anything about it. It's "magic" to them and something theyconsider way above their head. In that respect, when I post, I'll try to:

1) Write in plain English. As Albert Einstein said, "If you can't explain it to a 6 year old, you don't understand it yourself."
2) Give the in-depth code analysis for those that want it.
3) Ask Questions! I love to learn and I like to think I take criticism well so I'll be asking for feedback. This, of course, doesn't mean that you have to comment... I'm just hoping you will. :)

Wow, I'm blogging...

Yep, I never thought this day would come... when I actually start my own blog.

So why am I doing it? No, it's not because I think I have the ultimate view on anything. I just want to share things I've learned and (hopefully) get feedback. :)

We'll see how it goes. It'll probably mostly revolve around coding with the occasional random thought thrown in.