Friday, September 18, 2009

Thread pools suck

So I was at the Guildhall mixer last night at the Austin GDC and I met Colt McAnlis, a Guildhall professor and programmer. We had a very short but very intense discussion on threading where he advised me that, in short, thread pools suck...

Which got me to thinking, why?

For those who have no idea what a thread pool is, here's the short of it... Most computers now have multiple cores, each one of which can do their own thing independently of the rest of the system. A thread pool creates a number of threads (independent tasks) usually in line with the number of cores that are available and then farms out tasks to an available pool as they become available. Of course, I'm leaving out a lot of details like hyperthreading, pseudo-threads that spend most of their time waiting, etc.

Ok, the non-geek version... You live in a house with 4 people and you've just had a party. You've accumulated a ton of dishes and you need to wash them all. "Single threaded" is one person washes, rinses, dries, and puts away all the dishes while the other three stand around and watch. "Multi-threaded" is one person washes, one rinses, one dries, and one puts them all away all at the same time. Pretty much a major improvement as the washer can keep washing while the other dishes magically make it into the cupboard.

But there's a huge gaping hole in this problem. What if the person drying takes too long and everyone else ends up waiting for him? The rinser can't rinse any more dishes because the counter is full of wet ones, the washer can't wash any more because the rinse spot is full, and the person putting them away has gone off to watch television.

What if we reassign priorities between the people. Sometimes the person washing helps to dry dishes. This works out a bit better, right? Well, what if there's only one drying towel or, even worse, the person washing is doing so because he absolutely sucks at drying them? Our improvement may give us a slight advantage, but it's not all-that-and-a-bag-of-chips... so how do we make it better?

Oh, there is so much of a better way...

Back to the geekified version because the explanation relies a lot on hardware.

A bit of background first.

Hard drives: Hard drives essentially are s.l.o.w. DVD drives are WAY worse, but trying to read data from either of these is time consuming and essentially wastes useful time while you sit there waiting for the data to come in.

System resources and shared hardware: Some system resources do not react well to being told to suddenly change tasks. The math pipeline is one that comes to mind. Lets say you queue up a bunch of math operations on a processor and it's over there grinding away on them. In the middle of the work, you tell it to stop and do this other thing instead. It has to flush every single job is has queued, cache in the new work and fill the pipeline all over again. Blech, this sucks.

So how do we avoid these scenarios?

File IO: Ideally, we'll want to serialize all of the file I/O operations per device. If we allow two threads to try to read/write to the same device at the same time (even if they're on different cores!) it will cause disk thrash. This will happen if you have multiple threads asking for data in different areas. The disk will spin to one area, read a bit, then spin to a different area, read some more, spin back to the original area and read a bit more, etc. You know this is happening when you hear a grinding sound or feel your computer vibrating when you touch it. This is ugly bad. If we restrict all of the I/O happening on a specific device to a particular thread, it will essentially serialize it all for us, optimizing the way it reads the data (assuming you're not asking for data in different areas, which I can't help you with.)

System resources and shared hardware: Using the math coprocessor as an example again, the best possible scenario here is to share the workload around to as many independent hardware pieces as possible that can do the math. For each one of the tasks, let it start, work and complete end to end so the pipeline doesn't get flushed.

But Wait, There's More!

A lot of these tasks will interfere with each other, as with the file I/O. Some of them will coexist perfectly with other types, such as mixing file I/O and math operations since they use different hardware. But here's the kicker... for some tasks, we can start it in the background and go off and do something else.

Let's mix the use of file reads and math. Traditionally, we read a file waiting for all of the data to show up then return to the caller and proceed to the next task. That is a HUGE waste of time. If we instead use asynchronous file operations, we can tell the system to queue the read then go off and do something else while we wait.

This, of course, takes some intelligence by the thread scheduler. I'm already assuming that we tag each threaded operation as the type that it's going to do (file I/O, math, etc.) When we schedule a file I/O operation, we know that it will go on a particular thread because we've chosen to serialize all of those operations. If we use asynchronous operations, we can start the task and look through our queue to find another operation that we can do while we wait for the I/O to finish.

But Wait, There's More!

Yeah yeah, I know I'm pretty much rambling at this point, but... cest la vie!

What if we give each of the threads a tiny bit of intelligence so it can schedule tasks itself grabbing them from the main pool. For example, thread 1 knows that it handles all of the file I/O for a particular device, so it pulls those tasks first. It queues the operation and immediately puts it on the back burner, requesting another task that it knows will coexist happily with the first. When that second operation finishes, it looks at the I/O task to see if it's finished with the disk operation. If it is, it finalizes the task and moves on. If it's not, it goes out and grabs another task to do.

Essentially, it puts all the scheduling logic in the threads, not the main app. The main app then just throws all of the tasks into a series of queues which can be done in linear time.

This puts traditional thread pooling on its head. Normally we have the main app schedule tasks to threads and the threads sit and wait for a semaphore to fire. Doing it this way, the main app just categorizes the tasks and the threads actively seek work. Yeah, the threads are doing more work, but if you look at Vtune or thread timings there were most likely huge gaps there anyways. The threads become more fully utilized, the main thread becomes less utilized and the tasks happen in parallel with much reduced collisions.