Wednesday, February 15, 2012

Dangling Pointer Detection

"tl;dr and I know what I'm doing": Create a slab allocator that hands out page size blocks of memory to a requester using VirtualAlloc and VirtualProtect to control page access. Don't allow pages to be reused within some amount of time.

So you've got an application. Great. But now it crashes because of data corruption. Crap.

Most of the time, this type of overwrite occurs because of a dangling pointer. Dangling pointers are pointers to memory where the pointer thinks that it is pointing to, for example, a vertex buffer, but the block of memory it points to was freed, reallocated to someone else, and is now actually an Actor object.

Dangling pointers are historically one of the most difficult types of bugs to track down. This is because the problem occurred in the past; the memory was freed and reallocated to someone else, then someone, somewhere in the application, wrote to the data and corrupted it. When the actual owner went to read the data, it read badness and thus badness occurred.

So there are a few options available for debugging. First, if you are intimately familiar with the data and the application, you may be able to figure out who's doing it through sheer brute force. This isn't usually feasible and is definitely not the most efficient use of your time. Second, you can use data breakpoints if the object is always the same pointer and is always corrupted the same way. But, this only works if the corruption is repeatable; if it moves around (which it usually does) then this may work but it'll take forever to finally catch it. Third, you immediately halt the application when and where the corruption occurs. This method doesn't rely on how repeatable the crash is, how familiar you are with the code, or anything else; when a corrupting write occurs, the application halts on the line that caused the corruption. The debugger actually screams at you "Dis Bad Pionter! No do dis!"

Awesome, so option 3 is easily the best, but how can it be done?

If you've got your own heap (you do have your own heap, right?) then you can substitute a debugging heap that the allocations are forwarded to.

Your platform will usually allow you to control memory access on a per page basis. (This works on the Xbox 360, the Playstation3, and in Windows, but I can't say for other platforms.) I'll use Windows as an example here; you can do the rest of the platform work on your own. :)

And the obligatory word of warning here... This is Not an optimal solution. It uses a ton of memory and is fairly slow. This solution also only supports allocations up to the page size of the platform. I use a much more elegant method myself, but it's too complicated for a first pass explanation. For hints at my changes, see the "Advanced" notations. They're all an exercise for the reader, however.

In Windows, there are a few functions you'll need: VirtualAlloc, VirtualFree, and VirtualProtect. VirtualAlloc allows you to reserve ("I want this address range") and commit ("back this address range up with physical memory") large chunks of memory. VirtualFree releases the memory. VirtualProtect allows us to change the protection state of the memory.

You'll also need to know your system's page size. Contrary to normal operation, you actually want to go with the smallest page size available for the platform. Under Windows, this is typically 4k, but check for yourself using the GetSystemInfo function. The heap will only operate on full pages of memory, regardless what size the allocations are.

So now you need to build the special heap. There are a couple of variations you can make to detect bounds overwrites and underwrites, but I'll go into that a bit later; let's get it working first.

When the heap is created, reserve and commit a large block of memory (1-2 GB) via VirtualAlloc, passing PAGE_NOACCESS to prevent any reads or writes. (Advanced: minimize memory use by reserving the address space but not committing it until it's needed.)

Next, you'll need to add a bitfield array that specifies whether the page is in use or not. You only need to track full pages, though, as every allocation is going to get its own page of memory. I use bitfields in a 32 bit integer for this; getting a free bit is fast and searching can be done fairly quickly. Just create an array of your maximum number of pages divided by 32 and memset the whole thing to 0.

You'll also need to create another list of integers that will be used for timing, one per page. Initially memset these to zero as well. You'll need some sort of define/constant that defines the "hang time" of a page. When a page is freed, we don't want to immediately reallocate it to someone else, but instead stay unusable and inaccessible for some amount of time. A second is usually plenty of time, but you'll need to tune it.

When an allocation is requested, your heap finds a free page (Advanced: or pages, depending on the size of the allocation) in the list. When one is found, check the timing list to see if can be used (e.g., GetTickCount() > timing_list[page].) Once a usable page is found, mark it as used, call VirtualProtect to change the access rights of the page to PAGE_READWRITE and return the address. (Advanced: I've ordered the operations here in such a way as to make this heap lock-free.)

When an allocation is freed, use VirtualProtect to change the access rights of those page(s) to PAGE_NOACCESS, set the timer value to GetTickCount() + my_page_hang_time, and clear the allocation markers in the bitfield array . This page will now be protected from reads and writes for as long as your timer says. Any accesses to this block of memory will cause the platform to halt on the offending line.

That's pretty much it. To substitute this in really depends on how you have your memory manager set up but typically it's just forwarding an allocation to the heap.

Oh, and as an added bonus, if you create safety pages (PAGE_NOACCESS) prior to and after the page containing the memory, you can trap underwrites and overwrites. Just move the memory to the end of the page for overwrites and the start of the page for underwrites (but watch your alignment!)

"Dude! Where's the kodez?!"


I'm not posting any. The simple reason is that my implementation is vastly different than this and would serve only to confuse. The implementation is simple and straightforward so if you're advanced enough to replace the allocator, you're advanced enough to write the code I described. :)

Wednesday, February 8, 2012

The difference between knowing your algorithms and knowing your data.

I've been working on a telemetry tool for analyzing applications. Specifically, this is for game engines, but realistically it could be applied to any software. One of the plugins I'm working on analyzes memory usage patterns in real time. It does so by getting basic heap and memory allocation information from the client and rebuilding the data on the server (the telemetry tool.)

To store the pointer information, I initially used a non-recursive AVL tree as a test. AVL trees are O(log(n)) because they're always balanced, so inserting, removing, and finding the data will be fast, right?

After running the application with a heavy client (the game) memory allocation count, I found that the AVL trees performance was pretty bad. In fact, it was unusable. As the count of allocations increased, the time taken to rebalance the tree grew and grew. In short, the telemetry tool simply couldn't keep up with the game.

So I replaced the AVL tree with a simple hash map. It's a non-optimal implementation; it has a statically set block count, it's not lock-free, and is a simple first pass quicky algorithm that I implemented for something else. It's definitely not built for speed or efficiency.

But it was faster. Not only was it faster, it literally blew away the AVL implementation. My CPU usage went from 100% to 27% and the telemetry tool was back to waiting for data to arrive, not trying to process it. I could visually see a massive difference by just watching the application, not even doing timing tests on it.

The reason that the hash map, even in a non-optimal implementation, is faster is the way that memory allocations typically occur. A standard memory allocator creates a large block of memory and then hands out pieces of it. This typically happens linearly, handing out allocations from the start to the end of the block. AVL trees perform best when having random key values inserted into them, not organized and linear values. When the AVL tree inserts the data based on the addresses, it rebalances the tree very often, shuffling entire branches of the tree around.

The hash map, on the other hand, doesn't care about allocations being in order, out or order, or randomly chosen. It masks out portions of the key, jumps directly to the correct insertion list, and pushes the data into the list. Searches and removals are then slow linear searches through the keyed lists.

So the moral of the story is, don't assume that because your algorithm is efficient that your use of it is also efficient. Know your data; Applying the right algorithm depends on what your data is doing, not how awesome your O notation is.