random coding

Wednesday, February 15, 2012

Dangling Pointer Detection

"tl;dr and I know what I'm doing": Create a slab allocator that hands out page size blocks of memory to a requester using VirtualAlloc and VirtualProtect to control page access. Don't allow pages to be reused within some amount of time.

So you've got an application. Great. But now it crashes because of data corruption. Crap.

Most of the time, this type of overwrite occurs because of a dangling pointer. Dangling pointers are pointers to memory where the pointer thinks that it is pointing to, for example, a vertex buffer, but the block of memory it points to was freed, reallocated to someone else, and is now actually an Actor object.

Dangling pointers are historically one of the most difficult types of bugs to track down. This is because the problem occurred in the past; the memory was freed and reallocated to someone else, then someone, somewhere in the application, wrote to the data and corrupted it. When the actual owner went to read the data, it read badness and thus badness occurred.

So there are a few options available for debugging. First, if you are intimately familiar with the data and the application, you may be able to figure out who's doing it through sheer brute force. This isn't usually feasible and is definitely not the most efficient use of your time. Second, you can use data breakpoints if the object is always the same pointer and is always corrupted the same way. But, this only works if the corruption is repeatable; if it moves around (which it usually does) then this may work but it'll take forever to finally catch it. Third, you immediately halt the application when and where the corruption occurs. This method doesn't rely on how repeatable the crash is, how familiar you are with the code, or anything else; when a corrupting write occurs, the application halts on the line that caused the corruption. The debugger actually screams at you "Dis Bad Pionter! No do dis!"

Awesome, so option 3 is easily the best, but how can it be done?

If you've got your own heap (you do have your own heap, right?) then you can substitute a debugging heap that the allocations are forwarded to.

Your platform will usually allow you to control memory access on a per page basis. (This works on the Xbox 360, the Playstation3, and in Windows, but I can't say for other platforms.) I'll use Windows as an example here; you can do the rest of the platform work on your own. :)

And the obligatory word of warning here... This is Not an optimal solution. It uses a ton of memory and is fairly slow. This solution also only supports allocations up to the page size of the platform. I use a much more elegant method myself, but it's too complicated for a first pass explanation. For hints at my changes, see the "Advanced" notations. They're all an exercise for the reader, however.

In Windows, there are a few functions you'll need: VirtualAlloc, VirtualFree, and VirtualProtect. VirtualAlloc allows you to reserve ("I want this address range") and commit ("back this address range up with physical memory") large chunks of memory. VirtualFree releases the memory. VirtualProtect allows us to change the protection state of the memory.

You'll also need to know your system's page size. Contrary to normal operation, you actually want to go with the smallest page size available for the platform. Under Windows, this is typically 4k, but check for yourself using the GetSystemInfo function. The heap will only operate on full pages of memory, regardless what size the allocations are.

So now you need to build the special heap. There are a couple of variations you can make to detect bounds overwrites and underwrites, but I'll go into that a bit later; let's get it working first.

When the heap is created, reserve and commit a large block of memory (1-2 GB) via VirtualAlloc, passing PAGE_NOACCESS to prevent any reads or writes. (Advanced: minimize memory use by reserving the address space but not committing it until it's needed.)

Next, you'll need to add a bitfield array that specifies whether the page is in use or not. You only need to track full pages, though, as every allocation is going to get its own page of memory. I use bitfields in a 32 bit integer for this; getting a free bit is fast and searching can be done fairly quickly. Just create an array of your maximum number of pages divided by 32 and memset the whole thing to 0.

You'll also need to create another list of integers that will be used for timing, one per page. Initially memset these to zero as well. You'll need some sort of define/constant that defines the "hang time" of a page. When a page is freed, we don't want to immediately reallocate it to someone else, but instead stay unusable and inaccessible for some amount of time. A second is usually plenty of time, but you'll need to tune it.

When an allocation is requested, your heap finds a free page (Advanced: or pages, depending on the size of the allocation) in the list. When one is found, check the timing list to see if can be used (e.g., GetTickCount() > timing_list[page].) Once a usable page is found, mark it as used, call VirtualProtect to change the access rights of the page to PAGE_READWRITE and return the address. (Advanced: I've ordered the operations here in such a way as to make this heap lock-free.)

When an allocation is freed, use VirtualProtect to change the access rights of those page(s) to PAGE_NOACCESS, set the timer value to GetTickCount() + my_page_hang_time, and clear the allocation markers in the bitfield array . This page will now be protected from reads and writes for as long as your timer says. Any accesses to this block of memory will cause the platform to halt on the offending line.

That's pretty much it. To substitute this in really depends on how you have your memory manager set up but typically it's just forwarding an allocation to the heap.

Oh, and as an added bonus, if you create safety pages (PAGE_NOACCESS) prior to and after the page containing the memory, you can trap underwrites and overwrites. Just move the memory to the end of the page for overwrites and the start of the page for underwrites (but watch your alignment!)

"Dude! Where's the kodez?!"

I'm not posting any. The simple reason is that my implementation is vastly different than this and would serve only to confuse. The implementation is simple and straightforward so if you're advanced enough to replace the allocator, you're advanced enough to write the code I described. :)

Wednesday, February 8, 2012

The difference between knowing your algorithms and knowing your data.

I've been working on a telemetry tool for analyzing applications. Specifically, this is for game engines, but realistically it could be applied to any software. One of the plugins I'm working on analyzes memory usage patterns in real time. It does so by getting basic heap and memory allocation information from the client and rebuilding the data on the server (the telemetry tool.)

To store the pointer information, I initially used a non-recursive AVL tree as a test. AVL trees are O(log(n)) because they're always balanced, so inserting, removing, and finding the data will be fast, right?

After running the application with a heavy client (the game) memory allocation count, I found that the AVL trees performance was pretty bad. In fact, it was unusable. As the count of allocations increased, the time taken to rebalance the tree grew and grew. In short, the telemetry tool simply couldn't keep up with the game.

So I replaced the AVL tree with a simple hash map. It's a non-optimal implementation; it has a statically set block count, it's not lock-free, and is a simple first pass quicky algorithm that I implemented for something else. It's definitely not built for speed or efficiency.

But it was faster. Not only was it faster, it literally blew away the AVL implementation. My CPU usage went from 100% to 27% and the telemetry tool was back to waiting for data to arrive, not trying to process it. I could visually see a massive difference by just watching the application, not even doing timing tests on it.

The reason that the hash map, even in a non-optimal implementation, is faster is the way that memory allocations typically occur. A standard memory allocator creates a large block of memory and then hands out pieces of it. This typically happens linearly, handing out allocations from the start to the end of the block. AVL trees perform best when having random key values inserted into them, not organized and linear values. When the AVL tree inserts the data based on the addresses, it rebalances the tree very often, shuffling entire branches of the tree around.

The hash map, on the other hand, doesn't care about allocations being in order, out or order, or randomly chosen. It masks out portions of the key, jumps directly to the correct insertion list, and pushes the data into the list. Searches and removals are then slow linear searches through the keyed lists.

So the moral of the story is, don't assume that because your algorithm is efficient that your use of it is also efficient. Know your data; Applying the right algorithm depends on what your data is doing, not how awesome your O notation is.

Monday, October 25, 2010

Creating a Super-Fast Math Library, Part 4: Vectors with SSE

Here's my Windows specific vector implementation. I had to do some formatting changes to get it to show up decently, hopefully I didn't mess anything up in the process.

Almost everything is just a straight wrapper to the SSE intrinsics. The shuffle and shuffle mask had to be defines because the intrinsics require immediate (hard coded) values.

There are a couple of things about this that are not so good, but I'll leave it to you to improve them (I'm already doing it myself, but I can't just give you everything now can I?) Most specifically, the length and dot products break the vectors, returning only a float.

#include 
#include 

namespace math
{
    typedef __m128 vec4f_t;

    inline vec4f_t vec4f_zero()
    {
        return _mm_setzero_ps();
    }

    inline bool vec4f_cmpeq(const vec4f_t &a,
                            const vec4f_t &b)
    {
        return (0xf==_mm_movemask_ps(_mm_cmpeq_ps(a,b)));
    }

    inline bool vec4f_le(const vec4f_t &a,
                         const vec4f_t &b)
    {
        __m128 value = _mm_cmple_ps(a, b);
        return (0xf == _mm_movemask_ps(value));
    }

    inline vec4f_t vec4f_splat(const float v)
    {
        return _mm_set1_ps(&v);
    }

    inline vec4f_t vec4f_splat(const vec4f_t &v,
                               const uint32_t i)
    {
        switch (i)
        {
            case 0: return _mm_shuffle_ps(v, v, 0x00);
            case 1: return _mm_shuffle_ps(v, v, 0x55);
            case 2: return _mm_shuffle_ps(v, v, 0xaa);
            case 3: return _mm_shuffle_ps(v, v, 0xff);
            default: REQUIRE(0, "Invalid index");
        }
        return _mm_setzero_ps();
    }

    inline vec4f_t vec4f_load(const f32_t a,
                              const f32_t b,
                              const f32_t c,
                              const f32_t d)
    {
        return _mm_setr_ps(a, b, c, d);
    }

    inline vec4f_t vec4f_load(const f32_t * const p)
    {
        REQUIRE(((uintptr_t)p & 0xf) == 0, "ALIGNMENT");
        return _mm_load_ps(p);
    }

    inline void vec4f_store(const vec4f_t &a,
                            f32_t * const p)
    {
        REQUIRE(((uintptr_t)p & 0xf) == 0, "ALIGNMENT");
        _mm_store_ps(p, a);
    }

    inline vec4f_t vec4f_movelh(const vec4f_t &a,
                                const vec4f_t &b)
    {
        return _mm_movelh_ps(a, b);
    }

    inline vec4f_t vec4f_movehl(const vec4f_t &a,
                                const vec4f_t &b)
    {
        return _mm_movehl_ps(b, a);
    }

    inline vec4f_t vec4f_add(const vec4f_t &a,
                             const vec4f_t &b)
    {
        return _mm_add_ps(a, b);
    }

    inline vec4f_t vec4f_sub(const vec4f_t &a,
                             const vec4f_t &b)
    {
        return _mm_sub_ps(a, b);
    }

    inline vec4f_t vec4f_mul(const vec4f_t &a,
                             const vec4f_t &b)
    {
        return _mm_mul_ps(a, b);
    }

    inline vec4f_t vec4f_div(const vec4f_t &a,
                             const vec4f_t &b)
    {
        return _mm_div_ps(a, b);
    }

    inline vec4f_t vec4f_sqrt(const vec4f_t &a)
    {
        return _mm_sqrt_ps(a);
    }

    inline vec4f_t vec4f_rcp(const vec4f_t &a)
    {
        return _mm_rcp_ps(a);
    }

    inline vec4f_t vec4f_rsqrt(const vec4f_t &a)
    {
        return _mm_rsqrt_ps(a);
    }

    inline vec4f_t vec4f_min(const vec4f_t &a,
                             const vec4f_t &b)
    {
        return _mm_min_ps(a, b);
    }

    inline vec4f_t vec4f_max(const vec4f_t &a,
                             const vec4f_t &b)
    {
        return _mm_max_ps(a, b);
    }

    inline f32_t vec4f_dot(const vec4f_t &a,
                           const vec4f_t &b)
    {
        __m128 mult = _mm_mul_ps(a, b);
        __m128 shf1 = _mm_shuffle_ps(mult, mult, 0x4e);
        __m128 add1 = _mm_add_ps(shf1, mult);
        __m128 shf2 = _mm_shuffle_ps(add1, add1, 0x1b);
        __m128 add2 = _mm_add_ps(add1, shf2);
        f32_t result;
        _mm_store_ss(&result, add2);
        return result;
    }

    inline f32_t vec4f_length_sq(const vec4f_t &v)
    {
        return vec4f_dot(v, v);
    }

    inline f32_t vec4f_length(const vec4f_t &v)
    {
        return sqrt(vec4f_length_sq(v));
    }

    inline vec4f_t vec4f_normalize(const vec4f_t &v)
    {
        return vec4f_div(v, vec4f_splat(vec4f_length(v)));
    }

    #define vec4f_shuffle_mask(a, b, c, d)\
        _MM_SHUFFLE(d, c, b, a)

    #define vec4f_shuffle(a, b, mask)\
        _mm_shuffle_ps(a, b, mask)
} // namespace math

Sunday, October 24, 2010

Creating a Super-Fast Math Library, Part 3: Vectors

We'll be using vectors for the basis of most of the math classes, so this is the best place to start.

Before we get into it, though, there's a few different definitions of what a "vector" is so I'd like to start by clarifying. Indeed, simply doing a google or wikipedia search for "vector" will likely turn up a dozen different meanings. But this isn't meant to be an introduction to math or anything, so I'll keep it very brief.

We're using the mathematical/game development version of "a collection of four numbers." As it turns out (you'll see later) a vector, as we use it, is actually any 128 bit set of stuff; we just happen to use it as a set of four floats. To add to possibilities, there's also a 256 bit set coming out. For our purposes though, we're only using it as a set of four floats.

So our vector object will be comprised of four floats. You can name them anything you want, be it { w, x, y, z }, { x, y, z, w }, { three, peas, inna, pod } I don't care, just as long as we know them as "those four floats that comprise a vector." Scientific, eh?

It's useful to implement a regular scalar vector as well as using any intrinsics provided by the hardware we're running on. The most important reason is for unit testing. If we create the regular scalar vector using an array of floats and can verify that these operations are correct in functionality, we can generate tests for the intrinsic versions very easily. Another reason is to have a fallback, so all of the code doesn't suddenly stop working if you start to develop for a platform that doesn't have intrinsics. Honestly though, the likelihood of that is slim to none.

It just so happens that most hardware vendors somewhat agree on what a vector is, how it operates, and have even given them a type. For the PC, we're using Intel's version (if that link doesn't work for you, let me know!) which is an __m128. The Playstation3 uses the CELL version, which is sometimes called AltiVec or VMX (even though the hardware implementation is entirely different.)

So the most difficult part is to come up with an interface for the vector. We want it to be simple, support common functionality, be very flexible, and hopefully very fast. It also has to support all of the platforms we'd like to use or we have to simulate the functionality.

Before we define the interface, there are a few gotchas with hardware vectors. First, both Intel and the CELL really like 16 byte alignment. It is possible to use unaligned data with vectors, but I avoid them like the plague and we're going to ignore them here. Second, their types are different. To get around this, I don't publicize what a 'vector' type is, I just define an opaque type (vec4f_t) and give the caller methods to access the data. Third, hardware vectors really don't like to be split or accessed individually once they're loaded. My method for enforcing this is that the caller loses easy access to the individual elements. You can still get access to the data if you really need it, but you'll have to create storage and call a store function to get the data.

So here's the interface that I currently use:

namespace math
{
    // assume that the vec4f_t type has been defined and
    // that it is an opaque type.  access is only granted
    // via these functions:

    // create and initialize a vector as zero
    vec4f_t vec4f_zero();

    bool vec4f_cmpeq(const vec4f_t&, const vec4f_t&);
    bool vec4f_le(const vec4f_t&, const vec4f_t&);

    // create and initialize a vector with the value in
    // all four elements
    vec4f_t vec4f_splat(const float);

    // create and initialize a vector with an index of
    // another vector in all four elements
    vec4f_t vec4f_splat(const vec4f_t&, const uint32_t);

    // create and load a vector where components are
    // stored as:
    // v.a = a
    // v.b = b
    // v.c = c
    // v.d = d
    vec4f_t vec4f_load(const f32_t a,
                       const f32_t b,
                       const f32_t c,
                       const f32_t d);

    // create and load a vector where components are
    // stored as:
    // v.a = p[0]
    // v.b = p[1]
    // v.c = p[2]
    // v.d = p[3]
    // the array must be 16 byte aligned
    vec4f_t vec4f_load(const f32_t * const);

    // retrieve the components as:
    // p[0] = v.a
    // p[1] = v.b
    // p[2] = v.c
    // p[3] = v.d
    // the array must be 16 byte aligned
    void vec4f_store(const vec4f_t&, f32_t * const);

    // creates a vector from the first two components of
    // the first vector and the first two components of
    // the second vector.
    // v.a = a.a
    // v.b = a.b
    // v.c = b.a
    // v.d = b.b
    vec4f_t vec4f_movelh(const vec4f_t&, const vec4f_t&);

    // creates a vector from the second two components of
    // the first vector and the first two components of
    // the second vector.
    // v.a = a.c
    // v.b = a.d
    // v.c = b.c
    // v.d = b.d
    vec4f_t vec4f_movehl(const vec4f_t&, const vec4f_t&);

    vec4f_t vec4f_add(const vec4f_t&, const vec4f_t&);
    vec4f_t vec4f_sub(const vec4f_t&, const vec4f_t&);
    vec4f_t vec4f_mul(const vec4f_t&, const vec4f_t&);
    vec4f_t vec4f_div(const vec4f_t&, const vec4f_t&);
    vec4f_t vec4f_sqrt(const vec4f_t&);
    vec4f_t vec4f_rcp(const vec4f_t&);
    vec4f_t vec4f_rsqrt(const vec4f_t&);
    vec4f_t vec4f_min(const vec4f_t&, const vec4f_t&);
    vec4f_t vec4f_max(const vec4f_t&, const vec4f_t&);

    f32_t vec4f_dot(const vec4f_t&, const vec4f_t&);

    f32_t vec4f_length_sq(const vec4f_t&);
    f32_t vec4f_length(const vec4f_t&);

    vec4f_t vec4f_normalize(const vec4f_t&);

    // you can assume that these are also available,
    // although the exact implementation is left up to
    // the platform.
    //
    // shuffling order is determined by passing a series
    // of indices to the shuffle mask function.  the
    // resulting shuffle will be ordered as:
    // { ax, ay, bz, bw }
    //
    // example 1:
    //  - shuffle mask of: 2, 3, 0, 1
    //  - vector A set to: 0, 1, 2, 3
    //  - vector B set to: 4, 5, 6, 7
    //  - final vector as: 2, 3, 4, 5
    //
    // example 2:
    //  - shuffle mask of: 3, 2, 1, 0
    //  - vector A set to: 0, 1, 2, 3
    //  - vector B set to: 4, 5, 6, 7
    //  - final vector as: 3, 2, 5, 4
    //uint32_t vec4f_shuffle_mask(const uint32_t,
    //                            const uint32_t,
    //                            const uint32_t,
    //                            const uint32_t);
    //vec4f_t vec4f_shuffle(const vec4f_t&,
    //                      const vec4f_t&,
    //                      const uint32_t);
} // namespace math

Those of you already familiar with vector and SIMD operations probably notice that this is missing some stuff, like projection. This isn't the full blown interface, just what is needed to get to the fun stuff. ;)

For a baseline, we'll create a simple scalar version that is used for unit testing as well as an illusion of portability. This version does it the old fashioned way, with a structure of floats.

So here it is (I haven't strongly verified that everything is present here... if you see something I missed, let me know and I'll fix it!)

namespace math
{
    PRE_ALIGN(16) typedef union _vec4f_t
    {
        float p[4];
        struct
        {
            f32_t x;
            f32_t y;
            f32_t z;
            f32_t w;
        } v;
    } vec4f_t POST_ALIGN(16);

    inline vec4f_t vec4f_zero()
    {
        vec4f_t v = { 0, 0, 0, 0 };
        return v;
    }

    inline bool vec4f_cmpeq(const vec4f_t &a,
                            const vec4f_t &b)
    {
        return (a.v.x == b.v.x &&
                a.v.y == b.v.y &&
                a.v.z == b.v.z &&
                a.v.w == b.v.w);
    }

    inline bool vec4f_le(const vec4f_t &a,
                         const vec4f_t &b)
    {
        return (a.v.x <= b.v.x &&
                a.v.y <= b.v.y &&
                a.v.z <= b.v.y &&
                a.v.w <= b.v.w); 
    }

    inline vec4f_t vec4f_splat(const float v)
    {
        vec4f_t vec = { v, v, v, v };
        return vec;
    }

    inline vec4f_t vec4f_splat(const vec4f_t &v,
                               const uint32_t i)
    {
        vec4f_t vec = { v.p[i], v.p[i], v.p[i], v.p[i] };
        return vec;
    }

    inline vec4f_t vec4f_load(const f32_t a,
                              const f32_t b,
                              const f32_t c,
                              const f32_t d)
    {
        vec4f_t vec = { a, b, c, d };
        return vec;
    }

    inline vec4f_t vec4f_load(const f32_t * const p)
    {
        REQUIRE(((uintptr_t)p & 0xf) == 0, "ALIGNMENT");
        vec4f_t vec = { p[0], p[1], p[2], p[3] };
        return vec;
    }

    inline void vec4f_store(const vec4f_t &a,
                            f32_t * const p)
    {
        REQUIRE(((uintptr_t)p & 0xf) == 0, "ALIGNMENT");
        p[0] = a.v.x;
        p[1] = a.v.y;
        p[2] = a.v.z;
        p[3] = a.v.w;
    }

    inline vec4f_t vec4f_movelh(const vec4f_t &a,
                                const vec4f_t &b)
    {
        vec4f_t vec =
        {
            a.v.x,
            a.v.y,
            b.v.x,
            b.v.y
        };
        return vec;
    }

    inline vec4f_t vec4f_movehl(const vec4f_t &a,
                                const vec4f_t &b)
    {
        vec4f_t vec =
        {
            a.v.z,
            a.v.w,
            b.v.z,
            b.v.w
        };
        return vec;
    }

    inline vec4f_t vec4f_add(const vec4f_t &a,
                             const vec4f_t &b)
    {
        vec4f_t vec =
        {
            a.v.x + b.v.x,
            a.v.y + b.v.y,
            a.v.z + b.v.z,
            a.v.w + b.v.w
        };
        return vec;
    }

    inline vec4f_t vec4f_sub(const vec4f_t &a,
                             const vec4f_t &b)
    {
        vec4f_t vec =
        {
            a.v.x - b.v.x,
            a.v.y - b.v.y,
            a.v.z - b.v.z,
            a.v.w - b.v.w
        };
        return vec;
    }

    inline vec4f_t vec4f_mul(const vec4f_t &a,
                             const vec4f_t &b)
    {
        vec4f_t vec =
        {
            a.v.x * b.v.x,
            a.v.y * b.v.y,
            a.v.z * b.v.z,
            a.v.w * b.v.w
        };
        return vec;
    }

    inline vec4f_t vec4f_div(const vec4f_t &a,
                             const vec4f_t &b)
    {
        vec4f_t vec =
        {
            a.v.x / b.v.x,
            a.v.y / b.v.y,
            a.v.z / b.v.z,
            a.v.w / b.v.w
        };
        return vec;
    }

    inline vec4f_t vec4f_sqrt(const vec4f_t &a)
    {
        vec4f_t vec =
        {
            sqrt(a.v.x),
            sqrt(a.v.y),
            sqrt(a.v.z),
            sqrt(a.v.w)
        };
        return vec;
    }

    inline vec4f_t vec4f_rcp(const vec4f_t &a)
    {
        vec4f_t vec =
        {
            1.0f / a.v.x,
            1.0f / a.v.y,
            1.0f / a.v.z,
            1.0f / a.v.w
        };
        return vec;
    }

    inline vec4f_t vec4f_rsqrt(const vec4f_t &a)
    {
        vec4f_t vec =
        {
            1.0f / sqrt(a.v.x),
            1.0f / sqrt(a.v.y),
            1.0f / sqrt(a.v.z),
            1.0f / sqrt(a.v.w)
        };
        return vec;
    }

    inline vec4f_t vec4f_min(const vec4f_t &a,
                             const vec4f_t &b)
    {
        vec4f_t vec =
        {
            min(a.v.x, b.v.x),
            min(a.v.y, b.v.y),
            min(a.v.z, b.v.z),
            min(a.v.w, b.v.w)
        };
        return vec;
    }

    inline vec4f_t vec4f_max(const vec4f_t &a,
                             const vec4f_t &b)
    {
        vec4f_t vec =
        {
            max(a.v.x, b.v.x),
            max(a.v.y, b.v.y),
            max(a.v.z, b.v.z),
            max(a.v.w, b.v.w)
        };
        return vec;
    }

    inline f32_t vec4f_dot(const vec4f_t &a,
                             const vec4f_t &b)
    {
        return (a.v.x * b.v.x) +
               (a.v.y * b.v.y) +
               (a.v.z * b.v.z) +
               (a.v.w * b.v.w);
    }

    inline f32_t vec4f_length_sq(const vec4f_t &v)
    {
        return vec4f_dot(v, v);
    }
    inline f32_t vec4f_length(const vec4f_t &v)
    {
        return sqrt(vec4f_length_sq(v));
    }

    inline vec4f_t vec4f_normalize(const vec4f_t &v)
    {
        return vec4f_div(v, vec4f_splat(vec4f_length(v)));
    }

    inline uint32_t vec4f_shuffle_mask(const uint32_t a,
                                       const uint32_t b,
                                       const uint32_t c,
                                       const uint32_t d)
    {
        return ((a&0x3)<<0) |
               ((b&0x3)<<2) |
               ((c&0x3)<<4) |
               ((d&0x3)<<6);
    }

    inline vec4f_t vec4f_shuffle(const vec4f_t &a,
                                 const vec4f_t &b,
                                 const uint32_t mask)
    {
        return vec4f_load(a.p[(mask>>0)&0x3],
                          a.p[(mask>>2)&0x3],
                          b.p[(mask>>4)&0x3],
                          b.p[(mask>>6)&0x3]);
    }
} // namespace math

To include the code above, I put it in an inline (.inl) file and actually switch off of platform defines interface header. I then include the .inl file directly from there. The caller just includes "vec4f.h" and the right one is built with the code.

Oh, and if you're curious (or a lawyer...), this code is free to use by anyone for anything and I'm not responsible for any misuse or damage it may cause to any person, thing, or being, whether they be alien, native to this planet, or some other classification we haven't thought of yet.

Creating a Super-Fast Math Library, Part 2: Basics

Before we get to digging into the fun information, I thought it'd be helpful if I described what types I use and a little about my coding style.

A good portion of what I write is in C, but I use C++ namespaces, templates, and classes where appropriate. I always prefer clarity over brevity. Saving space by putting a lot of data on one line makes the code unreadable by everyone but the author (actually, it's unreadable to them as well, they just don't admit it.) Also, there is no such thing as "self documenting code." If I have to mentally trace through the code to understand what it's doing, it needs documentation.

In my engine I have various typedefs set up for standard types. This is one thing that allows me to easily switch platforms and not have to worry about, for instance, the size of an int versus the size of a long.

It's a pretty standard practice, but here's the basic types that I use:

typedef unsigned char      uint8_t;
typedef unsigned short     uint16_t;
typedef unsigned int       uint32_t;
typedef unsigned long long uint64_t;

typedef signed char      sint8_t;
typedef signed short     sint16_t;
typedef signed int       sint32_t;
typedef signed long long sint64_t;

typedef float  f32_t;
typedef double f64_t;

There are actually more than these and they are different for different platforms, but this gives you an idea of what they are when I use them later.

After that, I have a couple of macros that align data. Because the Playstation3 uses alignment attributes that follow the data declaration, we have to either double up with pre and post macros or declare a special DECLARE_MY_ALIGNED_VARIABLE(int thing) which I'm not overly fond of. Who knows, maybe one day I'll change my mind.

# define PRE_ALIGN(x) __declspec(align(x))
# define POST_ALIGN(x)

Another thing that you may (or may not...) see is branch prediction hints. MSVC++ optimizes code under the assumption that it will always enter the first case of a conditional. For example:


    if (a) // this is predicted as always being true
    {
        do_a_stuff(); // which (hopefully...) loads this code in the cache
    }
    else
    {
        do_not_a_stuff(); // but doesn't load this code.
    }

When a branch prediction is wrong, the CPU has to grab the code and load it into memory, which takes time. Data and code cache misses kill a program's speed like an army of ants descending on a picnic.

My branch prediction hints are simply defined as this for Windows:

# define LIKELY(x) x
# define UNLIKELY(x) x

In other words... they don't do anything. But, if you see them, you know what they're for.

On a final note about coding conventions, style, and so forth... I tend to adapt to what I need, not pick a particular book style and strictly adhere to it. If I see something I like that suits my needs I'll probably end up using it. I'm always open to new ideas.

Creating a Super-Fast Math Library, Part 1

This is part one of an open ended series on writing a super fast math library using SIMD. The goal of the project was to write a math library that is both cross platform and very, very fast.

I've had a lot of fun doing it and have learned quite a bit and thought I'd share what I've done so far.

Let's skip the small talk and get down to business. The goal is to write vector, matrix and quaternion types that are 1) correct, 2) cross platform, and 3) fast. All of the development has, thus far, been done using Microsoft Visual Studio 5 under Windows.

To get timing results, I'm using Agner Fog's Test programs for measuring clock cycles and performance monitoring. It uses the rdtsc instruction to determine clock counts, cache misses, branch mispredictions and a host of other stuff.

For intrinsics, I am strictly limiting myself (at the moment) to SSE version 1. It is supported on Intel, AMD, and VIA chipsets going back quite a bit.

The platforms I'm targeting are the Xbox, Playstation3, and, of course, the PC but all of the information I'm putting here is PC oriented. The platforms are vastly different and, while I'd love to cover all three in deep meaningful investigation... I really don't have the time. :)

Monday, August 30, 2010

Personal project: in-place list and tree management

Time is a valuable resource. Time spent cannot be regained no matter how hard you try. The only thing you can try to do is be more efficient with your time. Spend it wisely.

Since I started at Obsidian (wow, that was like 6 months ago, time flies!) I've been working non-stop. As a result, some of my personal projects have, shall we say, lagged behind the times. Interestingly enough, it's given me time to gather my thoughts on my own stuff and come up with some interesting new ideas.

One little project I've been toying around with is in-place list and tree management. This is a mundane little project that can actually have some significant impact.

Take a doubly linked list, for example. The difficult and time consuming part of doubly linked lists is getting the insert/remove operations correct. Even if you're an expert, it still takes brain power to trace the code and ensure that it's correct. After that, it's pretty mundane as it's just iteration and allocation of node space. The in-place functions (actually just inline templates) assume that you're providing the memory for the storage and do direct management of the data.

So why, you might ask, is this important or even need addressing? I mean seriously, you could either:

Just write the doubly linked list when you need it and be-done-wif-it!
Use a pre-canned doubly linked list and be-done-wif-it!

Say you go route number 2 and use the pre-canned doubly linked list. It allocates a node, manages next/prev pointers, and is nice and clean and happy from the front end. And then you use it in a memory allocator. Oops! You've now got the memory manager that is supposed to be managing a block of memory allocating. This causes the memory manager to call itself again, which causes a list operation, which calls the memory manager... which makes you pull your hair out as you try to resolve the circular reference.

Ok, screw that. Go with option number 1 so we can avoid the recursion. I need to write a doubly linked list using space that I've provided. Let's see, I'll create a node with next and previous pointers. Now I'll figure out the pointers to manipulate the nodes... ah that's right, three special cases, head, tail and middle... handle those... and crap now it's broke because I missed something stupid.

Ok, no, really, screw all of that. That's a recipe for disaster both from a maintenance perspective as well as from initially writing the code. K.I.S.S. How much time do you want to spend debugging something that isn't the core of your work?

Let's break it down just a touch... Option number 1 is nice in that you've got this reusable doubly linked list that allocates nodes and manages the list. Option number 2 is necessary because there are lots of situations when you don't want to allocate the node. The similarity between the two of these is that the list is managed in some known block of memory.

Easy, write this and have both Option 1 and Option 2 call it with the data to manage!

template <typename NodeType>
inline void insert_after(NodeType *target, NodeType **list)
{
 NodeType *position = *list;

 if (position)
 {
  target->next(position->next());
  target->prev(position);

  if (position->next())
  {
   position->next()->prev(target);
  }

  position->next(target);
 }
 else
 {
  *list = target;
  target->next(0);
  target->prev(0);
 }
}

Not a groundbreaking idea, I know, but there are so many possibilities here... you can specialize the function for different arguments. For instance, if you know that the node that you're inserting into is not null and that the next node in the list is not null, you can do a branchless version:

template <typename NodeType>
inline void insert_after_ln(NodeType *target, NodeType *list)
{
 target->next(list->next());
 target->prev(list);
 list->next()->prev(target);
 list->next(target);
}

But wait, order now and I'll throw in this fancy little tidbit:

You can use this functionality to make a BRANCHLESS doubly linked list. Imagine, a doubly linked list that never hits a conditional. Ever! Ok, so maybe this is only really interesting to me, but on platforms like the PS3 SPU which absolutely *hate* branch prediction misses, this could be a significant gain in performance.

As to how, I'll leave that as an exercise to you... or maybe post it later today, we'll see. :)