Sunday, October 24, 2010

Creating a Super-Fast Math Library, Part 3: Vectors

We'll be using vectors for the basis of most of the math classes, so this is the best place to start.

Before we get into it, though, there's a few different definitions of what a "vector" is so I'd like to start by clarifying. Indeed, simply doing a google or wikipedia search for "vector" will likely turn up a dozen different meanings. But this isn't meant to be an introduction to math or anything, so I'll keep it very brief.

We're using the mathematical/game development version of "a collection of four numbers." As it turns out (you'll see later) a vector, as we use it, is actually any 128 bit set of stuff; we just happen to use it as a set of four floats. To add to possibilities, there's also a 256 bit set coming out. For our purposes though, we're only using it as a set of four floats.

So our vector object will be comprised of four floats. You can name them anything you want, be it { w, x, y, z }, { x, y, z, w }, { three, peas, inna, pod } I don't care, just as long as we know them as "those four floats that comprise a vector." Scientific, eh?

It's useful to implement a regular scalar vector as well as using any intrinsics provided by the hardware we're running on. The most important reason is for unit testing. If we create the regular scalar vector using an array of floats and can verify that these operations are correct in functionality, we can generate tests for the intrinsic versions very easily. Another reason is to have a fallback, so all of the code doesn't suddenly stop working if you start to develop for a platform that doesn't have intrinsics. Honestly though, the likelihood of that is slim to none.

It just so happens that most hardware vendors somewhat agree on what a vector is, how it operates, and have even given them a type. For the PC, we're using Intel's version (if that link doesn't work for you, let me know!) which is an __m128. The Playstation3 uses the CELL version, which is sometimes called AltiVec or VMX (even though the hardware implementation is entirely different.)

So the most difficult part is to come up with an interface for the vector. We want it to be simple, support common functionality, be very flexible, and hopefully very fast. It also has to support all of the platforms we'd like to use or we have to simulate the functionality.

Before we define the interface, there are a few gotchas with hardware vectors. First, both Intel and the CELL really like 16 byte alignment. It is possible to use unaligned data with vectors, but I avoid them like the plague and we're going to ignore them here. Second, their types are different. To get around this, I don't publicize what a 'vector' type is, I just define an opaque type (vec4f_t) and give the caller methods to access the data. Third, hardware vectors really don't like to be split or accessed individually once they're loaded. My method for enforcing this is that the caller loses easy access to the individual elements. You can still get access to the data if you really need it, but you'll have to create storage and call a store function to get the data.

So here's the interface that I currently use:
namespace math
{
// assume that the vec4f_t type has been defined and
// that it is an opaque type. access is only granted
// via these functions:

// create and initialize a vector as zero
vec4f_t vec4f_zero();

bool vec4f_cmpeq(const vec4f_t&, const vec4f_t&);
bool vec4f_le(const vec4f_t&, const vec4f_t&);

// create and initialize a vector with the value in
// all four elements
vec4f_t vec4f_splat(const float);

// create and initialize a vector with an index of
// another vector in all four elements
vec4f_t vec4f_splat(const vec4f_t&, const uint32_t);

// create and load a vector where components are
// stored as:
// v.a = a
// v.b = b
// v.c = c
// v.d = d
vec4f_t vec4f_load(const f32_t a,
const f32_t b,
const f32_t c,
const f32_t d);

// create and load a vector where components are
// stored as:
// v.a = p[0]
// v.b = p[1]
// v.c = p[2]
// v.d = p[3]
// the array must be 16 byte aligned
vec4f_t vec4f_load(const f32_t * const);

// retrieve the components as:
// p[0] = v.a
// p[1] = v.b
// p[2] = v.c
// p[3] = v.d
// the array must be 16 byte aligned
void vec4f_store(const vec4f_t&, f32_t * const);

// creates a vector from the first two components of
// the first vector and the first two components of
// the second vector.
// v.a = a.a
// v.b = a.b
// v.c = b.a
// v.d = b.b
vec4f_t vec4f_movelh(const vec4f_t&, const vec4f_t&);

// creates a vector from the second two components of
// the first vector and the first two components of
// the second vector.
// v.a = a.c
// v.b = a.d
// v.c = b.c
// v.d = b.d
vec4f_t vec4f_movehl(const vec4f_t&, const vec4f_t&);

vec4f_t vec4f_add(const vec4f_t&, const vec4f_t&);
vec4f_t vec4f_sub(const vec4f_t&, const vec4f_t&);
vec4f_t vec4f_mul(const vec4f_t&, const vec4f_t&);
vec4f_t vec4f_div(const vec4f_t&, const vec4f_t&);
vec4f_t vec4f_sqrt(const vec4f_t&);
vec4f_t vec4f_rcp(const vec4f_t&);
vec4f_t vec4f_rsqrt(const vec4f_t&);
vec4f_t vec4f_min(const vec4f_t&, const vec4f_t&);
vec4f_t vec4f_max(const vec4f_t&, const vec4f_t&);

f32_t vec4f_dot(const vec4f_t&, const vec4f_t&);

f32_t vec4f_length_sq(const vec4f_t&);
f32_t vec4f_length(const vec4f_t&);

vec4f_t vec4f_normalize(const vec4f_t&);

// you can assume that these are also available,
// although the exact implementation is left up to
// the platform.
//
// shuffling order is determined by passing a series
// of indices to the shuffle mask function. the
// resulting shuffle will be ordered as:
// { ax, ay, bz, bw }
//
// example 1:
// - shuffle mask of: 2, 3, 0, 1
// - vector A set to: 0, 1, 2, 3
// - vector B set to: 4, 5, 6, 7
// - final vector as: 2, 3, 4, 5
//
// example 2:
// - shuffle mask of: 3, 2, 1, 0
// - vector A set to: 0, 1, 2, 3
// - vector B set to: 4, 5, 6, 7
// - final vector as: 3, 2, 5, 4
//uint32_t vec4f_shuffle_mask(const uint32_t,
// const uint32_t,
// const uint32_t,
// const uint32_t);
//vec4f_t vec4f_shuffle(const vec4f_t&,
// const vec4f_t&,
// const uint32_t);
} // namespace math

Those of you already familiar with vector and SIMD operations probably notice that this is missing some stuff, like projection. This isn't the full blown interface, just what is needed to get to the fun stuff. ;)

For a baseline, we'll create a simple scalar version that is used for unit testing as well as an illusion of portability. This version does it the old fashioned way, with a structure of floats.

So here it is (I haven't strongly verified that everything is present here... if you see something I missed, let me know and I'll fix it!)
namespace math
{
PRE_ALIGN(16) typedef union _vec4f_t
{
float p[4];
struct
{
f32_t x;
f32_t y;
f32_t z;
f32_t w;
} v;
} vec4f_t POST_ALIGN(16);

inline vec4f_t vec4f_zero()
{
vec4f_t v = { 0, 0, 0, 0 };
return v;
}

inline bool vec4f_cmpeq(const vec4f_t &a,
const vec4f_t &b)
{
return (a.v.x == b.v.x &&
a.v.y == b.v.y &&
a.v.z == b.v.z &&
a.v.w == b.v.w);
}

inline bool vec4f_le(const vec4f_t &a,
const vec4f_t &b)
{
return (a.v.x <= b.v.x &&
a.v.y <= b.v.y &&
a.v.z <= b.v.y &&
a.v.w <= b.v.w);
}

inline vec4f_t vec4f_splat(const float v)
{
vec4f_t vec = { v, v, v, v };
return vec;
}

inline vec4f_t vec4f_splat(const vec4f_t &v,
const uint32_t i)
{
vec4f_t vec = { v.p[i], v.p[i], v.p[i], v.p[i] };
return vec;
}

inline vec4f_t vec4f_load(const f32_t a,
const f32_t b,
const f32_t c,
const f32_t d)
{
vec4f_t vec = { a, b, c, d };
return vec;
}

inline vec4f_t vec4f_load(const f32_t * const p)
{
REQUIRE(((uintptr_t)p & 0xf) == 0, "ALIGNMENT");
vec4f_t vec = { p[0], p[1], p[2], p[3] };
return vec;
}

inline void vec4f_store(const vec4f_t &a,
f32_t * const p)
{
REQUIRE(((uintptr_t)p & 0xf) == 0, "ALIGNMENT");
p[0] = a.v.x;
p[1] = a.v.y;
p[2] = a.v.z;
p[3] = a.v.w;
}

inline vec4f_t vec4f_movelh(const vec4f_t &a,
const vec4f_t &b)
{
vec4f_t vec =
{
a.v.x,
a.v.y,
b.v.x,
b.v.y
};
return vec;
}

inline vec4f_t vec4f_movehl(const vec4f_t &a,
const vec4f_t &b)
{
vec4f_t vec =
{
a.v.z,
a.v.w,
b.v.z,
b.v.w
};
return vec;
}

inline vec4f_t vec4f_add(const vec4f_t &a,
const vec4f_t &b)
{
vec4f_t vec =
{
a.v.x + b.v.x,
a.v.y + b.v.y,
a.v.z + b.v.z,
a.v.w + b.v.w
};
return vec;
}

inline vec4f_t vec4f_sub(const vec4f_t &a,
const vec4f_t &b)
{
vec4f_t vec =
{
a.v.x - b.v.x,
a.v.y - b.v.y,
a.v.z - b.v.z,
a.v.w - b.v.w
};
return vec;
}

inline vec4f_t vec4f_mul(const vec4f_t &a,
const vec4f_t &b)
{
vec4f_t vec =
{
a.v.x * b.v.x,
a.v.y * b.v.y,
a.v.z * b.v.z,
a.v.w * b.v.w
};
return vec;
}

inline vec4f_t vec4f_div(const vec4f_t &a,
const vec4f_t &b)
{
vec4f_t vec =
{
a.v.x / b.v.x,
a.v.y / b.v.y,
a.v.z / b.v.z,
a.v.w / b.v.w
};
return vec;
}

inline vec4f_t vec4f_sqrt(const vec4f_t &a)
{
vec4f_t vec =
{
sqrt(a.v.x),
sqrt(a.v.y),
sqrt(a.v.z),
sqrt(a.v.w)
};
return vec;
}

inline vec4f_t vec4f_rcp(const vec4f_t &a)
{
vec4f_t vec =
{
1.0f / a.v.x,
1.0f / a.v.y,
1.0f / a.v.z,
1.0f / a.v.w
};
return vec;
}

inline vec4f_t vec4f_rsqrt(const vec4f_t &a)
{
vec4f_t vec =
{
1.0f / sqrt(a.v.x),
1.0f / sqrt(a.v.y),
1.0f / sqrt(a.v.z),
1.0f / sqrt(a.v.w)
};
return vec;
}

inline vec4f_t vec4f_min(const vec4f_t &a,
const vec4f_t &b)
{
vec4f_t vec =
{
min(a.v.x, b.v.x),
min(a.v.y, b.v.y),
min(a.v.z, b.v.z),
min(a.v.w, b.v.w)
};
return vec;
}

inline vec4f_t vec4f_max(const vec4f_t &a,
const vec4f_t &b)
{
vec4f_t vec =
{
max(a.v.x, b.v.x),
max(a.v.y, b.v.y),
max(a.v.z, b.v.z),
max(a.v.w, b.v.w)
};
return vec;
}

inline f32_t vec4f_dot(const vec4f_t &a,
const vec4f_t &b)
{
return (a.v.x * b.v.x) +
(a.v.y * b.v.y) +
(a.v.z * b.v.z) +
(a.v.w * b.v.w);
}

inline f32_t vec4f_length_sq(const vec4f_t &v)
{
return vec4f_dot(v, v);
}
inline f32_t vec4f_length(const vec4f_t &v)
{
return sqrt(vec4f_length_sq(v));
}

inline vec4f_t vec4f_normalize(const vec4f_t &v)
{
return vec4f_div(v, vec4f_splat(vec4f_length(v)));
}

inline uint32_t vec4f_shuffle_mask(const uint32_t a,
const uint32_t b,
const uint32_t c,
const uint32_t d)
{
return ((a&0x3)<<0) |
((b&0x3)<<2) |
((c&0x3)<<4) |
((d&0x3)<<6);
}

inline vec4f_t vec4f_shuffle(const vec4f_t &a,
const vec4f_t &b,
const uint32_t mask)
{
return vec4f_load(a.p[(mask>>0)&0x3],
a.p[(mask>>2)&0x3],
b.p[(mask>>4)&0x3],
b.p[(mask>>6)&0x3]);
}
} // namespace math

To include the code above, I put it in an inline (.inl) file and actually switch off of platform defines interface header. I then include the .inl file directly from there. The caller just includes "vec4f.h" and the right one is built with the code.

Oh, and if you're curious (or a lawyer...), this code is free to use by anyone for anything and I'm not responsible for any misuse or damage it may cause to any person, thing, or being, whether they be alien, native to this planet, or some other classification we haven't thought of yet.

No comments:

Post a Comment