A few of us at work have been having a friendly path-tracing competition (greets to Tom & Dom). It’s been a lot of fun and comparing images in the office each Monday morning is a great motivation to get features in and renders out. I thought I’d write a post about it to record my progress and gather links to some reference material.

Here’s a list of features I’ve implemented so far and some pics below:

  • Monte-Carlo path tracing with explicit area light sampling at each step
  • Stratified image sampling
  • Importance sampled Lambert and Blinn BRDFs
  • Sphere, Plane, Disc, Metaball and Distance Field primitives (no triangles yet)
  • Multi-threaded tile renderer
  • Cross-compiles for PS3 on Linux (runs on SPUs)
  • Quite general shade-trees with Perlin noise etc




Sphere-tracing the distance fields produced some cool effects (the blobby sphere above). I first heard about the technique from Inigo Quilez who used it to generate an amazing image in his slisesix demo, he has some good descriptions on his page but for the details I would check out these papers:


And for global illumination and path-tracing in general:


Also, this is what happens when you push Perlin too far:

I’ve always had an interest in computer generated plants so I was pleased to read Self-organising tree models for image synthesis from Siggraph this year.

The paper basically pulls together a bunch of techniques that have been around for a while and uses them to generate some really good looking tree models.

Seeing as I’ve had a bit of time on my hands between Batman and before I start at LucasArts I decided to put together an implementation in OpenGL (being a games programmer I want realtime feedback).

Some screenshots below and a Win32 executable available – Plant.zip

tree_1
tree_2_bare
tree_2_leaves

Some Notes:

I implemented both the space colonisation and shadow propagation methods. The space colonisation is nice in that you can draw where the plant should grow by placing space samples with the mouse, this allows some pretty funky topiary but I found it difficult to grow convincing real-world plants with this method. The demo only uses the shadow propagation method.

Creating the branch geometry from generalised cylinders requires generating a continuous coordinate frame along a curve without any twists or knots. I used a parallel transport frame for this which worked out really nicely, these two papers describe the technique and the problem:

Parallel Transport Approach to Curve Framing
Quaternion Gauss Maps and Optimal Framings of Curves and Surfaces (1998)

Getting the lighting and leaf materials to look vaguely realistic took quite a lot of tweaking and I’m not totally happy with it. Until I implemented self-shadowing on the trunk and leaves it looked very weird. Also you need to account for the transmission you get through the leaves when looking toward the light:

tree_inside

There is a nice article in GPU Gems 3 on how SpeedTree do this.

The leaves are normal mapped with a simple Phong specular, I messed about with various modified diffuse models like half-Lambert but eventually just went with standard Lambert. It would be interesting to use a more sophisticated ambient term.

Still a lot of scope for performance optimisation, the leaves are alpha-tested right now so it’s doing loads of redundant fragment shader work (something like Emil Persson’s particle trimmer would be useful here).

If you want to take a look at the source code drop me an email.

Known issues:

On my NVIDIA card when the vert count is > 10^6 it runs like a dog, I need to break it up into smaller vertex buffers.

Some ATI mobile drivers don’t like the variable number of shadow mapping samples. If that’s your card then I recommend hacking the shaders to disable it.

No hardware I know of has atomic floating point operations but here’s a handy little code snippet from Matt Pharr over on the PBRT mailing list which emulates the same functionality using an atomic compare and swap:

inline float AtomicAdd(volatile float *val, float delta) {
     union bits { float f; int32_t i; };
     bits oldVal, newVal;
     do {
         oldVal.f = *val;
         newVal.f = oldVal.f + delta;
     } while (AtomicCompareAndSwap(*((AtomicInt32 *)val),
         newVal.i, oldVal.i) != oldVal.i);
     return newVal.f;
}

In unrelated news, I’ve taken a job at LucasArts which I’ll be starting soon, sad to say goodbye to Rocksteady they’re a great company to work for and I’ll miss the team there.

Looking forward to San Francisco though, 12 hours closer to my home town (Auckland, New Zealand) and maybe now I can finally get along to Siggraph or GDC. If anyone has some advice on where to live there please let me know!

Also a few weeks in between jobs so hopefully time to write some code and finish off all the tourist activities we never got around to in London.

Batman: Arkham Asylum is finished and the demo is up on PSN and Xbox Live. I was pretty much responsible for the PS3 version on the code side so anything wrong with it is ultimately my fault. As any PS3 engineer working on a cross platform title will tell you there is always the fear of the ’side by side comparison’. This popped up pretty quickly after the demo was released:

http://www.eurogamer.net/articles/digitalfoundry-batman-demo-showdown-blog-entry

The article is quite accurate (unlike some of the comments) and TBH it was as positive as I could have hoped for.

Generally the game has been getting a good reception which is great to see as Batman games have a proud tradition of being complete rubbish.

This link was going round our office, a discussion over at Lambda the Ultimate regarding Tim Sweeney’s HPG talk.

http://lambda-the-ultimate.org/node/3560

Tim chimes in a bit further down in the comments.

Code517E recently reminded me of a site I’ve used before when looking up form factors for various geometric configurations.

One I had missed the first time though is the differential element on ceiling, floor or wall to cow.

http://www.me.utexas.edu/~howell/sectionb/B-68.html

Very handy if you’re writing a farmyard simulator I’m sure.

I put together an implementation of the particle shadowing technique NVIDIA showed off a while ago. My original intention was to do a survey of particle lighting techniques, in the end I just tried out two different methods that I thought sounded promising.

The first was the one ATI used in the Ruby White Out demo, the best take away from it is that they write out the min distance, max distance and density in one pass. You can do this by setting your RGB blend mode to GL_MIN, your alpha blend mode to GL_ADD and writing out r=z, g=1-z, b=0, a=density for each particle (you can reconstruct the max depth from min(1-z), think of it as the minimum distance from an end point). Here’s a screen:

smoke2

The technique needs a bit of fudging to look OK. Blur the depths, add some smoothing functions, it only works for mostly convex objects, good for amorphous blobs (clouds maybe). Performance wise it is probably the best candidate for current-gen consoles.

http://ati.amd.com/developer/gdc/2007/ArtAndTechnologyOfWhiteout(Siggraph07).pdf

IMO the NVIDIA technique is much nicer visually, it gives you fairly accurate self shadowing which looks great but is considerably more expensive. I won’t go into the implementation details too much as the paper does a pretty good job at describing it.
http://developer.download.nvidia.com/compute/cuda/sdk/website/projects/smokeParticles/doc/smokeParticles.pdf

The Nvidia demo uses 32k particles and 32 slices but you can get pretty decent results with much less. Here’s a pic of my implementation, this is running on my trusty 7600 with 1000 particles and 10 slices through the volume:

smoke1

Unfortunately you need quite a lot of quite transparent particles otherwise there are noticeable artifacts as particles change order and end up in different slices. You can improve this by using a non-linear distribution of slices so that you use more slices up front (which works nicely because the extinction for light in participating media is exponential).

Looking forward to tackling some surface shaders next.

A friend just sent me this:

http://playpower.org/

It’s a non-profit organisation with the goal of developing educational games for developing countries that run on 8bit NES hardware. The old Nintedo chips are now patent-free and clones are very common:

nes

They’re trying to recruit programmers with a social conscience, I’m not old-school enough to know 8bit assembly but then I wouldn’t mind learning.. who needs GPUs anyway!

A small update on my global illumination renderer, I’ve ported the radiance transfer to the GPU. It was fairly straight forward as my CPU tree structure was already set up for easy GPU traversal, basically just a matter of converting array offsets into texture coordinates and packing into an indices texture.

The hardest part is of course wrangling OpenGL to do what you want and give you a proper error message. This site is easily the best starting point I found for GPGPU stuff:

http://www.mathematik.uni-dortmund.de/~goeddeke/gpgpu/tutorial.html

So here’s an image, there are 7850 surfels, it runs about 20ms on my old school NVidia 7600, so it’s still at least an order of magnitude or two slower than you would need for typical game scenes. But besides that it’s fun to pull area lights around in real time.

balls2

Not as much colour bleeding as you might expect, there is some but it is subtle.

I changed my surfel renderer over to use a pre-order traversal layout for the nodes, this generally gives better cache utilisation and I did see a small speed up from using it. The layout is quite nice because to traverse your tree you just linearly iterate over your array and whenever you find a subtree you want to skip you just increment your node pointer by the size of that subtree (which is precomputed, see Real Time Collision Detection 6.6.2).

The best optimisation though comes from compacting the size of the surfel data, which again improves the cache performance. As some parts of the traversal don’t need all of the surfel data it seems to make sense to split things out, for instance to store the hierarchy information and the area seperately from the colour/irradiance information.

In fact it seems like when generalised, this idea leads you to the structure of arrays (SOA) layout, which essentially provides the finest grained breakdown where you only pull into the cache what you use and for all the nodes that you skip over there is no added cost.

I haven’t done any timings to see how much of a win this would actually be, mainly because dealing with SOA data is so damn cumbersome.

It definitely seems like something you should do after you’ve done all your hierarchy building and node shuffling which is just so much more intuitive with structures. Then you can just ‘bake’ it down to SOA format and throw it at the GPU/SIMD.

I just bought a copy of Physically Based Rendering, I’ve been meaning to get one for ages as it’s often recommended and I thought it might be useful given my recent interest in global illumination. I’m also hoping to get a more formal background in rendering rather than the hacktastic world of real time.

The subjects covered are broad and it’s very readable. It’s my first exposure to literate programming, where essentially the book describes and contains the full implementation of a program. In fact the source code for their renderer is generated (tangled) from the definition of the book before compilation.

The only problem is the amount it weighs, I like to read on the tube but 1000 page hard backs aren’t exactly light reading.

After my initial implementation of surfel based illumination I’ve extended it to do hierarchical clustering of surfels based on a similar idea to the one presented in GPU Gems 2.

A few differences:

I’m using a k-means clustering to build the approximation hierarchy bottom up. A couple of iterations of Lloyd’s algorithm provides pretty good results. Really you could get away with one iteration.

To seed the clustering I’m simply selecting every n’th surfel from the source input. At first I thought I should be choosing a random spatial distribution but it turns out clustering based on regular intervals in the mesh works well. This is because you will end up with more detail in highly tesselated places, which is what you want (assuming your input has been cleaned).

For example, a two tri wall will be clustered into one larger surfel where as a highly tesselated sphere will be broken into more clusters.

The error metric I used to do the clustering is this:

Error = (1.0f + (1.0f-Dot(surfel.normal, cluster.normal)) * Length(surfel.pos-cluster.pos)

So it’s a combination of how aligned the surfel and cluster are and how far away they are from each other. You can experiment with weighting each of those metrics individually but just summing seems to give good results.

You could also be more selective and search for the pair of surfels with the lowest error metric, cluster and repeat, I believe this would give better results than Lloyd’s algorithm but would be a lot slower to build.

When summing up the surfels to form your representative cluster surfel you want to:

a) sum the area
b) average the position, normal, emission and colour weighted by area

Weighting by area is quite important and necessary for the emissive value or you’ll basically be adding energy into the simulation.

Then you get to the traversal, Bunnel recommended skipping a sub-tree of surfels if the distance to the query point is > 4*radius of the cluster surfel. That gives results practically identical to the brute force algorithm and I think you can be more agressive without losing much quality at all.

I get between a 4-6x speed up using the hierarchy over brute force. Not quite realtime yet but I haven’t optimised the tree structure at all, I’m also not running it on the GPU :)

It seems like every post I make here has to reference Christer Ericson somehow but I really recommend his book for ideas about optimising bounding volumes. Loads of good stuff in there that I have yet to implement.

Refs:

Real-time Soft Shadows in Dynamic Scenes using Spherical Harmonic Exponentiation
Lightcuts: a scalable approach to illumination

Wow, there is some seriously cool research happening in global illumination at the moment. I was trying to come up with a nice GPU assisted realtime method and decided rendering direct illumination to light maps on the fly and then gathering using pre-computed texture lookups per vertex would be the way to go.

This method has a few drawbacks though:

- Precomputing the lookups is slow, it’s basically a final gather pass with lots ray casting
- You need a unique parameterisation for your geometry
- It supports dynamic local light sources but not dynamic geometry
- You need to store the lookup coordinates for each vert

Turns out that this the same basic idea Dreamworks used on Shrek2, and sure enough their gather pass was the bottleneck (An Approximate Global Illumination System for Computer Generated Films).

The current state of the art seems to be the ’surfel’ approach that (I think) was invented by Michael Bunnell with his GPU Gems 2 article Dynamic Ambient Occlusion and Indirect Lighting, he approximates the mesh as a set of oriented discs and computes the radiance transfer dynamically on the GPU.

Turns out Pixar took this idea and now use it on all their movies, Pirates of Carribean, Wall-E, etc. The technique is described here in Point Based Approximate Color Bleeding. Bunnell’s original algorithm was O(N^2) in the number of surfels but he used an approximation hierarchy to get that down to O(N.log(N)), Pixar use an Octree which also stores a spherical harmonic approximation at each node.

What’s really interesting is how far Bunnell has pushed this idea, if you read through Fantasy Lab’s recent patent (August 2008), there are some really nice ideas in there that I haven’t seen published anywhere.

Here’s a summary of what’s new since the GPU Gems 2 article:

- Fixed the slightly cumbersome multiple shadow passes by summing ‘negative illumination’ from back-facing surfels
- Takes advantage of temporal coherence by simply iterating the illumination map each frame
- Added some directional information by subdividing the hemisphere into quadrants
- Chucked in some nice subdivision surfaces stuff in at the end

Anyway, I knocked up a prototype of the indirect illumination technique and it seems to work quite well. The patent leaves out loads of information (and spends two pages describing what a GPU is), but it’s not too difficult to work out the details (note the form factor calculation is particularly simplified).

Here are the results from a *very* low resolution mesh, in reality you would prime your surfels with direct illumination calculated in the traditional way with shadow mapping / shaders then let the sim bounce it round but in this case I’ve done the direct lighting using his method as well.

gi_shot1

gi_shot2

Disclaimer: some artifacts are visible here due to the way illumination is baked back to the mesh and there aren’t really enough surfels to capture all the fine detail but it’s quite promising.

This seems like a perfect job for CUDA or Larrabee as the whole algorithm can be run in parallel. You can do it purely through DirectX or OpenGL but it’s kind’ve nasty.

One of my work mates had some code with a lot of floating point clamps in it the other day so I wrote this little branch free version using the PS3’s floating point select intrinsic:

float Clamp(float x, float lower, float upper)
{
	float t = __fsels(x-lower, x, lower);
	return __fsels(t-upper, upper, t);
}

__fsels basically does this:

float __fsels(float x, float a, float b)
{
	return (x >= 0.0f) ? a : b
}

I measured it to be 8% faster than a standard implementation, not a whole lot but quite fun to write. The SPUs have quite general selection functionality which is more useful, some stuff about it here:

http://realtimecollisiondetection.net/blog/?p=90

(Not sure about this free WordPress code formatting, I may have to move it to my own host soon)

An interesting thread going around the GDA mailing list at the moment about multithreaded programming reminded me of a little test app I wrote a while back to measure the cost of two threads accessing memory on the same cache line.

The program basically creates two threads which increment a variable a large number of times measuring the time it takes to complete with different distances between the write addresses. Something like this:

__declspec (align (512)) volatile int B[128]; 

DWORD WINAPI ThreadProc(LPVOID param)
{
	// read/write the address a whole lot
	for (int i=0; i < 10000000; ++i)
	{
		(*(volatile int*)param)++;
	}

	return 0;
}

int main()
{
	volatile int* d1 = &B[0];
	volatile int* d2 = &B[127];

	while (d1 != d2)
	{
		HANDLE threads[2];

		// QPC wrapper
		double start = GetSeconds();

		threads[0] = CreateThread(NULL, 0, ThreadProc, (void*)d1, 0, NULL);
		threads[1] = CreateThread(NULL, 0, ThreadProc, (void*)d2, 0, NULL);

		WaitForMultipleObjects(2, threads, TRUE, INFINITE);

		double end = GetSeconds();

		--d2;

		cout << (d2-d1) * sizeof(int) << "bytes apart: " << (end-start)*1000.0f << "ms" << endl;
	}

	int i;
	cin >> i;
	return 0;
}

On my old P4 with a 64 byte cache line these are the results:

128bytes apart: 17.4153ms
124bytes apart: 17.878ms
120bytes apart: 17.4028ms
116bytes apart: 17.3625ms
112bytes apart: 17.959ms
108bytes apart: 18.0241ms
104bytes apart: 17.2938ms
100bytes apart: 17.6643ms
96bytes apart: 17.5377ms
92bytes apart: 19.3156ms
88bytes apart: 17.2013ms
84bytes apart: 17.9361ms
80bytes apart: 17.1321ms
76bytes apart: 17.5997ms
72bytes apart: 17.4634ms
68bytes apart: 17.6562ms
64bytes apart: 17.4704ms
60bytes apart: 17.9947ms
56bytes apart: 149.759ms ***
52bytes apart: 151.64ms
48bytes apart: 150.132ms
44bytes apart: 125.318ms
40bytes apart: 160.33ms
36bytes apart: 147.889ms
32bytes apart: 152.42ms
28bytes apart: 157.003ms
24bytes apart: 149.552ms
20bytes apart: 142.372ms
16bytes apart: 136.908ms
12bytes apart: 145.691ms
8bytes apart: 146.768ms
4bytes apart: 128.408ms
0bytes apart: 125.655ms

You can see when it gets to 56 bytes there is a large penalty (9x slower!) as it brings the cache-coherency protocol into play which forces the processor to reload from main memory. I’m not sure why it is triggered at 56 bytes and not 64 bytes, maybe someone out there knows..

Actually it turns out this is called “false sharing” and it’s quite well known, the general solution is to pad your shared data to be at least one cache line apart.

Refs:

http://software.intel.com/en-us/articles/reduce-false-sharing-in-net/
http://www.ddj.com/embedded/196902836

So after running my Metaballs demo on my girlfriends laptop it appears to be GPU limited due to fillrate. This is mainly due to the overkill number of particles in my test setup and the fact they hang round for so long, but the technique is fillrate heavy so it might be a problem.

It’d be nice to do a multithreaded CPU implementation to see how that compares, but the advantage of the current method is that it keeps the CPU free to do other things.

You could probably get more performance in some cases by uploading all the metaballs as shader parameters (or as a texture) and evaluating them directly in the pixel shader. Also I just realised I could probably render the ‘density’ texture at a lower resolution for a cheap speedup.

I’ve been meaning to implement this idea for ages, it’s a GPU implementation of 2D metaballs. It’s very simple, very fast and doesn’t even require hardware shader support. Seems like the kind of effect that could be useful in some little game..

metaballs

It’s quite fun to play with (use left drag to rotate the emitter), so I might try and fancy it up with some better graphics and collision.

The demo is below, I’ll probably release source at some stage but at the moment it’s all tied up in my dev framework which needs to be cleaned up (the exe is also larger than it should be as I have all sorts of crap linked in).

Demo

The simulation is just done using my simple particle system but it would be fun to implement smoothed particle hydrodynamics..

Information on how to correctly make sprite textures is just not on the web anywhere. There are so many subtle problems with making transparent textures for use in a 3D engine, here are some of the things I learned recently from my hobby project:

Generally the easiest way is to paint on a transparent background, that is, don’t try and paint the alpha channel yourself. That’s because it’s a complete nightmare trying to match up your rgb and alpha channels correctly. This is one of the reasons why you see loads of games with dark ‘halo’s around their particle textures (although it’s not the only reason). Unless you want to get really tricky and try and paint your own pre-multiplied texture it’s not worth it, just create a new image in Photoshop with a transparent background and export as PNG.

Note: the Photoshop Targa exporter won’t export the alpha channel properly from a transparent image, it only supports alpha if you explicitly add the alpha channel to an RGB image and paint it yourself. This is a slight pain because I’ve always favoured Targa as an image file format that’s easy to read / write but in this case Photoshop drops the ball once again.

OK now you have your image exported, the image is currently NOT pre-multiplied, it may look like it in Photoshop and most image viewers but on disc it is not pre-multiplied.

Read the file into ram using libPNG or your PNG library of choice.

Generate mip maps… this is where it gets interesting. If you just call something like gluBuild2DMipmaps() your texture will be wrong, the lower level mipmaps will have a dark halo around the edges. Now what most artists do in this case is convert it to a non-transparent image and start painting in some bright colour around the edges of their sprite in the RGB channels, this is the equivalent of a really nasty filthy hack from which no good can come. You can never paint just the right colour in there and can never get it in just the right place, if you’re having this problem go talk to your coders and get them to implement a better texture import pipeline (see below):

The fix is really quite simple and you should probably have already guessed that it is to use pre-multiplied alpha. If you haven’t heard of it before read here:

Tom Forsyth’s blog

http://keithp.com/~keithp/porterduff/

Although it’s not mentioned on those sites pre-multiplication is also the solution to building mipmaps correctly. Let’s imagine you don’t pre-multiply before you downscale to build your mipmap, you have four source RGBA texels from the higher level:

t1, t2, t3, t4

In some kind of box filter arrangement, then your destination pixel in the next lower mip is computed like this:

lowermip.rgb = (t1.rgb + t2.rgb + t3.rgb + t4.rgb) / 4
lowermip.a = (t1.a + t2.a + t3.a + t4.a) /4

And when you come to render with your blend function set to GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA you get:

framebuffer.rgb = lowermip.rgb * lowermip.a + framebuffer.rgb * (1.0-lowermip.a)

It should be farily obvious why this doesn’t work, the filtered RGB channel will contain information from neighbouring pixels even if those pixels have a zero alpha. As the background in most exported source textures is black it’s the equivalent of ‘pulling in’ black to your lower mip rgb channel.

The very simple solution is to use pre-multiplication which makes your lower mip calculation like this:

lowermip.rgb = (t1.rgb*t1.a + t2.rgb*t2.a + t3.rgb*t3.a + t4.rgb*t4.a) / 4;
lowermip.a = (t1.a + t2.a + t3.a + t4.a) / 4;

So now you’re weighting each pixels contribution to your lower mip by the appropriate alpha value. And when you render you set your blend modes to GL_ONE, GL_INV_SRC_ALPHA. Not only do your mips work correctly you can now ‘mix’ additive and subtractive blending which is great for fire turning into smoke for instance (see references below).

You don’t need to export your image pre-multiplied, I fix mine up at import / asset cooking time which is probably preferable so you can keep working on the PNG source asset in a convenient way.

This is really just the basics and mip-map filtering has had a lot of other good stuff written about it:

http://number-none.com/product/Mipmapping,%20Part%201/index.html
http://number-none.com/product/Mipmapping,%20Part%202/index.html

Also see http://code.google.com/p/nvidia-texture-tools/ for some image manipulation source code (the Nvidia Photoshop DDS plugin also has an option to modulate your image by alpha when generating mip-maps).

This is a great little paper about soft-edged particles and pre-multiplied alpha (even though it’s not referred too as such)
Soft Edges and Burning Things: Enhanced Real-Time Rendering of Particle Systems

Profiling is one of those things that requires a lot of discipline. It’s basically the scientfic method, recently I’ve been optimizing our PS3 build and I got burnt by all my sloppy habits.

Basically you always need to have a benchmark to verify that what you’re doing has actually made things faster. Lot’s of people will just recommend doing a profile before you start making changes, well of course you should profile first, how else would you know what’s wrong? What you need to do though is to make a copy of you executable at the start of every day (along with saves of any profiling data).

This is because your initial profile has skewed the data so much that you can’t be sure what you’re looking at is execution time or instrumentation time. If you suddenly want to profile some new section of code and all you have is your old capture data you’re screwed. Hence keep the executable around and you can profile the same section and doing a meaningful comparison.

My friend James related this to the Heisenberg uncertainty principle which I quite liked, you can’t observe performance without altering it (except at a very macro scale).

Whenever an artist (user) comes over to report a bug the typical response from a programmer is “that hasnt changed in ages”, “nothings different”. Of course this is a completely ridciulous response unless you happen to be the only programmer on the project. The user just wants you to acknowledge the fact that something’s wrong so it can be fixed it but of course the programmer doesn’t want anymore work on his plate so he feebly attempts to pass it off.

The other great one is, “it works on my computer”.. brilliant fobbery.

I’m definitely guilty of this every now and then but I do at least try not to give such a lame response.

Dogfight is coming along pretty well, I’ve started writing a particle engine, which is tricky as it was one of my main responsibilities at Sony and I had a pretty complex and full featured system. The tricky part is stopping myself from over engineering it and instead write the most bare bones system possible. I always doubted whether the artists needed / used all the features I added anyway.

In general with particle systems there are two types, the uber-emitter which is one monolithic class with a thousand options, tick boxes and modes, this type is easy to start with but then quickly becomes rather messy as you add options.

The second type is module based where you essentially hook modules together in an “update chain”, this is what I had at Sony running on the SPUs, it’s nice but has runtime overhead in that you usually will need to do multiple passes over the data allowing each module to modify the particle data at spawn and update time. An optimization of this type of system is to link the modules together at compile time so that you literally compile an update loop specific to a set of modules and options. This is pretty much the best of both worlds except that your iteration time now includes a C++ compilation step which isn’t great for artists.

In Dogfight I’m going the uber-emitter approach, I don’t think you can get away with multiple passes over the data on something like the iPhone, although I might add update modules for things like vortex fields and more expensive updates.

This also leads me to a rant on point sprites, basically point sprites are a complete pile of shit. They can’t do anything interesting, you can’t rotate them, you can’t animate the uvs, you can’t billboard along a vector for trail like effects. Also the GL_OES_point_sprite extension culls the sprite based on the point position so as it goes off screen the entire sprite will disappear even though half of it is still on screen!

If you have vertex shaders this is all a non issue, using two vertex streams, one for your particle data and one for your uv data you set the divisor on the particle stream to 4 and the modulo on the uv data to 4 and you’ve quartered the amount of data you need to send to card.

Inside the vertex shader you can then do rotation or interesting billboarding. Nice and fast, minimal API calls. If only embedded devices had them.

So I’ve been trying to find out a bit more about the iPhone’s capabilities, this site has some good details. No shaders, roughly enough power for 15000 lit triangles @ 30hz. Not too shabby, I wonder about the power consumption and if certain hardware paths drain the battery faster than others. It would be nice to make something that people can play for more than 3 hours.

So the word on the street is that O2 are bringing out a pre-pay iPhone in December, I’m currently debating whether to wait for that or just get an iPod touch which you can still develop for using pretty much the same feature set.

Small update, well a whole extra dimension really. Finally got my Collada importer working, it’s actually pretty easy compared to the process of writing and maintaining your own one but there’s always the pain of learning a new, relatively undocumented api.

So I chucked a free plane model in there, the materials aren’t imported yet but it seems Collada supports that (and everything else).

Right now I’m integrating Squirrel as I really want to be able to develop when I’m away from home without having to compile new exes. Plus scripting languages make all sorts of things easier really. Yet another API to learn but it’s fairly straight forward. I’m looking forward to finishing all this middleware integration and actually getting on with making the game.

I woke up Saturday and decided to just do some mindless ‘aesthetic’ work on Dogfight.. I added a simple background and also some sound effects. I also hooked up a wav (which is actually from a sea plane) and messed with the frequency based on airspeed. Works great and made it much more fun to play with, flying in silence is no fun.

Turns out my issues with the airplane control were really due to me using an incomplete flight model. I wasn’t incorporating the angle of attack and the subsequent lift coefficient and increase in dynamic pressure.

I didn’t think it would be such a big deal but it makes a huge difference and solves the problem that unless I gave the plane a very large thrust to weight ratio it had almost no responsiveness. The trouble with a high thrust to weight ratio is that anything greater than 1.0 means your plane can completely oppose gravity and climb up to space unhindered (disregarding the change in air pressure at high altitudes). This ruins one of the key game play elements I wanted to incorporate which is the aerodynamic stall.

For example generally only really modern jet planes have a TTW ratio > 1.0f. Most commercial planes are well below that. Concord for instance has a ratio of 0.373 (although that is a passenger craft so weighs a lot more also).

So, back to the angle of attack, for most aerofoils there is a sweet spot at an angle of attack of about 20 degrees where the lift coefficient is maximal. Once I had this approximated using something like:

lift coefficient = 1.0f + sin(angle of attack)

Things started getting a lot better, artificially multiply this by 2 and you’ve got a plane which picks up nicely when you pull up the nose. It feels a little more like a glider rather than a jet in that it’s more about lift than engine thrust but it feels pretty realistic and flighty. Try to climb too fast and you’ll lose lift and find the plane quickly returning to earth, align the plane with it’s direction of travel, pull up the nose and you can save it just before the crash which is exactly what I originally wanted.

So that’s all running nicely, it needs a bit more tweaking, I’m not actually using the drag coefficient at the moment which I’ll try out soon.

This is an awesome reference that does a much better job describing these concepts than Wikipedia (although that’s still useful):

http://www.av8n.com/how/htm/airfoils.html

One of the best two player games of my childhood was Dogfight. It’s a true classic, and like many other people I decided I wanted to do a remake, networked and fancied up a bit.

So I start coding up a flight model calculating lift The Right Way from dynamic pressure, proper drag etc and my little triangle is gliding along beautifully, but is it fun? Hell no, it’s horrible to play, flying a plane isn’t simple, it’s a pain in the ass to just keep your triangle in the air.

The original had totally arcade controls, you could do endless loop-da-loops on the spot, you press left and you get an instantaneous change of direction. This is great for gameplay and is satisfying because you feel in control of your triangle.

But it’s horrible to code, your flight model goes from being an elegant relationship between variables to some horrible hacked thing full of conditionals which enables physics at some points like when you stall (aerodynamic stall not engine stall). Maybe I’m being precious but this is about when I start losing motivation to continue, you realise all the horrible little things you’re going to have to write to actually finish a game.

I should just suck it up and get on with it, I want to do the networking side of things as I’ve never touched socket programming and this is like a perfect test case.

I am no expert when it comes to physics programming (for that see Christer Ericson’s blog) but like lots of games programmers I have written a few physics demos using Verlet integration and point masses.

So that’s nothing new, but a while back I came across a really nice approach for iterative constraint solving (which is the common approach in Verlet engines). Usually you would formulate your constraint function and solve it directly, for example the classic distance constraint would look like:

error = restlength - |A-B|

In the case of the distance constraint, fixing this error is straight forward, you simply adjust A and B along their axis by error/2 and -error/2 respectively.

For more complicated constraints things can become difficult, because although it can be easy to calculate the error it is not always obvious how to correct for it.

Generally what you want to do is move along the derivative of your constraint function, ie: minimize the error function using Newton’s method. This works pretty fine and provided you do enough iterations will converge quite quickly. However finding the derivative function itself can be problematic, let’s take the example of an angular constraint function:

Given three points and a rest angle (the angle between the two line segments) this function will the return the current error:

error = restangle - acos(dot(B-A/|B-A|, C-B/|C-B|))

Now you have the error, how do you move the three points to correct for it? Well you need to calculate the partial derivative of the error function for each component of A, B and C. To do this symbolically is a pain in the ass and quickly gets messy, I tried plugging it into Mathematica and got a huge mess before eventually doing it by hand which was very tedious given my rusty calculus.

But here’s another way, if you can’t calculate the derivative directly you can at least approximate it using finite differences. This turns out to be so sweet because all of a sudden you only have to write your error function and you’re done. If that’s not clear here’s an example to hopefully elucidate things:

First of all, calculate the current error, I’ll stick with the angular constraint example which takes three points.

currentError = Error(A, B, C)

Now calculate the finite difference for each component of the three points by evaluating the error at some small step away from the current position, here h is our step size and A,B and C are two-vectors but the same thing applies to other dimensions:

X = {h, 0}
Y = {0, h}


derivA.x = (Error(A + X, B, C) - currentError)/h
derivA.y = (Error(A + Y, B, C) - currentError)/h


derivB.x = (Error(A, B + X, C) - currentError)/h
derivB.y = (Error(A, B + Y, C) - currentError)/h


derivC.x = (Error(A, B, C + X) - currentError)/h
derivC.y = (Error(A, B, C + Y) - currentError)/h

(Note the short hand notation for the differentiation of a scalar function with respect to a vector is the del operator)

Now you just need to scale according to the current error and relative masses of the points and apply each derivative to each point respectively.

So there you have it, you’ve solved the constraint with no more information than a way of measuring the current error. Because this is less direct it is also generally more costly but can be a great for prototyping new or more complex constraints.

For more information see the following:

http://www.beosil.com/download/PositionBasedDynamics_VRIPHYS06.pdf
http://www.bulletphysics.com/Bullet/wordpress/bullet
http://www.gamasutra.com/resource_guide/20030121/jacobson_01.shtml

It’s ridiculous not being able to post URLs into youtube video comments.  Perhaps I can understand restricting everyone else but if you posted the video you should be able to post what ever you like in the comments for it.

Instead, people have come up with some way of encoding the URL to bypass their filters which is rubbish.

It would also be nice if they actually gave you an error message telling you why your post hasn’t shown up instead of silently failing.

While I was at Sony I spent a lot of time thinking about making tasks run concurrently and how they should be scheduled to maximise performance.

Recently this kind of analysis has been spilling over into my everyday life and you start seeing ways you could parallelise all sorts of tasks.  Of course this is just common sense and something we all do naturally do different degrees.

A simple example is making a cup of tea, you don’t want to get the tea bags and the cups first, no, that would be a waste of precious milliseconds.  First you switch on the jug and then get to work on the rest of the process in parallel.  I imagine anyone who has worked in a kitchen before would be experts at this kind of scheduling and could probably do my job better than I can.

So I took a new job today, working for Rocksteady in London.  Quite exciting, they use Unreal3 tech which I’ve always respected (and stolen ideas from).  It’s a big game title and technically I don’t know what it is.. but it’s a big franchise and should do well.

Plus working with friends in Kentish Town, s’all good.

I want this blog to be mainly code focused and finally I have something worth writing about.  I made a little visualizer for the recently released Radiohead House of Cards data set.

Check it out on YouTube:

http://www.youtube.com/watch?v=YDSHGppDNPg

And / or download the executables & source (Win32 + OpenGL only at this stage):

http://code.google.com/p/hocgl/downloads/list

Oh, and by Programmer Bum I refer to the fact that I have had the last month off work doing whatever I please and it’s been glorious.