This is an update to my previous post about multi-threaded rendering, and some thoughts about leveraging the A5 processor in the iPad 2.
I’ve finally managed to track down the issues that broke the multi-threaded rendering. Turned out that, for reasons unknown to me, the [context renderbufferStorage:fromDrawable:] call has to be performed on the main thread. If not, it will simply not work.
After I found this out, I was able to get all my methods to work, and I added a new method. Here is a little summary:
- The single-threaded method does everything on the main thread and is just for reference.
- The GCD method uses a display link on the main thread to kick off the rendering on a serial GCD queue that runs on another thread. Display link events may get dropped if the main thread is busy.
- The threaded method uses a display link on a separate thread that kicks off the rendering on the same thread. Display link events may get dropped when the rendering takes too long.
- The threaded GCD method combines the GCD and threaded methods. It runs a display link on a separate thread and kicks off the rendering into a serial GCD queue that runs on yet another thread. It is completely decoupled from the main thread, and the rendering doesn’t block the display link either. Hence, the display link should be very reliable.
I didn’t conduct any real performance measurements to see which method is better. However, I personally like the last approach. It should minimize blocking, and one nice benefit is that it is very easy to count frame drops (GCD queue is still busy while display link fires again).
In addition to getting it to work, I’ve also added a very simple asynchronous .pvr texture loader.
The code is available at https://github.com/Volcore/LimbicGL .
Based on the above results, I’ve been thinking about how to write a renderer that properly utilizes the A5 chip.
Before the A5, we had to balance three principal systems: the cpu, the tiler (transforming the geometry and throwing it at the rendering tiles), and the renderer (renders the pixels for each tile)
Balancing between tiler and renderer is app dependent and somewhat straight forward: if the tiler usage is low, we can use higher poly models “for free”. And if the renderer usage is low, we can do more pixel shader magic. If both are low and the game runs slow, it’s probably cpu bound.
Now, with the A5, there is an additional component in the mix, a second cpu core. The golden question is: How can we use this in a game effectively?
Here are some of my ideas:
- Run the game update and the rendering in parallel. This requires double buffering of the game data, either by flip-flopping, or by copying the data before every frame. Interestingly, this works well with the threaded GCD approach from above. We can just kick off a game update task for the next frame into a separate serial GCD queue at the same time we render the current frame, and they both run in parallel.
- After the game update is done (this should only take a fraction of a frame unless you do some fancy physics), we can pre-compute some rendering data:
- View Frustum Culling, Occlusion Culling, etc
- Precompute skinning matrices, transformations
- CPU skinning. Instead of handling the forward kinematics skinning in the tiler on the GPU, we could run it on the cpu. This is more flexible, since we’re not bound to the limits of the vertex shaders (limit of the number of matrices comes to mind). I’m uncertain about the performance benefits here. It’s a trade between CPU and DMA memory bandwidth vs tiler usage. I think this may pay off very well in situations where one mesh is rendered several times (shadow mapping, deferred shading without multiple render targets, multi-pass algorithms in general). One of the biggest drawbacks is that the memory usage is (#instances of mesh * size of mesh) versus just one instance.
- Precompute lighting with methods such as spherical harmonic lighting, where the results can be backed into the vertex colors. This could even run over several frames, and then only be updated at a certain rate (eg. every 10 frames)
- Procedural meshes and textures. This is interesting, and mostly depends on a fast memory bandwidth, which the A5 should provide.
- Asynchronous loading of data (textures, meshes). This is mostly limited by IO though, but some interesting applications (such as re-encoding, compression) come to mind.
I’m going to try a few of these over the next month, I hope I’ll have some nice results and insights 🙂
As my closing words: We live in exciting times for mobile GPU programming! <3