Performance Optimizations

March 7, 2014|

Normally the last few weeks of development before launch are really stressful because everything has to come together in a nice clean package. Usually this means frantically trying to fix the remaining bugs, making sure the game runs great on all supported hardware while at the same time polishing the game. The last weeks are very critical to achieving that certain high quality feel. That’s why this time I’ve decided to ease the pain of the last few weeks with a head start. In this blog post I’ll talk about the most important performance optimizations that I’ve been working on for the last two weeks.

Performance Profiler

Some might think that optimizing is boring and tedious because the work is very time consuming and nothing seems to happen to the game on the surface. To help with this I’ve turned opimizing into a sort of game for myself. Whatever I’m optimizing I first come up with some metric, usually a number whose value I can track and try to make it as small as possible. I set up a challenge for myself to see how low I can push that number. Some things are easy to measure like memory consumption of the process with Windows Task Manager or the time needed to process a single frame. But to really dig deep into performance issues it’s necessary to breakdown the measurements into smaller bits to get a better idea what to optimize. Therefore I built a performance profiler directly into the game that I can summon with a press of a button.

The profiler shows milliseconds and percentage of frame time spent in each subsystem of the game and various other statistics. To run at 60 frames per second the computer can spent up to about 16 milliseconds to process each frame, including updating the game world and rendering the view. The profiler also shows amount of temporary memory allocations (Malloc column) during the frame — more about that later.

Memory optimizations

Legend of Grimrock 2 is a 32-bit application on Windows so that means that the application can use up to 2GB of memory without hacks. To make matters worse in my experience the real limit is closer to 1.5GB presumably because DirectX resources eat up virtual address space. Textures and other assets eat a lot of memory so 1.5GB is not that much today. Optimizing memory usage will also help with load times, and can potentially increase overall performance too. It’s important to get the memory usage as low as possible without sacrificing quality of assets.

After cleaning up unused assets we determined that we still needed to shave off some more memory so I begin looking into what could be done on the code size. One easy optimization which was already planned for Grimrock 1 but I never had time to work on was a simple animation compression technique. In Grimrock 1 animations are stored as “array of structs”, where the struct contains position, rotation and scale. This was not optimal because, for example, scaling is very rarely used. In fact most skeletal animation nodes have only rotation movement. A simple optimization is to store keyframes as “struct of arrays”, meaning that position, rotation and scale keys are optional. For many animation nodes we just need to store constant position and scale values and varying rotation keyframes. This optimization cut the memory usage of animations by 20 MB.

Another big optimization was compression of vertex format used by models. Previously all model vertices had normal, tangent, bitangent and texcoord vectors stored as 32-bit floating point values. Floats have a very big range and high precision, more than we need so I compressed those into 16-bit integers. Also a common trick is to leave out the bitangent needed for normal mapping because it can be reconstructed in the shader by computing the crossproduct of the normal and tangent vectors (TBN handedness still need to be stored but it fits nicely into tangent vector’s fourth component). Vertex format optimization yielded about 75 MB saving.

Rendering optimizations

Grimrock 1 didn’t need a geometry level of detail system but Grimrock 2 has much longer view distances and has many more models on the screen so the number of triangles drawn can get quite large. I had implemented a very simple level of detail (LOD) rendering system some months ago where we simply swap between high detail and low detail meshes based on their distance from the player. While this increased the frame rate this resulted in ugly snap when the lod level was changed which limited the usability of the system. When doing the performance optimizations I revisited the old lod system. After doing some initial tests with the artists we figured that a crossfade between the LODs would be the ideal solution. Unfortunately alpha-blending the models is out of the question with a light prepass deferred renderer. A pretty common technique is to use alphatest dissolving instead. I used the technique successfully in Alan Wake’s rendering engine and it still works great today. The end result is quite good and it’s hard to see the LOD transition even if you know what’s going on. We also use the same dissolving technique to fade out small objects like grass.

We are using distorted planar reflections for our water rendering and this requires rendering the scene twice, once for the main view and once mirrored upside down. This can get really heavy on frame rate. Fortunately it’s not necessary to draw the reflected scene with full detail. In fact many object don’t need reflections at all, usually those that are far away from the reflective surface (except if they are really tall like towers). Making an automatic solution that handles all cases nicely is pretty hard, so we gave a few hints to the renderer to help pick the objects to reflect. Objects in Grimrock 2 have three reflection modes: “never” means that the object is never reflected. It’s used for small objects like most items. “always” means the object is always reflected, like the sky and very large structures. “cell” is the default option and used by almost everything. With this option we take advantage of our grid structure. The level designer can paint in Dungeon Editor which cells in the level have reflections enabled. Static objects with “cell” reflection mode will then skip reflection rendering if their cell is not reflective. For dynamic objects we currently use either “never” or “always” mode so that we don’t have to check constantly where they are and update their reflection enable flag.

Game world update

Years of object oriented programming tends to produce bad habits. One good example is the game object update logic that was in place two weeks ago. In Grimrock 2 game objects such as monsters, doors and teleporters are made from components such as lights, models, clickable zones and particle effects. In the object-oriented way each game object had a update() method which calls update() for all its components. Can you see what’s wrong with this? There are at least two big problems (and a few other missed optimization opportunities). First the code has to iterate through all components regardless if they actually need to be updated. For example models do not need to be updated at all because there is nothing dynamically changing about them. Secondly the code has to “megamorphically dispatch” to the component update routine, meaning the code is jumping between different component types all the time. This code branching is very slow. A much better approach is to update all components of given type in one go, i.e. update all particle systems in one pass, update all animations in one pass and so on, something like this:


This restructuring of update alone saves several milliseconds per frame. It also has other very nice properties. The code is easier to profile and it’s easy to toggle updating of component by type. It’s also trivial to change the update order of component types. For example, animation components should be updated after monsters so that animations start playing immediately, not one frame after the monster’s brain has decided what to do.

With all of these optimizations in place the average frame rate seemed decent (we haven’t tested on low end setups yet though so we may have go back to optimizations later). I have a frame rate number displayed on the screen all the time and I’ve set it up so that it turns bright red if the frame rate dips below 60 fps. While testing I noticed that every now and then, apparently for now reason, the frame time spiked above 16ms. I immediately began suspecting Lua’s garbage collector and added Lua memory statistics to the profiler. It turned out we were allocating about 40 KB per frame, at 60 fps that’s about 2.5 MB per second! After a few seconds of this Lua decided that enough is enough and collected garbage which dipped the frame rate. We were very lucky that this problem had not surfaced with Grimrock 1. Lucky because garbage collection issues are really hard to fix. I suspect that the working set and garbage generated was much smaller in Grimrock 1 so the problem did not exist.

I began hunting down the source of garbage. Thanks to the update restructuring it was easy to add per component type mem alloc statistics and a few culprits were quickly found. Some cases were easy to optimize away, like the creation of temporary tables here and there. Much more problematic was vector math code that created a lot of temporaries. Lua garbage collector is not particularly good at short lived temporaries like these. I decided to try an experimental technique, to make a separate vector and matrix classes that would be allocated from a pool. At the end of the frame the temp vectors and matrices would be returned back to the pool. The only problem was how to handle “boxing”. Temporaries could not be stored permanently in objects’ fields because their values would become corrupt at the end of current frame. A simple solution is to use boxed vectors as member variables and copy the values explicitly from temporary to the boxed version. It’s a bit of a chore to do it but it seems to work okay.

I still haven’t gone through all the places but the most bad behaving temporary allocating routines have now been optimized. As a result temporary memory allocations per frame has gone down from over 40KB to about 4KB per frame. My goal is to keep optimizing it below 1 KB. Garbage collection is already pretty harmless but I want to beat it so there’s no doubt about it.

That pretty much sums up the work of the past two weeks. Together with content optimizations the game now uses about 25% less memory and runs 25% faster. Not bad for two weeks of work! Hopefully this was an interesting read. If not, then prepare yourself for some more artsy blog update coming next! 🙂

28 responses to “Performance Optimizations”

  1. mrgo0se says:

    Fantastic and interesting update, cheers guys – (Although a few more screenshots to whet our appetite wouldn’t hurt 🙂

    Also now intrigued as to what those 4 redacted performance metrics are!? 🙂

    Really looking forward to the release, Keep up the good work..


  2. Earlyflash says:

    I’m curious – why not just make the executable 64-bit. There can’t be that many 32-bit platforms left around these days, can there?

    • petri says:

      According to Steam Hardware Survey about 20% of Windows 7, Windows XP and Windows Vista users have a 32-bit operating system…

      • evilskillit says:

        That’s too bad as it sounds like 32 bit is getting pretty restrictive for modern gaming. When RAM was cheap I picked up 16gb for $70 and Grimrock 2 can only use about 10% of it 🙁 But eliminating at least 20% of your potential user base is probably a marketing move you can’t afford to make and developing both 32 and 64 bit versions would probably be just as costly if not more so I suspect.

        • petri says:

          Yeah, it’s a bummer. Targeting the game for 64 bits and backporting to 32 bits would not be wise I think. Lots of work + the 32 bit version would probably feel too stripped down / unpolished.

          But the next game engine I write is hopefully 64 bit only 🙂

  3. Rorrik says:

    Glad I’m not the only one with optimization issues. Very impressive work. Getting to this early should make the high quality feel better than ever!

  4. Diarmuid says:

    Wow Petri, this was such an impressive and interesting read! We are incredibly lucky to have you guys share such low-level techinacal details without being “afraid” of boring readers…

    This profiler thing… if I only could have this while finishing the tests for the ORRR2 mod! (sigh)

    Can’t wait to play this, take your time to do it right.

  5. LostBoy says:

    Very interesting read, thanks for this 🙂

    You mentioned last few weeks before launch, is there a rough estimate of when this is going to be released?

    Haven’t seen any kind of dates mentioned in any of the Blogs, so I’m wondering when to look for this.

  6. j.wordsworth says:

    Thanks for the detailed update. It’s great to read a bit more about what’s going on under the hood – both as an interested reader and a modder.

    The performance profiler looks great and it’s interesting to read about some of the optimisations that you’ve implemented, especially when trying to keep under the 2GB-limit on process memory on Win32. The revised component system also sounds nice. I’ve always like the idea of taking a more ‘data-driven’ approach of game objects / components, and it’s nice to see that it has a real impact on performance.

    With regards to lua garbage collection, I’ve never used it – but theoretically you can ‘step’ the lua garbage collector with collectgarbage(“step”). I don’t know what the overall additional overhead is for doing just a step of the garbage collection instead of doing it all in one go, but if you were to step the GC just a little bit every frame – you could at least measure the impact on a frame-by-frame basis instead of having it kick in and disrupt rendering of every Nth frame?

    Anyway, thanks again for the great update. I look forward to seeing the benefits of all of this in LOG2 and, more excitingly, from the perspective of the Dungeon Editor!

    • petri says:

      Thanks for the feedback! The GC issue is now fixed — amount of garbage generated is now < 1KB per frame and default Lua GC settings seem to have no problems with it anymore.

  7. cvesperc says:

    I’ve just finished reading the article and I have to admit, I understood only half of the words there…! 😀 Nevertheless, I’m surprised and happy to know that people (or at least one of them) who created Grimrock created also another of my favourite games, Alan Wake.
    And that “weeks before release” made me pumped as I was expecting the game to come out in winter or so!

  8. vlzvl says:

    As a WebGl programmer myself, i noted you’re using quite a few triangles for your skybox 🙂
    Why’s that great number vs normal skyboxes?
    About the profiler, you’re the man 🙂
    You’re showing some recent games how this critical phase (performance) should be handled
    Keep it going

  9. Stebe says:

    Good stuff. If you can achieve smooth LOD transitions, that will be killer. Seems like I can spot it most games. 🙁

    I’d rather not see any screenshots of the game at all before it’s released – it’s like watching a movie trailer, and learning and seeing all about the major plot points and characters, before you even buy tickets. 🙁 That is not the optimal experience, surely! =p

  10. sepikotona says:

    Petri, mahdatko olla aka Aivo/Mellow Chips? Jos niin, olin jyryssä ja organisaattorina demolitioneilla aikoinaan täällä Joensuussa. Niin tai näin, pysähdyin leveille 4 grimrockissa, missä ns. sienet ja siun volume shadow rutiini pääsi oikeuksiinsa. Piti ihan pysähtyä, keittää kahvit ja katsoa sitä valojen ja varjojen tanssia. Sanoin itsekseni silloin “ei vittu häkä” 😀


  11. sepikotona says:

    Pete, äijä pystyy koodaa 68x ja x86. Propsit siitä. Itse pelin ostan tietysti, ihan vaan nähdäkseni water efen ja miten volume shadow ruutiini toimii 🙂 Oisko mahollisuutta implementoida easter eggiä? Tyyliin sine scrollia tai starfieldiä? Demoscene varmasti arvostaisi 😀


  12. sepikotona says:

    Oh i forgot, english here. What i said to Petri, he’s volume shadow routine beats hands down MM Legacy.
    And on other hand, phonged donuts rules 🙂


  13. Hokkou says:

    Really interesting update, thank you! While I’m definitely no expert in this area it’s still interesting to get a basic understanding about the various tasks you have to do to create this game =) Keep it coming!

  14. Makuta says:

    I want to thank you SO MUCH for keeping this Windows XP playable ^_^, I have a strangely good graphics card (free laptop from a friend), and the only thing that stops me from playing some games is the DX10+ requirements. Can’t wait to try the new game! ^_^

  15. Montis says:

    Thanks for the update.

    One thing I noticed though: all items are set to never reflect… what about if someone wants to do a mirror-like floor tile, then the items would not be displayed in the mirrored version. Could this be circumvented somehow for user created content?

  16. Telcontar says:

    I’m surprised that some readers have not realized that the 1.5 GB memory limit does not only apply to ancient Windows versions, but also to some iDevices where Grimrock 2 may be ported to… indeed, they currently have only 1 GB of RAM so even with smaller textures, these optimizations are necessary.
    I take the fact that Grimrock 2 fits within 1.5 GB as a good sign of also fitting within 1 GB (after some more optimizations) 🙂

  17. PauliP says:

    Perhaps you’ll add AMD Mantle support to improve performance even further ;).

  18. Dejay Clayton says:

    It’s interesting to see the performance statistics that you’ve chose to blur out in the screenshot above. I’m going to keep my mouth shut, but those statistic names were pretty easy to reverse engineer in Photoshop 🙂

Leave a Reply