Fibre Fuelled Nathan Drake - Technobabble
Video game development is a complicated business and can be shrouded in mystery at times, but it can also be a friendly community that can share the knowledge and allow some of these to become much less fuzzy and courtesy of GDC 2015 this is the case here.
Like my last article on DX12 – one that I will be updating on soon with a more “hands on” set of benchmarks and information so stay tuned for that – that looked at the XboxOne and PC development cycle here we are looking at the Sony camp and specifically the PS4.
As I have so much to cover I do not want to pack as much into my articles and dilute the content so I will try and keep them as tight and On-point as possible. Naughty dog was present via Lead programmer Christian Gryling, who was in charge of the single threaded PS3 engine moving over to the PS4. As you may imagine migrating such a huge code base engine from something as unique as the Cell and have it up and running on an x86-64 CPU was no mean feat, but one he and the team managed more than admirably with the Remastered classic TLOU which you can check out my detailed analysis of at launch.
The talk was full of very interesting and insightful details into both life on the PS3 and the PS4 which I will go through here. Common knowledge is the fact the PS3 engine was 30Hz (see 30fps) designed from its inception on the original Uncharted Drakes fortune (Which Christian worked on as one of his first projects). This was the same engine that delivered all of its sequels along with the swan song TLOU on the PS3. From my earlier software article I explain the PS3 and X360 in more detail which to shorten here is that the machine had a weaker GPU from Nvidia (RSX) and was aided greatly by the much stronger Cell CPU that sported 1 PPU and 7 SPUs – 1 was locked to the OS and the other 6 ran devoted to game code – with which developers like ND and others had to use to make up for a deficit in the GPU and Vram, 256MB where available to CPU and GPU respectively. The SPUs where used for handling pretty much all the engine system threads with very little actual game code running on them at all as it was problematic with most game code running on the PPU. It was a completely serialised engine with Game Logic preceding render calls via the Command Buffer generation stage, this mirrors how DX11 and most game engines work.
This engine and configuration, in Christian’s own words, was complicated and difficult for the game programmers to work. Jobs could not end once started, user had to allocate memory along with freeing it when done, job sync index’s need to be flushed when filled which caused big stalls as evident with Uncharted 3, the long hang as context switch happens. Job list arrays covered many jobs running per frame so the complication of the engine was big, bad and busy along with it being serial. The specific aim for the PS4 engine was to improve and simplify the engine for the game developers and NOT to improve performance using the fibre job system, it was a secondary by product and bonus that Fibre’s also have large performance gains when used but it was not all plain sailing and benefits as he explains. By allowing the entire engine to be jobified means to break it down into smaller, manageable chunks that allow jobs to start, wait and kick other jobs mid execute rather than wait till the end which is costly and wasteful. Leave all memory management hidden from the user so that they did not need to manage this on each game’s code process.
Now fibres are a small packet in essence of the entire entity of the job and what is used to fill the stack thread, stackpointer, register etc. Meaning that the system OS register function switch happens from a system call (sceFiberSwitch) which yields the tasks from running. This saves the state of the job, thus allowing it to be seamlessly moved with little to no penalty. Each thread –The tasks that starts each job/Fibre - is locked to a given CPU core and does not move. All jobs are executed without exception in these fibres so that control is manageable with sync points where Data race conditions may occur – where a variable is being read from and written to by separate entities that can cause stale pointers or cache data misses or errors - managed through Atomic counters which handle how and when a variable or data set is save to release from a temporary lock.
The new engine support a total of 160 Fibres (jobs) in concurrent flow and incomplete with 128 small stacks of 64KiB and 32 larger stacks of 512KiB that cannot be a known stack that may be used by middle-ware solutions. These jobs have a low, medium and high priority flag when pushed into the queue that will allow a higher priority job to “jump the queue” and become the next job to be picked up by a thread. As demonstrated by the image here showing the flowing of each job pulled in and also can add more jobs itself which may be as a result of the current job it is executing. In addition if the atomic counter that is used to see if the job that has been pulled can complete or needs to wait on another piece of data that is still in use, it can be pushed into a wait queue. This is then copied into a wait pool which is the entire Fibre content with all address and pointers along with the counter so that it can be put to sleep until the data it needs is free and then it can be resumed. This could be the next job to run on this core or another, once this job is complete it will flag the Atomic counter as 0 (free to go) and the job that was out in the queue can now be pulled back into the thread to be completed. As the entire stack is copied this could even happen on another core other than the one that started the job with this being seamless and without contention, but that does come and is an interesting section of the talk.
With the locked 6 cores dedicated to game logic by the engine because when a System OS call happens you have to basically claim the core otherwise the system Kernel can take control of the Core (or any other thread request that is needed) for a job and evicts the current running tasks until complete, similar to the situation I explained in my console article regards the % use from the XboxOne CPU 7th Core that could lose part of its use dynamically from a system call. Without the lock this can cause a ripple effect as all 6 cores could context switch as they effectively try to offer up their spot so the actual stopped job can run elsewhere. This will as you may have guessed cause stalls and hangs, by locking or claiming the cores at the start this means that only 1 core will be affected by this and minimise the disruption. The interesting part of this is that CPU core allocation will always be dynamic like this due to not wasting cycles on cores that are idling, the main Kernel can always take control when needed. This could explain some of the hangs you can get in games when loading, saving or notifications from the OS V-shell where explicit allocation of cores does not happen the context switch becomes more of performance hindrance by a factor of 6. But I stress I am only going on the info shown here and my own experience and understanding of it all as far as operating systems work. Never the less an interesting nugget of information that gives some insight into the PS4 architecture/OS in practise.
This new and very clever dissection of the engine allowed it to run up to a 1000 jobs per frame in TLOU remastered, this means ever 16.66ms so it was busy. What this means is you no longer have a main worker thread that manages the entire show, instead the frame call is started and ended by a small little task and everything else is managed at a granular level as we have gone through and cores are not idling waiting on other jobs to finish before continuing which is incredibly useful along with it being efficient. Aside the small section of standard I/O from disk or socket activity from network and some other small tasks which are all handled with standard system threads and interrupts but sleep most of the time and only call another job and then sleep again to avoid hogging a thread. Here is the ND profiling tool that shows on the left Y axis the 6 cores available to the game which is your threads. Along the x axis for each core is the separate jobs and the work they process fills up the stack or has gaps where it is wasted idle time.
With this fibre system it removes the need and restriction of Mutex locking a shared resource between 2 or more threads, a big issue when you get into multi-threading work. But due to the light weight balance of the system they cannot use Mutex or semaphore flags within jobs, but the entire process has much more benefits over any negatives. One area that has restricted fibres being vastly used as yet is due to many debugging tools not supporting them or handling them well. But the PS4 SDK debugging tool box most certainly does support them and treats them as threads so the use of these will greatly expand both within 1st Party and 3rd party games, with other debuggers likely adding or added support recently. It still has some issues with a TLS (Thread Local Storage) error than can cause a switched fibre to awake with the wrong pointer (location of data) as it resides in another cores Local storage. Again a flag can be set in some compilers to lock this, another demonstration of how detailed, fiddly and easy it can be to miss settings, features or functions within software development. The talk has more information that I will leave out here but another system function priority inversion was also highlighted that allows a lower priority job to leapfrog a higher one as it is holding data that the higher priority job needs to complete. This important ingredient to the new system solved most of the remaining locks on the jobs and is interspersed with a new spin from the job systems atomic first in case that free’s up the resource before getting heavy handed. All of this engine work makes the very manageable from a user point of view and efficient in use with a similar design to how an Operating System works.
But from all this information and hard work you would assume that by now the game was up, the engine was converted and running seamlessly at 60fps...and you would be wrong. With the initial conversion done the TLOU PS4 was running at less than 10fps 132ms per frame. But the entire rendering engine has been changed but none of that had been jobified off like the engine code, once this was done over a 100% improvement with the frame time dropping down to 55ms, but this is still terrible at around 15fps. The aim from the team was clear with the profiler split up into Red, Blue & Green that was a clear marker for the game and graphics programmers, if you are within the green then great you are at 60fps, blue is good for 30 and red is Assassins Creed Unity territory..or worse (Joke!). And even at this point with all the game engine and rendering jobified successfully with all 6 CPU cores successfully running in parallel and multi-threaded with extra fibre content the game was still woefully over budget and under target with even 30 being a way off, and for anyone in doubt about how tight and close to the knuckle a game or any software can come together need look no further than this, April! 3 months from going gold it was struggling to run at 30fps and I am sure at this point panic was setting in.
It will come as little surprise to know that the GPU was not the issue here –In fact the GPU regularly had issues that caused it to run too fast so they had to call stall & wait so that the engine could get back into lock step – no! the engine was CPU bound. With the critical path each core of the frame taking 25ms each to complete meaning the 16 needed to get 60fps was at this point not only off the table but the entire menu. They had to work out a better way and did not want to give up and just ship at 30fps, which is what many would have done. Be under no illusion with modern games in modern engines 60fps is not easy and very expensive and has to be a initial design decision at the start of a project. So the next step was to look at the gaps, with earlier wasteful locks reduced to as low as possible along with the breaking down of each task to a low atomic level their did not seem much left to look at. Here on the graph you can see the render logic in red and game logic in green but notice the big gaps, the CPU was not being maximised at all which is a problem, but dropping these gaps the calculation said that 100ms over the 6 cores was within target 100/6 = 16.66 which is the magic number, so the goal was still possible but how to do this, again at this point they are 2 months to ship and still at 30fps.
And this is where Christian and crew got very clever, with the normal structure of Game logic then Command Buffer calls to the GPU being standard and serial the quick win is just run them in parallel with game and render running in separate threads. Which they did but added a very clever and most certainly dangerous and ingenious plan, not only running the logic split but also on different frames. Now the engine will start the first frame off with standard game logic on the CPU and then pass onto the render calls on another thread. Meanwhile the first thread then begins the game logic for frame 2 while this is happening. Once this is done it passes onto the second stage as this has generated the Command Buffer work which has been passed onto the GPU so that it can then get to work on rendering the frame. Now the first stage is on frame 3’s logic, the second on 2 and the GPU rendering the first frame. No syncing is needed as the GPU is not involved in rendering any game-logic like physics or light calculation, Ray casts (In TLOU:R anyway) all this is done on the CPU so each stage can be processed, closed and passed on. But to have multiple frames in process this will give you big memory issues as you will need buffers, rotate them and ultimately clear them when done BUT only when done as this although very efficient is also very dangerous and could easily turn into system crash, dreaded memory stomp so this needs some care.
They achieved this by creating a FrameParams, a handler that is passed with a unique instance of each frame that lives with it and its data, is all encompassed here and allows a free un-contended resource that can be used, processed and ultimately destroyed when done. It contains all relevant data for pointers in memory and the mesh data that will appear in the frame making debugging simple within its design. 16 of these can be viewed in a trace with the last 15 being tracked but only 3 live in memory at once as memory would not allow for 16 frames to exist at once and the params are rotated through as the game engine runs. This is as always had the safe option being taken which inflated the allocation with over 200MiB wasted on this. So to resolve this they used what they call tagged heaps, simple 2MB blocks of system Ram from the Onion or Garlic memory block (CPU/GPU) for the frame section in flight. This is then tagged with a 64bit Integer to reference what it is being used for, when the frame section completes the request is sent to release that block or blocks of RAM that have the relevant tag associated with them. If the job needs more than the 2MB then another is requested and added with the tag, in these cases a lock is required but only small and at each 2MB block. This is in very rare occasions though with 99.9% of allocations fitting in the 1 block allocation meaning at worse you only waste <2MB maximum in case of a memory leak or spike giving back the 200MiB lost with the first process. Not only this but as the locks are not needed then contention is non-existent with pointer addresses being stored in LS of the executing core so the entire job runs completely uninterrupted.
From reading and seeing what Christian and Naughty Dog have done here and the work they put in to get the engine moved over to PS4 but also again making huge strides with the help of the ICE team, SN Systems and other Sony departments. The big strides they made so late in the schedule albeit exasperated by the migration from PS3 shows how close the game (and this is not as uncommon as you would think) is to finished before it ships. Other games are also using Fibres with the new Mortal Kombat game having these so I am confident it will be a large move by many studios to at least try to use them, to be clear this is not a ND invented process just that they seem to be the most involved and embraced team that has used them so exhaustively within the entire engine and I think the clear benefits have been shown with far better CPU core utilisation evident.
With the examples given and the now delayed into 2016 release of the first actual game made from scratch in this new engine the target for 60fps is certainly more than still the aim. With the engine being designed now for 60 and the improvements that are being made still to their own engine along with the SDK and hardware itself many things can and will still change. If Sony manages to allocate back the 7th core in-time for the launch and the example given by ND here with 25ms needed per frame. Meaning that the GPU is most likely not going to be the issue at hitting this target, with the demo shown last year reported by Edge as running at ~40fps along with the tessellation work on Sea line, objects & characters as per my demo dissection at the time this means the frame time is around the 25ms mark. A further core allocation that could handle another ~20ms of work would be enough to push this into the 60fps target all on its own, so long as they are still CPU bound within the engine.
No matter if they achieve the 60 goal throughout -and I can certainly see that happening with possibly a 30 target for the real-time cinematics if they really want to push the fidelity higher than the reveal showing which would not be a bad result – or not Uncharted 4 will be stunning, exhilarating and impressive game from Naughty Dog and show the potential of the team and engine on PS4. With such hard work and dedication the wait will be worth it for the bookend of Nathan Drake that started 8 years ago on the PS3 and will conclude on the PS4.