It may be a bit strange for a performance addict like myself to publish something on the upcoming AMD mobile architectures. Quite frankly, I don’t really care that much for the mobile space. Most if not all of the devices are too constraint in terms of power and thermal which means the tuning options are limited. I am however still very interested in new architectures and more specifically, to find out the limitations of new architectures.
AMD’s Beema and Mullins architecture are the third generation of AMD’s mobile offerings, following Brazos and Kabini/Temash. A few weeks ago Timothée and I attended a conference call in Taipei and here’s what we learned.
AMD 2014 Goal #1 – PSP with ARM Cortex A5
AMD breaks down their 2014 plan in three parts. The first part is the so-called Platform Security Processor, which is an additional ARM-based Cortex A5 processor on-die. From a performance point of view there is not much to say about the PSP. It’s mainly AMD’s approach to the industry’s request for hardware based security solutions as response to the many-devices world we live in today. As prime example AMD reminds us how a lot of people access mail and other work-related documents from not only their workstation, but also from their notebook, tablet and smartphone. Especially for security sensitive data this forms a problem.
AMD takes care of all that by integrating a licensed ARM Cortex A5.
The main reason why I want to emphasize this is because AMD is the first to integrate an ARM core on an X86 processor. Whether that’s a definite win is something I will leave in the middle, but the fact remains that AMD is continuing its story of integration. On a Beema or Mullins processor you now have access to X86, ARM and GCN technology, which makes the architecture seem very versatile. Of course we will have to see how much developers take advantage of the various technologies.
AMD 2014 Goal #2 – Power optimization and frequency up
The second part of AMD’s plan to conquer 2014 is, not so surprisingly, to deliver vastly more performance per watt. This can be achieved in two ways: either you reduce the power at a fixed performance point, or you increase the performance at a fixed power consumption point. AMD tries to do both at the same time.
When we compare the second and third generation of AMD’s mobile offerings, we can see that the frequency has indeed increased. Kabini and Temash, two products featuring the Jaguar core, operate at a maximum frequency of respectively 2.0GHz and 1.4GHz. For Beema and Mullins this increases to 2.4GHz and 2.2GHz. An additional 800 MHz on the most low-power spec is quite significant. Comparing the integrated GPUs of both generations we find AMD bumping up the frequency 200MHz on Beema and 100MHz on Mullins. Also, AMD has upped the memory support from DDR3L-1600 to DDR3L-1866 claiming a 5% performance increase 3DMark Cloud Gate.
As for performance measurements I’m afraid I can only present you the numbers and figures from the AMD presentation slide. Grains of salt to be added sized subjectively, in other words. Overall, AMD positions the new architectures at 15-35% higher system performance than Intel’s products in the same market space. In terms of compute, that performance difference goes up to 3-7x in for example BasemarkCL. Of course this is mostly a numbers game and I advise anyone to read performance comparisons from other well-respected media to form a proper conclusion.
On the power consumption side of the story, AMD has lowered the TDP ratings for Beema and Mullins to respectively 15/10W and 4.5/3.95W. This is up from 25/9W for Kabini and 8/3.9W for Temash. A respectable improvement, we could say. Additionally, AMD claims to achieve a 500mW power reduction on the system memory side benchmarked at DDR3L-1333 and a 200mW improvement for high-resolution displays through the use of voltage-mode logic.
Comparing the performance per watt to their own products, AMD claims double system and graphics performance per watt for Mullins over Temash. For Beema, that’s 10% performance increase using 40% less power. All great stuff.
AMD 2014 Goal #3 – Intelligent power management
The million dollar question is of course how AMD manages all this. Well, to make things really simple, it’s all part of the learning experience. Much of what AMD presents as optimizations come across as fairly logical and straight-forward. That is partially because some of the new approaches we’ve seen from competitors already. There are 4 parts to AMD’s approach:
- Architectural power improvements
- Skin Temperature Aware Power Management (“STAPM”)
- Intelligent Boost Control
- Energy Aware / Battery Boost
Architectural power improvements
This is the easiest to cover as I will need no more than one sentence to explain. Wait no, two sentences. Damn, three. Okay, n+1 sentences.
In a single slide AMD shows how the system power efficiency can be improved by architectural design optimizations such as including dynamic power management (e.g.: Turbo Boost), integration of system components (e.g.: north bridge), power circuit optimizations and of course silicon process optimizations. Over the past 6 years, AMD has been able to bring down its idle silicon power consumption from 4W to close to 0.5W in ULP Notebooks.
Skin Temperature Aware Power Management (“STAPM”)
One of the things on the list of obvious items is the STAPM approach. In short, AMD realized that the critical operating temperature of a device is much more limited by the perception of the user than the silicon temperature limit. In proper industry terms that reads: “Table power is limited by stead-state maximum skin temperature I.e. TSP – defined as the power that if consumed indefinitely, will cause skin temperature to hit the user sensitivity limited.” Essentially, it boils down to this: as a user, you don’t care about the temperature of anything inside your device as long as you don’t “feel” the temperature is too high.
Depending on the design of the device chassis, the increasing temperature of a SoC under load will be perceived faster or slower. Typically, AMD estimates the time between the start of the application and reaching Tskin,max (“maximum skin temperature”) to be about 25 minutes. That timeframe is called the Potential Boost Opportunity. Again simply put, within this timeframe there is enough thermal headroom to boost the performance (and thus temperature) of the SoC without affecting the user.
The rather unfortunately abbreviated STAPM (it brings the STAPH-meme humor to mind) exploits that timeframe to boost the operating frequency aggressively. The argumentation is two-fold. One, most use-cases for mobile devices are of relative short duration, which means the device can be set to high-performance state most of the application time. Two, because of the race-to-idle behavior the device will switch to low-power mode faster as the task is finished faster. This second point is the same as Intel posed when demoing the QuickSync technology for video-encoding. To make it simple to understand: because you can run higher frequencies, the task will complete faster and thus the device can go to low-power mode faster.
This ties in quite well with the Energy Aware Boost technology, which allows faster shutting down of chip and platform components. Both STAPM and Energy Aware Boost technology allow for higher performance and longer battery life within the same TDP.
Intelligent Boost Control
The last technology AMD mentions is the so-called Intelligent Boost Control. Again, a fancy name for something that is in essence quite simple. IBC is designed to avoid wasting power on applications that don’t really benefit from it. As you know, the performance boost of STAPM is achieved mainly by increasing the operating frequency at the right time. But not all applications benefit equally from an increase in frequency. IBC identifies the applications which will benefit from a frequency bump and provides juice accordingly. For those which don’t benefit, no additional power is wasted.
I asked AMD how they identify the applications and this is their answer: “The primary indicator of frequency sensitivity that is used is related to the number of instructions retired per clock cycle for each core independently. This inversely correlates to the level of memory and IO boundedness of the workload and therefore to how responsive it will be to increases in CPU frequency.” In other words, the more instructions for the CPU core, the more likely a boost will occur.
… But nothing for the performance tuning enthusiasts
Sadly enough none of these technologies are currently available for the performance tuning enthusiasts. As I mentioned in the opening lines, my main interest is finding out the performance limitations of all types of hardware. This includes X86 high-end desktop, but also the ARM technologies of Raspberry Pi and Odroid. I would have loved to give the Puma+ cores a spin to see how much performance could be squeezed out of the design if we remove the power and thermal restrictions. I asked, but AMD replied “There currently are no plans to expose the parameters for our power management at the application level for these products,” which is too bad. As the crypto-currency story has shown us, there are plenty of use-case scenarios companies cannot for-see. I’d say open up those controls and enable enthusiasts to explore the limits of the architecture both on the upper side of the performance or the lower side of the power consumption spectrum.
All in all I am fairly positive about AMD’s evolution. There is still tons of things to improve, but all in all the progression seems okay. Let’s just hope they give a little more to the enthusiasts and then we can all be happy. In 2015 there will be a couple of improvements I’m looking forward to seeing in action. This includes the integration of the voltage regulation similar to Haswell, but also per part adaptive voltage scaling.