ARM announces the new v9 architecture, to introduce custom core X2, Cortex A710, A510 along with new Mali GPUs

ARMv8 has been around for a long time. This is the first 64-bit version of the ARM architecture, and now it’s time to move on to ARMv9. ARM is not exactly jumping to 128-bit, but it does bring the end of 32-bit designs. ARM is planning to take Cortex-A cores 64-bit only by 2023 and it is working with its partners to ensure that the software ecosystem is ready to wave goodbye to 32-bit apps (a process that ARM hopes will be completed by the end of this year).

Arm’s CPU designs are used in the vast majority of Android smartphones we see today, with everyone from Google and OnePlus to Samsung and Huawei using the company’s CPUs in some form. These companies license Arm’s CPU cores and use them together with a GPU, NPU, ISP, DSP, etc., to make a system-on-a-chip (SoC). For example, the Snapdragon 888 uses a Cortex-X1, three Cortex-A78 cores, and four Cortex-A55 cores.

Bits aside, ARM unveiled new CPU and GPU designs that will be featured in future chipsets as well as new support hardware to tie them all together in varying configurations that will be used in everything from laptops, through phones to smart TVs and other multimedia appliances. There are some exciting high-end designs, but it’s the entry-level stuff that may prove to be a game-changer.

Cortex-X2: The performance core gets more performance and even uses less juice

The Cortex-X1 was the first CPU core from Arm’s Cortex-X Custom (CXC) program. This focuses on performance over efficiency, even more so than Arm’s traditional big cores. The Cortex-X1 has found its way into the Exynos 2100 and Snapdragon 888 chipsets, serving as the new prime core in these SoCs. Because it is tweaked for performance, there is normally only one X core on a mobile device. However, there is always the potential for multiple Cortex-X cores in an SoC designed for Chromebooks or other laptops.

Now, Arm has revealed the Cortex-X2. It is a 64-bit only (no 32-bit mode) Armv9-based CPU with the potential of a 16% performance improvement over the X1 (if built using the same manufacturing process and clock frequencies). A node shrink could double or triple the performance at the same frequencies, while even doubling efficiency.

The company expects the processors using the Cortex-X2 to offer up to a 30% performance boost over 2021’s flagship phones (which use the X1) when other improvements like more cache are taken into account. Arm also says you can expect a 2x boost to machine learning performance over the X1.

To find the extra performance, the X2 designers have decoupled the branch production from the fetch. This means the fetch can run ahead of the branch predictor and allow it to smooth out any gaps that may appear in the pipeline due to branching. The predictor itself has also been improved and now includes an alternative path predictor. This results in fewer branch misses, which in turn increases performance.

The graph below shows the reduction in branch miss predictions per 1,000 instructions (MPKI) of the X2 compared to the X1.

The X2 uses a 10-stage pipeline with an increased out-of-order window. Since it is an Armv9 CPU, it implements SVE2, this time at 128-bits. The X2 also improves instruction-level parallelism by increasing the load-store window/structure sizes.

The improved performance can also partially be attributed to increases in cache size. More specifically, while the L2 cache still tops out at 1MB, the L3 cache has been doubled from a maximum of 8MB in the Cortex-X1 and can now support up to 16MB.

Cortex-A710: The big core but not that big

Arm has also issued a successor to the Cortex-A78, and the company is going with an all-new name in the Cortex-A710.

The Cortex-A710 doesn’t have the same peak performance as the X2, but you still see a respectable 10% performance boost over a Cortex-A78 on the same manufacturing process. But a far bigger improvement is to be had when it comes to machine learning and battery life, as Arm touts a 2x performance gain and 30% efficiency gain, respectively.

Arm has increased the performance by improving the branch predictor accuracy at the front-end of the processor and doubling the capacity of key branch prediction structures, namely the Branch Target Buffer (BTB) and the Global History Buffer (GHB).

For improved efficiency, the A710 is a five-wide core (versus six-wide on the A78) and switches to a 10-stage pipeline (much like the Cortex-X2). In addition, there are changes in the data-prefetcher that yield improved coverage and accuracy.

Unlike the X2, the Cortex-A710 also supports AArch32 (i.e., 32-bit apps), a feature that will soon disappear. Arm has announced that by 2023 all its new CPU cores for mobile will be 64-bit only. Like the Cortex-X2, the SVE2 engine is 128-bits wide.

Cortex-A510: Alas, a new mini core we’ve been waiting for

Arm hasn’t released a new little core in four years, which is an eternity in smartphone years. Thankfully, the wait is over as the company has launched the Armv9-based Cortex-A510 to pick up where the Cortex-A55 left off.

As you’d expect from a long-overdue upgrade, Arm says the Cortex-A510 brings a 35% performance improvement, a 20% efficiency gain, and a 3x boost to machine learning compared to a Cortex-A55 on the same process.

The company says a combination of a three-wide in-order design (compared to two-wide in the A55), along with branch prediction and data prefetching tech from the Cortex-X project, have contributed to the A510’s improved performance and efficiency. It also uses a three-wide decode, a three-wide issue, features three integer ALU pipelines, and dual load/store pipelines. The load/store pipelines can work as 2x load or 1x load plus 1x store.

The most interesting feature of the Cortex-A510 is its merged-core microarchitecture. Two Cortex-A510 cores can be grouped in a complex. When in a complex, the Cortex-A510 cores share some resources, most notably the L2 cache, the L2 Translation Lookaside Buffer (TLB), and the SIMD engine (meaning floating-point, NEON, and SVE2).

This is a similar idea to simultaneous multithreading (SMT), which you may know as hyperthreading, in that parts of the CPU core are shared. However, the Cortex-A510 merged-core microarchitecture is much less drastic. The main parts of the core are still independent, and everything except floating-point operations and SIMD operating remains on each core. However, when the core needs to do some vector math, it uses a NEON/SVE2 engine that is shared with another core. Some clever fine-grained scheduling between the cores means there is minimal overhead even when both cores are using the vector unit. Under some floating-point heavy benchmarks, Arm is seeing only a 1% dip in math performance.

The advantages of the merged-core microarchitecture setup aren’t so much about performance or energy efficiency, but area. The more transistors in a processor, the more money it costs. This isn’t normally a problem at the high-end. However, price-sensitive phones need to save money wherever possible, including down to how many mm² the CPU core occupies.

Speaking of vector math, since the Cortex-A510 is an Armv9 processor, it implements SVE2. However, unlike the X2 and the A710, the A510 can be built using a 64-bit implementation of SVE2 or a 128-bit one. This gives chip makers the flexibility between area and performance.

Since the Cortex-A510 will also be used in flagship processors, it is possible to create one-core complexes, meaning there are no shared resources. So, to get the best performance from the A510, it needs to use one-core complexes and 128-bit SVE2. An area-conscious version would use two cores per complex and 64-bit SVE2.

There was lots of internal discussion at Arm about the architecture for the Cortex-A510: should it remain an in-order CPU like the Cortex-A53 and Cortex-A55, or should it move to an out-of-order design? In-order designs are very efficient, but the question was, can the desired performance be obtained? The answer is yes; the in-order design was the right way to go for maintaining power efficiency while boosting performance.

To highlight this, Arm makes a comparison to the 2016/2017 Cortex-A73. That CPU design was found in processors like the Qualcomm Snapdragon 835 and phones like the Google Pixel 2. The Cortex-A73 is an 11-stage, out-of-order processor based on Armv8. A smartphone processor that uses just the Cortex-A510 in 2022 will offer 90% of the performance compared to a Cortex-A73-based smartphone but consume 35% less power. That also means the Cortex-A510 is faster than the Cortex-A57 and the Cortex-A72! In other words, today’s power-efficiency cores (the little cores) are closing in on the performance levels of past big core CPU designs.

New configs

Arm has deliberately left the door open for maxed-out configurations of the Cortex-X2 if that is what its partners want to build. There is no technical reason stopping someone from building an octa-core Cortex-X2 processor with up to 16MB L3 cache and 32MB of system-level cache. It would be designed for laptops or even small desktop units. Will someone build such a processor? We can only hope! A potentially more realistic option would be a quad-core Cortex-X2 plus quad-core Cortex-A710 setup, again aimed at Chromebooks or laptops.

We will likely see a repeat of the common 1+3+4 format in the mobile space, but this time with one X2, three A710 cores, and four Cortex-A510 cores. Could this be the setup of Samsung’s mobile processor for the Galaxy S22? Such a processor would theoretically offer a 30% jump in single-core peak performance (thanks to the X2), a 30% increase in sustained efficiency (thanks to the Cortex-A710), and a 35% uplift in little core performance (thanks to the Cortex-A510).

We can expect to see the Cortex-A710 coupled with the Cortex-A510 in either a 4+4 or 2+6 setup for chipmakers who aren’t part of the Cortex-X Custom program. There is also the potential for an octa-core A510 processor or even a quad-core variant. Octa-core Cortex-A53 processors were quite popular, but we didn’t see the same enthusiasm for octa-core Cortex-A55 chips. The Cortex-A510 has the potential to rekindle the passions for such processors, especially considering the area-saving benefits of the merge-core microarchitecture. However, since the Cortex-A510 is 64-bit only, it might limit the appeal in markets that don’t use Google’s services (i.e., haven’t transitioned to 64-bit only apps yet).

When will the new CPUs be out?

Designing modern CPU cores can take years. In fact, the first discussions about the Cortex-A510 took place as early as 2016, and the ideas around the merged-core microarchitecture were being touted even as far back as the design of the Cortex-A53. The public announcement of these new cores is one of the final steps. However, long before we heard about these designs, Arm’s key partners — including Qualcomm, Samsung, and MediaTek — will have already been working with Arm.

This means we can expect to see Armv9 processors announced, using some or all of these cores, towards the end of 2021. Actual phones using these processors might launch as early as the first quarter of 2022.

Who will be using the new chips?

We all know that Qualcomm would be the first to make use of the new ARMv9 cores in the upcoming snapdragon SoC. Whenever a new arch releases, Qualcomm always makes sure to include it in their latest flagships based on the 800 series processors.

Coming to Exynos, we can see that as per reports, Exynos is getting ready to launch it’s new SoC with the AMD GPU and the name will be Exynos 2200. They could rebrand the new SoC with a different name too, in order to make it stand out with the new ARMv9 architecture.

Last but not the least, Mediatek too could make use of the new chipset. Mediatek has been flexing it’s muscles with the new Dimensity 1000, 1100 and 1200 making use of top of the line ARMv8 cores. Mediatek could even bump up the numbers to the Dimensity 2000 series this time around.

Other manufacturers like Unisoc have been using the A78-A55 combos, but nothing much is given about them. They could make use of the new cores after a period of 2 years or more.

Not just CPU cores, but GPUs too: Mali-G710, G610, G510 and G310 GPUs

Arm also introduced its refreshed Mali GPU lineup, which the company says represents the broadest range of performance it’s ever released for its graphics cores.

Did you know that Mali is the #1 GPU in terms of shipments? Over 1 billion Mali GPUs were shipped in 2020. They power about half of smartphones out there and around 80% of smart TVs. And today ARM is bringing out the widest range of GPU designs that will fit every nice of the market.

The Mali-G710 slots in as the flagship with a claimed 20% performance improvement and 35% improvement in machine learning over the previous-gen Mali-G78. Meanwhile, the G510 slots in for applications like TV and augmented reality with 22% higher efficiency and a doubling of machine learning performance, while the lowest-end Mali-G310 slots in for low-cost devices with a claimed 6X performance increase in texturing.

You can find out more about the Mali GPUs here.