With silicon clock scaling largely dead thanks to the laws of physics, computer scientists and chip designers have had to search for performance improvements in other areas — typically by improving architectural efficiency and reducing power consumption. Multi-core scaling may have paid big dividends with dual and quad-core chips, but largely plateaued thereafter. There are a number of reasons why adding new processor cores yields diminishing marginal returns. One of the critical causes is the steep overhead associated with maintaining cache coherency between cores that share data sets. Now, a team of researchers working with Intel think they may have found a solution. If their work proves useful, it could offer a significant performance boost in certain applications.
Before we discuss the solution, we need to spend a bit of time talking about the problem. Imagine two separate CPU cores, each of which is working on part of a common computation. Each CPU will have its own L2 cache, where data related to the problem is stored. In a coherent cache, CPU 0 completes part of its calculations, writes a new value to a block of memory, and then communicates that it has done so. CPU 1 now knows that its own data is out-of-sync with CPU 1 and can update its own L2 cache accordingly. There are several methods of implementing coherence, but at the simplest level, it’s a method for ensuring that all of the CPUs are “on the same page,” as it were. Cache coherence is essential to multi-core scaling, but it also represents a substantial bottleneck as core counts increase. The more CPUs in a system, the more CPU time must be spent enforcing whatever coherence strategy has been chosen, and the less bandwidth is available for actually solving the compute problem in question.
Cache coherency — image from Wikipedia
The North Carolina researchers and Intel have jointly proposed a combined software-hardware solution they call a Communication Accelerated Framework (CAF). The CAF would include a queue management device (QMD) implemented in hardware. The researchers describe its benefits as follows:
QMD achieves several significant benefits. First, it makes queue operations fast. Instead of executing hundreds of instructions at the core to manage a software queue, a core can execute an enqueue or dequeue instruction, with QMD handling the rest. Consequently, QMD frees up the core to work on more useful jobs. Second, QMD can handle multiple producers and consumers without requiring locks or synchronizations. Third, QMD removes most coherence-related communication incurred in software queue implementations, both in the control plane and in the data plane. The last two benefits increase the scalability ceiling vs. software queues. Furthermore, the scalability ceiling of QMD can be further lifted by making QMD distributed. Our results show up to 2− 12× throughput improvement compared to a fully optimized software queue structure.
The proposed hardware queue manager
The QMD proved capable of delivering up to a 20-fold performance improvement in test simulations, and Intel is said to be keenly interested in the results. It’s important to note that tests like this don’t solve all the problems of multi-core scaling, even if they prove valuable — the same forces pushing Intel and other companies towards cloud computing would keep shoving that way, especially since multi-core communication doesn’t really bottleneck modern CPUs running desktop applications. Trying to find solutions to many of these problems is difficult, trying to find solutions that justify incorporating them into all processors is even more so.
“We have to improve performance by improving energy efficiency,” Yan Solihin, lead author on the study and a professor of electrical and computer engineering, told IEEE Spectrum. “The only way to do that is to move some software to hardware. The challenge is to figure out which software is used frequently enough that we could justify implementing it in hardware. There is a sweet spot.”
Now read: How L1 and L2 CPU caches work, and why they’re an essential part of modern chips