Global Sources
EE Times-Asia
Stay in touch with EE Times Asia
EE Times-Asia > Memory/Storage

Overcoming embedded memory bottleneck (Part 1)

Posted: 17 Aug 2012 ?? ?Print Version ?Bookmark and Share

Keywords:on-chip memory? latency? MOPS? algorithmic?

In the era of broadband Internet, 4G smart phones, and untethered tablet computing, there is an insatiable demand for ever-increasing computing performance. Over the years, processing performance has rapidly progressed, initially via increasing clock speeds and then later courtesy of architectural innovations such as instruction-level parallelism, pipelining, and the issuing of multiple instructions per cycle. Memory performance, on the other hand, has not kept pace, thus creating the traditional processor-memory gap.

Despite attempts to temper that gap with huge increases in on-chip memory capacity and the advent of multicore architectures (once again increasing the effective processing performance), system on chip (SoC) architects and designers continue to struggle to meet the performance requirements of today's data-hungry applications. Memory technology is long overdue for an innovation that can increase performance by an order of magnitude. One promising technology, algorithmic memory, combines existing embedded memories with the capabilities of algorithms to increase embedded memory performance by a factor of 10. While not a panacea, it offers a new and innovative approach to alleviating the disparity between processor and memory performance in SoCs.

Traditionally, the processor-memory performance gap referred to the difference between the performance of processors and the external memories, which took hundreds of cycles or more to access. The obvious solution to closing this gap was to alleviate off-chip memory delay by integrating the processors with the memory and other components on the same chip thus leading to the advent of the SoC approach. SoCs have emerged as the architecture of choice for delivering higher and higher levels of computing performance. Have SoCs really solved the processor-memory performance gap, though, or have they just pushed it to a lower level and recreated it within the microcosm of the chip?

Figure 1: Over the years, processing performance (red line) has rapidly progressed. Memory performance, on the other hand, has not kept pace (green and blue lines), thus creating a processor-memory gap.

SoCs are typically designed with their processors primarily accessing the embedded memory, and accessing external memory only when absolutely required. SoCs architects embed cache memory for frequently requested data, for example, or implement dedicated on-chip memories where possible. Memory used for these purposes can be accessed within a few clock cycles, and is typically placed immediately next to the processing cores to minimize latency. However, while latency remains a major concern, these memories are also required to respond to back-to-back sustained access requests issued by the processor(s), which in many applications have been dramatically increasing. Once more, systems architects are up against a processor-memory gap, this time with embedded memory (figure 1).

Measuring memory
Before tackling the problem of how to increase memory performance, we need a way to measure memory performance that accurately reflects real-life requirements. Note that, colloquially, memory bandwidth has often been used to describe memory performance. Memory bandwidth is the rate at which data can be read from or stored into a memory. It is a measure of the rate of data transfer to or from memory, and can easily be increased by expanding the data bus width of the embedded memory. An increase in the data bus width does not allow more unique accesses to memory, however.

Consider a processor, or a set of multiprocessor cores, that make an aggregate of 500 million unique accesses to memory in a second. Suppose that there is a single port memory, supporting one memory access per clock cycle, that runs at a frequency of 250MHz. This memory supports exactly 250 million unique accesses per second. Doubling the memory bandwidth of this memory by widening the data bus would only help in giving more data for each of the 250 million unique accessesit would not support the processor's 500 million unique requests. A more inclusive measure of memory performance, then, would be the memory operations per second (MOPS) metric.

MOPS refers to the rate at which unique accesses can be performed to a memory system. The relation between the bandwidth and MOPS is:

Memory Bandwidth = MOPS X Databus Width.

In other words, doubling the MOPS of a memory while keeping everything else the same doubles the total memory bandwidth. The use of MOPS for measuring memory performance mirrors the trend of using input/output operations per second (IOPS) for measuring the performance of computer storage device.

Current solutions to the MOPS problem
SoC architects and designers are well aware of the MOPS bottleneck of embedded memory. Unfortunately, today's embedded memories (built using circuit techniques alone) that offer more MOPS require a large amount of die area and can be extremely impractical. Achieving a 4X MOPS increase for a memory built using circuit techniques alone, for example, typically takes 400% to 800% more physical memory area than a corresponding memory providing 1X MOPS. As a result, architects and designers must use a variety of other techniques to achieve the necessary performance.

A common approach is to break up memory into multiple banks. Each memory bank can be accessed independently, and if two accesses in the same clock cycle go to different banks, then they can both be serviced in parallel to effectively double the MOPS supported by the memory as a whole. What happens when multiple accesses go to the same bank, however? We refer to this as a bank conflict, and when it does occur, memory stalls. Subsequent memory accesses need to be queued up in FIFOs, increasing both the memory latency and, because accesses are no longer guaranteed to be read or written to memory in a fixed time, raising the coherency management of the memory. The combination leads to processor stalls that are propagated as backpressure to earlier stages of the system pipeline. As a result, system performance can no longer be guaranteed.

Multi-banked solutions are relatively inexpensive to implement in terms of memory area and power. The technique increases the design complexity by adding additional logic required to manage non-deterministic memory output results, however. Also, the increase in design verification complexity significantly increases SoC development time. In the end, the system performance will still be affected in cases in which bank conflicts occur. An ideal memory solution should 100% guarantee the required MOPS, avoiding non-deterministic output results.

Rethinking memory performance
It is time to take a fresh perspective on how to increase memory performance. Today, a single-port embedded memory can perform one memory operation per clock cycle. Embedded memory performance has traditionally been closely tied to memory clock speed, and is therefore ultimately limited by it. The question to consider is whether it is possible to increase memory performance without increasing memory clock speeds.

1???2?Next Page?Last Page

Article Comments - Overcoming embedded memory bottlenec...
*? You can enter [0] more charecters.
*Verify code:


Visit Asia Webinars to learn about the latest in technology and get practical design tips.

Back to Top