Cadence's embedded processors go mano-a-mano
Keywords:Cadence Design Systems? embedded processor? SoC? extensible architecture?
Cadence Design Systems' Tensilica and Synopsys ARC are facing each other off in the market for embedded processors in SoC designs. According to the company, the devices can have as many as 30 controller cores outside the main CPU, handling data movement and signal processing with higher clock rates and higher memory bandwidth.
"There's been a very substantial shift in the market to people that want a lot more programmability in their data operations," said Chris Rowen, founder of Tensilica and now Fellow at cadence Design Systems following the acquisition in May 2013. "Down on the factory floor where the real work gets done there's an increasing shift to a smarter data plane to more programmable engines that are adapting more often under software control so you can choose your algorithms after the design of the chip. The data rates and energy budgets keep that out of the reach of the execution CPUs."
"Process technologies are so dense the small premium to make a programmable block is negligible, but it means people want to design it once and tape it out and not have to get back to it if they change the algorithm," he said. "These kinds of processors come much closer to reconciling the gap."
The latest Tensilica Xtensa LX5 is the tenth generation of the extensible architecture, but the first new core since the acquisition. "The acquisition was an important step forward in technology for extensible processors," said Rowen, "and a big validation of everything that we are working for as it reinforces that this is one of the key technologies."
The move has gone well, he said. "The whole team came across and Tensilica maintains its identity under the IP group of Cadence run by Martin Lund," he said. "We do a major release every 18 to 24 hours months so [the definition of] this processor goes back a couple of years and is pushing on data plane processing and efficiency. It brings leadership in the IP space especially in the need for better data plane processors and Tensilica is engaging with customers early in the design cycle and in product definition as a result."
"We have DRAM improvements in the data cache performance particularly to reduce the latency for cache misses and improving the cache pre-fetch," he said. "We have also done some innovation in the banking memory to provide much higher bandwidth. In order to sustain the bandwidth you often need multiple banks for example 512bit wide and you need to have two ports ready for these wide memories and a DMA channel so you have may have three 512bit operations per cycle. We have also introduced more independent arbitration for banks, coalescing the reads so if it requires locations in the same bank it does a single read and gives the bits back from a single memory width, which the effective memory bandwidth higher," he said.
One new element is a semantic engine. "The vector processor operates on certain elements of a data word," said Rowen. "In the past you needed to read the whole word and even if you updated one bit you had to re-write the whole word, so we have added this feature to enable and disable individual bit writes. That's part of the semantics engine. We have an operation that computes which bits you write and don't write so you can combine two operations for the same latency and the same power."
Synopsys has also enhanced its 32bit ARC extendible embedded processor architecture to target the high memory bandwidth applications such as storage and digital TV and also added support for ARM infrastructure.
The ARC HS34 and HS36 processors are the highest performance ARC processor cores to date, handling 1.9DMIPS/MHz at speeds up to 2.2GHz. In a typical 28nm process the cores consume as little as 0.025mW/MHz in an area as small as 0.15mm2.
The family uses the ARCv2 instruction-set architecture (ISA) coupled with a new ten stage pipeline that supports out of order instruction retirement, minimizing idle processor cycles and maximizing instruction throughput. Sophisticated branch prediction and a late stage ALU improve the efficiency of instruction processing and allow a deterministic response for real time performance, stated Mike Thompson, senior product marketing manager for ARC Processors and Subsystems. SoC peripherals can be directly accessed by the CPU in a single cycle using native ARM AMBA AXI and AHB standard interfaces that are configurable for 32bit or 64bit transactions to optimize system throughput.
To speed the execution of math functions, the HS cores give designers the option to implement a hardware integer divider, instructions for 64bit multiply, multiply-accumulate (MAC), vector addition and vector subtraction, and a configurable IEEE 754-compliant floating point unit (single or double precision or both).
The ARCv2-based cores provide an 18 percent improvement in code density compared to previous generation ARC cores, reducing memory requirements and support close coupled memory as well as instruction and data cache (HS36 only), with new 64bit load-double/store-double and unaligned memory access capabilities that accelerate data transfers.
- Nick Flaherty
??EE Times Europe
Related Articles | Editor's Choice |
Visit Asia Webinars to learn about the latest in technology and get practical design tips.