Global Sources
EE Times-Asia
Stay in touch with EE Times Asia
EE Times-Asia > EDA/IP

Self-healing chips prevent system failure

Posted: 02 Oct 2006 ?? ?Print Version ?Bookmark and Share

Keywords:Semiconductor Research Corp.? National Science Foundation? DFM? self-healing chips? Richard Goering?

You wouldn't want a chip in a car, airplane or medical device to suddenly fail, but with reliability challenges that worsen at 45nm and below, that's a real possibility. This is why Semiconductor Research Corp. and the National Science Foundation are funding a groundbreaking research effort into "self-healing" chips that can detect and repair defects in the field.

The three-year research project will fund work by two principal investigators from the University of MichiganTodd Austin, associate professor of electrical engineering, and Valeria Bertacco, assistant professor of electrical engineering. The two have already published research about defect-tolerant architectures that involve minimal area and performance trade-offs, in contrast to the large sacrifices required by today's modular redundancy approaches. During the research project, the investigators will work to boost defect coverage and to extend the new architectures to a wide variety of chips.

While techniques such as design-for-manufacturing (DFM) and restricted design rules will help maintain nanometer yields, there are still some chips that will fail days, months or years after they've been deployed. Culprits include electromigration, hot-carrier degradation, undetected manufacturing defects, unpredicted process variations, and thin and vulnerable gate oxides.

"With much larger chips and much smaller geometries, we're going to have chips in which not all the transistors are going to work," said Bill Joyner, director of CAD and test at the Semiconductor Research Corp. "We're looking for research that will give us chips and systems that are going to work, in spite of the fact that components are going to fail."

Self-healing chips, said Austin, may extend Moore's Law for another process generation or two. "There are so few atoms forming the transistors that any amount of variation can cause them to be too weak or too slow," he said. "By building self-healing into the system, you can tolerate these types of things, and give yourself the opportunity to extend the life span of CMOS silicon a little further than it would otherwise."

Self-healing chips are several years away, but they represent "a big issue for the future of technology," said Mary Olsson, research VP for design and engineering at Gartner Dataquest. Like restricted design rules, she noted, self-healing chips may potentially reduce the need for certain types of DFM or IC layout tools. If self-healing chips really take off, perhaps there will be less need for restricted design rules as well, she said.

New approach
Fault-tolerant architectures are nothing new, but thus far, they've been restricted to high-end computing systems, said the University of Michigan's Bertacco. The main approach, she said, is triple modular redundancy (TMR), where there are three copies of the system. "This is very expensive technology because it will require a 200 percent overhead in area," she said. "In contrast, the solution we're trying to propose is much lower-cost and can thus be applied to a much broader range of systems."

The initial University of Michigan work is with microprocessors, but investigators plan to extend the research to a broad range of chips, Bertacco said. The three-year project, funded to the tune of $100,000 per year, will also involve the creation of high-level defect models, she said. System designers and architects can use such models to evaluate a system's need for resiliency.

Self-healing chips will detect and repair defects in the field.

Part of the work, noted Austin, is the development of a "simulation infrastructure" that can model potential silicon failures. He said the investigators have taken tools from Cadence Design Systems and Synopsys, and added a capability to "inject" faults into a system model. The model can then be used to evaluate the integrity of a design.

Austin and Bertacco are co-authors of two papers that describe some initial research into self-healing chips. The first of these, given at the International Symposium on High-Performance Computer Architecture in February, discusses a defect-tolerant chip multiprocessor (CMP) switch architecture.

That paper contributes a high-level modeling approach for silicon failures and describes a CMP switch router architecture that incorporates system-level checking and recovery, component-level fault diagnosis and spare-part reconfiguration. This "Bulletproof" switch design claims to be more robust and less costly than existing approaches, including TMR and error-correction codes.

The defect-tolerant switch design, aimed at multicore ICs, detects data-corrupting errors through cyclic redundancy checkers at the switch's output channels. Recovery logic is added to the input buffers. To detect errors that cause functional incorrectness, the design uses buffer checker units, extra routing-logic units and an extra switch arbiter. The area overhead is only 10 percent, according to the paper.

A second paper, to be given at the Architectural Support for Programming Languages and Operating Systems conference in October, discusses more recent work. It outlines a very specific solution for very long instruction word architectures that uses the natural redundancy of VLIW architectures to facilitate repair.

The paper introduces the Bulletproof pipeline, described as "the first ultralow-cost mechanism to protect a microprocessor pipeline and on-chip memory system from silicon defects." This goal is achieved through online, built-in self-test (BIST) techniques combined with system-level checkpointing. For a four-wide VLIW processor with 32Kbytes of instruction and data cache, the approach claims to achieve an 89 percent silicon defect coverage with only a 5.8 percent cost in area, along with a 4 percent to 18 percent performance degradation after a defect is found.

The approach uses a microarchitectural checkpointing technique to create "epochs" of execution during which BIST is used to validate the integrity of the underlying hardware. If a defect is found, Austin noted, this approach makes it possible to "roll back" time to the last point where there were no defects. Recovery to a correct state is accomplished by flushing the pipeline and copying the architectural registers from a backup register file.

Perhaps the main concern in this approach is increasing the defect coverage well beyond 89 percent. Austin said he'd like to see it rise to "two-, three- or four-nines of coverage," meaning 99.99 percent. And this must be done by sticking with a 5 percent to 10 percent area overhead, he noted.

There's also an educational angle. "A future challenge in resilient systems is understanding the effects of physical phenomena on abstract descriptions of your machine," Austin said. "That's really an open question. I hope that once we build some physical models, we can allow architects and designers to better understand how to address these problems."

- Richard Goering
EE Times

Article Comments - Self-healing chips prevent system fa...
*? You can enter [0] more charecters.
*Verify code:


Visit Asia Webinars to learn about the latest in technology and get practical design tips.

Back to Top