Novel lockstep-based fault mitigation approach for SoCs with roll-back and roll-forward recovery

Server Kasap, Eduardo Weber Wachter, Xiaojun Zhai, Shoaib Ehsan, Klaus McDonald-Maier

Research output: Contribution to journalArticlepeer-review

13 Citations (Scopus)
111 Downloads (Pure)

Abstract

All-Programmable System-on-Chips (APSoCs) constitute a compelling option for employing applications in radiation environments thanks to their high-performance computing and power efficiency merits. Despite these advantages, APSoCs are sensitive to radiation like any other electronic device. Processors embedded in APSoCs, therefore, have to be adequately hardened against ionizing-radiation to make them a viable choice of design for harsh environments. This paper proposes a novel lockstep-based approach to harden the dual-core ARM Cortex-A9 processor in the Xilinx Zynq-7000 APSoC against radiation-induced soft errors by coupling it with a MicroBlaze TMR subsystem in the programmable logic (PL) layer of the Zynq. The proposed technique uses the concepts of checkpointing along with roll-back and roll-forward mechanisms at the software level, i.e. software redundancy, as well as processor replication and checker circuits at the hardware level (i.e. hardware redundancy). Results of fault injection experiments show that the proposed approach achieves high levels of protection against soft errors by mitigating around 98% of bit-flips injected into the register files of both ARM cores while keeping timing performance overhead as low as 25% if block and application sizes are adjusted appropriately. Furthermore, the incorporation of the roll-forward recovery operation in addition to the roll-back operation improves the Mean Workload between Failures (MWBF) of the system by up to ≈19% depending on the nature of the running application, since the application can proceed faster, in a scenario where a fault occurs, when treated with the roll-forward operation rather than roll-back operation. Thus, relatively more data can be processed before the next error occurs in the system.
Original languageEnglish
Article number114297
JournalMicroelectronics Reliability
Volume124
Early online date5 Aug 2021
DOIs
Publication statusPublished - Sept 2021

Bibliographical note

This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

Funder

UK Engineering and Physical Sciences Research Council through grants EP/P017487/1 , EP/R02572X/1 and EP/V000462/1

Keywords

  • ARM cortex-a processor
  • Fault tolerance
  • Lockstep
  • MicroBlaze processor
  • Reliability
  • Soft error mitigation
  • Zynq APSoC

ASJC Scopus subject areas

  • Electronic, Optical and Magnetic Materials
  • Atomic and Molecular Physics, and Optics
  • Condensed Matter Physics
  • Safety, Risk, Reliability and Quality
  • Surfaces, Coatings and Films
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Novel lockstep-based fault mitigation approach for SoCs with roll-back and roll-forward recovery'. Together they form a unique fingerprint.

Cite this