Monolithic 3D technology is emerging as a promising solution that can bring massive opportunities, but the gains can be hindered due to the reliability issues exaggerated by high temperature. Conventional reliability solutions focus on one specific feature and assume that the other required features would be provided by different solutions. Hence, this assumption has resulted in solutions that are proposed in isolation of each other and fail to consider the overall compatibility and the implied overheads of multiple isolated solutions for one system.
This article proposes a holistic reliability management engine, R2D3, for post-Moore’s M3D parallel systems that have low yield and high failure rate. The proposed engine, comprising a controller, reconfigurable crossbars, and detection circuitry, provides concurrent single-replay detection and diagnosis, fault-mitigating repair, and aging-aware lifetime management at runtime. This holistic view enables us to create a solution that is highly effective while achieving a low overhead. Our solution achieves 96% coverage of defect; reduces
V
th
degradation by 53%, leading to a 78% performance improvement on average over 8 years for an eight-core system; and ultimately yields a 2.16× longer mean-time-to-failure (MTTF) while incurring an overhead of 7.4% in area, 6.5% in power, and an 8.2% decrease in frequency.