transient errors
Recently Published Documents


TOTAL DOCUMENTS

71
(FIVE YEARS 7)

H-INDEX

13
(FIVE YEARS 1)

Author(s):  
Bahman Arasteh ◽  
Reza Solhi

Software play remarkable roles in different critical applications. On the other hand, due to the shrinking of transistor size and reduction in supply voltage, radiation-induced transient errors (soft errors) have become an important source of computer systems failure. As the rate of transient hardware faults increases, researchers have investigated software techniques to control these faults. Performance overhead is the main drawback of software-implemented methods like recovery blocks that use technical redundancy. Enhancing the software reliability against soft errors by utilizing inherently error masking (invulnerable) programming structures is the main goal of this study. During the programming phase and at the source code level, programmers can select different storage classes such as automatic, global, static and register for the data into their program without paying attention to their inherent reliability. In this study, the inherent effects of these storage classes on the program reliability are investigated. Extensive series of profiling and fault-injection experiments were performed on the set of benchmark programs implemented with different storage classes. Regarding the results of experiments, we find that the programs implemented with automatic storage classes have inherently higher reliability than the programs with static and register storage classes without performance overhead. This finding enables the programmers to develop highly reliable programs without technical redundancy and performance overhead.


Author(s):  
Lianhua Yu ◽  
Ming Diao ◽  
Xiaobo Chen

It is necessary to study fault tolerant techniques for nanotechnology since nanometer devices are very sensitive to system and environment influences. In this paper, we present a novel fault tolerant technique for nanocomputers, namely, XOR multiplexing based on redundancy-modified NAND gates. The error distributions and fault tolerant ability of the proposed architecture are analyzed and compared them with von Neumann’s multiplexing. Experimental results show that compared with conventional multiplexing technique based on NAND gate, the new system has a much higher fault tolerant ability. According to the evaluation, by using multiple redundant components, the device error tolerant ability of the proposed architecture can up to the 10−1. In unreliable nanometer-scale devices-based systems, this architecture is potentially effective against the increasing transient errors.


2020 ◽  
Vol 245 ◽  
pp. 04009
Author(s):  
Georgios Bitzes ◽  
Fabio Luchetti ◽  
Andrea Manzi ◽  
Mihai Patrascoiu ◽  
Andreas Joachim Peters ◽  
...  

EOS [1] is the main storage system at CERN providing hundreds of PB of capacity to both physics experiments and also regular users of the CERN infrastructure. Since its first deployment in 2010, EOS has evolved and adapted to the challenges posed by ever-increasing requirements for storage capacity, user-friendly POSIX-like interactive experience and new paradigms like collaborative applications along with sync and share capabilities. Overcoming these challenges at various levels of the software stack meant coming up with a new architecture for the namespace subsystem, completely redesigning the EOS FUSE module and adapting the rest of the components like draining, LRU engine, file system consistency check and others, to ensure a stable and predictable performance. In this paper we detail the issues that triggered all these changes along with the software design choices that we made. In the last part of the paper, we move our focus to the areas that need immediate improvements in order to ensure a seamless experience for the end-user along with increased over-all availability of the service. Some of these changes have far-reaching effects and are aimed at simplifying both the deployment model but more importantly the operational load when dealing with (non/)transient errors in a system managing thousands of disks.


IEEE Access ◽  
2019 ◽  
Vol 7 ◽  
pp. 140182-140189 ◽  
Author(s):  
Corrado De Sio ◽  
Sarah Azimi ◽  
Luca Sterpone ◽  
Boyang Du

2019 ◽  
Vol 214 ◽  
pp. 03020
Author(s):  
Michal Svatos ◽  
Alessandro De Salvo ◽  
Alastair Dewhurst ◽  
Emmanouil Vamvakopoulos ◽  
Julio Lozano Bahilo ◽  
...  

The ATLAS Distributed Computing system uses the Frontier system to access the Conditions, Trigger, and Geometry database data stored in the Oracle Offline Database at CERN by means of the HTTP protocol. All ATLAS computing sites use Squid web proxies to cache the data, greatly reducing the load on the Frontier servers and the databases. One feature of the Frontier client is that in the event of failure, it retries with different services. While this allows transient errors and scheduled maintenance to happen transparently, it does open the system up to cascading failures if the load is high enough. Throughout LHC Run 2 there has been an ever increasing demand on the Frontier service. There have been multiple incidents where parts of the service failed due to high load. A significant improvement in the monitoring of the Frontier service wasrequired. The monitoring was needed to identify both problematic tasks, which could then be killed or throttled, and to identify failing site services as the consequence of a cascading failure is much higher. This presentation describes the implementation and features of the monitoring system.


Sign in / Sign up

Export Citation Format

Share Document