Fault-tolerant strategy and workspace of the subreflector parallel adjusting mechanism

Parallel mechanism has been widely used in large-scale and heavy-duty attitude adjustment equipment. In order to improve the work reliability of the subreflector parallel adjusting mechanism, a fault-tolerant strategy based on the redundant degree-of-freedom and a workspace boundary identification method are proposed in this paper, which can realize the proper function of the subreflector parallel adjusting mechanism when it has a driven fault. The configuration and parameters of the parallel adjusting mechanism are introduced firstly, then the degrees-of-freedom of the parallel adjusting mechanism is calculated when it has a driven fault, and the principle of the fault-tolerant strategy based on the redundant degree-of-freedom is deduced in detail. Next, the method to solve the workspace boundary identification problem for the parallel adjusting mechanism in fault tolerance conditions is proposed, the maximum and minimum workspaces of the parallel adjusting mechanism at fault tolerance conditions in different frequency bands are analyzed. The results showed that the workspace calculated by the fault-tolerant strategy in the fault condition can completely meet the needs of the subreflector, where this method can also be applied to other parallel mechanisms. Lastly, an experiment is conducted to verify the correctness and effectiveness of the fault-tolerant strategy, in which the results showed that the fault-tolerant strategy can effectively improve the work reliability of the parallel adjusting mechanism. The fault-tolerant strategy and workspace boundary identification method can make the subreflector parallel adjusting mechanism work normally when it has a driven fault, which can significantly improve the work reliability and work efficiency and the maintenance cost can also be reduced. The fault-tolerant strategy and workspace boundary identification method can also be well applied to the research and development for this kind of parallel mechanical equipment.

Download Full-text

Dynamic Fault Tolerant Topology Control for Wireless Sensor Network Based on Node Cascading Failure

International Journal of Online Engineering (iJOE) ◽

10.3991/ijoe.v14i05.8644 ◽

2018 ◽

Vol 14 (05) ◽

pp. 118 ◽

Cited By ~ 1

Author(s):

Yang Xiao

Keyword(s):

Fault Tolerance ◽

Degree Distribution ◽

Large Scale ◽

Fault Tolerant ◽

Wireless Sensor ◽

Cascading Failure ◽

Network Node ◽

Simulation Test ◽

Node Failure ◽

Scale Free

To address the node cascading failure (CF) of the wireless sensor networks (WSNs), considering such factors as node load and maximum capacity in scale-free topology, this paper establishes the WSN dynamic fault tolerant topology model based on node cascading failure, analyses the relationships between node load, topology and dynamic fault tolerance, and demonstrates the proposed model through simulation test. It studies the effects of topology parameter and load in case of random node failure in the network node cascading failure, and utilizes the theoretical derivation method to derive the structural feature of scale-free topology and the capacity limit for the WSNs large-scale cascading failure, effectively enhancing the cascading fault tolerance of traditional WSNs. The simulation test results show that, with the degree distribution parameter C increasing, the minimum network node degree will increase accordingly, and in highly intensive topology, the dynamic fault tolerance will be better; with the parameter λ increasing, the maximum degree of the network node will gradually decrease, and the degree distribution of topology structure tends to be uniform, so that the large-scale cascading failure caused by node failure will have less influence on the WSN, and further improve the dynamic fault tolerance performance of the system.

Download Full-text

A Replica Based Co-Scheduler (RBS) for Fault Tolerant Computational Grid

Cloud, Grid and High Performance Computing ◽

10.4018/978-1-60960-603-9.ch007 ◽

2011 ◽

pp. 101-116

Author(s):

Zahid Raza ◽

Deo P. Vidyarthi

Keyword(s):

Fault Tolerance ◽

Performance Optimization ◽

Large Scale ◽

Heterogeneous Computing ◽

Fault Tolerant ◽

Turnaround Time ◽

Computational Grid ◽

Successful Execution ◽

Computational Resources ◽

The Cost

Grid is a parallel and distributed computing network system comprising of heterogeneous computing resources spread over multiple administrative domains that offers high throughput computing. Since the Grid operates at a large scale, there is always a possibility of failure ranging from hardware to software. The penalty paid of these failures may be on a very large scale. System needs to be tolerant to various possible failures which, in spite of many precautions, are bound to happen. Replication is a strategy often used to introduce fault tolerance in the system to ensure successful execution of the job, even when some of the computational resources fail. Though replication incurs a heavy cost, a selective degree of replication can offer a good compromise between the performance and the cost. This chapter proposes a co-scheduler that can be integrated with main scheduler for the execution of the jobs submitted to computational Grid. The main scheduler may have any performance optimization criteria; the integration of co-scheduler will be an added advantage towards fault tolerance. The chapter evaluates the performance of the co-scheduler with the main scheduler designed to minimize the turnaround time of a modular job by introducing module replication to counter the effects of node failures in a Grid. Simulation study reveals that the model works well under various conditions resulting in a graceful degradation of the scheduler’s performance with improving the overall reliability offered to the job.

Download Full-text

Improved Checkpoint Using the Effective Management of I/O in a Cloud Environment

Advances in Computer and Electrical Engineering - Advanced Methodologies and Technologies in Network Architecture, Mobile Computing, and Data Analytics ◽

10.4018/978-1-5225-7598-6.ch016 ◽

2019 ◽

pp. 219-233

Author(s):

Bakhta Meroufel ◽

Ghalem Belalem

Keyword(s):

Fault Tolerance ◽

Execution Time ◽

Large Scale ◽

Fault Tolerant ◽

Experimental Results ◽

Cloud Environment ◽

Large Scale Systems ◽

Major Disadvantage ◽

Reliability And Robustness ◽

Effective Use

One of the most important points for more effective use in the environment of cloud is undoubtedly the study of reliability and robustness of services related to this environment. In this case, fault tolerance is necessary to ensure that reliability and reduce the SLA violation. Checkpointing is a popular fault tolerance technique in large-scale systems. However, its major disadvantage is the overhead caused by the storage time of checkpointing files, which increases the execution time and minimizes the possibility to meet the desired deadlines. In this chapter, the authors propose a checkpointing strategy with lightweight storage. The storage is provided by creating a virtual topology VRbIO and the use of an intelligent and fault tolerant I/O technique CSDS (collective and selective data sieving). The proposal is executed by active and reactive agents and it solves many problems of checkpointing with standard I/O. To evaluate the approach, the authors compare it with a checkpointing with ROMIO as I/O strategy. Experimental results show the effectiveness and reliability of the proposed approach.

Download Full-text

Transparent fault tolerance for scalable functional computation

Journal of Functional Programming ◽

10.1017/s095679681600006x ◽

2016 ◽

Vol 26 ◽

Cited By ~ 2

Author(s):

ROBERT STEWART ◽

PATRICK MAIER ◽

PHIL TRINDER

Keyword(s):

Fault Tolerance ◽

High Performance Computing ◽

High Performance ◽

Large Scale ◽

Programming Model ◽

Fault Tolerant ◽

Fault Recovery ◽

Actor Model ◽

Work Stealing ◽

Performance Computing

AbstractReliability is set to become a major concern on emergent large-scale architectures. While there are many parallel languages, and indeed many parallel functional languages, very few address reliability. The notable exception is the widely emulated Erlang distributed actor model that provides explicit supervision and recovery of actors with isolated state. We investigate scalable transparent fault tolerant functional computation with automatic supervision and recovery of tasks. We do so by developing HdpH-RS, a variant of the Haskell distributed parallel Haskell (HdpH) DSL with Reliable Scheduling. Extending the distributed work stealing protocol of HdpH for task supervision and recovery is challenging. To eliminate elusive concurrency bugs, we validate the HdpH-RS work stealing protocol using the SPIN model checker. HdpH-RS differs from the actor model in that its principal entities are tasks, i.e. independent stateless computations, rather than isolated stateful actors. Thanks to statelessness, fault recovery can be performed automatically and entirely hidden in the HdpH-RS runtime system. Statelessness is also key for proving a crucial property of the semantics of HdpH-RS: fault recovery does not change the result of the program, akin to deterministic parallelism. HdpH-RS provides a simple distributed fork/join-style programming model, with minimal exposure of fault tolerance at the language level, and a library of higher level abstractions such as algorithmic skeletons. In fact, the HdpH-RS DSL is exactly the same as the HdpH DSL, hence users can opt in or out of fault tolerant execution without any refactoring. Computations in HdpH-RS are always as reliable as the root node, no matter how many nodes and cores are actually used. We benchmark HdpH-RS on conventional clusters and an High Performance Computing platform: all benchmarks survive Chaos Monkey random fault injection; the system scales well e.g. up to 1,400 cores on the High Performance Computing; reliability and recovery overheads are consistently low even at scale.

Download Full-text

SCHEDULING WITH JOB CHECKPOINT IN COMPUTATIONAL GRID ENVIRONMENT

International Journal of Modeling Simulation and Scientific Computing ◽

10.1142/s1793962311000517 ◽

2011 ◽

Vol 02 (03) ◽

pp. 299-316

Author(s):

MALARVIZHI NANDAGOPAL ◽

S. GAJALAKSHMI ◽

V. RHYMEND UTHARIARAJ

Keyword(s):

Fault Tolerance ◽

Large Scale ◽

Job Scheduling ◽

Fault Tolerant ◽

Scheduling Algorithm ◽

Computational Grids ◽

Tolerance Mechanism ◽

Grid Resource ◽

Distributed Resources ◽

Grid Environment

Computational grids have the potential for solving large-scale scientific applications using heterogeneous and geographically distributed resources. In addition to the challenges of managing and scheduling these applications, reliability challenges arise because of the unreliable nature of grid infrastructure. Two major problems that are critical to the effective utilization of computational resources are efficient scheduling of jobs and providing fault tolerance in a reliable manner. This paper addresses these problems by combining the checkpoint replication based fault tolerance mechanism with minimum total time to release (MTTR) job scheduling algorithm. TTR includes the service time of the job, waiting time in the queue, transfer of input and output data to and from the resource. The MTTR algorithm minimizes the response time by selecting a computational resource based on job requirements, job characteristics, and hardware features of the resources. The fault tolerance mechanism used here sets the job checkpoints based on the resource failure rate. If resource failure occurs, the job is restarted from its last successful state using a checkpoint file from another grid resource. Globus ToolKit is used as the grid middleware to set up a grid environment and evaluate the performance of the proposed approach. The monitoring tools Ganglia and Network Weather Service are used to gather hardware and network details, respectively. The experimental results demonstrate that, the proposed approach effectively schedule the grid jobs with fault-tolerant way thereby reduces TTR of the jobs submitted in the grid. Also, it increases the percentage of jobs completed within specified deadline and making the grid trustworthy.

Download Full-text

Fault Tolerance

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Large-Scale Distributed Computing and Applications ◽

10.4018/978-1-61520-703-9.ch008 ◽

2010 ◽

pp. 168-193

Author(s):

Valentin Cristea ◽

Ciprian Dobre ◽

Corina Stratan ◽

Florin Pop

Keyword(s):

Distributed Systems ◽

Fault Tolerance ◽

Large Scale ◽

Fault Tolerant ◽

Distributed Applications ◽

Future Trends ◽

Distributed Architectures ◽

P2p Systems ◽

Geographically Distributed ◽

Commercial Applications

The domains of usage of large scale distributed systems have been extending during the past years from scientific to commercial applications. Together with the extension of the application domains, new requirements have emerged for large scale distributed systems. Among these requirements, fault tolerance is needed by more and more modern distributed applications, not only by the critical ones. In this chapter we analyze current existing work in enabling fault tolerance in case of large scale distributed systems, presenting specific problem, existing solution, as well as several future trends. The characteristics of these systems pose problems to ensuring fault tolerance especially because of their complexity, involving many resources and users geographically distributed, because of the volatility of resources that are available only for limited amounts of time, and because of the constraints imposed by the applications and resource owners. A general fault tolerant architecture should, at a minimum, be comprised of at least a mechanism to detect failures and a component capable to recover and handle the detected failures, usually using some form of a replication mechanism. In this chapter we analyzed existing fault tolerance implementations, as well as solutions adopted in real world large scale distributed systems. We analyzed the fault tolerance architectures being proposed for particular distributed architectures, such as Grid or P2P systems.

Download Full-text

On-Line Fault-Tolerant Fuzzy-Based Path Planning and Obstacles Avoidance Approach for Manipulator Robots

International Journal of Uncertainty Fuzziness and Knowledge-Based Systems ◽

10.1142/s0218488518500368 ◽

2018 ◽

Vol 26 (05) ◽

pp. 809-838 ◽

Cited By ~ 1

Author(s):

Abderraouf Maoudj ◽

Abdelfetah Hentout ◽

Brahim Bouzouia ◽

Redouane Toumi

Keyword(s):

Fault Tolerance ◽

Path Planning ◽

Fault Tolerant ◽

Degree Of Freedom ◽

Planning Approach ◽

Industrial Manipulators ◽

On Line ◽

Point To Point ◽

Six Degree Of Freedom ◽

High Degree

Manipulator robots are widely used in many fields to replace humans in complex and risky environments. However, in some particular environments the robot is prone to failure, resulting in decreased performance. In such environments, it is extremely difficult to repair the robot which interrupts the execution process. Therefore, fault tolerance plays an important role in industrial manipulators applications. In this paper, the key problems related to fault-tolerance and path planning of manipulator robots under joints failures are handled within an on-line fault-tolerant fuzzy-logic based path planning approach for high degree-of-freedom robots. This approach provides an alternative to using mathematical models to control such robots, and improves tolerance to certain faults and mechanical failures. The controller consists of two fuzzy units (i) the first unit, Fuzzy_Path_Planner, is responsible of path planning; (ii) the second unit, Fuzzy_Obstacle_Avoidance, is conceived for obstacles avoidance. Moreover, the proposed approach is capable of repelling the manipulator away from both local minima and limit cycle problems. Finally, to validate the proposed approach and show its performances and effectiveness, different tests are carried out on two six degree-of-freedom manipulator robots (ULM and PUMA560 robots), accomplishing point-to-point tasks, with and without considering some joints failures.

Download Full-text

Exploring Parallel MPI Fault Tolerance Mechanisms for Phylogenetic Inference with RAxML-NG

10.1101/2021.01.15.426773 ◽

2021 ◽

Author(s):

Lukas Hübner ◽

Alexey M. Kozlov ◽

Demian Hespe ◽

Peter Sanders ◽

Alexandros Stamatakis

Keyword(s):

Fault Tolerance ◽

Phylogenetic Trees ◽

Large Scale ◽

Fault Tolerant ◽

Phylogenetic Inference ◽

Molecular Data ◽

Supplementary Information ◽

Tolerance Mechanisms ◽

Recovery Mechanisms ◽

Mpi Implementation

Phylogenetic trees are now routinely inferred on large scale HPC systems with thousands of cores as the parallel scalability of phylogenetic inference tools has improved over the past years to cope with the molecular data avalanche. Thus, the parallel fault tolerance of phylogenetic inference tools has become a relevant challenge. To this end, we explore parallel fault tolerance mechanisms and algorithms, the software modifications required, and the performance penalties induced via enabling parallel fault tolerance by example of RAxML-NG, the successor of the widely used RAxML tool for maximum likelihood based phylogenetic tree inference. We find that the slowdown induced by the necessary additional recovery mechanisms in RAxML-NG is on average 2%. The overall slowdown by using these recovery mechanisms in conjunction with a fault tolerant MPI implementation amounts to 8% on average for large empirical datasets. Via failure simulations, we show that RAxML-NG can successfully recover from multiple simultaneous failures, subsequent failures, failures during recovery, and failures during checkpointing. Recoveries are automatic and transparent to the user. The modified fault tolerant RAxML-NG code is available under GNU GPL at https://github.com/lukashuebner/ft-raxml-ng Contact: lukas.huebner@{kit.edu,h-its.org};, [email protected], [email protected], [email protected], [email protected] Supplementary information: Supplementary data are available at bioRχiv.

Download Full-text

Survey on Performance and Energy consumption of Fault Tolerance in Network on Chip

International Journal of Reconfigurable and Embedded Systems (IJRES) ◽

10.11591/ijres.v5.i1.pp71-76 ◽

2016 ◽

Vol 5 (1) ◽

pp. 71

Author(s):

B. Naresh Kumar Reddy ◽

Vasantha M.H ◽

Nithin Kumar Y.B.

Keyword(s):

Fault Tolerance ◽

Large Scale ◽

Fault Tolerant ◽

Network On Chip ◽

Fault Analysis ◽

Communication Links ◽

Analysis Strategy ◽

Large Scale Integration ◽

On Chip ◽

Scale Integration

Network on Chip (NoC) is a communication subsystem, which has the logic for sending and receiving the data from different sources in a single IC, is adopting the technology of VLSI making it to be as compact as possible. However, the increasing probability of failures in NoC’s has been raising concern among the researchers due to large scale integration of components. In specific the issues of fault-tolerance, increase in length of global wires of NoC has to be addressed for on chip and multi core architectures. This survey presents a perspective on existing NoC Fault-tolerant algorithm and a Corresponding distributed fault analysis strategy that encourages in observing the fault status of individual NoC components and their adjacent communication links. The analysis of the Fault-tolerant Network subjected to dynamic workloads for large scale applications is also equally important. This research paper mainly emphasizes on Fault tolerant NoC strategies summarizing over thirty research papers.

Download Full-text

PREVENTIVE MIGRATION VS. PREVENTIVE CHECKPOINTING FOR EXTREME SCALE SUPERCOMPUTERS

Parallel Processing Letters ◽

10.1142/s0129626411000126 ◽

2011 ◽

Vol 21 (02) ◽

pp. 111-132 ◽

Cited By ~ 15

Author(s):

FRANCK CAPPELLO ◽

HENRI CASANOVA ◽

YVES ROBERT

Keyword(s):

Fault Tolerance ◽

Large Scale ◽

Fault Tolerant ◽

Preventive Measure ◽

Mean Time Between Failures ◽

Future Technology ◽

Good Utilization ◽

Time Between Failures ◽

Failure Avoidance ◽

Extreme Scale

An alternative to classical fault-tolerant approaches for large-scale clusters is failure avoidance, by which the occurrence of a fault is predicted and a preventive measure is taken. We develop analytical performance models for two types of preventive measures: preventive checkpointing and preventive migration. We instantiate these models for platform scenarios representative of current and future technology trends. We find that preventive migration is the better approach in the short term by orders of magnitude. However, in the longer term, both approaches have comparable merit with a marginal advantage for preventive checkpointing. We also develop an analytical model of the performance for fault tolerance based on periodic checkpointing and compare this approach to both failure avoidance techniques. We find that this comparison is sensitive to the nature of the stochastic distribution of the time between failures, and that failure avoidance is likely inferior to fault tolerance in the long term. Regardless, our result show that each approach is likely to achieve poor utilization for large-scale platforms (e.g., 220 nodes) unless the mean time between failures is large. We show how bounding parallel job size improves utilization, but conclude that achieving good utilization in future large-scale platforms will require a combination of techniques.

Download Full-text