State of Art Survey for Fault Tolerance Feasibility in Distributed Systems

The use of technology has grown dramatically, and computer systems are now interconnected via various communication mediums. The use of distributed systems (DS) in our daily activities has only gotten better with data distributions. This is due to the fact that distributed systems allow nodes to arrange and share their resources across linked systems or devices, allowing humans to be integrated with geographically spread computer capacity. Due to multiple system failures at multiple failure points, distributed systems may result in a lack of service availability. to avoid multiple system failures at multiple failure points by using fault tolerance (FT) techniques in distributed systems to ensure replication, high redundancy, and high availability of distributed services. In this paper shows ease fault tolerance systems, its requirements, and explain about distributed system. Also, discuss distributed system architecture; furthermore, explain used techniques of fault tolerance, in additional that review some recent literature on fault tolerance in distributed systems and finally, discuss and compare the fault tolerance literature.

Download Full-text

Neighbor-Replica Distribution Technique Model for Availability Prediction in Distributed Interdependent Environment

International Journal of Cloud Applications and Computing ◽

10.4018/ijcac.2012070105 ◽

2012 ◽

Vol 2 (3) ◽

pp. 98-109

Author(s):

Ahmad Shukri Mohd Noor ◽

Tutut Herawan ◽

Mustafa Mat Deris

Keyword(s):

Distributed Systems ◽

Fault Tolerance ◽

Prediction Model ◽

Large Scale ◽

Extended Period ◽

High Availability ◽

Performance Data ◽

Online System ◽

Future Expectation ◽

Replication Technique

High availability is important for large scale distributed systems. Replication provides effective ways to enhance performance, high availability and fault tolerance in distributed systems. An efficient and effective replication technique is the key to improve the availability performance. Data and processes can be replicated for failures recovery. There are currently projects successfully implemented in two-replica distribution technique (TRDT) or primary–backup technique. However, these projects have their weaknesses of increasing cost overhead and inherit irrecoverable scenarios from TRDT such as double faults when both copies of replicated components are damaged. The authors propose the Neighbor Replica Distributed Technique (NRDT) availability prediction model. Focusing on improving high availability in which it predicts future expectation of interdependent server’s availability in a distributed online system over an extended period of time. The results and discussion are explored further in the article.

Download Full-text

Transient-Snapshot based Minimum-process Synchronized Check pointing Etiquette for Mobile Distributed Systems

International Journal of Advanced Trends in Computer Science and Engineering ◽

10.30534/ijatcse/2021/321042021 ◽

2021 ◽

Vol 10 (4) ◽

pp. 2861-2866

Keyword(s):

Distributed Systems ◽

Fault Tolerance ◽

Distributed System ◽

Mobile Systems ◽

Message Complexity ◽

In The Beginning ◽

Bare Minimum

Minimum-process harmonized checkpointing is well thought-out an attractive methodology to acquaint with fault tolerance in mobile systems patently. We design a minimum- process synchronous checkpointing algorithm for mobile distributed system. We try to minimize the intrusion of processes during checkpointing. We collect the transitive dependencies in the beginning, and therefore, the obstructive time of processes is bare minimum. During obstructive period, processes can do their normal computations, send messages and can process selective messages. In case of failure during checkpointing, all applicable processes are necessitated to abandon their transient snapshots only. In this way, we try to reduce the loss of checkpointing effort when any process fails to take its checkpoint in coordination with others. We also try to minimize the harmonization message complexity during checkpointing.

Download Full-text

Survey on replication techniques for distributed system

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v9i2.pp1298-1303 ◽

2019 ◽

Vol 9 (2) ◽

pp. 1298

Author(s):

Ahmad Shukri Mohd Noor ◽

Nur Farhah Mat Zian ◽

Fatin Nurhanani M. Shaiful Bahri

Keyword(s):

Distributed Systems ◽

Fault Tolerance ◽

Distributed System ◽

Triangular Grid ◽

Critical Element ◽

Dynamic Nature ◽

Wide Range ◽

The Subject ◽

Quorum Consensus ◽

Computational Resources

<p>Distributed systems mainly provide access to a large amount of data and computational resources through a wide range of interfaces. Besides its dynamic nature, which means that resources may enter and leave the environment at any time, many distributed systems applications will be running in an environment where faults are more likely to occur due to their ever-increasing scales and the complexity. Due to diverse faults and failures conditions, fault tolerance has become a critical element for distributed computing in order for the system to perform its function correctly even in the present of faults. Replication techniques primarily concentrate on the two fault tolerance manners precisely masking the failures as well as reconfigure the system in response. This paper presents a brief survey on different replication techniques such as Read One Write All (ROWA), Quorum Consensus (QC), Tree Quorum (TQ) Protocol, Grid Configuration (GC) Protocol, Two-Replica Distribution Techniques (TRDT), Neighbour Replica Triangular Grid (NRTG) and Neighbour Replication Distributed Techniques (NRDT). These techniques have its own redeeming features and shortcoming which forms the subject matter of this survey.</p>

Download Full-text

Theory and Practice of Fault Tolerance in Distributed Systems.

10.21236/ada187559 ◽

1987 ◽

Author(s):

K. M. Chandy ◽

J. Misra

Keyword(s):

Distributed Systems ◽

Fault Tolerance ◽

Theory And Practice

Download Full-text

MobileRE: A replicas prioritized hybrid fault tolerance strategy for mobile distributed system

Journal of Systems Architecture ◽

10.1016/j.sysarc.2021.102217 ◽

2021 ◽

pp. 102217

Author(s):

Yu Wu ◽

Duo Liu ◽

Xianzhang Chen ◽

Jinting Ren ◽

Renping Liu ◽

...

Keyword(s):

Fault Tolerance ◽

Distributed System ◽

Tolerance Strategy

Download Full-text

The robust middleware approach for transparent and systematic fault tolerance in parallel and distributed systems

2003 International Conference on Parallel Processing, 2003. Proceedings. ◽

10.1109/icpp.2003.1240566 ◽

2003 ◽

Author(s):

Chi-Hsiang Yeh

Keyword(s):

Distributed Systems ◽

Fault Tolerance

Download Full-text

XERIS/APEX

ACM SIGAda Ada Letters ◽

10.1145/3463478.3463484 ◽

2021 ◽

Vol 40 (2) ◽

pp. 65-69

Author(s):

Richard Wai

Keyword(s):

Distributed Systems ◽

Real Time ◽

Distributed System ◽

Dynamic Scaling ◽

Distributed Application ◽

Heavy Weight ◽

System Models ◽

In The Wild ◽

On Line ◽

Language Technologies

Modern day cloud native applications have become broadly representative of distributed systems in the wild. However, unlike traditional distributed system models with conceptually static designs, cloud-native systems emphasize dynamic scaling and on-line iteration (CI/CD). Cloud-native systems tend to be architected around a networked collection of distinct programs ("microservices") that can be added, removed, and updated in real-time. Typically, distinct containerized programs constitute individual microservices that then communicate among the larger distributed application through heavy-weight protocols. Common communication stacks exchange JSON or XML objects over HTTP, via TCP/TLS, and incur significant overhead, particularly when using small size message sizes. Additionally, interpreted/JIT/VM-based languages such as Javascript (NodeJS/Deno), Java, and Python are dominant in modern microservice programs. These language technologies, along with the high-overhead messaging, can impose superlinear cost increases (hardware demands) on scale-out, particularly towards hyperscale and/or with latency-sensitive workloads.

Download Full-text

Cloud-Niagara: A high availability and low overhead fault tolerance middleware for the cloud

16th Int'l Conf. Computer and Information Technology ◽

10.1109/iccitechn.2014.6997344 ◽

2014 ◽

Cited By ~ 3

Author(s):

Asif Imran ◽

Alim Ul Gias ◽

Rayhanur Rahman ◽

Amit Seal ◽

Tajkia Rahman ◽

...

Keyword(s):

Fault Tolerance ◽

High Availability

Download Full-text

THREE DIMENSIONAL GRID STRUCTURE FOR EFFICIENT ACCESS OF REPLICATED DATA

Journal of Interconnection Networks ◽

10.1142/s0219265901000415 ◽

2001 ◽

Vol 02 (03) ◽

pp. 317-329 ◽

Cited By ~ 5

Author(s):

MUSTAFA MAT DERIS ◽

ALI MAMAT ◽

PUA CHAI SENG ◽

MOHD YAZID SAMAN

Keyword(s):

Distributed System ◽

Three Dimensional ◽

Data Replication ◽

High Availability ◽

Communication Cost ◽

Data Availability ◽

Grid Structure ◽

Communication Costs ◽

Replicated Data ◽

Efficient Access

This article addresses the performance of data replication protocol in terms of data availability and communication costs. Specifically, we present a new protocol called Three Dimensional Grid Structure (TDGS) protocol, to manage data replication in distributed system. The protocol provides high availability for read and write operations with limited fault-tolerance at low communication cost. With TDGS protocol, a read operation is limited to two data copies, while a write operation is required with minimal number of copies. In comparison to other protocols. TDGS requires lower communication cost for an operation, while providing higher data availability.

Download Full-text

Increasing the resiliency of highly loaded systems

Informacionno-technologicheskij vestnik ◽

10.21499/2409-1650-2020-25-3-118-123 ◽

2020 ◽

pp. 118-123

Author(s):

В.А. Рудометкин

Keyword(s):

Fault Tolerance ◽

High Load ◽

High Availability ◽

Distributed Transactions ◽

Microservice Architecture ◽

To Receive ◽

Processor Cores ◽

Highly Loaded

В настоящее время большинство сервисов переходят в онлайн, что позволяет пользователям получать услугу в любое время. Высокая доступность услуги приводит к росту количества пользователей, что влечет за собой повышение нагрузки на систему, поэтому необходимо уделить особое внимание отказоустойчивости системы перед началом ее разработки. Рассматриваются основные проблемы высоконагруженных систем, способ оптимизации приложения путем распараллеливания задач по ядрам процессора. В данной статье описывается необходимость перехода на микросервисную архитектуру, ее недостатки и способы их устранения. В процессе решения проблем масштабирования, затрагиваются проблемы распределенных транзакций и долгого ответа от сервера. Nowadays, most of the services are moving online, which allows users to receive the service at any time. The high availability of the service leads to an increase in the number of users, which entails an increase in the load on the system, therefore, it is necessary to pay special attention to the fault tolerance of the system before starting its development. The main problems of high-load systems, a way to optimize an application by parallelizing tasks across processor cores are considered. This article describes the need to migrate to a microservice architecture, its weaknesses, and how to fix them. In the process of solving scaling problems, the problems of distributed transactions and long response from the server are addressed.

Download Full-text