Fault Tolerant Architecture to Cloud Computing Using Adaptive Checkpoint

Author(s):  
Ghalem Belalem ◽  
Said Limam

Cloud computing refers to both the applications delivered as services over the Internet and the hardware and systems software in the datacenters that provide those services. Failures of any type are common in current datacenters, partly due to the number of nodes. Fault tolerance has become a major task for computer engineers and software developers because the occurrence of faults increases the cost of using resources and to meet the user expectations, the most fundamental user expectation is, of course, that his or her application correctly finishes independent of faults in the node. This paper proposes a fault tolerant architecture to Cloud Computing that uses an adaptive Checkpoint mechanism to assure that a task running can correctly finish in spite of faults in the nodes in which it is running. The proposed fault tolerant architecture is simultaneously transparent and scalable.

2011 ◽  
Vol 1 (4) ◽  
pp. 60-69 ◽  
Author(s):  
Ghalem Belalem ◽  
Said Limam

Cloud computing refers to both the applications delivered as services over the Internet and the hardware and systems software in the datacenters that provide those services. Failures of any type are common in current datacenters, partly due to the number of nodes. Fault tolerance has become a major task for computer engineers and software developers because the occurrence of faults increases the cost of using resources and to meet the user expectations, the most fundamental user expectation is, of course, that his or her application correctly finishes independent of faults in the node. This paper proposes a fault tolerant architecture to Cloud Computing that uses an adaptive Checkpoint mechanism to assure that a task running can correctly finish in spite of faults in the nodes in which it is running. The proposed fault tolerant architecture is simultaneously transparent and scalable.


Author(s):  
Said Limam ◽  
Ghalem Belalem

Cloud computing has become a significant technology and a great solution for providing a flexible, on-demand, and dynamically scalable computing infrastructure for many applications. Cloud computing also presents a significant technology trends. With the cloud computing technology, users use a variety of devices to access programs, storage, and application-development platforms over the Internet, via services offered by cloud computing providers. The probability of failure occur during the execution becomes stronger when the number of node increases; since it is impossible to fully prevent failures, one solution is to implement fault tolerance mechanisms. Fault tolerance has become a major task for computer engineers and software developers because the occurrence of faults increases the cost of using resources. In this paper, the authors have proposed an approach that is a combination of migration and checkpoint mechanism. The checkpoint mechanism minimizes the time lost and reduces the effect of failures on application execution while the migration mechanism guarantee the continuity of application execution and avoid any loss due to hardware failure in a way transparent and efficient. The results obtained by the simulation show the effectiveness of our approaches to fault tolerance in term of execution time and masking effects of failures.


Cloud Computing, being a delivery model is swiftly moving ahead by being adopted by small and large organization alike. This new model opens up many research challenges. As, cloud computing services are offered over the Internet on pay-per-use basis, it is very essential to provide fault tolerant services to the users. To ensure high availability, data centers are replicated. The process of replication is costly but in terms reliability it overtakes the cost factors. Vast amount of work has been undertaken in fault tolerance in other computing environments but they cannot be applied directly to the cloud. This gives an opportunity for new, effective solutions. In this paper, we propose policies for delivering fault tolerant services for private cloud computing environment related to virtual machine allocations. The experimental test results and policies derived are described with respect to virtual machine provisioning.


Author(s):  
Guru Prasad Bhandari ◽  
Ratneshwer Gupta

Cyber-physical systems (CPSs) are co-engineered integrating with physical and computational components networks. Additionally, a CPS is a mechanism controlled or monitored by computer-based algorithms, tightly interacting with the internet and its users. This chapter presents the definitions relating to dependability, safety-critical and fault-tolerance of CPSs. These definitions are supplemented by other definitions like reliability, availability, safety, maintainability, integrity. Threats to dependability and security like faults, errors, failures are also discussed. Taxonomy of different faults and attacks in CPSs are also presented in this chapter. The main objective of this chapter is to give the general information about secure CPS to the learners for the further enhancement in the field of CPSs.


Author(s):  
Zahid Raza ◽  
Deo P. Vidyarthi

Grid is a parallel and distributed computing network system comprising of heterogeneous computing resources spread over multiple administrative domains that offers high throughput computing. Since the Grid operates at a large scale, there is always a possibility of failure ranging from hardware to software. The penalty paid of these failures may be on a very large scale. System needs to be tolerant to various possible failures which, in spite of many precautions, are bound to happen. Replication is a strategy often used to introduce fault tolerance in the system to ensure successful execution of the job, even when some of the computational resources fail. Though replication incurs a heavy cost, a selective degree of replication can offer a good compromise between the performance and the cost. This chapter proposes a co-scheduler that can be integrated with main scheduler for the execution of the jobs submitted to computational Grid. The main scheduler may have any performance optimization criteria; the integration of co-scheduler will be an added advantage towards fault tolerance. The chapter evaluates the performance of the co-scheduler with the main scheduler designed to minimize the turnaround time of a modular job by introducing module replication to counter the effects of node failures in a Grid. Simulation study reveals that the model works well under various conditions resulting in a graceful degradation of the scheduler’s performance with improving the overall reliability offered to the job.


Author(s):  
Reema Abdulraziq ◽  
Muneer Bani Yassein ◽  
Shadi Aljawarneh

Big data refers to the huge amount of data that is being used in commercial, industrial and economic environments. There are three types of big data; structured, unstructured and semi-structured data. When it comes to discussions on big data, three major aspects that can be considered as its main dimensions are the volume, velocity, and variety of the data. This data is collected, analysed and checked for use by the end users. Cloud computing and the Internet of Things (IoT) are used to enable this huge amount of collected data to be stored and connected to the Internet. The time and the cost are reduced by means of these technologies, and in addition, they are able to accommodate this large amount of data regardless of its size. This chapter focuses on how big data, with the emergence of cloud computing and the Internet of Things (IOT), can be used via several applications and technologies.


Author(s):  
N.L. Udaya Kumar ◽  
M. Siddappa

Cloud computing is the way of computing, where all the computing resources are available as a service over the internet based on requirements of the users. Virtualization is the concept which plays very important role in reducing the cost of investment and increases utilization and allows multi-tenancy. This concept helps to create virtual resources out of existing physical resources. When the virtual resources are created, they may face the problems due to various reasons and may not work properly. Providing the protection to these virtual resources and make them to work without any problem is the important. Here we introduce an approach called duplication method which allows the users to create more number of same virtual resources, so that if one of the resources fail due to some reason, users may have some more same resources to continue without disturbing their work and provide the security at different levels to make Virtual resources secure.


By nature, a construction company needs to have data and information readily available at any stage and at any location. Given the often unstructured and fluid nature of construction activities, construction personnel face the challenge of collecting as well as collating data while on the move and at work under field conditions. With the advent of cloud computing and the rapidly increasing reach of the internet through wireless and mobile network carriers, construction companies are increasingly beginning to explore the power of information technology through cloud and IoT platforms. While at one stage, the cost of maintaining IT infrastructure was expensive for construction companies and engineering firms of all sizes, the rise of cloud computing has enabled to rationalize costs and investments while retaining high levels of productivity and service. This chapter discusses the various options that construction companies now have to choose from to streamline operations and increase productivity.


Author(s):  
Sam Goundar ◽  
Akashdeep Bhardwaj

With mission critical web applications and resources being hosted on cloud environments, and cloud services growing fast, the need for having greater level of service assurance regarding fault tolerance for availability and reliability has increased. The high priority now is ensuring a fault tolerant environment that can keep the systems up and running. To minimize the impact of downtime or accessibility failure due to systems, network devices or hardware, the expectations are that such failures need to be anticipated and handled proactively in fast, intelligent way. This article discusses the fault tolerance system for cloud computing environments, analyzes whether this is effective for Cloud environments.


Author(s):  
В. В. Нарожный ◽  
А. С. Назаров ◽  
Т. Г. Дегтярева

The past decade can be characterized by the accelerating Internet of Things (IoT) development. Currently, the European Research Cluster on the Internet of Things (IERC) defines IoT as a dynamic global network infrastructure with the possibility of self-tuning based on standard and compatible communication protocols. The Internet and microprocessor technology development caused the rise of IoT. Other factors influencing the rapid IoT development were cloud computing and wireless networks popularity growth. As a result, the widespread use of IoT required an increase in the reliability of the devices.In many areas of modern technological processes and physical researches, the temperature is a significant physical characteristic. The paper describes the hardware and software complex connecting the DS18B20 temperature meter (sensor). The complex is designed to study the fault-tolerance of temperature measurements in IoT. The Wi-Fi module NodeMCU V3 based on ESP8266 is applied as a control unit of the complex.The IoT appearance brought to a new level such an important segment of technical researches as the development of the fault-tolerant solutions. One of the important subsystems of such an application is the physical parameters detection of various devices in real-time. The temperature is a significant physical characteristic in many areas of modern technological processes and physical researches. The hardware and software complex for connecting a DS18B20 temperature measurer (sensor) is described in the paper. The complex is designed to examine the temperature measurement fault-tolerance in IoT. The Wi-Fi module NodeMCU V3 based on ESP8266 is applied as the complex controller.As far as the work of IoT depends mainly on the information provided by the sensors, the sensor performance monitoring is critically important. The autonomous system architecture of IoT includes such tasks as perception, localization, planning, management and control over systems exchanging information with each other.  For this reason, the reliability of the sensors is of high concern. Therefore, one failure can lead to the IoT system dangerous behavior.The IoT fault-tolerance is an important direction of modern systems design. The research of the ensuring possibility of the IoT fault-tolerance functioning is an urgent task. For such studies, hardware and software complexes are developed.


Sign in / Sign up

Export Citation Format

Share Document