Availability Analysis of Software Systems with Rejuvenation and Checkpointing

Junjun Zheng; Hiroyuki Okamura; Tadashi Dohi

doi:10.3390/math9080846

Availability Analysis of Software Systems with Rejuvenation and Checkpointing

Mathematics ◽

10.3390/math9080846 ◽

2021 ◽

Vol 9 (8) ◽

pp. 846

Author(s):

Junjun Zheng ◽

Hiroyuki Okamura ◽

Tadashi Dohi

Keyword(s):

Steady State ◽

Human Error ◽

System Modeling ◽

Regenerative Process ◽

State System ◽

Software Systems ◽

System Availability ◽

Software Rejuvenation ◽

Availability Analysis ◽

Availability Model

In software reliability engineering, software-rejuvenation and -checkpointing techniques are widely used for enhancing system reliability and strengthening data protection. In this paper, a stochastic framework composed of a composite stochastic Petri reward net and its resulting non-Markovian availability model is presented to capture the dynamic behavior of an operational software system in which time-based software rejuvenation and checkpointing are both aperiodically conducted. In particular, apart from the software-aging problem that may cause the system to fail, human-error factors (i.e., a system operator’s misoperations) during checkpointing are also considered. To solve the stationary solution of the non-Markovian availability model, which is derived on the basis of the reachability graph of stochastic Petri reward nets and is actually not one of the trivial stochastic models such as the semi-Markov process and the Markov regenerative process, the phase-expansion approach is considered. In numerical experiments, we illustrate steady-state system availability and find optimal software-rejuvenation policies that maximize steady-state system availability. The effects of human-error factors on both steady-state system availability and the optimal software-rejuvenation trigger timing are also evaluated. Numerical results showed that human errors during checkpointing both decreased system availability and brought a significant effect on the optimal rejuvenation-trigger timing, so that it should not be overlooked during system modeling.

Download Full-text

Degrading systems availability analysis: analytical semi-Markov approach

Eksploatacja i Niezawodnosc - Maintenance and Reliability ◽

10.17531/ein.2021.1.20 ◽

2021 ◽

Vol 23 (1) ◽

pp. 195-208

Author(s):

Varun Kumar ◽

Girish Kumar ◽

Rajesh Kumar Singh ◽

Umang Soni

Keyword(s):

Steady State ◽

Degradation Mechanism ◽

Continuous Process ◽

Mechanical Systems ◽

Embedded Markov Chain ◽

Steady State Probability ◽

Opportunistic Maintenance ◽

System Availability ◽

Availability Analysis ◽

Complex Mechanical Systems

This paper deals with modeling and analysis of complex mechanical systems that deteriorate with age. As systems age, the questions on their availability and reliability start to surface. The system is believed to suffer from internal stochastic degradation mechanism that is described as a gradual and continuous process of performance deterioration. Therefore, it becomes difficult for maintenance engineer to model such system. Semi-Markov approach is proposed to analyze the degradation of complex mechanical systems. It involves constructing states corresponding to the system functionality status and constructing kernel matrix between the states. The construction of the transition matrix takes the failure rate and repair rate into account. Once the steady-state probability of the embedded Markov chain is computed, one can compute the steady-state solution and finally, the system availability. System models based on perfect repair without opportunistic and with opportunistic maintenance have been developed and the benefits of opportunistic maintenance are quantified in terms of increased system availability. The proposed methodology is demonstrated for a two-stage reciprocating air compressor with intercooler in between, system in series configuration.

Download Full-text

Statistical software fault management based on bootstrap confidence intervals

International Journal of Quality & Reliability Management ◽

10.1108/ijqrm-10-2019-0326 ◽

2020 ◽

Vol 37 (6/7) ◽

pp. 905-923

Author(s):

Tadashi Dohi ◽

Hiroyuki Okamura ◽

Cun Hua Qian

Keyword(s):

Confidence Interval ◽

Confidence Intervals ◽

Fault Management ◽

Parametric Bootstrap ◽

Software Systems ◽

Statistical Software ◽

System Availability ◽

Content Type ◽

Software Rejuvenation ◽

Software Fault

PurposeIn this paper, the authors propose two construction methods to estimate confidence intervals of the time-based optimal software rejuvenation policy and its associated maximum system availability via a parametric bootstrap method. Through simulation experiments the authors investigate their asymptotic behaviors and statistical properties.Design/methodology/approachThe present paper is the first challenge to derive the confidence intervals of the optimal software rejuvenation schedule, which maximizes the system availability in the sense of long run. In other words, the authors concern the statistical software fault management by employing an idea of process control in quality engineering and a parametric bootstrap.FindingsAs a remarkably different point from the existing work, the authors carefully take account of a special case where the two-sided confidence interval of the optimal software rejuvenation time does not exist due to that fact that the estimator distribution of the optimal software rejuvenation time is defective. Here the authors propose two useful construction methods of the two-sided confidence interval: conditional confidence interval and heuristic confidence interval.Research limitations/implicationsAlthough the authors applied a simulation-based bootstrap confidence method in this paper, another re-sampling-based approach can be also applied to the same problem. In addition, the authors just focused on a parametric bootstrap, but a non-parametric bootstrap method can be also applied to the confidence interval estimation of the optimal software rejuvenation time interval, when the complete knowledge on the distribution form is not available.Practical implicationsThe statistical software fault management techniques proposed in this paper are useful to control the system availability of operational software systems, by means of the control chart.Social implicationsThrough the online monitoring in operational software systems, it would be possible to estimate the optimal software rejuvenation time and its associated system availability, without applying any approximation. By implementing this function on application programming interface (API), it is possible to realize the low-cost fault-tolerance for software systems with aging.Originality/valueIn the past literature, almost all authors employed parametric and non-parametric inference techniques to estimate the optimal software rejuvenation time but just focused on the point estimation. This may often lead to the miss-judgment based on over-estimation or under-estimation under uncertainty. The authors overcome the problem by introducing the two-sided confidence interval approach.

Download Full-text

APERIODIC OPTIMAL CHECKPOINT SEQUENCE UNDER STEADY-STATE SYSTEM AVAILABILITY CRITERION

Advanced Reliability Modeling II ◽

10.1142/9789812773760_0030 ◽

2006 ◽

Cited By ~ 2

Author(s):

K. IWAMOTO ◽

T. MARUO ◽

H. OKAMURA ◽

T. DOHI

Keyword(s):

Steady State ◽

State System ◽

System Availability

Download Full-text

Markovian Availability Measurement and Assessement for Hardware-Software System

International Journal of Reliability Quality and Safety Engineering ◽

10.1142/s0218539397000187 ◽

1997 ◽

Vol 04 (03) ◽

pp. 257-268 ◽

Cited By ~ 1

Author(s):

Koichi Tokuno ◽

Shigeru Yamada

Keyword(s):

System Reliability ◽

Software Systems ◽

Software System ◽

Reliability Growth ◽

System Availability ◽

Software Failures ◽

Software Failure ◽

Dependent Behavior ◽

Reliability Performance ◽

Availability Model

It is important to take into account the trade-off between hardware and software systems when total computer-system reliability/performance are evaluated and assessed. We develop an availability model for a hardware-software system. The system treated here consists of one hardware subsystem and one software subsystem and it is assumed that the system is down and restored whenever a hardware or a software failure occurs. Especially, for the software subsystem, it is supposed that (i) the restoration actions are not always performed perfectly, (ii) the restoration times for later software failures become longer and (iii) reliability growth occurs in the perfect restoration action. The hardware and the software failure-occurrence phenomena are respectively described by constant and geometrically decreasing hazard rates. The time-dependent behavior of the system, which alternately repeats the operational state that a system is operating without failures and the restoration state that a system is inoperable and restored, is described by a Markov process. Useful expressions for several quantitative measures of system performance are derived from this model. Finally, numerical examples are presented for illustration of system availability measurement and assessment.

Download Full-text

Multi-state System Availability Model of Electricity Generation for a Cogeneration District Cooling Plant

Asian Journal of Applied Sciences ◽

10.3923/ajaps.2011.431.438 ◽

2011 ◽

Vol 4 (4) ◽

pp. 431-438 ◽

Cited By ~ 3

Author(s):

Mohd Amin Abd Majid ◽

Meseret Nasir

Keyword(s):

Electricity Generation ◽

State System ◽

System Availability ◽

Availability Model

Download Full-text

A model of a growing steady state system

Journal of Theoretical Biology ◽

10.1016/0022-5193(66)90135-4 ◽

1966 ◽

Vol 10 (3) ◽

pp. 387-398 ◽

Cited By ~ 7

Author(s):

J.N.R. Grainger ◽

L. Bass

Keyword(s):

Steady State ◽

State System

Download Full-text

Methodology for the Assessment of Imprecise Multi-State System Availability

Mathematics ◽

10.3390/math10010150 ◽

2022 ◽

Vol 10 (1) ◽

pp. 150

Author(s):

Joanna Akrouche ◽

Mohamed Sallak ◽

Eric Châtelet ◽

Fahed Abdallah ◽

Hiba Hajj Chehade

Keyword(s):

Markov Chains ◽

Constraint Propagation ◽

State System ◽

Combined Approach ◽

System Availability ◽

Contraction Method ◽

Numerical Examples ◽

Complex Architectures ◽

Backward Propagation ◽

Interval Constraint

Most existing studies of a system’s availability in the presence of epistemic uncertainties assume that the system is binary. In this paper, a new methodology for the estimation of the availability of multi-state systems is developed, taking into consideration epistemic uncertainties. This paper formulates a combined approach, based on continuous Markov chains and interval contraction methods, to address the problem of computing the availability of multi-state systems with imprecise failure and repair rates. The interval constraint propagation method, which we refer to as the forward–backward propagation (FBP) contraction method, allows us to contract the probability intervals, keeping all the values that may be consistent with the set of constraints. This methodology is guaranteed, and several numerical examples of systems with complex architectures are studied.

Download Full-text

An analysis of software aging in cloud environment

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v10i6.pp5985-5991 ◽

2020 ◽

Vol 10 (6) ◽

pp. 5985

Author(s):

Shruthi P. ◽

Nagaraj G. Cholli

Keyword(s):

Cloud Computing ◽

Large Scale ◽

Virtual Machines ◽

Cloud Service ◽

Software Systems ◽

System Failure ◽

Dynamic Nature ◽

Software Aging ◽

Software Rejuvenation ◽

The Impact

Cloud Computing is the environment in which several virtual machines (VM) run concurrently on physical machines. The cloud computing infrastructure hosts multiple cloud service segments that communicate with each other using the interfaces. This creates distributed computing environment. During operation, the software systems accumulate errors or garbage that leads to system failure and other hazardous consequences. This status is called software aging. Software aging happens because of memory fragmentation, resource consumption in large scale and accumulation of numerical error. Software aging degrads the performance that may result in system failure. This happens because of premature resource exhaustion. This issue cannot be determined during software testing phase because of the dynamic nature of operation. The errors that cause software aging are of special types. These errors do not disturb the software functionality but target the response time and its environment. This issue is to be resolved only during run time as it occurs because of the dynamic nature of the problem. To alleviate the impact of software aging, software rejuvenation technique is being used. Rejuvenation process reboots the system or re-initiates the softwares. This avoids faults or failure. Software rejuvenation removes accumulated error conditions, frees up deadlocks and defragments operating system resources like memory. Hence, it avoids future failures of system that may happen due to software aging. As service availability is crucial, software rejuvenation is to be carried out at defined schedules without disrupting the service. The presence of Software rejuvenation techniques can make software systems more trustworthy. Software designers are using this concept to improve the quality and reliability of the software. Software aging and rejuvenation has generated a lot of research interest in recent years. This work reviews some of the research works related to detection of software aging and identifies research gaps.

Download Full-text

Determination of metabolic fluxes in a non-steady-state system

Phytochemistry ◽

10.1016/j.phytochem.2007.04.026 ◽

2007 ◽

Vol 68 (16-18) ◽

pp. 2313-2319 ◽

Cited By ~ 28

Author(s):

C.J. Baxter ◽

J.L. Liu ◽

A.R. Fernie ◽

L.J. Sweetlove

Keyword(s):

Steady State ◽

State System ◽

Metabolic Fluxes

Download Full-text

Analysis of Two Phases Queue With Vacations and Breakdowns Under T-Policy

Advances in Marketing, Customer Relationship Management, and E-Services - Advanced Methodologies and Technologies in Digital Marketing and Entrepreneurship ◽

10.4018/978-1-5225-7766-9.ch002 ◽

2019 ◽

pp. 13-31

Author(s):

Khalid Alnowibet ◽

Lotfi Tadj

Keyword(s):

Markov Chain ◽

Steady State ◽

Performance Measures ◽

Threshold Level ◽

System Size ◽

State System ◽

Markov Chain Approach ◽

Optimal Value ◽

Two Phases ◽

Random Breakdowns

The service system considered in this chapter is characterized by an unreliable server. Random breakdowns occur on the server and the repair may not be immediate. The authors assume the possibility that the server may take a vacation at the end of a given service completion. The server resumes operation according to T-policy to check if enough customers have arrived while he was away. The actual service of any arrival takes place in two consecutive phases. Both service phases are independent of each other. A Markov chain approach is used to obtain the steady state system size probabilities and different performance measures. The optimal value of the threshold level is obtained analytically.

Download Full-text