Weibull mixture regression for marginal inference in zero-heavy continuous outcomes

Continuous outcomes with preponderance of zero values are ubiquitous in data that arise from biomedical studies, for example studies of addictive disorders. This is known to lead to violation of standard assumptions in parametric inference and enhances the risk of misleading conclusions unless managed properly. Two-part models are commonly used to deal with this problem. However, standard two-part models have limitations with respect to obtaining parameter estimates that have marginal interpretation of covariate effects which are important in many biomedical applications. Recently marginalized two-part models are proposed but their development is limited to log-normal and log-skew-normal distributions. Thus, in this paper, we propose a finite mixture approach, with Weibull mixture regression as a special case, to deal with the problem. We use extensive simulation study to assess the performance of the proposed model in finite samples and to make comparisons with other family of models via statistical information and mean squared error criteria. We demonstrate its application on real data from a randomized controlled trial of addictive disorders. Our results show that a two-component Weibull mixture model is preferred for modeling zero-heavy continuous data when the non-zero part are simulated from Weibull or similar distributions such as Gamma or truncated Gauss.

Download Full-text

Multilevel Zero-inflated Censored Beta Regression Modeling for Proportions and Rate Data with Extra-zeros

10.21203/rs.2.16731/v1 ◽

2019 ◽

Author(s):

Leili Tapak ◽

Omid Hamidi ◽

Majid Sadeghifar ◽

Hassan Doosti ◽

Ghobad Moradi

Keyword(s):

Regression Model ◽

Simulation Study ◽

Real Data ◽

P Value ◽

Parameter Estimates ◽

Beta Regression ◽

Rate Data ◽

Data Set ◽

Proposed Model ◽

Beta Regression Model

Abstract Objectives Zero-inflated proportion or rate data nested in clusters due to the sampling structure can be found in many disciplines. Sometimes, the rate response may not be observed for some study units because of some limitations (false negative) like failure in recording data and the zeros are observed instead of the actual value of the rate/proportions (low incidence). In this study, we proposed a multilevel zero-inflated censored Beta regression model that can address zero-inflation rate data with low incidence.Methods We assumed that the random effects are independent and normally distributed. The performance of the proposed approach was evaluated by application on a three level real data set and a simulation study. We applied the proposed model to analyze brucellosis diagnosis rate data and investigate the effects of climatic and geographical position. For comparison, we also applied the standard zero-inflated censored Beta regression model that does not account for correlation.Results Results showed the proposed model performed better than zero-inflated censored Beta based on AIC criterion. Height (p-value <0.0001), temperature (p-value <0.0001) and precipitation (p-value = 0.0006) significantly affected brucellosis rates. While, precipitation in ZICBETA model was not statistically significant (p-value =0.385). Simulation study also showed that the estimations obtained by maximum likelihood approach had reasonable in terms of mean square error.Conclusions The results showed that the proposed method can capture the correlations in the real data set and yields accurate parameter estimates.

Download Full-text

Erlang mixture distribution with application on COVID-19 cases in egypt

International Journal of Biomathematics ◽

10.1142/s1793524521500157 ◽

2021 ◽

pp. 2150015

Author(s):

M. M. E. Abd El-Monsef

Keyword(s):

Hazard Function ◽

Mixture Distribution ◽

Real Data ◽

Finite Mixture ◽

Erlang Distribution ◽

Parameter Estimates ◽

Proposed Model ◽

Shape Characteristics ◽

Erlang Distributions ◽

Special Case

In this paper, a finite mixture of m-Erlang distributions is proposed. Different moments, shape characteristics and parameter estimates of the proposed model are also provided. The proposed mixture has the property that it has a bounded hazard function. A special case of the mixed Erlang distribution is introduced and discussed. In addition, a predictive technique is introduced to estimate the needed number of mixture components to fit a certain data. A real data concerning the confirmed COVID-19 cases in Egypt is introduced to utilize the predictive estimation technique. Two more real datasets are used to examine the flexibility of the proposed model.

Download Full-text

Imputation of single-cell gene expression with an autoencoder neural network

10.1101/504977 ◽

2018 ◽

Cited By ~ 1

Author(s):

Md. Bahadur Badsha ◽

Rui Li ◽

Boxiang Liu ◽

Yang I. Li ◽

Min Xian ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Missing Values ◽

Mean Squared Error ◽

Real Data ◽

Cell Types ◽

Parameter Estimates ◽

Single Experiment ◽

Data Set ◽

Nonparametric Approach

ABSTRACTBackgroundSingle-cell RNA-sequencing (scRNA-seq) is a rapidly evolving technology that enables measurement of gene expression levels at an unprecedented resolution. Despite the explosive growth in the number of cells that can be assayed by a single experiment, scRNA-seq still has several limitations, including high rates of dropouts, which result in a large number of genes having zero read count in the scRNA-seq data, and complicate downstream analyses.MethodsTo overcome this problem, we treat zeros as missing values and develop nonparametric deep learning methods for imputation. Specifically, our LATE (Learning with AuToEncoder) method trains an autoencoder with random initial values of the parameters, whereas our TRANSLATE (TRANSfer learning with LATE) method further allows for the use of a reference gene expression data set to provide LATE with an initial set of parameter estimates.ResultsOn both simulated and real data, LATE and TRANSLATE outperform existing scRNA-seq imputation methods, achieving lower mean squared error in most cases, recovering nonlinear gene-gene relationships, and better separating cell types. They are also highly scalable and can efficiently process over 1 million cells in just a few hours on a GPU.ConclusionsWe demonstrate that our nonparametric approach to imputation based on autoencoders is powerful and highly efficient.

Download Full-text

THE DEVELOPMENT OF THE MATHEMATICAL MODEL OF PREDICTING THE INDICATORS OF ACCREDITATION OF TECHNICAL UNIVERSITIES IN THE RUSSIAN FEDERATION

VESTNIK OF ASTRAKHAN STATE TECHNICAL UNIVERSITY SERIES MANAGEMENT COMPUTER SCIENCE AND INFORMATICS ◽

10.24143/2072-9502-2017-2-27-38 ◽

2017 ◽

pp. 27-38

Author(s):

Olga Mikhaylovna Tikhonova ◽

Alexander Fedorovich Rezchikov ◽

Vladimir Andreevich Ivashchenko ◽

Vadim Alekseevich Kushnikov

Keyword(s):

Mathematical Model ◽

System Dynamics ◽

Regression Model ◽

Oriented Graph ◽

Real Data ◽

Educational Process ◽

Proposed Model ◽

Linear Differential ◽

The University ◽

The Mathematical Model

The paper presents the system of predicting the indicators of accreditation of technical universities based on J. Forrester mechanism of system dynamics. According to analysis of cause-and-effect relationships between selected variables of the system (indicators of accreditation of the university) there was built the oriented graph. The complex of mathematical models developed to control the quality of training engineers in Russian higher educational institutions is based on this graph. The article presents an algorithm for constructing a model using one of the simulated variables as an example. The model is a system of non-linear differential equations, the modelling characteristics of the educational process being determined according to the solution of this system. The proposed algorithm for calculating these indicators is based on the system dynamics model and the regression model. The mathematical model is constructed on the basis of the model of system dynamics, which is further tested for compliance with real data using the regression model. The regression model is built on the available statistical data accumulated during the period of the university's work. The proposed approach is aimed at solving complex problems of managing the educational process in universities. The structure of the proposed model repeats the structure of cause-effect relationships in the system, and also provides the person responsible for managing quality control with the ability to quickly and adequately assess the performance of the system.

Download Full-text

EXPONENTIATED HALF-LOGISTIC LOMAX DISTRIBUTION WITH PROPERTIES AND APPLICATION

NED University Journal of Research ◽

10.35453/nedjr-ascn-2018-0033 ◽

2019 ◽

Vol XVI (2) ◽

pp. 1-11

Author(s):

Farrukh Jamal ◽

Hesham Mohammed Reyad ◽

Soha Othman Ahmed ◽

Muhammad Akbar Ali Shah ◽

Emrah Altun

Keyword(s):

Real Data ◽

Continuous Model ◽

Model Parameters ◽

Data Set ◽

Lomax Distribution ◽

Mathematical Properties ◽

Proposed Model ◽

Probability Weighted Moment ◽

Record Statistics ◽

Maximum Likelihood Criterion

A new three-parameter continuous model called the exponentiated half-logistic Lomax distribution is introduced in this paper. Basic mathematical properties for the proposed model were investigated which include raw and incomplete moments, skewness, kurtosis, generating functions, Rényi entropy, Lorenz, Bonferroni and Zenga curves, probability weighted moment, stress strength model, order statistics, and record statistics. The model parameters were estimated by using the maximum likelihood criterion and the behaviours of these estimates were examined by conducting a simulation study. The applicability of the new model is illustrated by applying it on a real data set.

Download Full-text

Evaluation for estimating of the PDF and the CDF of Generalized Inverted Exponential Distribution with Application in Industry

Advances in Mathematics: Scientific Journal ◽

10.37418/amsj.9.1.39 ◽

2020 ◽

pp. 507-522

Author(s):

Parisa Torkaman

Keyword(s):

Least Squares ◽

Exponential Distribution ◽

Mean Squared Error ◽

Weighted Least Squares ◽

Real Data ◽

Minimum Variance ◽

Cumulative Distribution ◽

Estimation Methods ◽

Data Set ◽

Better Than

The generalized inverted exponential distribution is introduced as a lifetime model with good statistical properties. This paper, the estimation of the probability density function and the cumulative distribution function of with five different estimation methods: uniformly minimum variance unbiased(UMVU), maximum likelihood(ML), least squares(LS), weighted least squares (WLS) and percentile(PC) estimators are considered. The performance of these estimation procedures, based on the mean squared error (MSE) by numerical simulations are compared. Simulation studies express that the UMVU estimator performs better than others and when the sample size is large enough the ML and UMVU estimators are almost equivalent and efficient than LS, WLS and PC. Finally, the result using a real data set are analyzed.

Download Full-text

Modeling Population Spatial-Temporal Distribution Using Taxis Origin and Destination Data

Sustainability ◽

10.3390/su13073727 ◽

2021 ◽

Vol 13 (7) ◽

pp. 3727

Author(s):

Fatema Rahimi ◽

Abolghasem Sadeghi-Niaraki ◽

Mostafa Ghodousi ◽

Soo-Mi Choi

Keyword(s):

Regression Analysis ◽

Mean Squared Error ◽

Population Distribution ◽

Temporal Distribution ◽

Coefficient Of Determination ◽

Temporal Modeling ◽

Location Data ◽

Time Period ◽

Proposed Model

During dangerous circumstances, knowledge about population distribution is essential for urban infrastructure architecture, policy-making, and urban planning with the best Spatial-temporal resolution. The spatial-temporal modeling of the population distribution of the case study was investigated in the present study. In this regard, the number of generated trips and absorbed trips using the taxis pick-up and drop-off location data was calculated first, and the census population was then allocated to each neighborhood. Finally, the Spatial-temporal distribution of the population was calculated using the developed model. In order to evaluate the model, a regression analysis between the census population and the predicted population for the time period between 21:00 to 23:00 was used. Based on the calculation of the number of generated and the absorbed trips, it showed a different spatial distribution for different hours in one day. The spatial pattern of the population distribution during the day was different from the population distribution during the night. The coefficient of determination of the regression analysis for the model (R2) was 0.9998, and the mean squared error was 10.78. The regression analysis showed that the model works well for the nighttime population at the neighborhood level, so the proposed model will be suitable for the day time population.

Download Full-text

An adaptive social distancing SIR model for COVID-19 disease spreading and forecasting

Epidemiologic Methods ◽

10.1515/em-2020-0044 ◽

2021 ◽

Vol 10 (s1) ◽

Author(s):

Said Gounane ◽

Yassir Barkouch ◽

Abdelghafour Atlas ◽

Mostafa Bendahmane ◽

Fahd Karami ◽

...

Keyword(s):

Mathematical Theory ◽

Real Data ◽

Sir Model ◽

Epidemic Threshold ◽

Social Distancing ◽

Sir Epidemic ◽

Proposed Model ◽

Disease Spreading ◽

The Government ◽

The Impact

Abstract Recently, various mathematical models have been proposed to model COVID-19 outbreak. These models are an effective tool to study the mechanisms of coronavirus spreading and to predict the future course of COVID-19 disease. They are also used to evaluate strategies to control this pandemic. Generally, SIR compartmental models are appropriate for understanding and predicting the dynamics of infectious diseases like COVID-19. The classical SIR model is initially introduced by Kermack and McKendrick (cf. (Anderson, R. M. 1991. “Discussion: the Kermack–McKendrick Epidemic Threshold Theorem.” Bulletin of Mathematical Biology 53 (1): 3–32; Kermack, W. O., and A. G. McKendrick. 1927. “A Contribution to the Mathematical Theory of Epidemics.” Proceedings of the Royal Society 115 (772): 700–21)) to describe the evolution of the susceptible, infected and recovered compartment. Focused on the impact of public policies designed to contain this pandemic, we develop a new nonlinear SIR epidemic problem modeling the spreading of coronavirus under the effect of a social distancing induced by the government measures to stop coronavirus spreading. To find the parameters adopted for each country (for e.g. Germany, Spain, Italy, France, Algeria and Morocco) we fit the proposed model with respect to the actual real data. We also evaluate the government measures in each country with respect to the evolution of the pandemic. Our numerical simulations can be used to provide an effective tool for predicting the spread of the disease.

Download Full-text

Kumaraswamy Generalized Power Lomax Distributionand Its Applications

Stats ◽

10.3390/stats4010003 ◽

2021 ◽

Vol 4 (1) ◽

pp. 28-45

Author(s):

Vasili B.V. Nagarjuna ◽

R. Vishnu Vardhan ◽

Christophe Chesneau

Keyword(s):

Hazard Rate ◽

Real Data ◽

Rate Function ◽

Maximum Likelihood Estimates ◽

Parameter Estimates ◽

Parameter Distribution ◽

Data Sets ◽

Lomax Distribution ◽

Entropy Measures ◽

Modeling Behavior

In this paper, a new five-parameter distribution is proposed using the functionalities of the Kumaraswamy generalized family of distributions and the features of the power Lomax distribution. It is named as Kumaraswamy generalized power Lomax distribution. In a first approach, we derive its main probability and reliability functions, with a visualization of its modeling behavior by considering different parameter combinations. As prime quality, the corresponding hazard rate function is very flexible; it possesses decreasing, increasing and inverted (upside-down) bathtub shapes. Also, decreasing-increasing-decreasing shapes are nicely observed. Some important characteristics of the Kumaraswamy generalized power Lomax distribution are derived, including moments, entropy measures and order statistics. The second approach is statistical. The maximum likelihood estimates of the parameters are described and a brief simulation study shows their effectiveness. Two real data sets are taken to show how the proposed distribution can be applied concretely; parameter estimates are obtained and fitting comparisons are performed with other well-established Lomax based distributions. The Kumaraswamy generalized power Lomax distribution turns out to be best by capturing fine details in the structure of the data considered.

Download Full-text

Assessing the Performance of Random Forests for Modeling Claim Severity in Collision Car Insurance

Risks ◽

10.3390/risks9030053 ◽

2021 ◽

Vol 9 (3) ◽

pp. 53

Author(s):

Yves Staudt ◽

Joël Wagner

Keyword(s):

Random Forests ◽

Goodness Of Fit ◽

Generalized Additive Model ◽

Mean Squared Error ◽

Additive Model ◽

Continuous Variables ◽

Insurance Portfolio ◽

Normal Transformation ◽

Car Insurance ◽

Log Normal

For calculating non-life insurance premiums, actuaries traditionally rely on separate severity and frequency models using covariates to explain the claims loss exposure. In this paper, we focus on the claim severity. First, we build two reference models, a generalized linear model and a generalized additive model, relying on a log-normal distribution of the severity and including the most significant factors. Thereby, we relate the continuous variables to the response in a nonlinear way. In the second step, we tune two random forest models, one for the claim severity and one for the log-transformed claim severity, where the latter requires a transformation of the predicted results. We compare the prediction performance of the different models using the relative error, the root mean squared error and the goodness-of-lift statistics in combination with goodness-of-fit statistics. In our application, we rely on a dataset of a Swiss collision insurance portfolio covering the loss exposure of the period from 2011 to 2015, and including observations from 81 309 settled claims with a total amount of CHF 184 mio. In the analysis, we use the data from 2011 to 2014 for training and from 2015 for testing. Our results indicate that the use of a log-normal transformation of the severity is not leading to performance gains with random forests. However, random forests with a log-normal transformation are the favorite choice for explaining right-skewed claims. Finally, when considering all indicators, we conclude that the generalized additive model has the best overall performance.

Download Full-text