Performance appraisal of validation techniques in R

In this article various statistical models were fitted utilizing simulated symmetric and asymmetric data. Fitting of models were carried out with the help of various libraries like minpack.lm, matrices and nlme in R studio (version 3.5.1, 2018) and various selection criteria like RMSE, MAE, AIC, BIC were used for fitting of models. In order to evaluate different validation techniques the simulated data was divided in training and testing data sets and various functions in R were developed for the purpose of validation. Co-efficient summary revealed that all statistical models were statistically significant across both symmetric as well as asymmetric distributions. In preliminary analysis TFEM (Type First Exponential Model) was found out to be the best linear model across the distributions with lower values of RMSE, MAE, BIAS, AIC and BIC. Among non-linear models, Haung model was found out to be best model across both the distributions as it has lower values of RMSE, MAE etc. Different validation techniques like Half splitting, LOOCV and 5-folded cross validation were used in the present study. Based on the results of evaluation 5-folded cross validation performed better, as it resulted in lower rates of prediction error in comparison to its counter parts.

Download Full-text

Electrode-Skin Impedance Component Estimation in the Time-Domain

Volume 9: 15th IEEE/ASME International Conference on Mechatronic and Embedded Systems and Applications ◽

10.1115/detc2019-98298 ◽

2019 ◽

Author(s):

Benjamin C. Fortune ◽

Lachlan R. McKenzie ◽

Logan T. Chatfield ◽

Christopher G. Pretty

Keyword(s):

Human Subject ◽

Simulated Data ◽

Maximum Error ◽

Exponential Model ◽

Skin Impedance ◽

Bipolar Electrode ◽

Passive Component ◽

Data Sets ◽

Mean Error ◽

The Individual

Abstract This paper presents a method to estimate the individual component values of a bipolar electrode-skin interface, with future intent of applying compensatory electrode-skin impedance balancing prior recording bio-signals with electromyography. The electrode-skin interface was stimulated by a step input and the output behaviour was characterised using a single exponential model per electrode. The method was applied to simulated circuitry, passive component circuitry and a human subject. The accuracy of the method was determined using the known values that comprised the simulated and passive component circuitry. Nine of ten simulated data sets resulted in accurate estimations, with a maximum error of 0.763% and a mean error of 0.076% per component. The method also produced successful estimates for nine of the ten physical circuitry data sets, with a maximum error of 10.2% and a mean error of 3.49% per component. The method was unsuccessful in estimating the individual electrode-skin impedance components for the human subject: this was due to the system failing to reach steady state during the stimulation period. The authors suspect a DC offset caused by the half-cell potentials associated with the electrode-skin interface were the cause of the unexpected behaviour.

Download Full-text

A Growth Model for Multilevel Ordinal Data

Journal of Educational and Behavioral Statistics ◽

10.3102/10769986030004369 ◽

2005 ◽

Vol 30 (4) ◽

pp. 369-396 ◽

Cited By ~ 8

Author(s):

Eisuke Segawa

Keyword(s):

Latent Variable ◽

Ordinal Data ◽

Linear Models ◽

Growth Models ◽

Simulated Data ◽

Real Data ◽

Analytic Structure ◽

Data Sets ◽

Data Set ◽

Time Points

Multi-indicator growth models were formulated as special three-level hierarchical generalized linear models to analyze growth of a trait latent variable measured by ordinal items. Items are nested within a time-point, and time-points are nested within subject. These models are special because they include factor analytic structure. This model can analyze not only data with item- and time-level missing observations, but also data with time points freely specified over subjects. Furthermore, features useful for longitudinal analyses, “autoregressive error degree one” structure for the trait residuals and estimated time-scores, were included. The approach is Bayesian with Markov Chain and Monte Carlo, and the model is implemented in WinBUGS. They are illustrated with two simulated data sets and one real data set with planned missing items within a scale.

Download Full-text

LASSO with cross-validation for genomic selection

Genetics Research ◽

10.1017/s0016672309990334 ◽

2009 ◽

Vol 91 (6) ◽

pp. 427-436 ◽

Cited By ~ 94

Author(s):

M. GRAZIANO USAI ◽

MIKE E. GODDARD ◽

BEN J. HAYES

Keyword(s):

Genomic Selection ◽

Cross Validation ◽

Simulated Data ◽

Reference Population ◽

Good Alternative ◽

Data Sets ◽

Nucleotide Polymorphisms ◽

Data Set ◽

Estimated Breeding Values ◽

Selection Operator

SummaryWe used a least absolute shrinkage and selection operator (LASSO) approach to estimate marker effects for genomic selection. The least angle regression (LARS) algorithm and cross-validation were used to define the best subset of markers to include in the model. The LASSO–LARS approach was tested on two data sets: a simulated data set with 5865 individuals and 6000 Single Nucleotide Polymorphisms (SNPs); and a mouse data set with 1885 individuals genotyped for 10 656 SNPs and phenotyped for a number of quantitative traits. In the simulated data, three approaches were used to split the reference population into training and validation subsets for cross-validation: random splitting across the whole population; random sampling of validation set from the last generation only, either within or across families. The highest accuracy was obtained by random splitting across the whole population. The accuracy of genomic estimated breeding values (GEBVs) in the candidate population obtained by LASSO–LARS was 0·89 with 156 explanatory SNPs. This value was higher than those obtained by Best Linear Unbiased Prediction (BLUP) and a Bayesian method (BayesA), which were 0·75 and 0·84, respectively. In the mouse data, 1600 individuals were randomly allocated to the reference population. The GEBVs for the remaining 285 individuals estimated by LASSO–LARS were more accurate than those obtained by BLUP and BayesA for weight at six weeks and slightly lower for growth rate and body length. It was concluded that LASSO–LARS approach is a good alternative method to estimate marker effects for genomic selection, particularly when the cost of genotyping can be reduced by using a limited subset of markers.

Download Full-text

Pooling methods for likelihood ratio tests in multiply imputed data sets

10.31234/osf.io/d459g ◽

2021 ◽

Author(s):

Simon Grund ◽

Oliver Lüdtke ◽

Alexander Robitzsch

Keyword(s):

Likelihood Ratio ◽

Statistical Models ◽

Structural Equation ◽

Linear Models ◽

R Package ◽

Likelihood Ratio Tests ◽

Equation Modeling ◽

Data Sets ◽

New Methods ◽

Multiply Imputed

Likelihood ratio tests (LRTs) are a popular tool for comparing statistical models. However, missing data are also common in empirical research, and multiple imputation (MI) is often used to deal with them. In multiply imputed data, there are multiple options for conducting LRTs, and new methods are still being proposed. In this article, we compare all available methods in multiple simulations covering applications in linear regression, generalized linear models, and structural equation modeling (SEM). In addition, we implemented these methods in an R package, and we illustrate its application in an example analysis concerned with the investigation of measurement invariance.

Download Full-text

Human Activity Recognition using Fourier Transform Inspired Deep Learning Combination Model

International Journal of Sensors Wireless Communications and Control ◽

10.2174/2210327908666180727123657 ◽

2019 ◽

Vol 9 (1) ◽

pp. 16-31

Author(s):

Kyungkoo Jun

Keyword(s):

Fourier Transform ◽

Deep Learning ◽

Short Term Memory ◽

Window Size ◽

Sensor Data ◽

Data Sets ◽

Data Set ◽

Proposed Model ◽

Testing Data ◽

Labeling Scheme

Background & Objective: This paper proposes a Fourier transform inspired method to classify human activities from time series sensor data. Methods: Our method begins by decomposing 1D input signal into 2D patterns, which is motivated by the Fourier conversion. The decomposition is helped by Long Short-Term Memory (LSTM) which captures the temporal dependency from the signal and then produces encoded sequences. The sequences, once arranged into the 2D array, can represent the fingerprints of the signals. The benefit of such transformation is that we can exploit the recent advances of the deep learning models for the image classification such as Convolutional Neural Network (CNN). Results: The proposed model, as a result, is the combination of LSTM and CNN. We evaluate the model over two data sets. For the first data set, which is more standardized than the other, our model outperforms previous works or at least equal. In the case of the second data set, we devise the schemes to generate training and testing data by changing the parameters of the window size, the sliding size, and the labeling scheme. Conclusion: The evaluation results show that the accuracy is over 95% for some cases. We also analyze the effect of the parameters on the performance.

Download Full-text

Mean reversion in corporate leverage: evidence from India

Managerial Finance ◽

10.1108/mf-09-2018-0425 ◽

2019 ◽

Vol 45 (9) ◽

pp. 1183-1198

Author(s):

Gaurav S. Chauhan ◽

Pradip Banerjee

Keyword(s):

Capital Structure ◽

Emerging Market ◽

Simulated Data ◽

Mean Reversion ◽

Developed Countries ◽

Data Sets ◽

Debt Ratio ◽

Testing Strategy ◽

Content Type ◽

Financing Behavior

Purpose Recent papers on target capital structure show that debt ratio seems to vary widely in space and time, implying that the functional specifications of target debt ratios are of little empirical use. Further, target behavior cannot be adjudged correctly using debt ratios, as they could revert due to mechanical reasons. The purpose of this paper is to develop an alternative testing strategy to test the target capital structure. Design/methodology/approach The authors make use of a major “shock” to the debt ratios as an event and think of a subsequent reversion as a movement toward a mean or target debt ratio. By doing this, the authors no longer need to identify target debt ratios as a function of firm-specific variables or any other rigid functional form. Findings Similar to the broad empirical evidence in developed economies, there is no perceptible and systematic mean reversion by Indian firms. However, unlike developed countries, proportionate usage of debt to finance firms’ marginal financing deficits is extensive; equity is used rather sparingly. Research limitations/implications The trade-off theory could be convincingly refuted at least for the emerging market of India. The paper here stimulated further research on finding reasons for specific financing behavior of emerging market firms. Practical implications The results show that the firms’ financing choices are not only depending on their own firm’s specific variables but also on the financial markets in which they operate. Originality/value This study attempts to assess mean reversion in debt ratios in a unique but reassuring manner. The results are confirmed by extensive calibration of the testing strategy using simulated data sets.

Download Full-text

A top-level model of case-based argumentation for explanation: Formalisation and experiments

Argument & Computation ◽

10.3233/aac-210009 ◽

2021 ◽

pp. 1-36

Author(s):

Henry Prakken ◽

Rosa Ratsma

Keyword(s):

Machine Learning ◽

Decision Making ◽

Linear Models ◽

Evaluation Studies ◽

Data Sets ◽

Machine Learning Applications ◽

Level Model ◽

Similarities And Differences ◽

Further Development ◽

Case Based

This paper proposes a formal top-level model of explaining the outputs of machine-learning-based decision-making applications and evaluates it experimentally with three data sets. The model draws on AI & law research on argumentation with cases, which models how lawyers draw analogies to past cases and discuss their relevant similarities and differences in terms of relevant factors and dimensions in the problem domain. A case-based approach is natural since the input data of machine-learning applications can be seen as cases. While the approach is motivated by legal decision making, it also applies to other kinds of decision making, such as commercial decisions about loan applications or employee hiring, as long as the outcome is binary and the input conforms to this paper’s factor- or dimension format. The model is top-level in that it can be extended with more refined accounts of similarities and differences between cases. It is shown to overcome several limitations of similar argumentation-based explanation models, which only have binary features and do not represent the tendency of features towards particular outcomes. The results of the experimental evaluation studies indicate that the model may be feasible in practice, but that further development and experimentation is needed to confirm its usefulness as an explanation model. Main challenges here are selecting from a large number of possible explanations, reducing the number of features in the explanations and adding more meaningful information to them. It also remains to be investigated how suitable our approach is for explaining non-linear models.

Download Full-text

The Flexible Burr X-G Family: Properties, Inference, and Applications in Engineering Science

Symmetry ◽

10.3390/sym13030474 ◽

2021 ◽

Vol 13 (3) ◽

pp. 474

Author(s):

Abdulhakim A. Al-Babtain ◽

Ibrahim Elbatal ◽

Hazem Al-Mofleh ◽

Ahmed M. Gemeay ◽

Ahmed Z. Afify ◽

...

Keyword(s):

Numerical Simulations ◽

Exponential Distribution ◽

Real Data ◽

Exponential Model ◽

Statistical Properties ◽

Engineering Science ◽

Data Sets ◽

Engineering Sciences ◽

General Statistical ◽

Anderson Darling

In this paper, we introduce a new flexible generator of continuous distributions called the transmuted Burr X-G (TBX-G) family to extend and increase the flexibility of the Burr X generator. The general statistical properties of the TBX-G family are calculated. One special sub-model, TBX-exponential distribution, is studied in detail. We discuss eight estimation approaches to estimating the TBX-exponential parameters, and numerical simulations are conducted to compare the suggested approaches based on partial and overall ranks. Based on our study, the Anderson–Darling estimators are recommended to estimate the TBX-exponential parameters. Using two skewed real data sets from the engineering sciences, we illustrate the importance and flexibility of the TBX-exponential model compared with other existing competing distributions.

Download Full-text

Risk prediction in multicentre studies when there is confounding by cluster or informative cluster size

BMC Medical Research Methodology ◽

10.1186/s12874-021-01321-x ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Menelaos Pavlou ◽

Gareth Ambler ◽

Rumana Z. Omar

Keyword(s):

Risk Prediction ◽

Cluster Size ◽

Linear Models ◽

Prediction Models ◽

Predictive Accuracy ◽

Clustered Data ◽

Predictor Variable ◽

Simulated Data ◽

Predictive Ability ◽

Informative Cluster Size

Abstract Background Clustered data arise in research when patients are clustered within larger units. Generalised Estimating Equations (GEE) and Generalised Linear Models (GLMM) can be used to provide marginal and cluster-specific inference and predictions, respectively. Methods Confounding by Cluster (CBC) and Informative cluster size (ICS) are two complications that may arise when modelling clustered data. CBC can arise when the distribution of a predictor variable (termed ‘exposure’), varies between clusters causing confounding of the exposure-outcome relationship. ICS means that the cluster size conditional on covariates is not independent of the outcome. In both situations, standard GEE and GLMM may provide biased or misleading inference, and modifications have been proposed. However, both CBC and ICS are routinely overlooked in the context of risk prediction, and their impact on the predictive ability of the models has been little explored. We study the effect of CBC and ICS on the predictive ability of risk models for binary outcomes when GEE and GLMM are used. We examine whether two simple approaches to handle CBC and ICS, which involve adjusting for the cluster mean of the exposure and the cluster size, respectively, can improve the accuracy of predictions. Results Both CBC and ICS can be viewed as violations of the assumptions in the standard GLMM; the random effects are correlated with exposure for CBC and cluster size for ICS. Based on these principles, we simulated data subject to CBC/ICS. The simulation studies suggested that the predictive ability of models derived from using standard GLMM and GEE ignoring CBC/ICS was affected. Marginal predictions were found to be mis-calibrated. Adjusting for the cluster-mean of the exposure or the cluster size improved calibration, discrimination and the overall predictive accuracy of marginal predictions, by explaining part of the between cluster variability. The presence of CBC/ICS did not affect the accuracy of conditional predictions. We illustrate these concepts using real data from a multicentre study with potential CBC. Conclusion Ignoring CBC and ICS when developing prediction models for clustered data can affect the accuracy of marginal predictions. Adjusting for the cluster mean of the exposure or the cluster size can improve the predictive accuracy of marginal predictions.

Download Full-text

Spectral Convolution Feature-Based SPD Matrix Representation for Signal Detection Using a Deep Neural Network

Entropy ◽

10.3390/e22090949 ◽

2020 ◽

Vol 22 (9) ◽

pp. 949

Author(s):

Jiangyi Wang ◽

Min Liu ◽

Xinwu Zeng ◽

Xiaoqiang Hua

Keyword(s):

Neural Network ◽

Signal Detection ◽

Convolutional Neural Network ◽

Deep Neural Network ◽

Detection Method ◽

Learning Algorithm ◽

Simulated Data ◽

Data Sets ◽

Feature Maps ◽

Simulated Data Sets

Convolutional neural networks have powerful performances in many visual tasks because of their hierarchical structures and powerful feature extraction capabilities. SPD (symmetric positive definition) matrix is paid attention to in visual classification, because it has excellent ability to learn proper statistical representation and distinguish samples with different information. In this paper, a deep neural network signal detection method based on spectral convolution features is proposed. In this method, local features extracted from convolutional neural network are used to construct the SPD matrix, and a deep learning algorithm for the SPD matrix is used to detect target signals. Feature maps extracted by two kinds of convolutional neural network models are applied in this study. Based on this method, signal detection has become a binary classification problem of signals in samples. In order to prove the availability and superiority of this method, simulated and semi-physical simulated data sets are used. The results show that, under low SCR (signal-to-clutter ratio), compared with the spectral signal detection method based on the deep neural network, this method can obtain a gain of 0.5–2 dB on simulated data sets and semi-physical simulated data sets.

Download Full-text