Approximate confidence intervals for the difference in proportions for partially observed binary data

2021 ◽  
pp. 096228022110605
Author(s):  
Ujjwal Das ◽  
Ranojoy Basu

We consider partially observed binary matched-pair data. We assume that the incomplete subjects are missing at random. Within this missing framework, we propose an EM-algorithm based approach to construct an interval estimator of the proportion difference incorporating all the subjects. In conjunction with our proposed method, we also present two improvements to the interval estimator through some correction factors. The performances of the three competing methods are then evaluated through extensive simulation. Recommendation for the method is given based on the ability to preserve type-I error for various sample sizes. Finally, the methods are illustrated in two real-world data sets. An R-function is developed to implement the three proposed methods.

2021 ◽  
pp. 121-142
Author(s):  
Charles Auerbach

This chapter covers tests of statistical significance that can be used to compare data across phases. These are used to determine whether observed outcomes are likely the result of an intervention or, more likely, the result of chance. The purpose of a statistical test is to determine how likely it is that the analyst is making an incorrect decision by rejecting the null hypothesis and accepting the alternative one. A number of tests of significance are presented in this chapter: statistical process control charts (SPCs), proportion/frequency, chi-square, the conservative dual criteria (CDC), robust conservative dual criteria (RCDC), the t test, and analysis of variance (ANOVA). How and when to use each of these are also discussed. The method for transforming autocorrelated data and merging data sets is discussed. Once new data sets are created using the Append() function, they can be tested for Type I error using the techniques discussed in the chapter.


2018 ◽  
Vol 20 (6) ◽  
pp. 2055-2065 ◽  
Author(s):  
Johannes Brägelmann ◽  
Justo Lorenzo Bermejo

Abstract Technological advances and reduced costs of high-density methylation arrays have led to an increasing number of association studies on the possible relationship between human disease and epigenetic variability. DNA samples from peripheral blood or other tissue types are analyzed in epigenome-wide association studies (EWAS) to detect methylation differences related to a particular phenotype. Since information on the cell-type composition of the sample is generally not available and methylation profiles are cell-type specific, statistical methods have been developed for adjustment of cell-type heterogeneity in EWAS. In this study we systematically compared five popular adjustment methods: the factored spectrally transformed linear mixed model (FaST-LMM-EWASher), the sparse principal component analysis algorithm ReFACTor, surrogate variable analysis (SVA), independent SVA (ISVA) and an optimized version of SVA (SmartSVA). We used real data and applied a multilayered simulation framework to assess the type I error rate, the statistical power and the quality of estimated methylation differences according to major study characteristics. While all five adjustment methods improved false-positive rates compared with unadjusted analyses, FaST-LMM-EWASher resulted in the lowest type I error rate at the expense of low statistical power. SVA efficiently corrected for cell-type heterogeneity in EWAS up to 200 cases and 200 controls, but did not control type I error rates in larger studies. Results based on real data sets confirmed simulation findings with the strongest control of type I error rates by FaST-LMM-EWASher and SmartSVA. Overall, ReFACTor, ISVA and SmartSVA showed the best comparable statistical power, quality of estimated methylation differences and runtime.


2020 ◽  
Author(s):  
Guosheng Yin ◽  
Chenyang Zhang ◽  
Huaqing Jin

BACKGROUND Recently, three randomized clinical trials on coronavirus disease (COVID-19) treatments were completed: one for lopinavir-ritonavir and two for remdesivir. One trial reported that remdesivir was superior to placebo in shortening the time to recovery, while the other two showed no benefit of the treatment under investigation. OBJECTIVE The aim of this paper is to, from a statistical perspective, identify several key issues in the design and analysis of three COVID-19 trials and reanalyze the data from the cumulative incidence curves in the three trials using more appropriate statistical methods. METHODS The lopinavir-ritonavir trial enrolled 39 additional patients due to insignificant results after the sample size reached the planned number, which led to inflation of the type I error rate. The remdesivir trial of Wang et al failed to reach the planned sample size due to a lack of eligible patients, and the bootstrap method was used to predict the quantity of clinical interest conditionally and unconditionally if the trial had continued to reach the originally planned sample size. Moreover, we used a terminal (or cure) rate model and a model-free metric known as the restricted mean survival time or the restricted mean time to improvement (RMTI) to analyze the reconstructed data. The remdesivir trial of Beigel et al reported the median recovery time of the remdesivir and placebo groups, and the rate ratio for recovery, while both quantities depend on a particular time point representing local information. We use the restricted mean time to recovery (RMTR) as a global and robust measure for efficacy. RESULTS For the lopinavir-ritonavir trial, with the increase of sample size from 160 to 199, the type I error rate was inflated from 0.05 to 0.071. The difference of RMTIs between the two groups evaluated at day 28 was –1.67 days (95% CI –3.62 to 0.28; <i>P</i>=.09) in favor of lopinavir-ritonavir but not statistically significant. For the remdesivir trial of Wang et al, the difference of RMTIs at day 28 was –0.89 days (95% CI –2.84 to 1.06; <i>P</i>=.37). The planned sample size was 453, yet only 236 patients were enrolled. The conditional prediction shows that the hazard ratio estimates would reach statistical significance if the target sample size had been maintained. For the remdesivir trial of Beigel et al, the difference of RMTRs between the remdesivir and placebo groups at day 30 was –2.7 days (95% CI –4.0 to –1.2; <i>P</i>&lt;.001), confirming the superiority of remdesivir. The difference in the recovery time at the 25th percentile (95% CI –3 to 0; <i>P</i>=.65) was insignificant, while the differences became more statistically significant at larger percentiles. CONCLUSIONS Based on the statistical issues and lessons learned from the recent three clinical trials on COVID-19 treatments, we suggest more appropriate approaches for the design and analysis of ongoing and future COVID-19 trials.


Author(s):  
Rand R. Wilcox

Hypothesis testing is an approach to statistical inference that is routinely taught and used. It is based on a simple idea: develop some relevant speculation about the population of individuals or things under study and determine whether data provide reasonably strong empirical evidence that the hypothesis is wrong. Consider, for example, two approaches to advertising a product. A study might be conducted to determine whether it is reasonable to assume that both approaches are equally effective. A Type I error is rejecting this speculation when in fact it is true. A Type II error is failing to reject when the speculation is false. A common practice is to test hypotheses with the type I error probability set to 0.05 and to declare that there is a statistically significant result if the hypothesis is rejected. There are various concerns about, limitations to, and criticisms of this approach. One criticism is the use of the term significant. Consider the goal of comparing the means of two populations of individuals. Saying that a result is significant suggests that the difference between the means is large and important. But in the context of hypothesis testing it merely means that there is empirical evidence that the means are not equal. Situations can and do arise where a result is declared significant, but the difference between the means is trivial and unimportant. Indeed, the goal of testing the hypothesis that two means are equal has been criticized based on the argument that surely the means differ at some decimal place. A simple way of dealing with this issue is to reformulate the goal. Rather than testing for equality, determine whether it is reasonable to make a decision about which group has the larger mean. The components of hypothesis-testing techniques can be used to address this issue with the understanding that the goal of testing some hypothesis has been replaced by the goal of determining whether a decision can be made about which group has the larger mean. Another aspect of hypothesis testing that has seen considerable criticism is the notion of a p-value. Suppose some hypothesis is rejected with the Type I error probability set to 0.05. This leaves open the issue of whether the hypothesis would be rejected with Type I error probability set to 0.025 or 0.01. A p-value is the smallest Type I error probability for which the hypothesis is rejected. When comparing means, a p-value reflects the strength of the empirical evidence that a decision can be made about which has the larger mean. A concern about p-values is that they are often misinterpreted. For example, a small p-value does not necessarily mean that a large or important difference exists. Another common mistake is to conclude that if the p-value is close to zero, there is a high probability of rejecting the hypothesis again if the study is replicated. The probability of rejecting again is a function of the extent that the hypothesis is not true, among other things. Because a p-value does not directly reflect the extent the hypothesis is false, it does not provide a good indication of whether a second study will provide evidence to reject it. Confidence intervals are closely related to hypothesis-testing methods. Basically, they are intervals that contain unknown quantities with some specified probability. For example, a goal might be to compute an interval that contains the difference between two population means with probability 0.95. Confidence intervals can be used to determine whether some hypothesis should be rejected. Clearly, confidence intervals provide useful information not provided by testing hypotheses and computing a p-value. But an argument for a p-value is that it provides a perspective on the strength of the empirical evidence that a decision can be made about the relative magnitude of the parameters of interest. For example, to what extent is it reasonable to decide whether the first of two groups has the larger mean? Even if a compelling argument can be made that p-values should be completely abandoned in favor of confidence intervals, there are situations where p-values provide a convenient way of developing reasonably accurate confidence intervals. Another argument against p-values is that because they are misinterpreted by some, they should not be used. But if this argument is accepted, it follows that confidence intervals should be abandoned because they are often misinterpreted as well. Classic hypothesis-testing methods for comparing means and studying associations assume sampling is from a normal distribution. A fundamental issue is whether nonnormality can be a source of practical concern. Based on hundreds of papers published during the last 50 years, the answer is an unequivocal Yes. Granted, there are situations where nonnormality is not a practical concern, but nonnormality can have a substantial negative impact on both Type I and Type II errors. Fortunately, there is a vast literature describing how to deal with known concerns. Results based solely on some hypothesis-testing approach have clear implications about methods aimed at computing confidence intervals. Nonnormal distributions that tend to generate outliers are one source for concern. There are effective methods for dealing with outliers, but technically sound techniques are not obvious based on standard training. Skewed distributions are another concern. The combination of what are called bootstrap methods and robust estimators provides techniques that are particularly effective for dealing with nonnormality and outliers. Classic methods for comparing means and studying associations also assume homoscedasticity. When comparing means, this means that groups are assumed to have the same amount of variance even when the means of the groups differ. Violating this assumption can have serious negative consequences in terms of both Type I and Type II errors, particularly when the normality assumption is violated as well. There is vast literature describing how to deal with this issue in a technically sound manner.


Author(s):  
YI-CHUNG HU

Flow-based methods based on the outranking relation theory are extensively used in multiple criteria classification problems. Flow-based methods usually employed an overall preference index representing the flow to measure the intensity of preference for one pattern over another pattern. A traditional flow obtained by the pairwise comparison may not be complete since it does not globally consider the differences on each criterion between all the other patterns and the latter. That is, a traditional flow merely locally considers the difference on each criterion between two patterns. In contrast with traditional flows, the relationship-based flow is newly proposed by employing the grey relational analysis to assess the flow from one pattern to another pattern by considering the differences on each criterion between all the other patterns and the latter. A genetic algorithm-based learning algorithm is designed to determine the relative weights of respective criteria to derive the overall relationship index of a pattern. Our method is tested on several real-world data sets. Its performance is comparable to that of other well-known classifiers and flow-based methods.


2017 ◽  
Vol 13 (1) ◽  
Author(s):  
Asanao Shimokawa ◽  
Etsuo Miyaoka

AbstractTo estimate or test the treatment effect in randomized clinical trials, it is important to adjust for the potential influence of covariates that are likely to affect the association between the treatment or control group and the response. If these covariates are known at the start of the trial, random assignment of the treatment within each stratum would be considered. On the other hand, if these covariates are not clear at the start of the trial, or if it is difficult to allocate the treatment within each stratum, completely randomized assignment of the treatment would be performed. In both sampling structures, the use of a stratified adjusted test is a useful way to evaluate the significance of the overall treatment effect by reducing the variance and/or bias of the result. If the trial has a binary endpoint, the Cochran and Mantel-Haenszel tests are generally used. These tests are constructed based on the assumption that the number of patients within a stratum is fixed. However, in practice, the stratum sizes are not fixed at the start of the trial in many situations, and are instead allowed to vary. Therefore, there is a risk that using these tests under such situations would result in an error in the estimated variation of the test statistics. To handle the problem, we propose new test statistics under both sampling structures based on multinomial distributions. Our proposed approach is based on the Cochran test, and the difference between the two tests tends to have similar values in the case of a large number of patients. When the total number of patients is small, our approach yields a more conservative result. Through simulation studies, we show that the new approach could correctly maintain the type I error better than the traditional approach.


2020 ◽  
Vol 28 (3) ◽  
pp. 227-236
Author(s):  
Logan Opperman ◽  
Wei Ning

AbstractIn this paper, we propose a goodness-of-fit test based on the energy statistic for skew normality. Simulations indicate that the Type-I error of the proposed test can be controlled reasonably well for given nominal levels. Power comparisons to other existing methods under different settings show the advantage of the proposed test. Such a test is applied to two real data sets to illustrate the testing procedure.


2021 ◽  
Author(s):  
Ping Zhang ◽  
Jiyao Sheng ◽  
Wanfu Gao ◽  
Juncheng Hu ◽  
Yonghao Li

Abstract Multi-label feature selection attracts considerable attention from multi-label learning. Information-theory based multi-label feature selection methods intend to select the most informative features and reduce the uncertain amount of information of labels. Previous methods regard the uncertain amount of information of labels as constant. In fact, as the classification information of the label set is captured by features, the remaining uncertainty of each label is changing dynamically. In this paper, we categorize labels into two groups: one contains the labels with few remaining uncertainty, which means that most of classification information with respect to the labels has been obtained by the already-selected features; another group contains the labels with extensive remaining uncertainty, which means that the classification information of these labels is neglected by already-selected features. Feature selection aims to select the new features with highly relevant to the labels in the second group. Existing methods do not distinguish the difference between two label groups and ignore the dynamic change amount of information of labels. To this end, a Relevancy Ratio is designed to clarify the dynamic change amount of information of each label under the condition of the already-selected features. Afterwards, a Weighted Feature Relevancy is defined to evaluate the candidate features. Finally, a new multi-label Feature Selection method based on Weighted Feature Relevancy (WFRFS) is proposed. The experiments obtain encouraging results of WFRFS in comparison to six multi-label feature selection methods on thirteen real-world data sets.


2021 ◽  
Vol 23 (3) ◽  
Author(s):  
Estelle Chasseloup ◽  
Adrien Tessier ◽  
Mats O. Karlsson

AbstractLongitudinal pharmacometric models offer many advantages in the analysis of clinical trial data, but potentially inflated type I error and biased drug effect estimates, as a consequence of model misspecifications and multiple testing, are main drawbacks. In this work, we used real data to compare these aspects for a standard approach (STD) and a new one using mixture models, called individual model averaging (IMA). Placebo arm data sets were obtained from three clinical studies assessing ADAS-Cog scores, Likert pain scores, and seizure frequency. By randomly (1:1) assigning patients in the above data sets to “treatment” or “placebo,” we created data sets where any significant drug effect was known to be a false positive. Repeating the process of random assignment and analysis for significant drug effect many times (N = 1000) for each of the 40 to 66 placebo-drug model combinations, statistics of the type I error and drug effect bias were obtained. Across all models and the three data types, the type I error was (5th, 25th, 50th, 75th, 95th percentiles) 4.1, 11.4, 40.6, 100.0, 100.0 for STD, and 1.6, 3.5, 4.3, 5.0, 6.0 for IMA. IMA showed no bias in the drug effect estimates, whereas in STD bias was frequently present. In conclusion, STD is associated with inflated type I error and risk of biased drug effect estimates. IMA demonstrated controlled type I error and no bias.


Sign in / Sign up

Export Citation Format

Share Document