scholarly journals Cumulative deviation of a subpopulation from the full population

2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Mark Tygert

AbstractAssessing equity in treatment of a subpopulation often involves assigning numerical “scores” to all individuals in the full population such that similar individuals get similar scores; matching via propensity scores or appropriate covariates is common, for example. Given such scores, individuals with similar scores may or may not attain similar outcomes independent of the individuals’ memberships in the subpopulation. The traditional graphical methods for visualizing inequities are known as “reliability diagrams” or “calibrations plots,” which bin the scores into a partition of all possible values, and for each bin plot both the average outcomes for only individuals in the subpopulation as well as the average outcomes for all individuals; comparing the graph for the subpopulation with that for the full population gives some sense of how the averages for the subpopulation deviate from the averages for the full population. Unfortunately, real data sets contain only finitely many observations, limiting the usable resolution of the bins, and so the conventional methods can obscure important variations due to the binning. Fortunately, plotting cumulative deviation of the subpopulation from the full population as proposed in this paper sidesteps the problematic coarse binning. The cumulative plots encode subpopulation deviation directly as the slopes of secant lines for the graphs. Slope is easy to perceive even when the constant offsets of the secant lines are irrelevant. The cumulative approach avoids binning that smooths over deviations of the subpopulation from the full population. Such cumulative aggregation furnishes both high-resolution graphical methods and simple scalar summary statistics (analogous to those of Kuiper and of Kolmogorov and Smirnov used in statistical significance testing for comparing probability distributions).

2008 ◽  
Vol 22 (1) ◽  
pp. 523-534 ◽  
Author(s):  
Kirsten E. Kramer ◽  
Robert E. Morris ◽  
Susan L. Rose-Pehrsson ◽  
Jeffrey Cramer ◽  
Kevin J. Johnson

2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Mark Tygert

AbstractComparing the differences in outcomes (that is, in “dependent variables”) between two subpopulations is often most informative when comparing outcomes only for individuals from the subpopulations who are similar according to “independent variables.” The independent variables are generally known as “scores,” as in propensity scores for matching or as in the probabilities predicted by statistical or machine-learned models, for example. If the outcomes are discrete, then some averaging is necessary to reduce the noise arising from the outcomes varying randomly over those discrete values in the observed data. The traditional method of averaging is to bin the data according to the scores and plot the average outcome in each bin against the average score in the bin. However, such binning can be rather arbitrary and yet greatly impacts the interpretation of displayed deviation between the subpopulations and assessment of its statistical significance. Fortunately, such binning is entirely unnecessary in plots of cumulative differences and in the associated scalar summary metrics that are analogous to the workhorse statistics of comparing probability distributions—those due to Kolmogorov and Smirnov and their refinements due to Kuiper. The present paper develops such cumulative methods for the common case in which no score of any member of the subpopulations being compared is exactly equal to the score of any other member of either subpopulation.


Author(s):  
Wahid A. M. Shehata ◽  
Haitham Yousof ◽  
Mohamed Aboraya

This paper presents a novel two-parameter G family of distributions. Relevant statistical properties such as the ordinary moments, incomplete moments and moment generating function are derived.  Using common copulas, some new bivariate type G families are derived. Special attention is devoted to the standard exponential base line model. The density of the new exponential extension can be “asymmetric and right skewed shape” with no peak, “asymmetric right skewed shape” with one peak, “symmetric shape” and “asymmetric left skewed shape” with one peak. The hazard rate of the new exponential distribution can be “increasing”, “U-shape”, “decreasing” and “J-shape”. The usefulness and flexibility of the new family is illustrated by means of two applications to real data sets. The new family is compared with many common G families in modeling relief times and survival times data sets.


PLoS ONE ◽  
2021 ◽  
Vol 16 (1) ◽  
pp. e0245253
Author(s):  
Muhammad Ali ◽  
Alamgir Khalil ◽  
Muhammad Ijaz ◽  
Noor Saeed

The main goal of the current paper is to contribute to the existing literature of probability distributions. In this paper, a new probability distribution is generated by using the Alpha Power Family of distributions with the aim to model the data with non-monotonic failure rates and provides a better fit. The proposed distribution is called Alpha Power Exponentiated Inverse Rayleigh or in short APEIR distribution. Various statistical properties have been investigated including they are the order statistics, moments, residual life function, mean waiting time, quantiles, entropy, and stress-strength parameter. To estimate the parameters of the proposed distribution, the maximum likelihood method is employed. It has been proved theoretically that the proposed distribution provides a better fit to the data with monotonic as well as non-monotonic hazard rate shapes. Moreover, two real data sets are used to evaluate the significance and flexibility of the proposed distribution as compared to other probability distributions.


2019 ◽  
Vol 8 (2) ◽  
pp. 70 ◽  
Author(s):  
Mustafa C. Korkmaz ◽  
Emrah Altun ◽  
Haitham M. Yousof ◽  
G.G. Hamedani

In this study, a new flexible family of distributions is proposed with its statistical properties as well as some useful characterizations. The maximum likelihood method is used to estimate the unknown model parameters by means of two simulation studies. A new regression model is proposed based on a special member of the proposed family called, the log odd power Lindley Weibull distribution. Residual analysis is conducted to evaluate the model assumptions. Four applications to real data sets are given to demonstrate the usefulness of the proposed model.


2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Muhammad Farooq ◽  
Qamruz zaman ◽  
Muhammad Ijaz ◽  
Said Farooq Shah ◽  
Mutua Kilai

In practice, the data sets with extreme values are possible in many fields such as engineering, lifetime analysis, business, and economics. A lot of probability distributions are derived and presented to increase the model flexibility in the presence of such values. The current study also focuses on investigations to derive a new probability model New Flexible Family (NFF) of distributions. The significance of NFF is carried out using the Weibull distribution called New Flexible Weibull distribution or in short NFW. Various mathematical properties of NFW have been discussed including the estimation of parameters and entropy measures. Two real data sets with extreme values and a simulation study have been conducted so as to delineate the importance of NFW. Furthermore, NFW is compared with other existing probability distributions; numerically, it has been observed that the new mechanism of producing the lifetime probability distributions plays a significant role in making predictions about the population than others using the data sets with extreme values.


2017 ◽  
Vol 22 (2) ◽  
pp. 186-201 ◽  
Author(s):  
Pedro Jodra ◽  
Hector Wladimir Gomez ◽  
Maria Dolores Jimenez-Gamero ◽  
Maria Virtudes Alba-Fernandez

Muth introduced a probability distribution with application in reliability theory. We propose a new model from the Muth law. This paper studies its statistical properties, such as the computation of the moments, computer generation of pseudo-random data and the behavior of the failure rate function, among others. The estimation of parameters is carried out by the method of maximum likelihood and a Monte Carlo simulation study assesses the performance of this method. The practical usefulness of the new model is illustrated by means of two real data sets, showing that it may provide a better fit than other probability distributions.


2016 ◽  
Author(s):  
Reuben Thomas ◽  
Sean Thomas ◽  
Alisha K Holloway ◽  
Katherine S Pollard

Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is an important tool for studying gene regulatory proteins, such as transcription factors and histones. Peak calling is one of the first steps in analysis of these data. Peak-calling consists of two sub-problems: identifying candidate peaks and testing candidate peaks for statistical significance. We surveyed 30 methods and identified 12 features of the two sub-problems that distinguish methods from each other. We picked six methods (GEM, MACS2, MUSIC, BCP, TM and ZINBA) that span this feature space and used a combination of 300 simulated ChIP-seq data sets, 3 real data sets and mathematical analyses to identify features of methods that allow some to perform better than others. We prove that methods that explicitly combine the signals from ChIP and input samples are less powerful than methods that do not. Methods that use windows of different sizes are more powerful than ones that do not. For statistical testing of candidate peaks, methods that use a Poisson test to rank their candidate peaks are more powerful than those that use a Binomial test. BCP and MACS2 have the best operating characteristics on simulated transcription factor binding data. GEM has the highest fraction of the top 500 peaks containing the binding motif of the immunoprecipitated factor, with 50% of its peaks within 10 base pairs (bp) of a motif. BCP and MUSIC perform best on histone data. These findings provide guidance and rationale for selecting the best peak caller for a given application.


2021 ◽  
Author(s):  
Jakob Raymaekers ◽  
Peter J. Rousseeuw

AbstractMany real data sets contain numerical features (variables) whose distribution is far from normal (Gaussian). Instead, their distribution is often skewed. In order to handle such data it is customary to preprocess the variables to make them more normal. The Box–Cox and Yeo–Johnson transformations are well-known tools for this. However, the standard maximum likelihood estimator of their transformation parameter is highly sensitive to outliers, and will often try to move outliers inward at the expense of the normality of the central part of the data. We propose a modification of these transformations as well as an estimator of the transformation parameter that is robust to outliers, so the transformed data can be approximately normal in the center and a few outliers may deviate from it. It compares favorably to existing techniques in an extensive simulation study and on real data.


Sign in / Sign up

Export Citation Format

Share Document