Predicting with Proxies: Transfer Learning in High Dimension

Management Science ◽

10.1287/mnsc.2020.3729 ◽

2020 ◽

Author(s):

Hamsa Bastani

Keyword(s):

Patient Population ◽

Predictive Accuracy ◽

Predictive Analytics ◽

Real Data ◽

Risk Scores ◽

Data Sets ◽

Proxy Data ◽

Medical Risk ◽

Healthcare Data ◽

True Outcome

Predictive analytics is increasingly used to guide decision making in many applications. However, in practice, we often have limited data on the true predictive task of interest and must instead rely on more abundant data on a closely related proxy predictive task. For example, e-commerce platforms use abundant customer click data (proxy) to make product recommendations rather than the relatively sparse customer purchase data (true outcome of interest); alternatively, hospitals often rely on medical risk scores trained on a different patient population (proxy) rather than their own patient population (true cohort of interest) to assign interventions. Yet, not accounting for the bias in the proxy can lead to suboptimal decisions. Using real data sets, we find that this bias can often be captured by a sparse function of the features. Thus, we propose a novel two-step estimator that uses techniques from high-dimensional statistics to efficiently combine a large amount of proxy data and a small amount of true data. We prove upper bounds on the error of our proposed estimator and lower bounds on several heuristics used by data scientists; in particular, our proposed estimator can achieve the same accuracy with exponentially less true data (in the number of features d). Finally, we demonstrate the effectiveness of our approach on e-commerce and healthcare data sets; in both cases, we achieve significantly better predictive accuracy as well as managerial insights into the nature of the bias in the proxy data. This paper was accepted by George Shanthikumar, big data and analytics.

Download Full-text

Jackknife model averaging prediction methods for complex phenotypes with gene expression levels by integrating external pathway information

10.1101/447706 ◽

2018 ◽

Author(s):

Xinghao Yu ◽

Lishun Xiao ◽

Ping Zeng ◽

Shuiping Huang

Keyword(s):

Predictive Accuracy ◽

Disease Risk ◽

Model Fitting ◽

Model Averaging ◽

Real Data ◽

Genetic Data ◽

Model Specification ◽

High Dimensional ◽

Data Sets ◽

Pathway Information

AbstractMotivationIn the past few years many novel prediction approaches have been proposed and widely employed in high dimensional genetic data for disease risk evaluation. However, those approaches typically ignore in model fitting the important group structures or functional classifications that naturally exists in genetic data.MethodsIn the present study, we applied a novel model averaging approach, called Jackknife Model Averaging Prediction (JMAP), for high dimensional genetic risk prediction while incorporating KEGG pathway information into the model specification. JMAP selects the optimal weights across candidate models by minimizing a cross-validation criterion in a jackknife way. Compared with previous approaches, one of the primary features of JMAP is to allow model weights to vary from 0 to 1 but without the limitation that the summation of weights is equal to one. We evaluated the performance of JMAP using extensive simulation studies and compared it with existing methods. We finally applied JMAP to five real cancer datasets that are publicly available from TCGA.ResultsThe simulations showed that, compared with other existing approaches, JMAP performed best or are among the best methods across a range of scenarios. For example, among 14 out of 16 simulation settings with PVE=0.3, JMAP has an average of 0.075 higher prediction accuracy compared with gsslasso. We further found that in the simulation the model weights for the true candidate models have much smaller chances to be zero compared with those for the null candidate models and are substantially greater in magnitude. In the real data application, JMAP also behaves comparably or better compared with the other methods for both continuous and binary phenotypes. For example, for the COAD, CRC and PAAD data sets, the average gains of predictive accuracy of JMAP are 0.019, 0.064 and 0.052 compared with gsslasso.ConclusionThe proposed method JMAP is a novel method that can provide more accurate phenotypic prediction while incorporating external useful group information.

Download Full-text

Transforming variables to central normality

Machine Learning ◽

10.1007/s10994-021-05960-5 ◽

2021 ◽

Author(s):

Jakob Raymaekers ◽

Peter J. Rousseeuw

Keyword(s):

Maximum Likelihood ◽

Maximum Likelihood Estimator ◽

Simulation Study ◽

Real Data ◽

Data Sets ◽

Transformation Parameter ◽

Likelihood Estimator ◽

Extensive Simulation ◽

Highly Sensitive

AbstractMany real data sets contain numerical features (variables) whose distribution is far from normal (Gaussian). Instead, their distribution is often skewed. In order to handle such data it is customary to preprocess the variables to make them more normal. The Box–Cox and Yeo–Johnson transformations are well-known tools for this. However, the standard maximum likelihood estimator of their transformation parameter is highly sensitive to outliers, and will often try to move outliers inward at the expense of the normality of the central part of the data. We propose a modification of these transformations as well as an estimator of the transformation parameter that is robust to outliers, so the transformed data can be approximately normal in the center and a few outliers may deviate from it. It compares favorably to existing techniques in an extensive simulation study and on real data.

Download Full-text

A New Extension of Thinning-Based Integer-Valued Autoregressive Models for Count Data

Entropy ◽

10.3390/e23010062 ◽

2020 ◽

Vol 23 (1) ◽

pp. 62

Author(s):

Zhengwei Liu ◽

Fukang Zhu

Keyword(s):

Likelihood Estimation ◽

Real Data ◽

Autoregressive Models ◽

Superior Performance ◽

Data Sets ◽

Binomial Thinning ◽

Free Case ◽

Two Parameters ◽

Conditional Maximum ◽

Thinning Operator

The thinning operators play an important role in the analysis of integer-valued autoregressive models, and the most widely used is the binomial thinning. Inspired by the theory about extended Pascal triangles, a new thinning operator named extended binomial is introduced, which is a general case of the binomial thinning. Compared to the binomial thinning operator, the extended binomial thinning operator has two parameters and is more flexible in modeling. Based on the proposed operator, a new integer-valued autoregressive model is introduced, which can accurately and flexibly capture the dispersed features of counting time series. Two-step conditional least squares (CLS) estimation is investigated for the innovation-free case and the conditional maximum likelihood estimation is also discussed. We have also obtained the asymptotic property of the two-step CLS estimator. Finally, three overdispersed or underdispersed real data sets are considered to illustrate a superior performance of the proposed model.

Download Full-text

Goodness-of-Fit Tests for Bivariate Time Series of Counts

Econometrics ◽

10.3390/econometrics9010010 ◽

2021 ◽

Vol 9 (1) ◽

pp. 10

Author(s):

Šárka Hudecová ◽

Marie Hušková ◽

Simos G. Meintanis

Keyword(s):

Goodness Of Fit ◽

Probability Generating Function ◽

Parametric Bootstrap ◽

Real Data ◽

Data Sets ◽

Test Statistics ◽

Finite Sample ◽

Generalized Poisson ◽

Goodness Of Fit Tests ◽

Monte Carlo Experiments

This article considers goodness-of-fit tests for bivariate INAR and bivariate Poisson autoregression models. The test statistics are based on an L2-type distance between two estimators of the probability generating function of the observations: one being entirely nonparametric and the second one being semiparametric computed under the corresponding null hypothesis. The asymptotic distribution of the proposed tests statistics both under the null hypotheses as well as under alternatives is derived and consistency is proved. The case of testing bivariate generalized Poisson autoregression and extension of the methods to dimension higher than two are also discussed. The finite-sample performance of a parametric bootstrap version of the tests is illustrated via a series of Monte Carlo experiments. The article concludes with applications on real data sets and discussion.

Download Full-text

TraceAll: A Real-Time Processing for Contact Tracing Using Indoor Trajectories

Information ◽

10.3390/info12050202 ◽

2021 ◽

Vol 12 (5) ◽

pp. 202

Author(s):

Louai Alarabi ◽

Saleh Basalamah ◽

Abdeltawab Hendawi ◽

Mohammed Abdalla

Keyword(s):

Infectious Diseases ◽

Infected Patient ◽

Public Health Problem ◽

Real Data ◽

Exposure Period ◽

Contact Tracing ◽

Data Sets ◽

Major Public Health Problem ◽

Real Time Processing ◽

Recent Developments

The rapid spread of infectious diseases is a major public health problem. Recent developments in fighting these diseases have heightened the need for a contact tracing process. Contact tracing can be considered an ideal method for controlling the transmission of infectious diseases. The result of the contact tracing process is performing diagnostic tests, treating for suspected cases or self-isolation, and then treating for infected persons; this eventually results in limiting the spread of diseases. This paper proposes a technique named TraceAll that traces all contacts exposed to the infected patient and produces a list of these contacts to be considered potentially infected patients. Initially, it considers the infected patient as the querying user and starts to fetch the contacts exposed to him. Secondly, it obtains all the trajectories that belong to the objects moved nearby the querying user. Next, it investigates these trajectories by considering the social distance and exposure period to identify if these objects have become infected or not. The experimental evaluation of the proposed technique with real data sets illustrates the effectiveness of this solution. Comparative analysis experiments confirm that TraceAll outperforms baseline methods by 40% regarding the efficiency of answering contact tracing queries.

Download Full-text

The Flexible Burr X-G Family: Properties, Inference, and Applications in Engineering Science

Symmetry ◽

10.3390/sym13030474 ◽

2021 ◽

Vol 13 (3) ◽

pp. 474

Author(s):

Abdulhakim A. Al-Babtain ◽

Ibrahim Elbatal ◽

Hazem Al-Mofleh ◽

Ahmed M. Gemeay ◽

Ahmed Z. Afify ◽

...

Keyword(s):

Numerical Simulations ◽

Exponential Distribution ◽

Real Data ◽

Exponential Model ◽

Statistical Properties ◽

Engineering Science ◽

Data Sets ◽

Engineering Sciences ◽

General Statistical ◽

Anderson Darling

In this paper, we introduce a new flexible generator of continuous distributions called the transmuted Burr X-G (TBX-G) family to extend and increase the flexibility of the Burr X generator. The general statistical properties of the TBX-G family are calculated. One special sub-model, TBX-exponential distribution, is studied in detail. We discuss eight estimation approaches to estimating the TBX-exponential parameters, and numerical simulations are conducted to compare the suggested approaches based on partial and overall ranks. Based on our study, the Anderson–Darling estimators are recommended to estimate the TBX-exponential parameters. Using two skewed real data sets from the engineering sciences, we illustrate the importance and flexibility of the TBX-exponential model compared with other existing competing distributions.

Download Full-text

Kumaraswamy Generalized Power Lomax Distributionand Its Applications

Stats ◽

10.3390/stats4010003 ◽

2021 ◽

Vol 4 (1) ◽

pp. 28-45

Author(s):

Vasili B.V. Nagarjuna ◽

R. Vishnu Vardhan ◽

Christophe Chesneau

Keyword(s):

Hazard Rate ◽

Real Data ◽

Rate Function ◽

Maximum Likelihood Estimates ◽

Parameter Estimates ◽

Parameter Distribution ◽

Data Sets ◽

Lomax Distribution ◽

Entropy Measures ◽

Modeling Behavior

In this paper, a new five-parameter distribution is proposed using the functionalities of the Kumaraswamy generalized family of distributions and the features of the power Lomax distribution. It is named as Kumaraswamy generalized power Lomax distribution. In a first approach, we derive its main probability and reliability functions, with a visualization of its modeling behavior by considering different parameter combinations. As prime quality, the corresponding hazard rate function is very flexible; it possesses decreasing, increasing and inverted (upside-down) bathtub shapes. Also, decreasing-increasing-decreasing shapes are nicely observed. Some important characteristics of the Kumaraswamy generalized power Lomax distribution are derived, including moments, entropy measures and order statistics. The second approach is statistical. The maximum likelihood estimates of the parameters are described and a brief simulation study shows their effectiveness. Two real data sets are taken to show how the proposed distribution can be applied concretely; parameter estimates are obtained and fitting comparisons are performed with other well-established Lomax based distributions. The Kumaraswamy generalized power Lomax distribution turns out to be best by capturing fine details in the structure of the data considered.

Download Full-text

Stress Mediating Genes in Aging, Health, and Longevity Traits: Effects of Multiple Interactions

Innovation in Aging ◽

10.1093/geroni/igaa057.915 ◽

2020 ◽

Vol 4 (Supplement_1) ◽

pp. 286-286

Author(s):

Anatoliy Yashin ◽

Dequing Wu ◽

Konstantin Arbeev ◽

Arseniy Yashkin ◽

Galina Gorbunova ◽

...

Keyword(s):

Strong Interaction ◽

Genetic Interaction ◽

Interaction Effects ◽

Genetic Interactions ◽

Risk Scores ◽

Data Sets ◽

Interaction Patterns ◽

Polygenic Risk ◽

Drug Candidates ◽

Aging Health

Abstract Persistent stress of external or internal origin accelerates aging, increases risk of aging related health disorders, and shortens lifespan. Stressors activate stress response genes, and their products collectively influence traits. The variability of stressors and responses to them contribute to trait heterogeneity, which may cause the failure of clinical trials for drug candidates. The objectives of this paper are: to address the heterogeneity issue; to evaluate collective interaction effects of genetic factors on Alzheimer’s disease (AD) and longevity using HRS data; to identify differences and similarities in patterns of genetic interactions within two genders; and to compare AD related genetic interaction patterns in HRS and LOADFS data. To reach these objectives we: selected candidate genes from stress related pathways affecting AD/longevity; implemented logistic regression model with interaction term to evaluate effects of SNP-pairs on these traits for males and females; constructed the novel interaction polygenic risk scores for SNPs, which showed strong interaction potential, and evaluated effects of these scores on AD/longevity; and compared patterns of genetic interactions within the two genders and within two datasets. We found there were many genes involved in highly significant interactions that were the same and that were different within the two genders. The effects of interaction polygenic risk scores on AD were strong and highly statistically significant. These conclusions were confirmed in analyses of interaction effects on longevity trait using HRS data. Comparison of HRS to LOADFS data showed that many genes had strong interaction effects on AD in both data sets.

Download Full-text

The Censored Beta-Skew Alpha-Power Distribution

Symmetry ◽

10.3390/sym13071114 ◽

2021 ◽

Vol 13 (7) ◽

pp. 1114

Author(s):

Guillermo Martínez-Flórez ◽

Roger Tovar-Falón ◽

María Martínez-Guerra

Keyword(s):

Power Distribution ◽

Information Matrix ◽

Real Data ◽

Likelihood Method ◽

Data Sets ◽

Alpha Power ◽

Score Functions ◽

New Family ◽

Two Parameters ◽

Family Of Distributions

This paper introduces a new family of distributions for modelling censored multimodal data. The model extends the widely known tobit model by introducing two parameters that control the shape and the asymmetry of the distribution. Basic properties of this new family of distributions are studied in detail and a model for censored positive data is also studied. The problem of estimating parameters is addressed by considering the maximum likelihood method. The score functions and the elements of the observed information matrix are given. Finally, three applications to real data sets are reported to illustrate the developed methodology.

Download Full-text

Theory and Applications of the Unit Gamma/Gompertz Distribution

Mathematics ◽

10.3390/math9161850 ◽

2021 ◽

Vol 9 (16) ◽

pp. 1850

Author(s):

Rashad A. R. Bantan ◽

Farrukh Jamal ◽

Christophe Chesneau ◽

Mohammed Elgarhy

Keyword(s):

Stochastic Ordering ◽

Real Data ◽

Rate Function ◽

The Other ◽

Likelihood Method ◽

Model Parameters ◽

Data Sets ◽

Gompertz Distribution ◽

Probability And Statistics ◽

Analytical Behavior

Unit distributions are commonly used in probability and statistics to describe useful quantities with values between 0 and 1, such as proportions, probabilities, and percentages. Some unit distributions are defined in a natural analytical manner, and the others are derived through the transformation of an existing distribution defined in a greater domain. In this article, we introduce the unit gamma/Gompertz distribution, founded on the inverse-exponential scheme and the gamma/Gompertz distribution. The gamma/Gompertz distribution is known to be a very flexible three-parameter lifetime distribution, and we aim to transpose this flexibility to the unit interval. First, we check this aspect with the analytical behavior of the primary functions. It is shown that the probability density function can be increasing, decreasing, “increasing-decreasing” and “decreasing-increasing”, with pliant asymmetric properties. On the other hand, the hazard rate function has monotonically increasing, decreasing, or constant shapes. We complete the theoretical part with some propositions on stochastic ordering, moments, quantiles, and the reliability coefficient. Practically, to estimate the model parameters from unit data, the maximum likelihood method is used. We present some simulation results to evaluate this method. Two applications using real data sets, one on trade shares and the other on flood levels, demonstrate the importance of the new model when compared to other unit models.

Download Full-text