A prototype knockoff filter for group selection with FDR control

Abstract In many applications, we need to study a linear regression model that consists of a response variable and a large number of potential explanatory variables, and determine which variables are truly associated with the response. In Foygel Barber & Candès (2015, Ann. Statist., 43, 2055–2085), the authors introduced a new variable selection procedure called the knockoff filter to control the false discovery rate (FDR) and proved that this method achieves exact FDR control. In this paper, we propose a prototype knockoff filter for group selection by extending the Reid–Tibshirani (2016, Biostatistics, 17, 364–376) prototype method. Our prototype knockoff filter improves the computational efficiency and statistical power of the Reid–Tibshirani prototype method when it is applied for group selection. In some cases when the group features are spanned by one or a few hidden factors, we demonstrate that the Principal Component Analysis (PCA) prototype knockoff filter outperforms the Dai–Foygel Barber (2016, 33rd International Conference on Machine Learning (ICML 2016)) group knockoff filter. We present several numerical experiments to compare our prototype knockoff filter with the Reid–Tibshirani prototype method and the group knockoff filter. We have also conducted some analysis of the knockoff filter. Our analysis reveals that some knockoff path method statistics, including the Lasso path statistic, may lead to loss of power for certain design matrices and a specially designed response even if their signal strengths are still relatively strong.

Download Full-text

A pseudo knockoff filter for correlated features

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iay012 ◽

2018 ◽

Vol 8 (2) ◽

pp. 313-341

Author(s):

Jiajie Chen ◽

Anthony Hou ◽

Thomas Y Hou

Keyword(s):

Variable Selection ◽

False Discovery Rate ◽

Numerical Experiments ◽

Selection Procedure ◽

Numerical Examples ◽

False Discovery ◽

Variable Selection Procedure ◽

Partial Analysis ◽

False Discovery Proportion

Abstract In Barber & Candès (2015, Ann. Statist., 43, 2055–2085), the authors introduced a new variable selection procedure called the knockoff filter to control the false discovery rate (FDR) and proved that this method achieves exact FDR control. Inspired by the work by Barber & Candès (2015, Ann. Statist., 43, 2055–2085), we propose a pseudo knockoff filter that inherits some advantages of the original knockoff filter and has more flexibility in constructing its knockoff matrix. Moreover, we perform a number of numerical experiments that seem to suggest that the pseudo knockoff filter with the half Lasso statistic has FDR control and offers more power than the original knockoff filter with the Lasso Path or the half Lasso statistic for the numerical examples that we consider in this paper. Although we cannot establish rigourous FDR control for the pseudo knockoff filter, we provide some partial analysis of the pseudo knockoff filter with the half Lasso statistic and establish a uniform false discovery proportion bound and an expectation inequality.

Download Full-text

Sparse principal component regression via singular value decomposition approach

Advances in Data Analysis and Classification ◽

10.1007/s11634-020-00435-2 ◽

2021 ◽

Author(s):

Shuichi Kawano

Keyword(s):

Singular Value Decomposition ◽

Principal Components ◽

Principal Component Regression ◽

Principal Component ◽

Estimation Algorithm ◽

Singular Value ◽

Decomposition Approach ◽

Response Variable ◽

Explanatory Variables ◽

Value Decomposition

AbstractPrincipal component regression (PCR) is a two-stage procedure: the first stage performs principal component analysis (PCA) and the second stage builds a regression model whose explanatory variables are the principal components obtained in the first stage. Since PCA is performed using only explanatory variables, the principal components have no information about the response variable. To address this problem, we present a one-stage procedure for PCR based on a singular value decomposition approach. Our approach is based upon two loss functions, which are a regression loss and a PCA loss from the singular value decomposition, with sparse regularization. The proposed method enables us to obtain principal component loadings that include information about both explanatory variables and a response variable. An estimation algorithm is developed by using the alternating direction method of multipliers. We conduct numerical studies to show the effectiveness of the proposed method.

Download Full-text

Classification of Observations through Combination of the Dimension Reduction and the Cluster Analysis

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse.v7i8.13 ◽

2017 ◽

Vol 7 (8) ◽

pp. 30

Author(s):

Hyeuk Kim

Keyword(s):

Machine Learning ◽

Principal Component Analysis ◽

Cluster Analysis ◽

Unsupervised Learning ◽

Principal Component ◽

Component Analysis ◽

Baseball Players ◽

Partitioning Around Medoids ◽

Different Characteristics

Unsupervised learning in machine learning divides data into several groups. The observations in the same group have similar characteristics and the observations in the different groups have the different characteristics. In the paper, we classify data by partitioning around medoids which have some advantages over the k-means clustering. We apply it to baseball players in Korea Baseball League. We also apply the principal component analysis to data and draw the graph using two components for axis. We interpret the meaning of the clustering graphically through the procedure. The combination of the partitioning around medoids and the principal component analysis can be used to any other data and the approach makes us to figure out the characteristics easily.

Download Full-text

Analysis of the Bath Motion in the MM-SQC Dynamics Using Unsupervised Machine Learning Dimensionality Reduction Approaches: Principal Component Analysis

10.26434/chemrxiv.13332530 ◽

2020 ◽

Author(s):

Jiawei Peng ◽

Yu Xie ◽

Deping Hu ◽

Zhenggang Lan

Keyword(s):

Machine Learning ◽

Principal Component Analysis ◽

Collective Motion ◽

Principal Component ◽

Component Analysis ◽

Nonadiabatic Dynamics ◽

Trajectory Data ◽

Unsupervised Machine Learning ◽

Physical Knowledge ◽

Vibronic Couplings

The system-plus-bath model is an important tool to understand nonadiabatic dynamics for large molecular systems. The understanding of the collective motion of a huge number of bath modes is essential to reveal their key roles in the overall dynamics. We apply the principal component analysis (PCA) to investigate the bath motion based on the massive data generated from the MM-SQC (symmetrical quasi-classical dynamics method based on the Meyer-Miller mapping Hamiltonian) nonadiabatic dynamics of the excited-state energy transfer dynamics of Frenkel-exciton model. The PCA method clearly clarifies that two types of bath modes, which either display the strong vibronic couplings or have the frequencies close to electronic transition, are very important to the nonadiabatic dynamics. These observations are fully consistent with the physical insights. This conclusion is obtained purely based on the PCA understanding of the trajectory data, without the large involvement of pre-defined physical knowledge. The results show that the PCA approach, one of the simplest unsupervised machine learning methods, is very powerful to analyze the complicated nonadiabatic dynamics in condensed phase involving many degrees of freedom.

Download Full-text

Application of Machine Learning in Animal Disease Analysis and Prediction

Current Bioinformatics ◽

10.2174/1574893615999200728195613 ◽

2020 ◽

Vol 15 ◽

Author(s):

Shuwen Zhang ◽

Qiang Su ◽

Qin Chen

Keyword(s):

Machine Learning ◽

Unsupervised Learning ◽

Supervised Learning ◽

Clustering Algorithm ◽

Principal Component ◽

Support Vector ◽

Animal Disease ◽

Human Beings ◽

Animal Diseases ◽

Disease Analysis

Abstract: Major animal diseases pose a great threat to animal husbandry and human beings. With the deepening of globalization and the abundance of data resources, the prediction and analysis of animal diseases by using big data are becoming more and more important. The focus of machine learning is to make computers learn how to learn from data and use the learned experience to analyze and predict. Firstly, this paper introduces the animal epidemic situation and machine learning. Then it briefly introduces the application of machine learning in animal disease analysis and prediction. Machine learning is mainly divided into supervised learning and unsupervised learning. Supervised learning includes support vector machines, naive bayes, decision trees, random forests, logistic regression, artificial neural networks, deep learning, and AdaBoost. Unsupervised learning has maximum expectation algorithm, principal component analysis hierarchical clustering algorithm and maxent. Through the discussion of this paper, people have a clearer concept of machine learning and understand its application prospect in animal diseases.

Download Full-text

Transformer Oil Quality Assessment Using Random Forest with Feature Engineering

Energies ◽

10.3390/en14071809 ◽

2021 ◽

Vol 14 (7) ◽

pp. 1809

Author(s):

Mohammed El Amine Senoussaoui ◽

Mostefa Brahami ◽

Issouf Fofana

Keyword(s):

Machine Learning ◽

Random Forest ◽

Oil Quality ◽

Principal Component ◽

Condition Assessment ◽

Classification Performance ◽

Transformer Oil ◽

Classification Model ◽

Insulation Degradation ◽

Transformer Oils

Machine learning is widely used as a panacea in many engineering applications including the condition assessment of power transformers. Most statistics attribute the main cause of transformer failure to insulation degradation. Thus, a new, simple, and effective machine-learning approach was proposed to monitor the condition of transformer oils based on some aging indicators. The proposed approach was used to compare the performance of two machine-learning classifiers: J48 decision tree and random forest. The service-aged transformer oils were classified into four groups: the oils that can be maintained in service, the oils that should be reconditioned or filtered, the oils that should be reclaimed, and the oils that must be discarded. From the two algorithms, random forest exhibited a better performance and high accuracy with only a small amount of data. Good performance was achieved through not only the application of the proposed algorithm but also the approach of data preprocessing. Before feeding the classification model, the available data were transformed using the simple k-means method. Subsequently, the obtained data were filtered through correlation-based feature selection (CFsSubset). The resulting features were again retransformed by conducting the principal component analysis and were passed through the CFsSubset filter. The transformation and filtration of the data improved the classification performance of the adopted algorithms, especially random forest. Another advantage of the proposed method is the decrease in the number of the datasets required for the condition assessment of transformer oils, which is valuable for transformer condition monitoring.

Download Full-text

Discretization and machine learning approximation of BSDEs with a constraint on the Gains-process

Monte Carlo Methods and Applications ◽

10.1515/mcma-2020-2080 ◽

2021 ◽

Vol 0 (0) ◽

Author(s):

Idris Kharroubi ◽

Thomas Lim ◽

Xavier Warin

Keyword(s):

Neural Network ◽

Machine Learning ◽

Neural Networks ◽

Differential Equations ◽

Numerical Experiments ◽

Optimization Problem ◽

Learning Approach ◽

The Neural Network ◽

Machine Learning Approach ◽

Mesh Grid

AbstractWe study the approximation of backward stochastic differential equations (BSDEs for short) with a constraint on the gains process. We first discretize the constraint by applying a so-called facelift operator at times of a grid. We show that this discretely constrained BSDE converges to the continuously constrained one as the mesh grid converges to zero. We then focus on the approximation of the discretely constrained BSDE. For that we adopt a machine learning approach. We show that the facelift can be approximated by an optimization problem over a class of neural networks under constraints on the neural network and its derivative. We then derive an algorithm converging to the discretely constrained BSDE as the number of neurons goes to infinity. We end by numerical experiments.

Download Full-text

Principal component regression in GAMLSS applied to Greek–German government bond yield spreads

Statistical Modelling ◽

10.1177/1471082x211022980 ◽

2021 ◽

pp. 1471082X2110229

Author(s):

D. Stasinopoulos Mikis ◽

A. Rigby Robert ◽

Georgikopoulos Nikolaos ◽

De Bastiani Fernanda

Keyword(s):

Additive Model ◽

Principal Component Regression ◽

Principal Component ◽

Complex Nature ◽

Government Bond ◽

German Government ◽

Yield Spreads ◽

Explanatory Variables ◽

Interaction Terms ◽

Bond Yield

A solution to the problem of having to deal with a large number of interrelated explanatory variables within a generalized additive model for location, scale and shape (GAMLSS) is given here using as an example the Greek–German government bond yield spreads from 25 April 2005 to 31 March 2010. Those were turbulent financial years, and in order to capture the spreads behaviour, a model has to be able to deal with the complex nature of the financial indicators used to predict the spreads. Fitting a model, using principal components regression of both main and first order interaction terms, for all the parameters of the assumed distribution of the response variable seems to produce promising results.

Download Full-text

Comparison of Capability of SAR and Optical Data in Mapping Forest above Ground Biomass Based on Machine Learning

Environmental Sciences Proceedings ◽

10.3390/iecg2020-07916 ◽

2020 ◽

Vol 5 (1) ◽

pp. 13

Author(s):

Negar Tavasoli ◽

Hossein Arefi

Keyword(s):

Machine Learning ◽

Carbon Stock ◽

Vegetation Indices ◽

Forest Biomass ◽

Principal Component ◽

Optical Data ◽

Above Ground Biomass ◽

Ground Biomass ◽

Texture Characteristics ◽

Sentinel 2

Assessment of forest above ground biomass (AGB) is critical for managing forest and understanding the role of forest as source of carbon fluxes. Recently, satellite remote sensing products offer the chance to map forest biomass and carbon stock. The present study focuses on comparing the potential use of combination of ALOSPALSAR and Sentinel-1 SAR data, with Sentinel-2 optical data to estimate above ground biomass and carbon stock using Genetic-Random forest machine learning (GA-RF) algorithm. Polarimetric decompositions, texture characteristics and backscatter coefficients of ALOSPALSAR and Sentinel-1, and vegetation indices, tasseled cap, texture parameters and principal component analysis (PCA) of Sentinel-2 based on measured AGB samples were used to estimate biomass. The overall coefficient (R2) of AGB modelling using combination of ALOSPALSAR and Sentinel-1 data, and Sentinel-2 data were respectively 0.70 and 0.62. The result showed that Combining ALOSPALSAR and Sentinel-1 data to predict AGB by using GA-RF model performed better than Sentinel-2 data.

Download Full-text

Full spectrum and genetic algorithm-selected spectrum-based chemometric methods for simultaneous determination of azilsartan medoxomil, chlorthalidone, and azilsartan: Development, validation, and application on commercial dosage form

Open Chemistry ◽

10.1515/chem-2021-0022 ◽

2021 ◽

Vol 19 (1) ◽

pp. 205-213

Author(s):

Hany W. Darwish ◽

Abdulrahman A. Al Majed ◽

Ibrahim A. Al-Suwaidan ◽

Ibrahim A. Darwish ◽

Ahmed H. Bakheit ◽

...

Keyword(s):

Genetic Algorithm ◽

Simultaneous Determination ◽

Predictive Power ◽

Principal Component Regression ◽

Selection Procedure ◽

Principal Component ◽

Chemometric Methods ◽

Full Spectrum ◽

Azilsartan Medoxomil

Abstract Five various chemometric methods were established for the simultaneous determination of azilsartan medoxomil (AZM) and chlorthalidone in the presence of azilsartan which is the core impurity of AZM. The full spectrum-based chemometric techniques, namely partial least squares (PLS), principal component regression, and artificial neural networks (ANN), were among the applied methods. Besides, the ANN and PLS were the other two methods that were extended by genetic algorithm procedure (GA-PLS and GA-ANN) as a wavelength selection procedure. The models were developed by applying a multilevel multifactor experimental design. The predictive power of the suggested models was evaluated through a validation set containing nine mixtures with different ratios of the three analytes. For the analysis of Edarbyclor® tablets, all the proposed procedures were applied and the best results were achieved in the case of ANN, GA-ANN, and GA-PLS methods. The findings of the three methods were revealed as the quantitative tool for the analysis of the three components without any intrusion from the co-formulated excipient and without prior separation procedures. Moreover, the GA impact on strengthening the predictive power of ANN- and PLS-based models was also highlighted.

Download Full-text