VEF: a Variant Filtering tool based on Ensemble methods

Mapping Intimacies ◽

10.1101/540286 ◽

2019 ◽

Cited By ~ 1

Author(s):

Chuanyi Zhang ◽

Idoia Ochoa

Keyword(s):

Random Forest ◽

Gold Standard ◽

Missing Values ◽

Genomic Analysis ◽

Ensemble Methods ◽

Classification Problem ◽

Significant Saving ◽

Variant Filtering ◽

Two Samples ◽

Human Dataset

AbstractMotivationVariant discovery is crucial in medical and clinical research, especially in the setting of personalized medicine. As such, precision in variant identification is paramount. However, variants identified by current genomic analysis pipelines contain many false positives (i.e., incorrectly called variants). These can be potentially eliminated by applying state-of-the-art filtering tools, such as the Variant Quality Score Recalibration (VQSR) or the Hard Filtering (HF), both proposed by GATK. However, these methods are very user-dependent and fail to run in some cases. We propose VEF, a variant filtering tool based on ensemble methods that overcomes the main drawbacks of VQSR and the HF. Contrary to these methods, we treat filtering as a supervised learning problem. This is possible by using for training variant call data for which the set of “true” variants is known, i.e., a gold standard exists. Hence, we can classify each variant in the training VCF file as true or false using the gold standard, and further use the annotations of each variant as features for the classification problem. Once trained, VEF can be directly applied to filter the variants contained in a given VCF file. Analysis of several ensemble methods revealed random forest as offering the best performance, and hence VEF uses a random forest for the classification task.ResultsAfter training VEF on a Whole Genome Sequencing (WGS) Human dataset of sample NA12878, we tested its performance on a WGS Human dataset of sample NA24385. For these two samples, the set of high-confident variants has been produced and made available. Results show that the proposed filtering tool VEF consistently outperforms VQSR and HF. In addition, we show that VEF generalizes well even when some features have missing values, and when the training and testing datasets differ either in coverage or in the sequencing machine that was used to generate the data. Finally, since the training needs to be performed only once, there is a significant saving in running time when compared to VQSR (50 minutes versus 4 minutes approximately for filtering the SNPs of WGS Human sample NA24385). Code and scripts available at: github.com/ChuanyiZ/vef.

VEF: a variant filtering tool based on ensemble methods

Bioinformatics ◽

10.1093/bioinformatics/btz952 ◽

2019 ◽

Vol 36 (8) ◽

pp. 2328-2336

Author(s):

Chuanyi Zhang ◽

Idoia Ochoa

Keyword(s):

Missing Values ◽

Genomic Analysis ◽

Ensemble Methods ◽

Supplementary Information ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Significant Saving ◽

Variant Call ◽

Variant Filtering ◽

Human Sample

Abstract Motivation Variants identified by current genomic analysis pipelines contain many incorrectly called variants. These can be potentially eliminated by applying state-of-the-art filtering tools, such as Variant Quality Score Recalibration (VQSR) or Hard Filtering (HF). However, these methods are very user-dependent and fail to run in some cases. We propose VEF, a variant filtering tool based on decision tree ensemble methods that overcomes the main drawbacks of VQSR and HF. Contrary to these methods, we treat filtering as a supervised learning problem, using variant call data with known ‘true’ variants, i.e. gold standard, for training. Once trained, VEF can be directly applied to filter the variants contained in a given Variants Call Format (VCF) file (we consider training and testing VCF files generated with the same tools, as we assume they will share feature characteristics). Results For the analysis, we used whole genome sequencing (WGS) Human datasets for which the gold standards are available. We show on these data that the proposed filtering tool VEF consistently outperforms VQSR and HF. In addition, we show that VEF generalizes well even when some features have missing values, when the training and testing datasets differ in coverage, and when sequencing pipelines other than GATK are used. Finally, since the training needs to be performed only once, there is a significant saving in running time when compared with VQSR (4 versus 50 min approximately for filtering the single nucleotide polymorphisms of a WGS Human sample). Availability and Implementation Code and scripts available at: github.com/ChuanyiZ/vef. Supplementary information Supplementary data are available at Bioinformatics online.

Handling Complex Missing Data Using Random Forest Approach for an Air Quality Monitoring Dataset: A Case Study of Kuwait Environmental Data (2012 to 2018)

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18031333 ◽

2021 ◽

Vol 18 (3) ◽

pp. 1333

Author(s):

Ahmad R. Alsaber ◽

Jiazhu Pan ◽

Adeeba Al-Hurban

Keyword(s):

Air Quality ◽

Missing Data ◽

Random Forest ◽

Missing Values ◽

Imputation Method ◽

Environmental Data ◽

Environmental Research ◽

Quality Data ◽

Data Set ◽

Air Quality Data

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.

Learning from Imbalanced Educational Data Using Ensemble Machine Learning Algorithms

Webology ◽

10.14704/web/v18si01/web18053 ◽

2021 ◽

Vol 18 (Special Issue 01) ◽

pp. 183-195

Author(s):

Thingbaijam Lenin ◽

N. Chandrasekaran

Keyword(s):

Machine Learning ◽

Random Forest ◽

Missing Values ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Adaptive Boosting ◽

Stochastic Gradient Boosting ◽

Ensemble Machine Learning ◽

Learning Techniques ◽

Student’S Performance

Student’s academic performance is one of the most important parameters for evaluating the standard of any institute. It has become a paramount importance for any institute to identify the student at risk of underperforming or failing or even drop out from the course. Machine Learning techniques may be used to develop a model for predicting student’s performance as early as at the time of admission. The task however is challenging as the educational data required to explore for modelling are usually imbalanced. We explore ensemble machine learning techniques namely bagging algorithm like random forest (rf) and boosting algorithms like adaptive boosting (adaboost), stochastic gradient boosting (gbm), extreme gradient boosting (xgbTree) in an attempt to develop a model for predicting the student’s performance of a private university at Meghalaya using three categories of data namely demographic, prior academic record, personality. The collected data are found to be highly imbalanced and also consists of missing values. We employ k-nearest neighbor (knn) data imputation technique to tackle the missing values. The models are developed on the imputed data with 10 fold cross validation technique and are evaluated using precision, specificity, recall, kappa metrics. As the data are imbalanced, we avoid using accuracy as the metrics of evaluating the model and instead use balanced accuracy and F-score. We compare the ensemble technique with single classifier C4.5. The best result is provided by random forest and adaboost with F-score of 66.67%, balanced accuracy of 75%, and accuracy of 96.94%.

Random Forest and Ensemble Methods

Comprehensive Chemometrics ◽

10.1016/b978-0-12-409547-2.14589-5 ◽

2020 ◽

pp. 661-672

Author(s):

George Stavropoulos ◽

Robert van Voorstenbosch ◽

Frederik-Jan van Schooten ◽

Agnieszka Smolinska

Keyword(s):

Random Forest ◽

Ensemble Methods

Comparative of Machine Learning Algorithms and Datasets to Classify Natural Coverage in the Cajas National Park (Ecuador) Based on GEOBIA Approach

Proceedings ◽

10.3390/proceedings2019019020 ◽

2019 ◽

Vol 19 (1) ◽

pp. 20

Author(s):

Diego Pacheco Prado ◽

Luis Ángel Ruiz

Keyword(s):

Random Forest ◽

National Park ◽

Classification Problem ◽

Machine Learning Algorithms ◽

High Resolution Data ◽

Natural Land ◽

Geographic Datasets ◽

Land Cover Maps ◽

Selection Of

GEOBIA is an alternative to create and update land cover maps. In this work we assessed the combination of geographic datasets of the Cajas National Park (Ecuador) to detect which is the appropriate dataset-algorithm combination for the classification tasks in the Ecuadorian Andean region. The datasets included high resolution data as photogrammetric orthomosaic, DEM and derivated slope. These data were compared with free Sentinel imagery to classify natural land covers. We evaluated two aspects of the classification problem: the appropriate algorithm and the dataset combination. We evaluated SMO, C4.5 and Random Forest algorithms for the selection of attributes and classification of objects. The best results of kappa in the comparison of algorithms of classification were obtained with SMO (0.8182) and Random Forest (0.8117). In the evaluation of datasets the kappa values of the photogrammetry orthomosaic and the combination of Sentinel 1 and 2 have similar values using the C4.5 algorithm.

Important Complexity Reduction of Random Forest in Multi-Classification Problem

2019 15th International Wireless Communications & Mobile Computing Conference (IWCMC) ◽

10.1109/iwcmc.2019.8766544 ◽

2019 ◽

Cited By ~ 2

Author(s):

Kawther Hassine ◽

Aiman Erbad ◽

Ridha Hamila

Keyword(s):

Random Forest ◽

Classification Problem ◽

Complexity Reduction ◽

Multi Classification

Three machine learning models for the 2019 Solubility Challenge

ADMET & DMPK ◽

10.5599/admet.835 ◽

2020 ◽

Cited By ~ 1

Author(s):

John Mitchell

Keyword(s):

Machine Learning ◽

Random Forest ◽

Gold Standard ◽

Challenge Test ◽

Wisdom Of Crowds ◽

Learning Models ◽

The Third ◽

Aqueous Solubilities ◽

Machine Learning Models ◽

Better Than

<p class="ADMETabstracttext">We describe three machine learning models submitted to the 2019 Solubility Challenge. All are founded on tree-like classifiers, with one model being based on Random Forest and another on the related Extra Trees algorithm. The third model is a consensus predictor combining the former two with a Bagging classifier. We call this consensus classifier Vox Machinarum, and here discuss how it benefits from the Wisdom of Crowds. On the first 2019 Solubility Challenge test set of 100 low-variance intrinsic aqueous solubilities, Extra Trees is our best classifier. One the other, a high-variance set of 32 molecules, we find that Vox Machinarum and Random Forest both perform a little better than Extra Trees, and almost equally to one another. We also compare the gold standard solubilities from the 2019 Solubility Challenge with a set of literature-based solubilities for most of the same compounds.</p>

Empirical Bayesian Binary Classification Forests Using Bootstrap Prior

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i4.30.22104 ◽

2018 ◽

Vol 7 (4.30) ◽

pp. 170 ◽

Cited By ~ 1

Author(s):

Oyebayo Ridwan Olaniran ◽

Mohd Asrul Affendi Bin Abdullah ◽

Khuneswari A/P Gopal Pillay ◽

Saidat Fehintola Olaniran

Keyword(s):

Random Forest ◽

Binary Classification ◽

Classification Problem ◽

Operating Characteristics ◽

Empirical Bayesian ◽

Receiver Operating Characteristics Curve ◽

Binary Classification Problem ◽

Input Variables ◽

Sensitivity Specificity ◽

Microarray Datasets

In this paper, we present a new method called Empirical Bayesian Random Forest (EBRF) for binary classification problem. The prior ingredient for the method was obtained using the bootstrap prior technique. EBRF addresses explicitly low accuracy problem in Random Forest (RF) classifier when the number of relevant input variables is relatively lower compared to the total number of input variables. The improvement was achieved by replacing the arbitrary subsample variable size with empirical Bayesian estimate. An illustration of the proposed, and existing methods was performed using five high-dimensional microarray datasets that emanated from colon, breast, lymphoma and Central Nervous System (CNS) cancer tumours. Results from the data analysis revealed that EBRF provides reasonably higher accuracy, sensitivity, specificity and Area Under Receiver Operating Characteristics Curve (AUC) than RF in most of the datasets used.

Inverse Classification Problem of Quantitative Attributes

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.44-47.3538 ◽

2010 ◽

Vol 44-47 ◽

pp. 3538-3542

Author(s):

Ai Guo Li ◽

Xin Zhou ◽

Jiu Long Zhang

Keyword(s):

Feature Selection ◽

Missing Values ◽

Main Idea ◽

Classification Problem ◽

Experimental Results ◽

Classification Algorithms ◽

Selection Algorithm ◽

Feature Selection Algorithm ◽

Class Label ◽

Training Samples

In order to overcome the disadvantage of most inverse classification algorithms address discrete attributes and can not deal with quantitative attributes. The discretization algorithms are applied to the inverse classification algorithms, and the main idea is: firstly, a group of feature attributes are selected by using feature selection algorithm; then, the quantitative attributes are discretized by using discretization algorithms, and the inverted statistics are constructed on the training samples; finally, the test samples are analyzed. Experimental results on IRIS and Ecoli datasets show that this method could find the class label effectively and estimate the missing values accurately, and the results were not worse than ISGNN and kNN.

Random forest predictive modeling of mineral prospectivity with small number of prospects and data with missing values in Abra (Philippines)

Computers & Geosciences ◽

10.1016/j.cageo.2014.10.004 ◽

2015 ◽

Vol 74 ◽

pp. 60-70 ◽

Cited By ~ 99

Author(s):

Emmanuel John M. Carranza ◽

Alice G. Laborte

Keyword(s):

Random Forest ◽

Predictive Modeling ◽

Missing Values ◽

Mineral Prospectivity