Differential Compositional Variation Feature Selection: A Machine Learning Framework with Log Ratios for Compositional Metagenomic Data

Learning Framework ◽

The demand for tight integration of compositional data analysis and machine learning methodologies for predictive modeling in high-dimensional settings has increased dramatically with the increasing availability of metagenomics data. We develop the differential compositional variation machine learning framework (DiCoVarML) with robust multi-level log ratio bio-marker discovery for metagenomic datasets. Our framework makes use of the full set of pairwise log ratios, scoring ratios according to their variation between classes and then selecting out a small subset of log ratios to accurately predict classes. Importantly, DiCoVarML supports a targeted feature selection mode enabling researchers to define the number of predictors used to develop models. We demonstrate the performance of our framework for binary classification tasks using both synthetic and real datasets. Selecting from all pairwise log ratios within the DiCoVarML framework provides greater flexibility that can in demonstrated cases lead to higher accuracy and enhanced biological insight.

To use or not to use: Feature selection for sentiment analysis of highly imbalanced data

Natural Language Engineering ◽

10.1017/s1351324917000298 ◽

2017 ◽

Vol 24 (1) ◽

pp. 3-37 ◽

Cited By ~ 5

Author(s):

SANDRA KÜBLER ◽

CAN LIU ◽

ZEESHAN ALI SAYYED

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Sentiment Analysis ◽

Information Gain ◽

Binary Classification ◽

Small Subset ◽

Large Set ◽

Learning Approaches ◽

Selection Methods ◽

Data Set

AbstractWe investigate feature selection methods for machine learning approaches in sentiment analysis. More specifically, we use data from the cooking platform Epicurious and attempt to predict ratings for recipes based on user reviews. In machine learning approaches to such tasks, it is a common approach to use word or part-of-speech n-grams. This results in a large set of features, out of which only a small subset may be good indicators for the sentiment. One of the questions we investigate concerns the extension of feature selection methods from a binary classification setting to a multi-class problem. We show that an inherently multi-class approach, multi-class information gain, outperforms ensembles of binary methods. We also investigate how to mitigate the effects of extreme skewing in our data set by making our features more robust and by using review and recipe sampling. We show that over-sampling is the best method for boosting performance on the minority classes, but it also results in a severe drop in overall accuracy of at least 6 per cent points.

A Machine Learning Framework for Intrusion Detection System in IoT Networks Using an Ensemble Feature Selection Method

10.1109/iemcon53756.2021.9623082 ◽

2021 ◽

Author(s):

Ge Guo

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Intrusion Detection ◽

Intrusion Detection System ◽

Detection System ◽

Feature Selection Method ◽

Selection Method ◽

Learning Framework

A machine-learning framework for predicting multiple air pollutants' concentrations via multi-target regression and feature selection

The Science of The Total Environment ◽

10.1016/j.scitotenv.2020.136991 ◽

2020 ◽

Vol 715 ◽

pp. 136991 ◽

Cited By ~ 2

Author(s):

Sahar Masmoudi ◽

Haytham Elghazel ◽

Dalila Taieb ◽

Orhan Yazar ◽

Amjad Kallel

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Air Pollutants ◽

Learning Framework

A machine learning framework to determine geolocations from metagenomic profiling

Biology Direct ◽

10.1186/s13062-020-00278-z ◽

2020 ◽

Vol 15 (1) ◽

Cited By ~ 1

Author(s):

Lihong Huang ◽

Canqiang Xu ◽

Wenxian Yang ◽

Rongshan Yu

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Geographic Origin ◽

Training Data ◽

Metagenomic Data ◽

Training Dataset ◽

Kriging Interpolation ◽

Learning Framework ◽

Testing Data ◽

Microbial Samples

Abstract Background Studies on metagenomic data of environmental microbial samples found that microbial communities seem to be geolocation-specific, and the microbiome abundance profile can be a differentiating feature to identify samples’ geolocations. In this paper, we present a machine learning framework to determine the geolocations from metagenomics profiling of microbial samples. Results Our method was applied to the multi-source microbiome data from MetaSUB (The Metagenomics and Metadesign of Subways and Urban Biomes) International Consortium for the CAMDA 2019 Metagenomic Forensics Challenge (the Challenge). The goal of the Challenge is to predict the geographical origins of mystery samples by constructing microbiome fingerprints.First, we extracted features from metagenomic abundance profiles. We then randomly split the training data into training and validation sets and trained the prediction models on the training set. Prediction performance was evaluated on the validation set. By using logistic regression with L2 normalization, the prediction accuracy of the model reaches 86%, averaged over 100 random splits of training and validation datasets.The testing data consists of samples from cities that do not occur in the training data. To predict the “mystery” cities that are not sampled before for the testing data, we first defined biological coordinates for sampled cities based on the similarity of microbial samples from them. Then we performed affine transform on the map such that the distance between cities measures their biological difference rather than geographical distance. After that, we derived the probabilities of a given testing sample from unsampled cities based on its predicted probabilities on sampled cities using Kriging interpolation. Results show that this method can successfully assign high probabilities to the true cities-of-origin of testing samples. Conclusion Our framework shows good performance in predicting the geographic origin of metagenomic samples for cities where training data are available. Furthermore, we demonstrate the potential of the proposed method to predict metagenomic samples’ geolocations for samples from locations that are not in the training dataset.

Variable selection in microbiome compositional data analysis

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa029 ◽

2020 ◽

Vol 2 (2) ◽

Cited By ~ 2

Author(s):

Antoni Susin ◽

Yiwen Wang ◽

Kim-Anh Lê Cao ◽

M Luz Calle

Keyword(s):

Data Analysis ◽

Variable Selection ◽

Compositional Data ◽

Penalized Regression ◽

Forward Selection ◽

Computationally Efficient ◽

Parsimonious Model ◽

Microbiome Data ◽

Abstract Though variable selection is one of the most relevant tasks in microbiome analysis, e.g. for the identification of microbial signatures, many studies still rely on methods that ignore the compositional nature of microbiome data. The applicability of compositional data analysis methods has been hampered by the availability of software and the difficulty in interpreting their results. This work is focused on three methods for variable selection that acknowledge the compositional structure of microbiome data: selbal, a forward selection approach for the identification of compositional balances, and clr-lasso and coda-lasso, two penalized regression models for compositional data analysis. This study highlights the link between these methods and brings out some limitations of the centered log-ratio transformation for variable selection. In particular, the fact that it is not subcompositionally consistent makes the microbial signatures obtained from clr-lasso not readily transferable. Coda-lasso is computationally efficient and suitable when the focus is the identification of the most associated microbial taxa. Selbal stands out when the goal is to obtain a parsimonious model with optimal prediction performance, but it is computationally greedy. We provide a reproducible vignette for the application of these methods that will enable researchers to fully leverage their potential in microbiome studies.

Counts: an outstanding challenge for log-ratio analysis of compositional data in the molecular biosciences

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa040 ◽

2020 ◽

Vol 2 (2) ◽

Cited By ~ 1

Author(s):

David R Lovell ◽

Xin-Yi Chua ◽

Annette McGrath

Keyword(s):

Count Data ◽

Compositional Data ◽

Ratio Analysis ◽

Sequencing Technology ◽

Scale Invariant ◽

Measurement And Analysis ◽

Discrete Nature ◽

The Impact ◽

Abstract Thanks to sequencing technology, modern molecular bioscience datasets are often compositions of counts, e.g. counts of amplicons, mRNAs, etc. While there is growing appreciation that compositional data need special analysis and interpretation, less well understood is the discrete nature of these count compositions (or, as we call them, lattice compositions) and the impact this has on statistical analysis, particularly log-ratio analysis (LRA) of pairwise association. While LRA methods are scale-invariant, count compositional data are not; consequently, the conclusions we draw from LRA of lattice compositions depend on the scale of counts involved. We know that additive variation affects the relative abundance of small counts more than large counts; here we show that additive (quantization) variation comes from the discrete nature of count data itself, as well as (biological) variation in the system under study and (technical) variation from measurement and analysis processes. Variation due to quantization is inevitable, but its impact on conclusions depends on the underlying scale and distribution of counts. We illustrate the different distributions of real molecular bioscience data from different experimental settings to show why it is vital to understand the distributional characteristics of count data before applying and drawing conclusions from compositional data analysis methods.

Prediction of Antidepressant Treatment Response and Remission Using an Ensemble Machine Learning Framework

Pharmaceuticals ◽

10.3390/ph13100305 ◽

2020 ◽

Vol 13 (10) ◽

pp. 305

Author(s):

Eugene Lin ◽

Po-Hsiu Kuo ◽

Yu-Li Liu ◽

Younger W.-Y. Yu ◽

Albert C. Yang ◽

...

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Feature Selection ◽

Treatment Response ◽

Antidepressant Treatment ◽

Learning Framework ◽

Antidepressant Treatments ◽

Ensemble Machine Learning ◽

Predictive Algorithms ◽

The Status

In the wake of recent advances in machine learning research, the study of pharmacogenomics using predictive algorithms serves as a new paradigmatic application. In this work, our goal was to explore an ensemble machine learning approach which aims to predict probable antidepressant treatment response and remission in major depressive disorder (MDD). To discover the status of antidepressant treatments, we established an ensemble predictive model with a feature selection algorithm resulting from the analysis of genetic variants and clinical variables of 421 patients who were treated with selective serotonin reuptake inhibitors. We also compared our ensemble machine learning framework with other state-of-the-art models including multi-layer feedforward neural networks (MFNNs), logistic regression, support vector machine, C4.5 decision tree, naïve Bayes, and random forests. Our data revealed that the ensemble predictive algorithm with feature selection (using fewer biomarkers) performed comparably to other predictive algorithms (such as MFNNs and logistic regression) to derive the perplexing relationship between biomarkers and the status of antidepressant treatments. Our study demonstrates that the ensemble machine learning framework may present a useful technique to create bioinformatics tools for discriminating non-responders from responders prior to antidepressant treatments.

Multiple similarly effective solutions exist for biomedical feature selection and classification problems

Scientific Reports ◽

10.1038/s41598-017-13184-8 ◽

2017 ◽

Vol 7 (1) ◽

Cited By ~ 9

Author(s):

Jiamei Liu ◽

Cheng Xu ◽

Weifeng Yang ◽

Yayun Shu ◽

Weiwei Zheng ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Association Studies ◽

Binary Classification ◽

Learning Algorithms ◽

Optimal Solution ◽

Machine Learning Algorithms ◽

Disease Classification ◽

Genome Wide Association Studies ◽

Classification Problems

Abstract Binary classification is a widely employed problem to facilitate the decisions on various biomedical big data questions, such as clinical drug trials between treated participants and controls, and genome-wide association studies (GWASs) between participants with or without a phenotype. A machine learning model is trained for this purpose by optimizing the power of discriminating samples from two groups. However, most of the classification algorithms tend to generate one locally optimal solution according to the input dataset and the mathematical presumptions of the dataset. Here we demonstrated from the aspects of both disease classification and feature selection that multiple different solutions may have similar classification performances. So the existing machine learning algorithms may have ignored a horde of fishes by catching only a good one. Since most of the existing machine learning algorithms generate a solution by optimizing a mathematical goal, it may be essential for understanding the biological mechanisms for the investigated classification question, by considering both the generated solution and the ignored ones.

Performance Assessment in Water Polo Using Compositional Data Analysis

Journal of Human Kinetics ◽

10.1515/hukin-2016-0043 ◽

2016 ◽

Vol 54 (1) ◽

pp. 143-151 ◽

Cited By ~ 4

Author(s):

Enrique García Ordóñez ◽

María del Carmen Iglesias Pérez ◽

Carlos Touriño González

Keyword(s):

Performance Indicators ◽

Cross Validation ◽

Compositional Data ◽

Original Sample ◽

Water Polo ◽

Discriminant Analyses ◽

Multivariate Discriminant ◽

Log Ratio ◽

Match Score

AbstractThe aim of the present study was to identify groups of offensive performance indicators which best discriminated between a match score (favourable, balanced or unfavourable) in water polo. The sample comprised 88 regular season games (2011-2014) from the Spanish Professional Water Polo League. The offensive performance indicators were clustered in five groups: Attacks in relation to the different playing situations; Shots in relation to the different playing situations; Attacks outcome; Origin of shots; Technical execution of shots. The variables of each group had a constant sum which equalled 100%. The data were compositional data, therefore the variables were changed by means of the additive log-ratio (alr) transformation. Multivariate discriminant analyses to compare the match scores were calculated using the transformed variables. With regard to the percentage of right classification, the results showed the group that discriminated the most between the match scores was “Attacks outcome” (60.4% for the original sample and 52.2% for cross-validation). The performance indicators that discriminated the most between the match scores in games with penalties were goals (structure coefficient (SC) = .761), counterattack shots (SC = .541) and counterattacks (SC = .481). In matches without penalties, goals were the primary discriminating factor (SC = .576). This approach provides a new tool to compare the importance of the offensive performance groups and their effect on the match score discrimination.

Assessing Global Covid-19 Cases Data through Compositional Data Analysis(CoDa)

10.1101/2020.12.17.20248424 ◽

2020 ◽

Author(s):

Luis P.V. Braga ◽

Dina Feigenbaum

Keyword(s):

Data Analysis ◽

Compositional Data ◽

Discrete Groups ◽

Data Sets ◽

Cumulative Number ◽

Governmental Agencies ◽

Global Pandemic ◽

Number Of Patients ◽

AbstractBackgroundCovid-19 cases data pose an enormous challenge to any analysis. The evaluation of such a global pandemic requires matching reports that follow different procedures and even overcoming some countries’ censorship that restricts publications.MethodsThis work proposes a methodology that could assist future studies. Compositional Data Analysis (CoDa) is proposed as the proper approach as Covid-19 cases data is compositional in nature. Under this methodology, for each country three attributes were selected: cumulative number of deaths (D); cumulative number of recovered patients(R); present number of patients (A).ResultsAfter the operation called closure, with c=1, a ternary diagram and Log-Ratio plots, as well as, compositional statistics are presented. Cluster analysis is then applied, splitting the countries into discrete groups.ConclusionsThis methodology can also be applied to other data sets such as countries, cities, provinces or districts in order to help authorities and governmental agencies to improve their actions to fight against a pandemic.