Consensus Machine Learning for Gene Target Selection in Pediatric AML Risk

AbstractAcute myeloid leukemia (AML) is a cancer of hematopoietic systems that poses high population burden, especially among pediatric populations. AML presents with high molecular heterogeneity, complicating patient risk stratification and treatment planning. While molecular and cytogenetic subtypes of AML are well described, significance of subtype-specific gene expression patterns is poorly understood and effective modeling of these patterns with individual algorithms is challenging. Using a novel consensus machine learning approach, we analyzed public RNA-seq and clinical data from pediatric AML patients (N = 137 patients) enrolled in the TARGET project.We used a binary risk classifier (Low vs. Not-Low Risk) to study risk-specific expression patterns in pediatric AML. We applied the following workflow to identify important gene targets from RNA-seq data: (1) Reduce data dimensionality by identification of differentially expressed genes for AML risk (N = 1984 loci); (2) Optimize algorithm hyperparameters for each of 4 algorithm types (lasso, XGBoost, random forest, and SVM); (3) Study ablation test results for penalized methods (lasso and XGBoost); (4) Bootstrap Boruta permutations with a novel consensus importance metric.We observed recurrently selected features across hyperparameter optimizations, ablation tests, and Boruta permutation bootstrap iterations, including HOXA9 and putative cofactors including MEIS1. Consensus feature selection from Boruta bootstraps identified a larger gene set than single penalized algorithm runs (lasso or XGBoost), while also including correlated and predictive genes from ablation tests.We present a consensus machine learning approach to identify gene targets of likely importance for pediatric AML risk. The approach identified a moderately sized set of recurrent important genes from across 4 algorithm types, including genes identified across ablation tests with penalized algorithms (HOXA9 and MEIS1). Our approach mitigates exclusion biases of penalized algorithms (lasso and XGBoost) while obviating arbitrary importance cutoffs for other types (SVM and random forest). This approach is readily generalizable for research of other heterogeneous diseases, single-assay experiments, and high-dimensional data. Resources and code to recreate our findings are available online.

Download Full-text

SMO-RF:A machine learning approach by random forest for predicting class imbalancing followed by SMOTE

Materials Today Proceedings ◽

10.1016/j.matpr.2020.12.891 ◽

2021 ◽

Author(s):

Ankur Goyal ◽

Likhita Rathore ◽

Avinash Sharma

Keyword(s):

Machine Learning ◽

Random Forest ◽

Learning Approach ◽

Machine Learning Approach

Download Full-text

A machine learning approach using random forest and LASSO to predict wine quality

International Journal of Sustainable Agricultural Management and Informatics ◽

10.1504/ijsami.2021.10040429 ◽

2021 ◽

Vol 7 (3) ◽

pp. 1

Author(s):

Dimitris Ioannidis ◽

Ioannis Athanasiadis

Keyword(s):

Machine Learning ◽

Random Forest ◽

Learning Approach ◽

Wine Quality ◽

Machine Learning Approach

Download Full-text

Machine-learning and statistical methods for DDoS attack detection and defense system in software defined networks

10.32920/ryerson.14657556 ◽

2021 ◽

Author(s):

Merlin James Rukshan Dennis

Keyword(s):

Machine Learning ◽

Random Forest ◽

Statistical Approach ◽

Denial Of Service ◽

Attack Detection ◽

Learning Approach ◽

Ddos Attack ◽

Machine Learning Approach ◽

Ddos Detection ◽

Ddos Attack Detection

Distributed Denial of Service (DDoS) attack is a serious threat on today’s Internet. As the traffic across the Internet increases day by day, it is a challenge to distinguish between legitimate and malicious traffic. This thesis proposes two different approaches to build an efficient DDoS attack detection system in the Software Defined Networking environment. SDN is the latest networking approach which implements centralized controller, which is programmable. The central control and the programming capability of the controller are used in this thesis to implement the detection and mitigation mechanisms. In this thesis, two designed approaches, statistical approach and machine-learning approach, are proposed for the DDoS detection. The statistical approach implements entropy computation and flow statistics analysis. It uses the mean and standard deviation of destination entropy, new flow arrival rate, packets per flow and flow duration to compute various thresholds. These thresholds are then used to distinguish normal and attack traffic. The machine learning approach uses Random Forest classifier to detect the DDoS attack. We fine-tune the Random Forest algorithm to make it more accurate in DDoS detection. In particular, we introduce the weighted voting instead of the standard majority voting to improve the accuracy. Our result shows that the proposed machine-learning approach outperforms the statistical approach. Furthermore, it also outperforms other machine-learning approach found in the literature.

Download Full-text

Modeling of apartment prices in a Colombian context from a machine learning approach with stable-important attributes

DYNA ◽

10.15446/dyna.v87n212.80202 ◽

2020 ◽

Vol 87 (212) ◽

pp. 63-72

Author(s):

Jorge Iván Pérez Rave ◽

Favián González Echavarría ◽

Juan Carlos Correa Morales

Keyword(s):

Machine Learning ◽

Random Forest ◽

Learning Approach ◽

Predictive Capability ◽

Predictive Capacity ◽

Machine Learning Model ◽

Machine Learning Approach ◽

Property Price ◽

Object Of Study ◽

Online Pricing

The objective of this work is to develop a machine learning model for online pricing of apartments in a Colombian context. This article addresses three aspects: i) it compares the predictive capacity of linear regression, regression trees, random forest and bagging; ii) it studies the effect of a group of text attributes on the predictive capability of the models; and iii) it identifies the more stable-important attributes and interprets them from an inferential perspective to better understand the object of study. The sample consists of 15,177 observations of real estate. The methods of assembly (random forest and bagging) show predictive superiority with respect to others. The attributes derived from the text had a significant relationship with the property price (on a log scale). However, their contribution to the predictive capacity was almost nil, since four different attributes achieved highly accurate predictions and remained stable when the sample change.

Download Full-text

Metabolomic Approach for Discrimination of Cultivation Age and Ripening Stage in Ginseng Berry Using Gas Chromatography-Mass Spectrometry

Molecules ◽

10.3390/molecules24213837 ◽

2019 ◽

Vol 24 (21) ◽

pp. 3837 ◽

Cited By ~ 1

Author(s):

Seong-Eun Park ◽

Seung-Ho Seo ◽

Eun-Ju Kim ◽

Dae-Hun Park ◽

Kyung-Mok Park ◽

...

Keyword(s):

Machine Learning ◽

Mass Spectrometry ◽

Gas Chromatography ◽

Random Forest ◽

Gas Chromatography Mass Spectrometry ◽

Learning Approach ◽

Ripening Stage ◽

Ripening Stages ◽

Machine Learning Approach ◽

Metabolomic Approach

The purpose of this study was to analyze metabolic differences of ginseng berries according to cultivation age and ripening stage using gas chromatography-mass spectrometry (GC-MS)-based metabolomics method. Ginseng berries were harvested every week during five different ripening stages of three-year-old and four-year-old ginseng. Using identified metabolites, a random forest machine learning approach was applied to obtain predictive models for the classification of cultivation age or ripening stage. Principal component analysis (PCA) score plot showed a clear separation by ripening stage, indicating that continuous metabolic changes occurred until the fifth ripening stage. Three-year-old ginseng berries had higher levels of valine, glutamic acid, and tryptophan, but lower levels of lactic acid and galactose than four-year-old ginseng berries at fully ripened stage. Metabolic pathways affected by different cultivation age were involved in amino acid metabolism pathways. A random forest machine learning approach extracted some important metabolites for predicting cultivation age or ripening stage with low error rate. This study demonstrates that different cultivation ages or ripening stages of ginseng berry can be successfully discriminated using a GC-MS-based metabolomic approach together with random forest analysis.

Download Full-text

Cholera Risk: A Machine Learning Approach Applied to Essential Climate Variables

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph17249378 ◽

2020 ◽

Vol 17 (24) ◽

pp. 9378

Author(s):

Amy Marie Campbell ◽

Marie-Fanny Racault ◽

Stephen Goult ◽

Angus Laurenson

Keyword(s):

Machine Learning ◽

Random Forest ◽

Land Surface ◽

Environmental Changes ◽

Random Forest Classifier ◽

Sea Surface Salinity ◽

Learning Approach ◽

Climate Variables ◽

Surface Salinity ◽

Machine Learning Approach

Oceanic and coastal ecosystems have undergone complex environmental changes in recent years, amid a context of climate change. These changes are also reflected in the dynamics of water-borne diseases as some of the causative agents of these illnesses are ubiquitous in the aquatic environment and their survival rates are impacted by changes in climatic conditions. Previous studies have established strong relationships between essential climate variables and the coastal distribution and seasonal dynamics of the bacteria Vibrio cholerae, pathogenic types of which are responsible for human cholera disease. In this study we provide a novel exploration of the potential of a machine learning approach to forecast environmental cholera risk in coastal India, home to more than 200 million inhabitants, utilising atmospheric, terrestrial and oceanic satellite-derived essential climate variables. A Random Forest classifier model is developed, trained and tested on a cholera outbreak dataset over the period 2010–2018 for districts along coastal India. The random forest classifier model has an Accuracy of 0.99, an F1 Score of 0.942 and a Sensitivity score of 0.895, meaning that 89.5% of outbreaks are correctly identified. Spatio-temporal patterns emerged in terms of the model’s performance based on seasons and coastal locations. Further analysis of the specific contribution of each Essential Climate Variable to the model outputs shows that chlorophyll-a concentration, sea surface salinity and land surface temperature are the strongest predictors of the cholera outbreaks in the dataset used. The study reveals promising potential of the use of random forest classifiers and remotely-sensed essential climate variables for the development of environmental cholera-risk applications. Further exploration of the present random forest model and associated essential climate variables is encouraged on cholera surveillance datasets in other coastal areas affected by the disease to determine the model’s transferability potential and applicative value for cholera forecasting systems.

Download Full-text

An Analytical Model for Prediction of Heart Disease using Machine Learning Classifiers

10.36227/techrxiv.14867175 ◽

2021 ◽

Author(s):

Diti Roy ◽

Md. Ashiq Mahmood ◽

Tamal Joyti Roy

Keyword(s):

Machine Learning ◽

Heart Disease ◽

Random Forest ◽

Learning Algorithm ◽

Modern Technology ◽

Learning Approach ◽

Data Sets ◽

Machine Learning Classifiers ◽

Machine Learning Approach ◽

Day By Day

Heart Disease is the most dominating disease which is taking a large number of deaths every year. A report from WHO in 2016 portrayed that every year at least 17 million people die of heart disease. This number is gradually increasing day by day and WHO estimated that this death toll will reach the summit of 75 million by 2030. Despite having modern technology and health care system predicting heart disease is still beyond limitations. As the Machine Learning algorithm is a vital source predicting data from available data sets we have used a machine learning approach to predict heart disease. We have collected data from the UCI repository. In our study, we have used Random Forest, Zero R, Voted Perceptron, K star classifier. We have got the best result through the Random Forest classifier with an accuracy of 97.69.

Download Full-text

A multi-step machine learning approach to assess the impact of COVID-19 lockdown on NO2 attributable deaths in Milan and Rome, Italy

Environmental Health ◽

10.1186/s12940-021-00825-9 ◽

2022 ◽

Vol 21 (1) ◽

Author(s):

Luca Boniardi ◽

Federica Nobile ◽

Massimo Stafoggia ◽

Paola Michelozzi ◽

Carla Ancona

Keyword(s):

Machine Learning ◽

Land Use ◽

Air Pollution ◽

Random Forest ◽

Citizen Science ◽

Learning Approach ◽

Machine Learning Approach ◽

Forest Models ◽

Random Forest Models ◽

The Impact

Abstract Background Air pollution is one of the main concerns for the health of European citizens, and cities are currently striving to accomplish EU air pollution regulation. The 2020 COVID-19 lockdown measures can be seen as an unintended but effective experiment to assess the impact of traffic restriction policies on air pollution. Our objective was to estimate the impact of the lockdown measures on NO2 concentrations and health in the two largest Italian cities. Methods NO2 concentration datasets were built using data deriving from a 1-month citizen science monitoring campaign that took place in Milan and Rome just before the Italian lockdown period. Annual mean NO2 concentrations were estimated for a lockdown scenario (Scenario 1) and a scenario without lockdown (Scenario 2), by applying city-specific annual adjustment factors to the 1-month data. The latter were estimated deriving data from Air Quality Network stations and by applying a machine learning approach. NO2 spatial distribution was estimated at a neighbourhood scale by applying Land Use Random Forest models for the two scenarios. Finally, the impact of lockdown on health was estimated by subtracting attributable deaths for Scenario 1 and those for Scenario 2, both estimated by applying literature-based dose–response function on the counterfactual concentrations of 10 μg/m3. Results The Land Use Random Forest models were able to capture 41–42% of the total NO2 variability. Passing from Scenario 2 (annual NO2 without lockdown) to Scenario 1 (annual NO2 with lockdown), the population-weighted exposure to NO2 for Milan and Rome decreased by 15.1% and 15.3% on an annual basis. Considering the 10 μg/m3 counterfactual, prevented deaths were respectively 213 and 604. Conclusions Our results show that the lockdown had a beneficial impact on air quality and human health. However, compliance with the current EU legal limit is not enough to avoid a high number of NO2 attributable deaths. This contribution reaffirms the potentiality of the citizen science approach and calls for more ambitious traffic calming policies and a re-evaluation of the legal annual limit value for NO2 for the protection of human health.

Download Full-text

Investigation of gut microbiome association with inflammatory bowel disease and depression: a machine learning approach

F1000Research ◽

10.12688/f1000research.15091.2 ◽

2019 ◽

Vol 7 ◽

pp. 702

Author(s):

Pedro Morell Miranda ◽

Francesca Bertolini ◽

Haja N. Kadarmideen

Keyword(s):

Machine Learning ◽

Inflammatory Bowel Disease ◽

Random Forest ◽

Bowel Disease ◽

Gut Microbiome ◽

Predictive Power ◽

Learning Approach ◽

Important Species ◽

Machine Learning Approach ◽

Inflammatory Bowel

Background: Inflammatory bowel disease (IBD) is a group of chronic diseases related to inflammatory processes in the digestive tract generally associated with an immune response to an altered gut microbiome in genetically predisposed subjects. For years, both researchers and clinicians have been reporting increased rates of anxiety and depression disorders in IBD, and these disorders have also been linked to an altered microbiome. However, the underlying pathophysiological mechanisms of comorbidity are poorly understood at the gut microbiome level. Methods: Metagenomic and metatranscriptomic data were retrieved from the Inflammatory Bowel Disease Multi-Omics Database. Samples from 70 individuals that had answered to a self-reported depression and anxiety questionnaire were selected and classified by their IBD diagnosis and their questionnaire results, creating six different groups. The cross-validation random forest algorithm was used in 90% of the individuals (training set) to retain the most important species involved in discriminating the samples without losing predictive power. The validation set that represented the remaining 10% of the samples equally distributed across the six groups was used to train a random forest using only the species selected in order to evaluate their predictive power. Results: A total of 24 species were identified as the most informative in discriminating the 6 groups. Several of these species were frequently described in dysbiosis cases, such as species from the genus Bacteroides and Faecalibacterium prausnitzii. Despite the different compositions among the groups, no common patterns were found between samples classified as depressed. However, distinct taxonomic profiles within patients of IBD depending on their depression status were detected. Conclusions: The machine learning approach is a promising approach for investigating the role of microbiome in IBD and depression. Abundance and functional changes in these species suggest that depression should be considered as a factor in future research on IBD.

Download Full-text

Prediction of Chemotherapy Response of Osteosarcoma Using Baseline 18F-FDG Textural Features Machine Learning Approaches with PCA

Contrast Media & Molecular Imaging ◽

10.1155/2019/3515080 ◽

2019 ◽

Vol 2019 ◽

pp. 1-7 ◽

Cited By ~ 3

Author(s):

Su Young Jeong ◽

Wook Kim ◽

Byung Hyun Byun ◽

Chang-Bae Kong ◽

Won Seok Song ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Fdg Pet ◽

Image Texture ◽

Support Vector ◽

Chemotherapy Response ◽

Learning Approach ◽

Textural Features ◽

Response To Chemotherapy ◽

Machine Learning Approach

Purpose. Patients with high-grade osteosarcoma undergo several chemotherapy cycles before surgical intervention. Response to chemotherapy, however, is affected by intratumor heterogeneity. In this study, we assessed the ability of a machine learning approach using baseline 18F-fluorodeoxyglucose (18F-FDG) positron emitted tomography (PET) textural features to predict response to chemotherapy in osteosarcoma patients. Materials and Methods. This study included 70 osteosarcoma patients who received neoadjuvant chemotherapy. Quantitative characteristics of the tumors were evaluated by standard uptake value (SUV), total lesion glycolysis (TLG), and metabolic tumor volume (MTV). Tumor heterogeneity was evaluated using textural analysis of 18F-FDG PET scan images. Assessments were performed at baseline and after chemotherapy using 18F-FDG PET; 18F-FDG textural features were evaluated using the Chang-Gung Image Texture Analysis toolbox. To predict the chemotherapy response, several features were chosen using the principal component analysis (PCA) feature selection method. Machine learning was performed using linear support vector machine (SVM), random forest, and gradient boost methods. The ability to predict chemotherapy response was evaluated using the area under the receiver operating characteristic curve (AUC). Results. AUCs of the baseline 18F-FDG features SUVmax, TLG, MTV, 1st entropy, and gray level co-occurrence matrix entropy were 0.553, 0538, 0.536, 0.538, and 0.543, respectively. However, AUCs of the machine learning features linear SVM, random forest, and gradient boost were 0.72, 0.78, and 0.82, respectively. Conclusion. We found that a machine learning approach based on 18F-FDG textural features could predict the chemotherapy response using baseline PET images. This early prediction of the chemotherapy response may aid in determining treatment plans for osteosarcoma patients.

Download Full-text