TSLRF: Two-Stage Algorithm Based on Least Angle Regression and Random Forest in genome-wide association studies

AbstractOne of the most important tasks in genome-wide association analysis (GWAS) is the detection of single-nucleotide polymorphisms (SNPs) which are related to target traits. With the development of sequencing technology, traditional statistical methods are difficult to analyze the corresponding high-dimensional massive data or SNPs. Recently, machine learning methods have become more popular in high-dimensional genetic data analysis for their fast computation speed. However, most of machine learning methods have several drawbacks, such as poor generalization ability, over-fitting, unsatisfactory classification and low detection accuracy. This study proposed a two-stage algorithm based on least angle regression and random forest (TSLRF), which firstly considered the control of population structure and polygenic effects, then selected the SNPs that were potentially related to target traits by using least angle regression (LARS), furtherly analyzed this variable subset using random forest (RF) to detect quantitative trait nucleotides (QTNs) associated with target traits. The new method has more powerful detection in simulation experiments and real data analyses. The results of simulation experiments showed that, compared with the existing approaches, the new method effectively improved the detection ability of QTNs and model fitting degree, and required less calculation time. In addition, the new method significantly distinguished QTNs and other SNPs. Subsequently, the new method was applied to analyze five flowering-related traits in Arabidopsis. The results showed that, the distinction between QTNs and unrelated SNPs was more significant than the other methods. The new method detected 60 genes confirmed to be related to the target trait, which was significantly higher than the other methods, and simultaneously detected multiple gene clusters associated with the target trait.

Download Full-text

Overview of Machine Learning Methods for Genome-Wide Association Analysis

The Fifth International Conference on Biological Information and Biomedical Engineering ◽

10.1145/3469678.3469682 ◽

2021 ◽

Author(s):

Minzhu Xie ◽

Fang Liu

Keyword(s):

Machine Learning ◽

Association Analysis ◽

Genome Wide Association ◽

Learning Methods ◽

Genome Wide Association Analysis ◽

Machine Learning Methods ◽

Genome Wide

Download Full-text

VariantSpark, A Random Forest Machine Learning Implementation for Ultra High Dimensional Data

10.1101/702902 ◽

2019 ◽

Cited By ~ 1

Author(s):

Arash Bayat ◽

Piotr Szul ◽

Aidan R. O’Brien ◽

Robert Dunne ◽

Oscar J. Luo ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Internet Of Things ◽

Random Forests ◽

Life Sciences ◽

High Dimensional ◽

The Internet ◽

Machine Learning Methods ◽

High Dimensional Datasets ◽

The Internet Of Things

AbstractThe demands on machine learning methods to cater for ultra high dimensional datasets, datasets with millions of features, have been increasing in domains like life sciences and the Internet of Things (IoT). While Random Forests are suitable for “wide” datasets, current implementations such as Google’s PLANET lack the ability to scale to such dimensions. Recent improvements by Yggdrasil begin to address these limitations but do not extend to Random Forest. This paper introduces CursedForest, a novel Random Forest implementation on top of Apache Spark and part of the VariantSpark platform, which parallelises processing of all nodes over the entire forest. CursedForest is 9 and up to 89 times faster than Google’s PLANET and Yggdrasil, respectively, and is the first method capable of scaling to millions of features.

Download Full-text

Genome wide association analysis of the 16th QTL- MAS Workshop dataset using the Random Forest machine learning approach

BMC Proceedings ◽

10.1186/1753-6561-8-s5-s4 ◽

2014 ◽

Vol 8 (S5) ◽

Cited By ~ 4

Author(s):

Giulietta Minozzi ◽

Andrea Pedretti ◽

Stefano Biffani ◽

Ezequiel Luis Nicolazzi ◽

Alessandra Stella

Keyword(s):

Machine Learning ◽

Random Forest ◽

Association Analysis ◽

Genome Wide Association ◽

Learning Approach ◽

Genome Wide Association Analysis ◽

Genome Wide ◽

Machine Learning Approach

Download Full-text

Evaluation of Machine Learning Methods and Deep Learning Method Performance in Different Sample Size in Genome-Wide Association Studies

Turkiye Klinikleri Journal of Biostatistics ◽

10.5336/biostatic.2020-73403 ◽

2020 ◽

Vol 12 (2) ◽

pp. 204-210

Author(s):

Ragıp Onur ÖZTORNACI ◽

Erdal COŞGUN ◽

Bahar TAŞDELEN

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Sample Size ◽

Association Studies ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Learning Method ◽

Method Performance ◽

Machine Learning Methods ◽

Genome Wide

Download Full-text

Integrating Machine/Deep Learning Methods and Filtering Techniques for Reliable Mineral Phase Segmentation of 3D X-ray Computed Tomography Images

Energies ◽

10.3390/en14154595 ◽

2021 ◽

Vol 14 (15) ◽

pp. 4595

Author(s):

Parisa Asadi ◽

Lauren E. Beckingham

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Random Forest ◽

Ct Images ◽

Ct Imaging ◽

Learning Method ◽

Learning Methods ◽

X Ray ◽

Machine Learning Methods ◽

Filtering Techniques

X-ray CT imaging provides a 3D view of a sample and is a powerful tool for investigating the internal features of porous rock. Reliable phase segmentation in these images is highly necessary but, like any other digital rock imaging technique, is time-consuming, labor-intensive, and subjective. Combining 3D X-ray CT imaging with machine learning methods that can simultaneously consider several extracted features in addition to color attenuation, is a promising and powerful method for reliable phase segmentation. Machine learning-based phase segmentation of X-ray CT images enables faster data collection and interpretation than traditional methods. This study investigates the performance of several filtering techniques with three machine learning methods and a deep learning method to assess the potential for reliable feature extraction and pixel-level phase segmentation of X-ray CT images. Features were first extracted from images using well-known filters and from the second convolutional layer of the pre-trained VGG16 architecture. Then, K-means clustering, Random Forest, and Feed Forward Artificial Neural Network methods, as well as the modified U-Net model, were applied to the extracted input features. The models’ performances were then compared and contrasted to determine the influence of the machine learning method and input features on reliable phase segmentation. The results showed considering more dimensionality has promising results and all classification algorithms result in high accuracy ranging from 0.87 to 0.94. Feature-based Random Forest demonstrated the best performance among the machine learning models, with an accuracy of 0.88 for Mancos and 0.94 for Marcellus. The U-Net model with the linear combination of focal and dice loss also performed well with an accuracy of 0.91 and 0.93 for Mancos and Marcellus, respectively. In general, considering more features provided promising and reliable segmentation results that are valuable for analyzing the composition of dense samples, such as shales, which are significant unconventional reservoirs in oil recovery.

Download Full-text

Machine Learning in Aging: An Example of Developing Prediction Models for Serious Fall Injury in Older Adults

Innovation in Aging ◽

10.1093/geroni/igaa057.859 ◽

2020 ◽

Vol 4 (Supplement_1) ◽

pp. 268-269

Author(s):

Jaime Speiser ◽

Kathryn Callahan ◽

Jason Fanning ◽

Thomas Gill ◽

Anne Newman ◽

...

Keyword(s):

Machine Learning ◽

Older Adults ◽

Random Forest ◽

Decision Tree ◽

Prediction Models ◽

Receiver Operating Curve ◽

Learning Methods ◽

Life Study ◽

Fall Injury ◽

Machine Learning Methods

Abstract Advances in computational algorithms and the availability of large datasets with clinically relevant characteristics provide an opportunity to develop machine learning prediction models to aid in diagnosis, prognosis, and treatment of older adults. Some studies have employed machine learning methods for prediction modeling, but skepticism of these methods remains due to lack of reproducibility and difficulty understanding the complex algorithms behind models. We aim to provide an overview of two common machine learning methods: decision tree and random forest. We focus on these methods because they provide a high degree of interpretability. We discuss the underlying algorithms of decision tree and random forest methods and present a tutorial for developing prediction models for serious fall injury using data from the Lifestyle Interventions and Independence for Elders (LIFE) study. Decision tree is a machine learning method that produces a model resembling a flow chart. Random forest consists of a collection of many decision trees whose results are aggregated. In the tutorial example, we discuss evaluation metrics and interpretation for these models. Illustrated in data from the LIFE study, prediction models for serious fall injury were moderate at best (area under the receiver operating curve of 0.54 for decision tree and 0.66 for random forest). Machine learning methods may offer improved performance compared to traditional models for modeling outcomes in aging, but their use should be justified and output should be carefully described. Models should be assessed by clinical experts to ensure compatibility with clinical practice.

Download Full-text

Forest and Trees: Exploring Bacterial Virulence with Genome-wide Association Studies and Machine Learning

Trends in Microbiology ◽

10.1016/j.tim.2020.12.002 ◽

2021 ◽

Author(s):

Jonathan P. Allen ◽

Evan Snitkin ◽

Nathan B. Pincus ◽

Alan R. Hauser

Keyword(s):

Machine Learning ◽

Association Studies ◽

Bacterial Virulence ◽

Genome Wide Association ◽

Genome Wide Association Studies ◽

Genome Wide

Download Full-text

FRI0046 PHARMACOGENOMICS-DRIVEN INDIVIDUALIZED PREDICTION OF TREATMENT RESPONSE TO METHOTREXATE IN PATIENTS WITH RHEUMATOID ARTHRITIS: A MACHINE LEARNING APPROACH

Annals of the Rheumatic Diseases ◽

10.1136/annrheumdis-2020-eular.4993 ◽

2020 ◽

Vol 79 (Suppl 1) ◽

pp. 598.2-598

Author(s):

E. Myasoedova ◽

A. Athreya ◽

C. S. Crowson ◽

R. Weinshilboum ◽

L. Wang ◽

...

Keyword(s):

Rheumatoid Arthritis ◽

Machine Learning ◽

Supervised Machine Learning ◽

Research Support ◽

Eular Response ◽

Learning Methods ◽

Machine Learning Methods ◽

Early Ra ◽

Genome Wide

Background:Methotrexate (MTX) is the most common anchor drug for rheumatoid arthritis (RA), but the risk of missing the opportunity for early effective treatment with alternative medications is substantial given the delayed onset of MTX action and 30-40% inadequate response rate. There is a compelling need to accurately predicting MTX response prior to treatment initiation, which allows for effectively identifying patients at RA onset who are likely to respond to MTX.Objectives:To test the ability of machine learning approaches with clinical and genomic biomarkers to predict MTX response with replications in independent samples.Methods:Age, sex, clinical, serological and genome-wide association study (GWAS) data on patients with early RA of European ancestry from 647 patients (336 recruited in United Kingdom [UK]; 307 recruited across Europe; 70% female; 72% rheumatoid factor [RF] positive; mean age 54 years; mean baseline Disease Activity Score with 28-joint count [DAS28] 5.65) of the PhArmacogenetics of Methotrexate in RA (PAMERA) consortium was used in this study. The genomics data comprised 160 genome-wide significant single nucleotide polymorphisms (SNPs) with p<1×10-5 associated with risk of RA and MTX metabolism. DAS28 score was available at baseline and 3-month follow-up visit. Response to MTX monotherapy at the dose of ≥15 mg/week was defined as good or moderate by the EULAR response criteria at 3 months’ follow up visit. Supervised machine-learning methods were trained with 5-repeats and 10-fold cross-validation using data from PAMERA’s 336 UK patients. Class imbalance (higher % of MTX responders) in training was accounted by using simulated minority oversampling technique. Prediction performance was validated in PAMERA’s 307 European patients (not used in training).Results:Age, sex, RF positivity and baseline DAS28 data predicted MTX response with 58% accuracy of UK and European patients (p = 0.7). However, supervised machine-learning methods that combined demographics, RF positivity, baseline DAS28 and genomic SNPs predicted EULAR response at 3 months with area under the receiver operating curve (AUC) of 0.83 (p = 0.051) in UK patients, and achieved prediction accuracies (fraction of correctly predicted outcomes) of 76.2% (p = 0.054) in the European patients, with sensitivity of 72% and specificity of 77%. The addition of genomic data improved the predictive accuracies of MTX response by 19% and achieved cross-site replication. Baseline DAS28 scores and following SNPs rs12446816, rs13385025, rs113798271, and rs2372536 were among the top predictors of MTX response.Conclusion:Pharmacogenomic biomarkers combined with DAS28 scores predicted MTX response in patients with early RA more reliably than using demographics and DAS28 scores alone. Using pharmacogenomics biomarkers for identification of MTX responders at early stages of RA may help to guide effective RA treatment choices, including timely escalation of RA therapies. Further studies on personalized prediction of response to MTX and other anti-rheumatic treatments are warranted to optimize control of RA disease and improve outcomes in patients with RA.Disclosure of Interests:Elena Myasoedova: None declared, Arjun Athreya: None declared, Cynthia S. Crowson Grant/research support from: Pfizer research grant, Richard Weinshilboum Shareholder of: co-founder and stockholder in OneOme, Liewei Wang: None declared, Eric Matteson Grant/research support from: Pfizer, Consultant of: Boehringer Ingelheim, Gilead, TympoBio, Arena Pharmaceuticals, Speakers bureau: Simply Speaking

Download Full-text

Possibility of Autonomous Estimation of Shiba Goat’s Estrus and Non-Estrus Behavior by Machine Learning Methods

Animals ◽

10.3390/ani10050771 ◽

2020 ◽

Vol 10 (5) ◽

pp. 771

Author(s):

Toshiya Arakawa

Keyword(s):

Neural Network ◽

Machine Learning ◽

Random Forest ◽

Markov Models ◽

Tracking System ◽

Video Tracking ◽

Training Data ◽

Support Vector ◽

Learning Methods ◽

Machine Learning Methods

Mammalian behavior is typically monitored by observation. However, direct observation requires a substantial amount of effort and time, if the number of mammals to be observed is sufficiently large or if the observation is conducted for a prolonged period. In this study, machine learning methods as hidden Markov models (HMMs), random forests, support vector machines (SVMs), and neural networks, were applied to detect and estimate whether a goat is in estrus based on the goat’s behavior; thus, the adequacy of the method was verified. Goat’s tracking data was obtained using a video tracking system and used to estimate whether they, which are in “estrus” or “non-estrus”, were in either states: “approaching the male”, or “standing near the male”. Totally, the PC of random forest seems to be the highest. However, The percentage concordance (PC) value besides the goats whose data were used for training data sets is relatively low. It is suggested that random forest tend to over-fit to training data. Besides random forest, the PC of HMMs and SVMs is high. However, considering the calculation time and HMM’s advantage in that it is a time series model, HMM is better method. The PC of neural network is totally low, however, if the more goat’s data were acquired, neural network would be an adequate method for estimation.

Download Full-text

Machine learning approaches to the determinants of women’s vasomotor symptoms using general hospital data

10.21203/rs.3.rs-74144/v1 ◽

2020 ◽

Author(s):

Ki-Jin Ryu ◽

Kyong Wook Yi ◽

Yong Jin Kim ◽

Jung Ho Shin ◽

Jun Young Hur ◽

...

Keyword(s):

Machine Learning ◽

Insulin Resistance ◽

Random Forest ◽

Variable Importance ◽

Vasomotor Symptoms ◽

Model Assessment ◽

Cancer Antigen ◽

Learning Methods ◽

Machine Learning Methods ◽

Glutamyl Transferase

Abstract Background To analyze the determinants of women’s vasomotor symptoms (VMS) using machine learning. Methods Data came from Korea University Anam Hospital in Seoul, Korea, with 3298 women, aged 40–80 years, who attended their general health check from January 2010 to December 2012. Five machine learning methods were applied and compared for the prediction of VMS, measured by a Menopause Rating Scale. Variable importance, the effect of a variable on model performance, was used for identifying major determinants of VMS. Results In terms of the mean squared error, the random forest (0.9326) was much better than linear regression (12.4856) and artificial neural networks with one, two and three hidden layers (1.5576, 1.5184 and 1.5833, respectively). Based on variable importance from the random forest, the most important determinants of VMS were age, menopause age, thyroid stimulating hormone, monocyte and triglyceride, as well as gamma glutamyl transferase, blood urea nitrogen, cancer antigen 19 − 9, C-reactive protein and low-density-lipoprotein cholesterol. Indeed, the following determinants ranked within the top 20 in terms of variable importance: cancer antigen 125, total cholesterol, insulin, free thyroxine, forced vital capacity, alanine aminotransferase, forced expired volume in one second, height, homeostatic model assessment for insulin resistance and carcinoembryonic antigen. Conclusions Machine learning provides an invaluable decision support system for the prediction of VMS. For preventing VMS, preventive measures would be needed regarding the thyroid function, the lipid profile, the liver function, inflammation markers, insulin resistance, the monocyte, cancer antigens and the lung function.

Download Full-text