Machine Learning Using H2O R Package: An Application in Bioinformatics

mlr3proba: An R Package for Machine Learning in Survival Analysis

Bioinformatics ◽

10.1093/bioinformatics/btab039 ◽

2021 ◽

Author(s):

Raphael Sonabend ◽

Franz J Király ◽

Andreas Bender ◽

Bernd Bischl ◽

Michel Lang

Keyword(s):

Machine Learning ◽

Survival Analysis ◽

General Model ◽

R Package ◽

Survival Modeling ◽

Model Tuning

Abstract Motivation As machine learning has become increasingly popular over the last few decades, so too has the number of machine learning interfaces for implementing these models. Whilst many R libraries exist for machine learning, very few offer extended support for survival analysis. This is problematic considering its importance in fields like medicine, bioinformatics, economics, engineering, and more. mlr3proba provides a comprehensive machine learning interface for survival analysis and connects with mlr3’s general model tuning and benchmarking facilities to provide a systematic infrastructure for survival modeling and evaluation. Availability mlr3proba is available under an LGPL-3 license on CRAN and at https://github.com/mlr-org/mlr3proba, with further documentation at https://mlr3book.mlr-org.com/survival.html.

treeheatr: an R package for interpretable decision tree visualizations

10.1101/2020.07.10.196352 ◽

2020 ◽

Author(s):

Trang T. Le ◽

Jason H. Moore

Keyword(s):

Machine Learning ◽

Decision Tree ◽

Feature Space ◽

R Package ◽

Tree Structure ◽

Decision Tree Model ◽

Teaching Tool ◽

Tree Model ◽

Machine Learning Methods ◽

Link Type

AbstractSummarytreeheatr is an R package for creating interpretable decision tree visualizations with the data represented as a heatmap at the tree’s leaf nodes. The integrated presentation of the tree structure along with an overview of the data efficiently illustrates how the tree nodes split up the feature space and how well the tree model performs. This visualization can also be examined in depth to uncover the correlation structure in the data and importance of each feature in predicting the outcome. Implemented in an easily installed package with a detailed vignette, treeheatr can be a useful teaching tool to enhance students’ understanding of a simple decision tree model before diving into more complex tree-based machine learning methods.AvailabilityThe treeheatr package is freely available under the permissive MIT license at https://trang1618.github.io/treeheatr and https://cran.r-project.org/package=treeheatr. It comes with a detailed vignette that is automatically built with GitHub Actions continuous [email protected]

chemmodlab: a cheminformatics modeling laboratory R package for fitting and assessing machine learning models

Journal of Cheminformatics ◽

10.1186/s13321-018-0309-4 ◽

2018 ◽

Vol 10 (1) ◽

Cited By ~ 1

Author(s):

Jeremy R. Ash ◽

Jacqueline M. Hughes-Oliver

Keyword(s):

Machine Learning ◽

R Package ◽

Learning Models ◽

Machine Learning Models

Spatio-Temporal Prediction of the Epidemic Spread of Dangerous Pathogens Using Machine Learning Methods

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi9010044 ◽

2020 ◽

Vol 9 (1) ◽

pp. 44 ◽

Cited By ~ 4

Author(s):

Wolfgang B. Hamer ◽

Tim Birr ◽

Joseph-Alexander Verreet ◽

Rainer Duttmann ◽

Holger Klink

Keyword(s):

Machine Learning ◽

Powdery Mildew ◽

Regional Scale ◽

R Package ◽

Good Time ◽

Climate Information ◽

Learning Methods ◽

Temporal Prediction ◽

Machine Learning Methods ◽

Weather And Climate

Real-time identification of the occurrence of dangerous pathogens is of crucial importance for the rapid execution of countermeasures. For this purpose, spatial and temporal predictions of the spread of such pathogens are indispensable. The R package papros developed by the authors offers an environment in which both spatial and temporal predictions can be made, based on local data using various deterministic, geostatistical regionalisation, and machine learning methods. The approach is presented using the example of a crops infection by fungal pathogens, which can substantially reduce the yield if not treated in good time. The situation is made more difficult by the fact that it is particularly difficult to predict the behaviour of wind-dispersed pathogens, such as powdery mildew (Blumeria graminis f. sp. tritici). To forecast pathogen development and spatial dispersal, a modelling process scheme was developed using the aforementioned R package, which combines regionalisation and machine learning techniques. It enables the prediction of the probability of yield- relevant infestation events for an entire federal state in northern Germany at a daily time scale. To run the models, weather and climate information are required, as is knowledge of the pathogen biology. Once fitted to the pathogen, only weather and climate information are necessary to predict such events, with an overall accuracy of 68% in the case of powdery mildew at a regional scale. Thereby, 91% of the observed powdery mildew events are predicted.

Predictive and interpretable models via the stacked elastic net

Bioinformatics ◽

10.1093/bioinformatics/btaa535 ◽

2020 ◽

Author(s):

Armin Rauschenberger ◽

Enrico Glaab ◽

Mark van de Wiel

Keyword(s):

Machine Learning ◽

R Package ◽

Elastic Net ◽

Machine Learning Techniques ◽

Supplementary Information ◽

Biomedical Sciences ◽

Molecular Features ◽

Learning Techniques ◽

Meta Learning ◽

Interpretable Models

Abstract Motivation Machine learning in the biomedical sciences should ideally provide predictive and interpretable models. When predicting outcomes from clinical or molecular features, applied researchers often want to know which features have effects, whether these effects are positive or negative, and how strong these effects are. Regression analysis includes this information in the coefficients but typically renders less predictive models than more advanced machine learning techniques. Results Here we propose an interpretable meta-learning approach for high-dimensional regression. The elastic net provides a compromise between estimating weak effects for many features and strong effects for some features. It has a mixing parameter to weight between ridge and lasso regularisation. Instead of selecting one weighting by tuning, we combine multiple weightings by stacking. We do this in a way that increases predictivity without sacrificing interpretability. Availability and Implementation The R package starnet is available on GitHub: https://github.com/rauschenberger/starnet. Supplementary information Supplementary data are available at Bioinformatics online.

Optimal Stratification and Allocation for the June Agricultural Survey

Journal of Official Statistics ◽

10.1515/jos-2018-0007 ◽

2018 ◽

Vol 34 (1) ◽

pp. 121-148

Author(s):

Jonathan Lisic ◽

Hejian Sang ◽

Zhengyuan Zhu ◽

Stephanie Zimmer

Keyword(s):

Machine Learning ◽

Simulated Annealing ◽

Simulated Data ◽

R Package ◽

Computational Approach ◽

Machine Learning Method ◽

Learning Method ◽

Computationally Efficient ◽

Computational Speed ◽

Efficient Machine

Abstract A computational approach to optimal multivariate designs with respect to stratification and allocation is investigated under the assumptions of fixed total allocation, known number of strata, and the availability of administrative data correlated with thevariables of interest under coefficient-of-variation constraints. This approach uses a penalized objective function that is optimized by simulated annealing through exchanging sampling units and sample allocations among strata. Computational speed is improved through the use of a computationally efficient machine learning method such as K-means to create an initial stratification close to the optimal stratification. The numeric stability of the algorithm has been investigated and parallel processing has been employed where appropriate. Results are presented for both simulated data and USDA’s June Agricultural Survey. An R package has also been made available for evaluation.

Lilikoi: an R package for personalized pathway-based classification modeling using metabolomics data

10.1101/283408 ◽

2018 ◽

Author(s):

Fadhl M. Al-Akwaa ◽

Sijia Huang ◽

Lana X. Garmire

Keyword(s):

Machine Learning ◽

R Package ◽

Classification Algorithms ◽

Feature Mapping ◽

Metabolomics Data ◽

Machine Learning Classification ◽

Disease Phenotypes ◽

Prediction Module ◽

Significant Pathway ◽

Pathway Deregulation

AbstractLilikoi (Hawaiian word for passion fruit) is a new and comprehensive R package for personalized pathway based classification modelling, using metabolomics data. Four basic modules are presented as the backbone of the package: 1) Feature mapping module, which standardizes the metabolite names provided by users, and map them to pathways. 2) Dimension transformation module, which transforms the metabolomic profiles to personalized pathway-based profiles using pathway deregulation scores (PDS). 3) Feature selection module which helps to select the significant pathway features related to the disease phenotypes, and 4) Classification and prediction module which offers various machine-learning classification algorithms. The package is freely available under the GPLv3 license through the github repository at: https://github.com/lanagarmire/lilikoi

Benchmarking machine learning models for the analysis of genetic data using FRESA.CAD Binary Classification Benchmarking

10.1101/733675 ◽

2019 ◽

Author(s):

Javier de Velasco Oriol ◽

Antonio Martinez-Torteya ◽

Victor Trevino ◽

Israel Alanis ◽

Edgar E. Vallejo ◽

...

Keyword(s):

Machine Learning ◽

Model Selection ◽

Binary Classification ◽

Genetic Data ◽

R Package ◽

Learning Models ◽

Classification Problems ◽

Machine Learning Methods ◽

Computational Perspective ◽

Machine Learning Models

AbstractBackgroundMachine learning models have proven to be useful tools for the analysis of genetic data. However, with the availability of a wide variety of such methods, model selection has become increasingly difficult, both from the human and computational perspective.ResultsWe present the R package FRESA.CAD Binary Classification Benchmarking that performs systematic comparisons between a collection of representative machine learning methods for solving binary classification problems on genetic datasets.ConclusionsFRESA.CAD Binary Benchmarking demonstrates to be a useful tool over a variety of binary classification problems comprising the analysis of genetic data showing both quantitative and qualitative advantages over similar packages.

How Threats Shape the Politics of Marginalized: Evidence from a Natural Experiment and Machine Learning

10.31235/osf.io/y65sd ◽

2020 ◽

Author(s):

Jae Yeon Kim ◽

Andrew Thompson

Keyword(s):

Machine Learning ◽

Information Seeking ◽

Natural Experiment ◽

Domestic Politics ◽

R Package ◽

Exogenous Shock ◽

Indian American ◽

Ethnic Newspapers ◽

September 11 Attacks ◽

Automated Text Classification

In this study, we used a natural experiment and machine learning to examine how threats prompt information seeking among marginalized populations. We traced how the September 11 attacks, an exogenous shock, increased the interest of Arab and Indian Americans in U.S. domestic politics. We classified 5,684 Arab American and Indian American newspaper articles using machine learning and estimated that three more articles on U.S. domestic politics were published daily in the post-9/11 period than in previous years. While the natural experiment design identifies the causal relationship between the intervention and the outcome variation, an automated text classification creates essential data for such a causal identification. This project also provides an accompanying R package that makes collecting data from the largest database of ethnic newspapers published in the U.S. easier and faster.

rSeqTU – a machine-learning based R package for prediction of bacterial transcription units

10.1101/553057 ◽

2019 ◽

Author(s):

Sheng-Yong Niu ◽

Binqiang Liu ◽

Qin Ma ◽

Wen-Chi Chou

Keyword(s):

Machine Learning ◽

Random Forest ◽

Regulatory Networks ◽

Prediction Models ◽

R Package ◽

Transcription Unit ◽

Support Vector ◽

Rna Seq ◽

Accurate Identification ◽

Prediction Approach

AbstractA transcription unit (TU) is composed of one or multiple adjacent genes on the same strand that are co-transcribed in mostly prokaryotes. Accurate identification of TUs is a crucial first step to delineate the transcriptional regulatory networks and elucidate the dynamic regulatory mechanisms encoded in various prokaryotic genomes. Many genomic features, e.g., gene intergenic distance, and transcriptomic features including continuous and stable RNA-seq reads count signals, have been collected from a large amount of experimental data and integrated into classification techniques to computationally predict genome-wide TUs. Although some tools and web servers are able to predict TUs based on bacterial RNA-seq data and genome sequences, there is a need to have an improved machine-learning prediction approach and a better comprehensive pipeline handling QC, TU prediction, and TU visualization. To enable users to efficiently perform TU identification on their local computers or high-performance clusters and provide a more accurate prediction, we develop an R package, named rSeqTU. rSeqTU uses a random forest algorithm to select essential features describing TUs and then uses support vector machine (SVM) to build TU prediction models. rSeqTU (available at https://s18692001.github.io/rSeqTU/) has six computational functionalities including read quality control, read mapping, training set generation, random-forest-based feature selection, TU prediction, and TU visualization.