Machine Learning Using H2O R Package: An Application in Bioinformatics

Author(s):  
Azian Azamimi Abdullah ◽  
Shigehiko Kanaya
Keyword(s):  

Author(s):  
Raphael Sonabend ◽  
Franz J Király ◽  
Andreas Bender ◽  
Bernd Bischl ◽  
Michel Lang

Abstract Motivation As machine learning has become increasingly popular over the last few decades, so too has the number of machine learning interfaces for implementing these models. Whilst many R libraries exist for machine learning, very few offer extended support for survival analysis. This is problematic considering its importance in fields like medicine, bioinformatics, economics, engineering, and more. mlr3proba provides a comprehensive machine learning interface for survival analysis and connects with mlr3’s general model tuning and benchmarking facilities to provide a systematic infrastructure for survival modeling and evaluation. Availability mlr3proba is available under an LGPL-3 license on CRAN and at https://github.com/mlr-org/mlr3proba, with further documentation at https://mlr3book.mlr-org.com/survival.html.



2020 ◽  
Author(s):  
Trang T. Le ◽  
Jason H. Moore

AbstractSummarytreeheatr is an R package for creating interpretable decision tree visualizations with the data represented as a heatmap at the tree’s leaf nodes. The integrated presentation of the tree structure along with an overview of the data efficiently illustrates how the tree nodes split up the feature space and how well the tree model performs. This visualization can also be examined in depth to uncover the correlation structure in the data and importance of each feature in predicting the outcome. Implemented in an easily installed package with a detailed vignette, treeheatr can be a useful teaching tool to enhance students’ understanding of a simple decision tree model before diving into more complex tree-based machine learning methods.AvailabilityThe treeheatr package is freely available under the permissive MIT license at https://trang1618.github.io/treeheatr and https://cran.r-project.org/package=treeheatr. It comes with a detailed vignette that is automatically built with GitHub Actions continuous [email protected]





2020 ◽  
Vol 9 (1) ◽  
pp. 44 ◽  
Author(s):  
Wolfgang B. Hamer ◽  
Tim Birr ◽  
Joseph-Alexander Verreet ◽  
Rainer Duttmann ◽  
Holger Klink

Real-time identification of the occurrence of dangerous pathogens is of crucial importance for the rapid execution of countermeasures. For this purpose, spatial and temporal predictions of the spread of such pathogens are indispensable. The R package papros developed by the authors offers an environment in which both spatial and temporal predictions can be made, based on local data using various deterministic, geostatistical regionalisation, and machine learning methods. The approach is presented using the example of a crops infection by fungal pathogens, which can substantially reduce the yield if not treated in good time. The situation is made more difficult by the fact that it is particularly difficult to predict the behaviour of wind-dispersed pathogens, such as powdery mildew (Blumeria graminis f. sp. tritici). To forecast pathogen development and spatial dispersal, a modelling process scheme was developed using the aforementioned R package, which combines regionalisation and machine learning techniques. It enables the prediction of the probability of yield- relevant infestation events for an entire federal state in northern Germany at a daily time scale. To run the models, weather and climate information are required, as is knowledge of the pathogen biology. Once fitted to the pathogen, only weather and climate information are necessary to predict such events, with an overall accuracy of 68% in the case of powdery mildew at a regional scale. Thereby, 91% of the observed powdery mildew events are predicted.



Author(s):  
Armin Rauschenberger ◽  
Enrico Glaab ◽  
Mark van de Wiel

Abstract Motivation Machine learning in the biomedical sciences should ideally provide predictive and interpretable models. When predicting outcomes from clinical or molecular features, applied researchers often want to know which features have effects, whether these effects are positive or negative, and how strong these effects are. Regression analysis includes this information in the coefficients but typically renders less predictive models than more advanced machine learning techniques. Results Here we propose an interpretable meta-learning approach for high-dimensional regression. The elastic net provides a compromise between estimating weak effects for many features and strong effects for some features. It has a mixing parameter to weight between ridge and lasso regularisation. Instead of selecting one weighting by tuning, we combine multiple weightings by stacking. We do this in a way that increases predictivity without sacrificing interpretability. Availability and Implementation The R package starnet is available on GitHub: https://github.com/rauschenberger/starnet. Supplementary information Supplementary data are available at Bioinformatics online.



2018 ◽  
Vol 34 (1) ◽  
pp. 121-148
Author(s):  
Jonathan Lisic ◽  
Hejian Sang ◽  
Zhengyuan Zhu ◽  
Stephanie Zimmer

Abstract A computational approach to optimal multivariate designs with respect to stratification and allocation is investigated under the assumptions of fixed total allocation, known number of strata, and the availability of administrative data correlated with thevariables of interest under coefficient-of-variation constraints. This approach uses a penalized objective function that is optimized by simulated annealing through exchanging sampling units and sample allocations among strata. Computational speed is improved through the use of a computationally efficient machine learning method such as K-means to create an initial stratification close to the optimal stratification. The numeric stability of the algorithm has been investigated and parallel processing has been employed where appropriate. Results are presented for both simulated data and USDA’s June Agricultural Survey. An R package has also been made available for evaluation.



2018 ◽  
Author(s):  
Fadhl M. Al-Akwaa ◽  
Sijia Huang ◽  
Lana X. Garmire

AbstractLilikoi (Hawaiian word for passion fruit) is a new and comprehensive R package for personalized pathway based classification modelling, using metabolomics data. Four basic modules are presented as the backbone of the package: 1) Feature mapping module, which standardizes the metabolite names provided by users, and map them to pathways. 2) Dimension transformation module, which transforms the metabolomic profiles to personalized pathway-based profiles using pathway deregulation scores (PDS). 3) Feature selection module which helps to select the significant pathway features related to the disease phenotypes, and 4) Classification and prediction module which offers various machine-learning classification algorithms. The package is freely available under the GPLv3 license through the github repository at: https://github.com/lanagarmire/lilikoi



2019 ◽  
Author(s):  
Javier de Velasco Oriol ◽  
Antonio Martinez-Torteya ◽  
Victor Trevino ◽  
Israel Alanis ◽  
Edgar E. Vallejo ◽  
...  

AbstractBackgroundMachine learning models have proven to be useful tools for the analysis of genetic data. However, with the availability of a wide variety of such methods, model selection has become increasingly difficult, both from the human and computational perspective.ResultsWe present the R package FRESA.CAD Binary Classification Benchmarking that performs systematic comparisons between a collection of representative machine learning methods for solving binary classification problems on genetic datasets.ConclusionsFRESA.CAD Binary Benchmarking demonstrates to be a useful tool over a variety of binary classification problems comprising the analysis of genetic data showing both quantitative and qualitative advantages over similar packages.



2020 ◽  
Author(s):  
Jae Yeon Kim ◽  
Andrew Thompson

In this study, we used a natural experiment and machine learning to examine how threats prompt information seeking among marginalized populations. We traced how the September 11 attacks, an exogenous shock, increased the interest of Arab and Indian Americans in U.S. domestic politics. We classified 5,684 Arab American and Indian American newspaper articles using machine learning and estimated that three more articles on U.S. domestic politics were published daily in the post-9/11 period than in previous years. While the natural experiment design identifies the causal relationship between the intervention and the outcome variation, an automated text classification creates essential data for such a causal identification. This project also provides an accompanying R package that makes collecting data from the largest database of ethnic newspapers published in the U.S. easier and faster.



2019 ◽  
Author(s):  
Sheng-Yong Niu ◽  
Binqiang Liu ◽  
Qin Ma ◽  
Wen-Chi Chou

AbstractA transcription unit (TU) is composed of one or multiple adjacent genes on the same strand that are co-transcribed in mostly prokaryotes. Accurate identification of TUs is a crucial first step to delineate the transcriptional regulatory networks and elucidate the dynamic regulatory mechanisms encoded in various prokaryotic genomes. Many genomic features, e.g., gene intergenic distance, and transcriptomic features including continuous and stable RNA-seq reads count signals, have been collected from a large amount of experimental data and integrated into classification techniques to computationally predict genome-wide TUs. Although some tools and web servers are able to predict TUs based on bacterial RNA-seq data and genome sequences, there is a need to have an improved machine-learning prediction approach and a better comprehensive pipeline handling QC, TU prediction, and TU visualization. To enable users to efficiently perform TU identification on their local computers or high-performance clusters and provide a more accurate prediction, we develop an R package, named rSeqTU. rSeqTU uses a random forest algorithm to select essential features describing TUs and then uses support vector machine (SVM) to build TU prediction models. rSeqTU (available at https://s18692001.github.io/rSeqTU/) has six computational functionalities including read quality control, read mapping, training set generation, random-forest-based feature selection, TU prediction, and TU visualization.



Sign in / Sign up

Export Citation Format

Share Document