scholarly journals Semi-supervised oblique predictive clustering trees

2021 ◽  
Vol 7 ◽  
pp. e506
Author(s):  
Tomaž Stepišnik ◽  
Dragi Kocev

Semi-supervised learning combines supervised and unsupervised learning approaches to learn predictive models from both labeled and unlabeled data. It is most appropriate for problems where labeled examples are difficult to obtain but unlabeled examples are readily available (e.g., drug repurposing). Semi-supervised predictive clustering trees (SSL-PCTs) are a prominent method for semi-supervised learning that achieves good performance on various predictive modeling tasks, including structured output prediction tasks. The main issue, however, is that the learning time scales quadratically with the number of features. In contrast to axis-parallel trees, which only use individual features to split the data, oblique predictive clustering trees (SPYCTs) use linear combinations of features. This makes the splits more flexible and expressive and often leads to better predictive performance. With a carefully designed criterion function, we can use efficient optimization techniques to learn oblique splits. In this paper, we propose semi-supervised oblique predictive clustering trees (SSL-SPYCTs). We adjust the split learning to take unlabeled examples into account while remaining efficient. The main advantage over SSL-PCTs is that the proposed method scales linearly with the number of features. The experimental evaluation confirms the theoretical computational advantage and shows that SSL-SPYCTs often outperform SSL-PCTs and supervised PCTs both in single-tree setting and ensemble settings. We also show that SSL-SPYCTs are better at producing meaningful feature importance scores than supervised SPYCTs when the amount of labeled data is limited.

2021 ◽  
pp. 109352662110016
Author(s):  
John Booth ◽  
Ben Margetts ◽  
Will Bryant ◽  
Richard Issitt ◽  
Ciaran Hutchinson ◽  
...  

Introduction Sudden unexpected death in infancy (SUDI) represents the commonest presentation of postneonatal death. We explored whether machine learning could be used to derive data driven insights for prediction of infant autopsy outcome. Methods A paediatric autopsy database containing >7,000 cases, with >300 variables, was analysed by examination stage and autopsy outcome classified as ‘explained (medical cause of death identified)’ or ‘unexplained’. Decision tree, random forest, and gradient boosting models were iteratively trained and evaluated. Results Data from 3,100 infant and young child (<2 years) autopsies were included. Naïve decision tree using external examination data had performance of 68% for predicting an explained death. Core data items were identified using model feature importance. The most effective model was XG Boost, with overall predictive performance of 80%, demonstrating age at death, and cardiovascular and respiratory histological findings as the most important variables associated with determining medical cause of death. Conclusion This study demonstrates feasibility of using machine-learning to evaluate component importance of complex medical procedures (paediatric autopsy) and highlights value of collecting routine clinical data according to defined standards. This approach can be applied to a range of clinical and operational healthcare scenarios


2021 ◽  
Author(s):  
Hyeyoung Koh ◽  
Hannah Beth Blum

This study presents a machine learning-based approach for sensitivity analysis to examine how parameters affect a given structural response while accounting for uncertainty. Reliability-based sensitivity analysis involves repeated evaluations of the performance function incorporating uncertainties to estimate the influence of a model parameter, which can lead to prohibitive computational costs. This challenge is exacerbated for large-scale engineering problems which often carry a large quantity of uncertain parameters. The proposed approach is based on feature selection algorithms that rank feature importance and remove redundant predictors during model development which improve model generality and training performance by focusing only on the significant features. The approach allows performing sensitivity analysis of structural systems by providing feature rankings with reduced computational effort. The proposed approach is demonstrated with two designs of a two-bay, two-story planar steel frame with different failure modes: inelastic instability of a single member and progressive yielding. The feature variables in the data are uncertainties including material yield strength, Young’s modulus, frame sway imperfection, and residual stress. The Monte Carlo sampling method is utilized to generate random realizations of the frames from published distributions of the feature parameters, and the response variable is the frame ultimate strength obtained from finite element analyses. Decision trees are trained to identify important features. Feature rankings are derived by four feature selection techniques including impurity-based, permutation, SHAP, and Spearman's correlation. Predictive performance of the model including the important features are discussed using the evaluation metric for imbalanced datasets, Matthews correlation coefficient. Finally, the results are compared with those from reliability-based sensitivity analysis on the same example frames to show the validity of the feature selection approach. As the proposed machine learning-based approach produces the same results as the reliability-based sensitivity analysis with improved computational efficiency and accuracy, it could be extended to other structural systems.


Author(s):  
Alex Zhavoronkov ◽  
Vladimir Aladinskiy ◽  
Alexander Zhebrak ◽  
Bogdan Zagribelnyy ◽  
Victor Terentiev ◽  
...  

<div> <div> <div> <p>The emergence of the 2019 novel coronavirus (2019-nCoV), for which there is no vaccine or any known effective treatment created a sense of urgency for novel drug discovery approaches. One of the most important 2019-nCoV protein targets is the 3C-like protease for which the crystal structure is known. Most of the immediate efforts are focused on drug repurposing of known clinically-approved drugs and virtual screening for the molecules available from chemical libraries that may not work well. For example, the IC50 of lopinavir, an HIV protease inhibitor, against the 3C-like protease is approximately 50 micromolar. In an attempt to address this challenge, on January 28th, 2020 Insilico Medicine decided to utilize a part of its generative chemistry pipeline to design novel drug-like inhibitors of 2019-nCoV and started generation on January 30th. It utilized three of its previously validated generative chemistry approaches: crystal-derived pocked- based generator, homology modelling-based generation, and ligand-based generation. Novel druglike compounds generated using these approaches are being published at www.insilico.com/ncov-sprint/ and will be continuously updated. Several molecules will be synthesized and tested using the internal resources; however, the team is seeking collaborations to synthesize, test, and, if needed, optimize the published molecules. </p> </div> </div> </div>


2021 ◽  
Author(s):  
Tuomo Hartonen ◽  
Teemu Kivioja ◽  
Jussi Taipale

Deep learning models have in recent years gained success in various tasks related to understanding information coded in the DNA sequence. Rapidly developing genome-wide measurement technologies provide large quantities of data ideally suited for modeling using deep learning or other powerful machine learning approaches. Although offering state-of-the art predictive performance, the predictions made by deep learning models can be difficult to understand. In virtually all biological research, the understanding of how a predictive model works is as important as the raw predictive performance. Thus interpretation of deep learning models is an emerging hot topic especially in context of biological research. Here we describe plotMI, a mutual information based model interpretation strategy that can intuitively visualize positional preferences and pairwise interactions learned by any machine learning model trained on sequence data with a defined alphabet as input. PlotMI is freely available at https://github.com/hartonen/plotMI.


2020 ◽  
Author(s):  
Ada Admin ◽  
Jialing Huang ◽  
Cornelia Huth ◽  
Marcela Covic ◽  
Martina Troll ◽  
...  

Early and precise identification of individuals with pre-diabetes and type 2 diabetes (T2D) at risk of progressing to chronic kidney disease (CKD) is essential to prevent complications of diabetes. Here, we identify and evaluate prospective metabolite biomarkers and the best set of predictors of CKD in the longitudinal, population-based Cooperative Health Research in the Region of Augsburg (KORA) cohort by targeted metabolomics and machine learning approaches. Out of 125 targeted metabolites, sphingomyelin (SM) C18:1 and phosphatidylcholine diacyl (PC aa) C38:0 were identified as candidate metabolite biomarkers of incident CKD specifically in hyperglycemic individuals followed during 6.5 years. Sets of predictors for incident CKD developed from 125 metabolites and 14 clinical variables showed highly stable performances in all three machine learning approaches and outperformed the currently established clinical algorithm for CKD. The two metabolites in combination with five clinical variables were identified as the best set of predictors and their predictive performance yielded a mean area value under the receiver operating characteristic curve of 0.857. The inclusion of metabolite variables in the clinical prediction of future CKD may thus improve the risk prediction in persons with pre- and T2D. The metabolite link with hyperglycemia-related early kidney dysfunction warrants further investigation.


2021 ◽  
Author(s):  
Chin Kuan Ho ◽  
Seng Huat Ong ◽  
Kamarul Imran Musa ◽  
Choo Yee Ting ◽  
Chiung Ching Ho ◽  
...  

UNSTRUCTURED The latest threat to global health is the ongoing outbreak of the Coronavirus Disease 2019 (COVID-19). There are three main areas of modeling research, namely epidemiology, drug repurposing and vaccine design. The most important purpose of the models is to inform institutional and nationwide efforts to ensure patient safety. This study aimed to review COVID-19 modelling and prediction tools. Understanding these methods streamlines the strengths and limitations of each method. We researched the traditional model and the more current models that flourish during the pandemic. This understanding is the key to the proper use of specific models to achieve certain goals. Modeling approaches for COVID-19 can be very broadly categorized into phenomenological models and mechanistic models. Phenomenological approaches treat the modeling problem purely from an empirical perspective. From our survey, there are three major types of approaches under the phenomenological models: time-series analysis and forecasting, fractal-based models, and machine learning approaches. Mechanistic models consider the underlying mechanics of the epidemic. In this survey, compartmental models and agent-based models are categorized as mechanistic models. We studied 46 scientific articles (published between 22 February 2020 and 29 January 2021) that we think are representative of the scientific community’s approaches in modeling and prediction. We highlight the challenges and limitations of modelling approaches such as the need for high quality data, and interpretable models. Finally, we list the desired features for developing robust and reliable modelling and prediction tools.


Sign in / Sign up

Export Citation Format

Share Document