Machine-Learning Models for Combinatorial Catalyst Discovery

ABSTRACTStandard machine-learning algorithms were used to build models capable of predicting the molecular weights of polymers generated by a homogeneous catalyst. Using descriptors calculated from only the two-dimensional structures of the ligands, the average accuracy of the models on an external validation data set was approximately 70%. Because the models show no bias and perform significantly better than equivalent models built using randomized data, we conclude that they learned useful rules and did not overfit the data.

Download Full-text

A comprehensive review and performance evaluation of bioinformatics tools for HLA class I peptide-binding prediction

Briefings in Bioinformatics ◽

10.1093/bib/bbz051 ◽

2020 ◽

Vol 21 (4) ◽

pp. 1119-1135 ◽

Cited By ~ 27

Author(s):

Shutao Mei ◽

Fuyi Li ◽

André Leier ◽

Tatiana T Marquez-Lago ◽

Kailin Giam ◽

...

Keyword(s):

Machine Learning ◽

T Cell ◽

Peptide Binding ◽

Hla Class I ◽

Machine Learning Algorithms ◽

Class I ◽

Target Cells ◽

Binding Prediction ◽

Validation Data ◽

Data Set

Abstract Human leukocyte antigen class I (HLA-I) molecules are encoded by major histocompatibility complex (MHC) class I loci in humans. The binding and interaction between HLA-I molecules and intracellular peptides derived from a variety of proteolytic mechanisms play a crucial role in subsequent T-cell recognition of target cells and the specificity of the immune response. In this context, tools that predict the likelihood for a peptide to bind to specific HLA class I allotypes are important for selecting the most promising antigenic targets for immunotherapy. In this article, we comprehensively review a variety of currently available tools for predicting the binding of peptides to a selection of HLA-I allomorphs. Specifically, we compare their calculation methods for the prediction score, employed algorithms, evaluation strategies and software functionalities. In addition, we have evaluated the prediction performance of the reviewed tools based on an independent validation data set, containing 21 101 experimentally verified ligands across 19 HLA-I allotypes. The benchmarking results show that MixMHCpred 2.0.1 achieves the best performance for predicting peptides binding to most of the HLA-I allomorphs studied, while NetMHCpan 4.0 and NetMHCcons 1.1 outperform the other machine learning-based and consensus-based tools, respectively. Importantly, it should be noted that a peptide predicted with a higher binding score for a specific HLA allotype does not necessarily imply it will be immunogenic. That said, peptide-binding predictors are still very useful in that they can help to significantly reduce the large number of epitope candidates that need to be experimentally verified. Several other factors, including susceptibility to proteasome cleavage, peptide transport into the endoplasmic reticulum and T-cell receptor repertoire, also contribute to the immunogenicity of peptide antigens, and some of them can be considered by some predictors. Therefore, integrating features derived from these additional factors together with HLA-binding properties by using machine-learning algorithms may increase the prediction accuracy of immunogenic peptides. As such, we anticipate that this review and benchmarking survey will assist researchers in selecting appropriate prediction tools that best suit their purposes and provide useful guidelines for the development of improved antigen predictors in the future.

Download Full-text

A Novel Ensemble Stacking Classification of Genetic Variations Using Machine Learning Algorithms

International Journal of Image and Graphics ◽

10.1142/s0219467823500158 ◽

2021 ◽

Author(s):

Jahnavi Yeturu ◽

Poongothai Elango ◽

S. P. Raja ◽

P. Nagendra Kumar

Keyword(s):

Machine Learning ◽

Heart Diseases ◽

Learning Algorithms ◽

Genetic Mutation ◽

Machine Learning Algorithms ◽

Support Vector ◽

Genetic Mutations ◽

Validation Data ◽

Data Set

Genetics is the clinical review of congenital mutation, where the principal advantage of analyzing genetic mutation of humans is the exploration, analysis, interpretation and description of the genetic transmitted and inherited effect of several diseases such as cancer, diabetes and heart diseases. Cancer is the most troublesome and disordered affliction as the proportion of cancer sufferers is growing massively. Identification and discrimination of the mutations that impart to the enlargement of tumor from the unbiased mutations is difficult, as majority tumors of cancer are able to exercise genetic mutations. The genetic mutations are systematized and categorized to sort the cancer by way of medical observations and considering clinical studies. At the present time, genetic mutations are being annotated and these interpretations are being accomplished either manually or using the existing primary algorithms. Evaluation and classification of each and every individual genetic mutation was basically predicated on evidence from documented content built on medical literature. Consequently, as a means to build genetic mutations, basically, depending on the clinical evidences persists a challenging task. There exist various algorithms such as one hot encoding technique is used to derive features from genes and their variations, TF-IDF is used to extract features from the clinical text data. In order to increase the accuracy of the classification, machine learning algorithms such as support vector machine, logistic regression, Naive Bayes, etc., are experimented. A stacking model classifier has been developed to increase the accuracy. The proposed stacking model classifier has obtained the log loss 0.8436 and 0.8572 for cross-validation data set and test data set, respectively. By the experimentation, it has been proved that the proposed stacking model classifier outperforms the existing algorithms in terms of log loss. Basically, minimum log loss refers to the efficient model. Here the log loss has been reduced to less than 1 by using the proposed stacking model classifier. The performance of these algorithms can be gauged on the basis of the various measures like multi-class log loss.

Download Full-text

PSX-A-29 Late-Breaking: Use of machine learning algorithms to predict residual feed intake value and classification groups in commercial beef cattle

Journal of Animal Science ◽

10.1093/jas/skab235.691 ◽

2021 ◽

Vol 99 (Supplement_3) ◽

pp. 377-378

Author(s):

Ghader Manafiazar ◽

Mohammad Riazi ◽

John A Basarab ◽

Changxi Li ◽

Paul Stothard ◽

...

Keyword(s):

Machine Learning ◽

Beef Cattle ◽

Feed Intake ◽

Phenotypic Variation ◽

Genomic Analysis ◽

Residual Feed Intake ◽

Machine Learning Algorithms ◽

Genomic Information ◽

Data Set ◽

Better Than

Abstract The objective of this study was to explore the potential of Machine Learning (ML) algorithms to predict residual feed intake (RFI) classification group (high or low RFI) and individual RFI using performance records and genomic information. A total of 4145 animals from research and commercial herds with RFI performance records were included in the study from which 3899 cattle had genomic information (genotyped using Illumina Bovine 50k SNP BeadChip). Different libraries based on R and Python including Lazy Predict, Scikit-learn, PyCaret, and H2O Flow were used to test various ML models. Genomic information was subjected to quality control by removing SNPs with an allele frequency less than 0.05 or with a call rate lower than 0.95. A total of 42,689 SNPs remained for further analysis and accounted for 34% of phenotypic variation (heritability of 0.34±0.07) in RFI. Different numbers of SNPs were selected based on their contribution to phenotypic variation (500 SNPs, 1K, 5K, 10K, and 15K) then were included in the ML models. The GLM Stacked Ensemble model with 15k SNPs performed better than the other models to predict RFI classification group (R2 = 0.54). Regardless of the number of SNPs included in the model, GLM Stacked Ensemble performed better than other models to predict individual RFI. This model’s performance improved with increasing SNPs (MAE=0.39 for 500 SNPs; 0.31 for 15k SNPs). In the test data set, an increasing number of SNPs did not change the performance of the model and had a MAE of 0.39). The results demonstrate the potential for ML to improve predictions for feed efficiency compare to genomic analysis in beef cattle without measuring feed intake.

Download Full-text

A machine learning-based treatment prediction model using whole genome variants of hepatitis C virus

PLoS ONE ◽

10.1371/journal.pone.0242028 ◽

2020 ◽

Vol 15 (11) ◽

pp. e0242028

Author(s):

Hiroaki Haga ◽

Hidenori Sato ◽

Ayumi Koseki ◽

Takafumi Saito ◽

Kazuo Okumoto ◽

...

Keyword(s):

Machine Learning ◽

Hepatitis C Virus ◽

Hepatitis C ◽

Prediction Model ◽

Predictive Model ◽

Machine Learning Algorithms ◽

Training Data ◽

Whole Genome ◽

Validation Data ◽

Data Set

In recent years, the development of diagnostics using artificial intelligence (AI) has been remarkable. AI algorithms can go beyond human reasoning and build diagnostic models from a number of complex combinations. Using next-generation sequencing technology, we identified hepatitis C virus (HCV) variants resistant to directing-acting antivirals (DAA) by whole genome sequencing of full-length HCV genomes, and applied these variants to various machine-learning algorithms to evaluate a preliminary predictive model. HCV genomic RNA was extracted from serum from 173 patients (109 with subsequent sustained virological response [SVR] and 64 without) before DAA treatment. HCV genomes from the 109 SVR and 64 non-SVR patients were randomly divided into a training data set (57 SVR and 29 non-SVR) and a validation-data set (52 SVR and 35 non-SVR). The training data set was subject to nine machine-learning algorithms selected to identify the optimized combination of functional variants in relation to SVR status following DAA therapy. Subsequently, the prediction model was tested by the validation-data set. The most accurate learning method was the support vector machine (SVM) algorithm (validation accuracy, 0.95; kappa statistic, 0.90; F-value, 0.94). The second-most accurate learning algorithm was Multi-layer perceptron. Unfortunately, Decision Tree, and Naive Bayes algorithms could not be fitted with our data set due to low accuracy (< 0.8). Conclusively, with an accuracy rate of 95.4% in the generalization performance evaluation, SVM was identified as the best algorithm. Analytical methods based on genomic analysis and the construction of a predictive model by machine-learning may be applicable to the selection of the optimal treatment for other viral infections and cancer.

Download Full-text

Abstract WP405: Automated Detection of Hemorrhagic Stroke From Non-Contrast Computed Tomography: A Machine Learning Approach

Stroke ◽

10.1161/str.51.suppl_1.wp405 ◽

2020 ◽

Vol 51 (Suppl_1) ◽

Author(s):

Pakinam Aboutaleb ◽

Arko Barman ◽

Victor Lopez-Rivera ◽

Songmi Lee ◽

Farhaan Vahidy ◽

...

Keyword(s):

Machine Learning ◽

External Validation ◽

Model Performance ◽

Area Under The Curve ◽

Validation Data ◽

Data Set ◽

Segmentation Evaluation ◽

Machine Learning Approach ◽

Automated Imaging ◽

Contrast Ct

Introduction: Automated neuroimaging analysis is being used increasingly in the acute ischemic stroke (AIS) evaluation. However, current algorithms do not factor in an assessment of intracranial hemorrhage (ICH) in the workflow. In this study we present a machine learning (ML) algorithm that uses brain symmetry information to detect ICH. Methods: We prospectively collected non-contrast CT (NCCT) images on patients that presented to the Emergency Department for AIS evaluation between 2017 and 2019. Patients were included if they underwent technically adequate NCCT imaging. Diagnoses of ICH, AIS and non-stroke were confirmed by experienced neuroradiologists as well as review of the clinical record. A ML algorithm which integrates symmetry features as well as standard features for the whole brain was trained on 80% of the sample and validated on the remaining images. Training was performed without any prior segmentation. Evaluation of the model performance was conducted using receiver-operator curve and area under the curve (AUC) analysis. Results are given as median [IQR] and [AUC 95% CI]. Results: Among the 568 patients that met inclusion criteria, median age was 65 [55-76], 47% were female and 34% were white. 128 (23%) patients were determined to have ICH and 440 as non-ICH (70% AIS and 30% non-stroke). Among ICH patients, 108 (84%) had a supratentorial ICH. When analyzing the regions of the CT images that most strongly contributed to the algorithm’s diagnostic decisions, they corresponded with the regions of ICH (Fig. 1A). On the external validation data set, the algorithm successfully detected ICH (Fig. 1B) with high accuracy (AUC 0.99 [0.97-1.00]). Conclusion: We have developed a symmetry-sensitive ML method that can with very high fidelity identify ICH in an automated fashion. Without prior training, the algorithm autonomously was able to learn ICH location. These results may help contribute to an automated imaging workflow for all stroke evaluations, not just AIS.

Download Full-text

Exploring the Use of Machine Learning to Automate the Qualitative Coding of Church-related Tweets

Fieldwork in Religion ◽

10.1558/firn.40610 ◽

2020 ◽

Vol 14 (2) ◽

pp. 140-159

Author(s):

Anthony-Paul Cooper ◽

Emmanuel Awuni Kolog ◽

Erkki Sutinen

Keyword(s):

Machine Learning ◽

Online Community ◽

High Volume ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Social Media Data ◽

Twitter Data ◽

Resource Intensity ◽

Media Data ◽

Better Than

This article builds on previous research around the exploration of the content of church-related tweets. It does so by exploring whether the qualitative thematic coding of such tweets can, in part, be automated by the use of machine learning. It compares three supervised machine learning algorithms to understand how useful each algorithm is at a classification task, based on a dataset of human-coded church-related tweets. The study finds that one such algorithm, Naïve-Bayes, performs better than the other algorithms considered, returning Precision, Recall and F-measure values which each exceed an acceptable threshold of 70%. This has far-reaching consequences at a time where the high volume of social media data, in this case, Twitter data, means that the resource-intensity of manual coding approaches can act as a barrier to understanding how the online community interacts with, and talks about, church. The findings presented in this article offer a way forward for scholars of digital theology to better understand the content of online church discourse.

Download Full-text

Development and temporal external validation of a simple risk score tool for prediction of outcomes after severe head injury based on admission characteristics from level-1 trauma centre of India using retrospectively collected data

BMJ Open ◽

10.1136/bmjopen-2020-040778 ◽

2021 ◽

Vol 11 (1) ◽

pp. e040778

Author(s):

Vineet Kumar Kamal ◽

Ravindra Mohan Pandey ◽

Deepak Agrawal

Keyword(s):

Hospital Mortality ◽

External Validation ◽

Trauma Centre ◽

Unfavourable Outcome ◽

Motor Score ◽

Validation Data ◽

Data Set ◽

Development Data ◽

Level 1 ◽

Pupillary Reactivity

ObjectiveTo develop and validate a simple risk scores chart to estimate the probability of poor outcomes in patients with severe head injury (HI).DesignRetrospective.SettingLevel-1, government-funded trauma centre, India.ParticipantsPatients with severe HI admitted to the neurosurgery intensive care unit during 19 May 2010–31 December 2011 (n=946) for the model development and further, data from same centre with same inclusion criteria from 1 January 2012 to 31 July 2012 (n=284) for the external validation of the model.Outcome(s)In-hospital mortality and unfavourable outcome at 6 months.ResultsA total of 39.5% and 70.7% had in-hospital mortality and unfavourable outcome, respectively, in the development data set. The multivariable logistic regression analysis of routinely collected admission characteristics revealed that for in-hospital mortality, age (51–60, >60 years), motor score (1, 2, 4), pupillary reactivity (none), presence of hypotension, basal cistern effaced, traumatic subarachnoid haemorrhage/intraventricular haematoma and for unfavourable outcome, age (41–50, 51–60, >60 years), motor score (1–4), pupillary reactivity (none, one), unequal limb movement, presence of hypotension were the independent predictors as its 95% confidence interval (CI) of odds ratio (OR)_did not contain one. The discriminative ability (area under the receiver operating characteristic curve (95% CI)) of the score chart for in-hospital mortality and 6 months outcome was excellent in the development data set (0.890 (0.867 to 912) and 0.894 (0.869 to 0.918), respectively), internal validation data set using bootstrap resampling method (0.889 (0.867 to 909) and 0.893 (0.867 to 0.915), respectively) and external validation data set (0.871 (0.825 to 916) and 0.887 (0.842 to 0.932), respectively). Calibration showed good agreement between observed outcome rates and predicted risks in development and external validation data set (p>0.05).ConclusionFor clinical decision making, we can use of these score charts in predicting outcomes in new patients with severe HI in India and similar settings.

Download Full-text

Systematic literature review of machine learning methods used in the analysis of real-world data for patient-provider decision making

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-021-01403-2 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Alan Brnabic ◽

Lisa M. Hess

Keyword(s):

Machine Learning ◽

Decision Making ◽

Literature Review ◽

Systematic Literature Review ◽

Real World ◽

Learning Algorithms ◽

External Validation ◽

Machine Learning Algorithms ◽

Learning Methods ◽

Machine Learning Methods

Abstract Background Machine learning is a broad term encompassing a number of methods that allow the investigator to learn from the data. These methods may permit large real-world databases to be more rapidly translated to applications to inform patient-provider decision making. Methods This systematic literature review was conducted to identify published observational research of employed machine learning to inform decision making at the patient-provider level. The search strategy was implemented and studies meeting eligibility criteria were evaluated by two independent reviewers. Relevant data related to study design, statistical methods and strengths and limitations were identified; study quality was assessed using a modified version of the Luo checklist. Results A total of 34 publications from January 2014 to September 2020 were identified and evaluated for this review. There were diverse methods, statistical packages and approaches used across identified studies. The most common methods included decision tree and random forest approaches. Most studies applied internal validation but only two conducted external validation. Most studies utilized one algorithm, and only eight studies applied multiple machine learning algorithms to the data. Seven items on the Luo checklist failed to be met by more than 50% of published studies. Conclusions A wide variety of approaches, algorithms, statistical software, and validation strategies were employed in the application of machine learning methods to inform patient-provider decision making. There is a need to ensure that multiple machine learning approaches are used, the model selection strategy is clearly defined, and both internal and external validation are necessary to be sure that decisions for patient care are being made with the highest quality evidence. Future work should routinely employ ensemble methods incorporating multiple machine learning algorithms.

Download Full-text

Machine Learning for the Dynamic Positioning of UAVs for Extended Connectivity

Sensors ◽

10.3390/s21134618 ◽

2021 ◽

Vol 21 (13) ◽

pp. 4618

Author(s):

Francisco Oliveira ◽

Miguel Luís ◽

Susana Sargento

Keyword(s):

Machine Learning ◽

Cellular Networks ◽

Real Data ◽

Emerging Technology ◽

Machine Learning Algorithms ◽

Base Stations ◽

Aerial Vehicle ◽

Positioning Algorithm ◽

The Military ◽

Better Than

Unmanned Aerial Vehicle (UAV) networks are an emerging technology, useful not only for the military, but also for public and civil purposes. Their versatility provides advantages in situations where an existing network cannot support all requirements of its users, either because of an exceptionally big number of users, or because of the failure of one or more ground base stations. Networks of UAVs can reinforce these cellular networks where needed, redirecting the traffic to available ground stations. Using machine learning algorithms to predict overloaded traffic areas, we propose a UAV positioning algorithm responsible for determining suitable positions for the UAVs, with the objective of a more balanced redistribution of traffic, to avoid saturated base stations and decrease the number of users without a connection. The tests performed with real data of user connections through base stations show that, in less restrictive network conditions, the algorithm to dynamically place the UAVs performs significantly better than in more restrictive conditions, reducing significantly the number of users without a connection. We also conclude that the accuracy of the prediction is a very important factor, not only in the reduction of users without a connection, but also on the number of UAVs deployed.

Download Full-text

Algorithmic and human prediction of success in human collaboration from visual features

Scientific Reports ◽

10.1038/s41598-021-81145-3 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Martin Saveski ◽

Edmond Awad ◽

Iyad Rahwan ◽

Manuel Cebrian

Keyword(s):

Machine Learning ◽

Visual Cues ◽

Success Factors ◽

Group Performance ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Adventure Game ◽

Group Success ◽

The Relationship ◽

Better Than

AbstractAs groups are increasingly taking over individual experts in many tasks, it is ever more important to understand the determinants of group success. In this paper, we study the patterns of group success in Escape The Room, a physical adventure game in which a group is tasked with escaping a maze by collectively solving a series of puzzles. We investigate (1) the characteristics of successful groups, and (2) how accurately humans and machines can spot them from a group photo. The relationship between these two questions is based on the hypothesis that the characteristics of successful groups are encoded by features that can be spotted in their photo. We analyze >43K group photos (one photo per group) taken after groups have completed the game—from which all explicit performance-signaling information has been removed. First, we find that groups that are larger, older and more gender but less age diverse are significantly more likely to escape. Second, we compare humans and off-the-shelf machine learning algorithms at predicting whether a group escaped or not based on the completion photo. We find that individual guesses by humans achieve 58.3% accuracy, better than random, but worse than machines which display 71.6% accuracy. When humans are trained to guess by observing only four labeled photos, their accuracy increases to 64%. However, training humans on more labeled examples (eight or twelve) leads to a slight, but statistically insignificant improvement in accuracy (67.4%). Humans in the best training condition perform on par with two, but worse than three out of the five machine learning algorithms we evaluated. Our work illustrates the potentials and the limitations of machine learning systems in evaluating group performance and identifying success factors based on sparse visual cues.

Download Full-text