Engaging proactive control: Influences of diverse language experiences using insights from machine learning

Mapping Intimacies ◽

10.31234/osf.io/7jq3w ◽

2020 ◽

Author(s):

Jason William Gullifer ◽

Debra Titone

Keyword(s):

Machine Learning ◽

Executive Control ◽

Cross Validation ◽

Predictive Performance ◽

Penalized Regression ◽

Information Criteria ◽

Lasso Regression ◽

Continuous Performance Task ◽

Proactive Control ◽

Language Experience

We used insights from machine learning to address an important but contentious question: is bilingual language experience associated with executive control abilities? Specifically, we assess proactive executive control for over 400 young adult bilinguals via reaction time on an AX continuous performance task (AX-CPT). We measured bilingual experience as a continuous, multidimensional spectrum (i.e., age of acquisition, language entropy, and sheer second language exposure). Linear mixed effects regression analyses indicated significant associations between bilingual language experience and proactive control, consistent with previous work. Information criteria (e.g., AIC) and cross-validation further suggested that these models are robust in predicting data from novel, unmodeled participants. These results were bolstered by cross-validated LASSO regression, a form of penalized regression. However, the results of both cross-validation procedures also indicated that similar predictive performance could be achieved through simpler models that only included information about the AX-CPT (i.e., trial type). Collectively, these results suggest that the effects of bilingual experience on proactive control, to the extent that they exist in younger adults, are likely small. Thus, future studies will require even larger or qualitatively different samples (e.g., older adults or children) in combination with valid, granular quantifications of language experience to reveal predictive effects on novel participants.

Download Full-text

Identifying the Main Risk Factors for CVD Prediction Using Machine Learning Algorithms

10.20944/preprints202108.0471.v1 ◽

2021 ◽

Author(s):

Luis Rolando Guarneros-Nolasco ◽

Nancy Aracely Cruz-Ramos ◽

Giner Alor-Hernández ◽

Lisbeth Rodríguez-Mazahua ◽

José Luis Sánchez-Cervantes

Keyword(s):

Machine Learning ◽

Cross Validation ◽

Performance Metrics ◽

Learning Algorithms ◽

Predictive Performance ◽

Machine Learning Algorithms ◽

Algorithm Performance ◽

Body Regions ◽

Risks Factors ◽

Fold Cross Validation

CVDs are a leading cause of death globally. In CVDs, the heart is unable to deliver enough blood to other body regions. Since effective and accurate diagnosis of CVDs is essential for CVD prevention and treatment, machine learning (ML) techniques can be effectively and reliably used to discern patients suffering from a CVD from those who do not suffer from any heart condition. Namely, machine learning algorithms (MLAs) play a key role in the diagnosis of CVDs through predictive models that allow us to identify the main risks factors influencing CVD development. In this study, we analyze the performance of ten MLAs on two datasets for CVD prediction and two for CVD diagnosis. Algorithm performance is analyzed on top-two and top-four dataset attributes/features with respect to five performance metrics –accuracy, precision, recall, f1-score, and roc-auc – using the train-test split technique and k-fold cross-validation. Our study identifies the top two and four attributes from each CVD diagnosis/prediction dataset. As our main findings, the ten MLAs exhibited appropriate diagnosis and predictive performance; hence, they can be successfully implemented for improving current CVD diagnosis efforts and help patients around the world, especially in regions where medical staff is lacking.

Download Full-text

Bayesian Predictive Performance Assessment of Rate-Time Models for Unconventional Production Forecasting

10.2118/205151-ms ◽

2021 ◽

Author(s):

Leopoldo M. Ruiz Maraggi ◽

Larry W. Lake ◽

Mark P. Walsh

Keyword(s):

Cross Validation ◽

Predictive Accuracy ◽

Predictive Performance ◽

Information Criteria ◽

Logistic Growth ◽

Tight Oil ◽

Two Phase ◽

Production Forecasting ◽

Point Estimates ◽

Future Production

Abstract A common industry practice is to select a particular model from a set of models to history match oil production and estimate reserves by extrapolation. Future production forecasting is usually done in this deterministic way. However, this approach neglects: a) model uncertainty, and b) quantification of uncertainty of future production forecasts. The current study evaluates the predictive accuracy of rate-time models to forecast production over a set of tight oil wells of West Texas. We present the application of an accuracy metric that evaluates the uncertainty of our models' estimates: the expected log predictive density (elpd). This work assesses the predictive performance of two empirical models—the Arps hyperbolic and the logistic growth models—and two physics-based models—scaled slightly compressible single-phase and scaled two-phase (oil and gas) solutions of the diffusivity equation. These models are arbitrarily selected for the purpose of illustrating the statistical procedure shown in this paper. First, we perform classical regression with the models and evaluate their predictive performance using frequentist (point estimates) metrics such as R2, the Akaike information criteria (AIC), and hindcasting. Second, we generate probabilistic production forecasts using Bayesian inference for each model. Third, we evaluate the predictive accuracy of the models using the elpd accuracy metric. This metric evaluates a measure of out-of-sample predictive performance. We apply both adjusted-within-sample and cross-validation techniques. The adjusted within-sample method is the widely applicable information criteria (WAIC). The cross-validation techniques are hindcasting and leave-one-out (LOO-CV) method. The results of this research are the following. First, we illustrate that the assessment of a model's predictive accuracy depends on whether we use frequentist or Bayesian approaches. This is an important finding in this work. The frequentist approach relies on point estimates while the Bayesian approach considers the uncertainty of our models' estimates. From a frequentist or classical standpoint, all of the models under study yielded very similar results which made it difficult to determine which model yielded the best predictive performance. From a Bayesian standpoint, however, we determined that the logistic growth model yielded a best match in 81 of 130 wells in our sample play and the two-phase physics-based model yielded a best match in 39 of the wells. In addition, we show that WAIC and LOO-CV present similar results for each model, a thing to expect because of their asymptotical equivalence. Finally, Our observations regarding the different models are subject to the dataset under study wherein a majority of the wells are in transient flow. The present study provides tools to evaluate the predictive accuracy of models used to forecast (extrapolate) production of tight oil wells. The elpd is an accuracy metric useful to evaluate the uncertainty of our models' estimates and compare their predictive performance since it assesses distributions instead of point estimates. To our knowledge, the proposed approach is a novel and an appropriate technique to evaluate the predictive accuracy of models to forecast hydrocarbon production.

Download Full-text

Comprehensive assessment of machine learning-based methods for predicting antimicrobial peptides

Briefings in Bioinformatics ◽

10.1093/bib/bbab083 ◽

2021 ◽

Author(s):

Jing Xu ◽

Fuyi Li ◽

André Leier ◽

Dongxu Xiang ◽

Hsin-Hui Shen ◽

...

Keyword(s):

Machine Learning ◽

Antimicrobial Peptides ◽

Computational Methods ◽

Cross Validation ◽

Predictive Performance ◽

Support Vector ◽

Data Sets ◽

Learning Methods ◽

Data Set ◽

Machine Learning Methods

Abstract Antimicrobial peptides (AMPs) are a unique and diverse group of molecules that play a crucial role in a myriad of biological processes and cellular functions. AMP-related studies have become increasingly popular in recent years due to antimicrobial resistance, which is becoming an emerging global concern. Systematic experimental identification of AMPs faces many difficulties due to the limitations of current methods. Given its significance, more than 30 computational methods have been developed for accurate prediction of AMPs. These approaches show high diversity in their data set size, data quality, core algorithms, feature extraction, feature selection techniques and evaluation strategies. Here, we provide a comprehensive survey on a variety of current approaches for AMP identification and point at the differences between these methods. In addition, we evaluate the predictive performance of the surveyed tools based on an independent test data set containing 1536 AMPs and 1536 non-AMPs. Furthermore, we construct six validation data sets based on six different common AMP databases and compare different computational methods based on these data sets. The results indicate that amPEPpy achieves the best predictive performance and outperforms the other compared methods. As the predictive performances are affected by the different data sets used by different methods, we additionally perform the 5-fold cross-validation test to benchmark different traditional machine learning methods on the same data set. These cross-validation results indicate that random forest, support vector machine and eXtreme Gradient Boosting achieve comparatively better performances than other machine learning methods and are often the algorithms of choice of multiple AMP prediction tools.

Download Full-text

Randomized boosting with multivariable base-learners for high-dimensional variable selection and prediction

BMC Bioinformatics ◽

10.1186/s12859-021-04340-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Christian Staerk ◽

Andreas Mayr

Keyword(s):

Variable Selection ◽

Prediction Models ◽

Predictor Variable ◽

Predictive Performance ◽

Penalized Regression ◽

Information Criteria ◽

High Dimensional ◽

Gradient Boosting ◽

Biomedical Data ◽

Selection Of

Abstract Background Statistical boosting is a computational approach to select and estimate interpretable prediction models for high-dimensional biomedical data, leading to implicit regularization and variable selection when combined with early stopping. Traditionally, the set of base-learners is fixed for all iterations and consists of simple regression learners including only one predictor variable at a time. Furthermore, the number of iterations is typically tuned by optimizing the predictive performance, leading to models which often include unnecessarily large numbers of noise variables. Results We propose three consecutive extensions of classical component-wise gradient boosting. In the first extension, called Subspace Boosting (SubBoost), base-learners can consist of several variables, allowing for multivariable updates in a single iteration. To compensate for the larger flexibility, the ultimate selection of base-learners is based on information criteria leading to an automatic stopping of the algorithm. As the second extension, Random Subspace Boosting (RSubBoost) additionally includes a random preselection of base-learners in each iteration, enabling the scalability to high-dimensional data. In a third extension, called Adaptive Subspace Boosting (AdaSubBoost), an adaptive random preselection of base-learners is considered, focusing on base-learners which have proven to be predictive in previous iterations. Simulation results show that the multivariable updates in the three subspace algorithms are particularly beneficial in cases of high correlations among signal covariates. In several biomedical applications the proposed algorithms tend to yield sparser models than classical statistical boosting, while showing a very competitive predictive performance also compared to penalized regression approaches like the (relaxed) lasso and the elastic net. Conclusions The proposed randomized boosting approaches with multivariable base-learners are promising extensions of statistical boosting, particularly suited for highly-correlated and sparse high-dimensional settings. The incorporated selection of base-learners via information criteria induces automatic stopping of the algorithms, promoting sparser and more interpretable prediction models.

Download Full-text

Machine Learning-Based Scoring Functions. Development and Applications with SAnDReS.

Current Medicinal Chemistry ◽

10.2174/0929867327666200515101820 ◽

2020 ◽

Vol 27 ◽

Author(s):

Gabriela Bitencourt-Ferreira ◽

Camila Rizzotto ◽

Walter Filgueira de Azevedo Junior

Keyword(s):

Machine Learning ◽

Binding Affinity ◽

Drug Targets ◽

Computational Models ◽

Factor Xa ◽

Coagulation Factor ◽

Predictive Performance ◽

Machine Learning Techniques ◽

Scoring Functions ◽

Molegro Virtual Docker

Background: Analysis of atomic coordinates of protein-ligand complexes can provide three-dimensional data to generate computational models to evaluate binding affinity and thermodynamic state functions. Application of machine learning techniques can create models to assess protein-ligand potential energy and binding affinity. These methods show superior predictive performance when compared with classical scoring functions available in docking programs. Objective: Our purpose here is to review the development and application of the program SAnDReS. We describe the creation of machine learning models to assess the binding affinity of protein-ligand complexes. Method: SAnDReS implements machine learning methods available in the scikit-learn library. This program is available for download at https://github.com/azevedolab/sandres. SAnDReS uses crystallographic structures, binding, and thermodynamic data to create targeted scoring functions. Results: Recent applications of the program SAnDReS to drug targets such as Coagulation factor Xa, cyclin-dependent kinases, and HIV-1 protease were able to create targeted scoring functions to predict inhibition of these proteins. These targeted models outperform classical scoring functions. Conclusion: Here, we reviewed the development of machine learning scoring functions to predict binding affinity through the application of the program SAnDReS. Our studies show the superior predictive performance of the SAnDReS-developed models when compared with classical scoring functions available in the programs such as AutoDock4, Molegro Virtual Docker, and AutoDock Vina.

Download Full-text

Prediction of K562 Cells Functional Inhibitors Based on Machine Learning Approaches

Current Pharmaceutical Design ◽

10.2174/1381612825666191107092214 ◽

2020 ◽

Vol 25 (40) ◽

pp. 4296-4302 ◽

Cited By ~ 2

Author(s):

Yuan Zhang ◽

Zhenyan Han ◽

Qian Gao ◽

Xiaoyi Bai ◽

Chi Zhang ◽

...

Keyword(s):

Machine Learning ◽

Inclusion Bodies ◽

Cross Validation ◽

Independent Set ◽

K562 Cells ◽

Machine Learning Algorithms ◽

Learning Approaches ◽

Validation Test ◽

Excess Number ◽

Fold Cross Validation

Background: β thalassemia is a common monogenic genetic disease that is very harmful to human health. The disease arises is due to the deletion of or defects in β-globin, which reduces synthesis of the β-globin chain, resulting in a relatively excess number of α-chains. The formation of inclusion bodies deposited on the cell membrane causes a decrease in the ability of red blood cells to deform and a group of hereditary haemolytic diseases caused by massive destruction in the spleen. Methods: In this work, machine learning algorithms were employed to build a prediction model for inhibitors against K562 based on 117 inhibitors and 190 non-inhibitors. Results: The overall accuracy (ACC) of a 10-fold cross-validation test and an independent set test using Adaboost were 83.1% and 78.0%, respectively, surpassing Bayes Net, Random Forest, Random Tree, C4.5, SVM, KNN and Bagging. Conclusion: This study indicated that Adaboost could be applied to build a learning model in the prediction of inhibitors against K526 cells.

Download Full-text

Application of Machine Learning Techniques to Predict Binding Affinity for Drug Targets: A Study of Cyclin-Dependent Kinase 2

Current Medicinal Chemistry ◽

10.2174/2213275912666191102162959 ◽

2020 ◽

Vol 28 (2) ◽

pp. 253-265 ◽

Cited By ~ 3

Author(s):

Gabriela Bitencourt-Ferreira ◽

Amauri Duarte da Silva ◽

Walter Filgueira de Azevedo

Keyword(s):

Machine Learning ◽

Binding Affinity ◽

Predictive Performance ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Scoring Functions ◽

Cyclin Dependent Kinase ◽

Learning Models ◽

Learning Techniques ◽

Machine Learning Models

Background: The elucidation of the structure of cyclin-dependent kinase 2 (CDK2) made it possible to develop targeted scoring functions for virtual screening aimed to identify new inhibitors for this enzyme. CDK2 is a protein target for the development of drugs intended to modulate cellcycle progression and control. Such drugs have potential anticancer activities. Objective: Our goal here is to review recent applications of machine learning methods to predict ligand- binding affinity for protein targets. To assess the predictive performance of classical scoring functions and targeted scoring functions, we focused our analysis on CDK2 structures. Methods: We have experimental structural data for hundreds of binary complexes of CDK2 with different ligands, many of them with inhibition constant information. We investigate here computational methods to calculate the binding affinity of CDK2 through classical scoring functions and machine- learning models. Results: Analysis of the predictive performance of classical scoring functions available in docking programs such as Molegro Virtual Docker, AutoDock4, and Autodock Vina indicated that these methods failed to predict binding affinity with significant correlation with experimental data. Targeted scoring functions developed through supervised machine learning techniques showed a significant correlation with experimental data. Conclusion: Here, we described the application of supervised machine learning techniques to generate a scoring function to predict binding affinity. Machine learning models showed superior predictive performance when compared with classical scoring functions. Analysis of the computational models obtained through machine learning could capture essential structural features responsible for binding affinity against CDK2.

Download Full-text

The landscape of gene co-expression modules correlating with prognostic genetic abnormalities in AML

Journal of Translational Medicine ◽

10.1186/s12967-021-02914-2 ◽

2021 ◽

Vol 19 (1) ◽

Author(s):

Chao Guo ◽

Ya-yue Gao ◽

Qian-qian Ju ◽

Chun-xia Zhang ◽

Ming Gong ◽

...

Keyword(s):

Regression Analysis ◽

Prediction Model ◽

Hox Genes ◽

Expression Profiles ◽

Penalized Regression ◽

Diagnostic Utility ◽

Lasso Regression ◽

Hub Genes ◽

Npm1 Mutation ◽

Genetic Abnormalities

Abstract Background The heterogenous cytogenetic and molecular variations were harbored by AML patients, some of which are related with AML pathogenesis and clinical outcomes. We aimed to uncover the intrinsic expression profiles correlating with prognostic genetic abnormalities by WGCNA. Methods We downloaded the clinical and expression dataset from BeatAML, TCGA and GEO database. Using R (version 4.0.2) and ‘WGCNA’ package, the co-expression modules correlating with the ELN2017 prognostic markers were identified (R2 ≥ 0.4, p < 0.01). ORA detected the enriched pathways for the key co-expression modules. The patients in TCGA cohort were randomly assigned into the training set (50%) and testing set (50%). The LASSO penalized regression analysis was employed to build the prediction model, fitting OS to the expression level of hub genes by ‘glmnet’ package. Then the testing and 2 independent validation sets (GSE12417 and GSE37642) were used to validate the diagnostic utility and accuracy of the model. Results A total of 37 gene co-expression modules and 973 hub genes were identified for the BeatAML cohort. We found that 3 modules were significantly correlated with genetic markers (the ‘lightyellow’ module for NPM1 mutation, the ‘saddlebrown’ module for RUNX1 mutation, the ‘lightgreen’ module for TP53 mutation). ORA revealed that the ‘lightyellow’ module was mainly enriched in DNA-binding transcription factor activity and activation of HOX genes. The ‘saddlebrown’ module was enriched in immune response process. And the ‘lightgreen’ module was predominantly enriched in mitosis cell cycle process. The LASSO- regression analysis identified 6 genes (NFKB2, NEK9, HOXA7, APRC5L, FAM30A and LOC105371592) with non-zero coefficients. The risk score generated from the 6-gene model, was associated with ELN2017 risk stratification, relapsed disease, and prior MDS history. The 5-year AUC for the model was 0.822 and 0.824 in the training and testing sets, respectively. Moreover, the diagnostic utility of the model was robust when it was employed in 2 validation sets (5-year AUC 0.743–0.79). Conclusions We established the co-expression network signature correlated with the ELN2017 recommended prognostic genetic abnormalities in AML. The 6-gene prediction model for AML survival was developed and validated by multiple datasets.

Download Full-text

Development of Machine Learning Models to Predict Probabilities and Types of Stroke at Prehospital Stage: the Japan Urgent Stroke Triage Score Using Machine Learning (JUST-ML)

Translational Stroke Research ◽

10.1007/s12975-021-00937-x ◽

2021 ◽

Author(s):

Kazutaka Uchida ◽

Junichi Kouno ◽

Shinichi Yoshimura ◽

Norito Kinjo ◽

Fumihiro Sakakibara ◽

...

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forests ◽

Prediction Models ◽

Characteristic Curve ◽

Predictive Performance ◽

Vessel Occlusion ◽

Predictive Values ◽

Training Cohort ◽

Sensitivity Specificity

AbstractIn conjunction with recent advancements in machine learning (ML), such technologies have been applied in various fields owing to their high predictive performance. We tried to develop prehospital stroke scale with ML. We conducted multi-center retrospective and prospective cohort study. The training cohort had eight centers in Japan from June 2015 to March 2018, and the test cohort had 13 centers from April 2019 to March 2020. We use the three different ML algorithms (logistic regression, random forests, XGBoost) to develop models. Main outcomes were large vessel occlusion (LVO), intracranial hemorrhage (ICH), subarachnoid hemorrhage (SAH), and cerebral infarction (CI) other than LVO. The predictive abilities were validated in the test cohort with accuracy, positive predictive value, sensitivity, specificity, area under the receiver operating characteristic curve (AUC), and F score. The training cohort included 3178 patients with 337 LVO, 487 ICH, 131 SAH, and 676 CI cases, and the test cohort included 3127 patients with 183 LVO, 372 ICH, 90 SAH, and 577 CI cases. The overall accuracies were 0.65, and the positive predictive values, sensitivities, specificities, AUCs, and F scores were stable in the test cohort. The classification abilities were also fair for all ML models. The AUCs for LVO of logistic regression, random forests, and XGBoost were 0.89, 0.89, and 0.88, respectively, in the test cohort, and these values were higher than the previously reported prediction models for LVO. The ML models developed to predict the probability and types of stroke at the prehospital stage had superior predictive abilities.

Download Full-text

Machine Learning for Predicting Risk of Drug-Induced Autoimmune Diseases by Structural Alerts and Daily Dose

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18137139 ◽

2021 ◽

Vol 18 (13) ◽

pp. 7139

Author(s):

Yue Wu ◽

Jieqiang Zhu ◽

Peter Fu ◽

Weida Tong ◽

Huixiao Hong ◽

...

Keyword(s):

Machine Learning ◽

Autoimmune Diseases ◽

Odds Ratio ◽

Area Under Curve ◽

Predictive Performance ◽

Drug Induced ◽

Drug Candidates ◽

Daily Dose ◽

Structural Alerts ◽

Underlying Mechanisms

An effective approach for assessing a drug’s potential to induce autoimmune diseases (ADs) is needed in drug development. Here, we aim to develop a workflow to examine the association between structural alerts and drugs-induced ADs to improve toxicological prescreening tools. Considering reactive metabolite (RM) formation as a well-documented mechanism for drug-induced ADs, we investigated whether the presence of certain RM-related structural alerts was predictive for the risk of drug-induced AD. We constructed a database containing 171 RM-related structural alerts, generated a dataset of 407 AD- and non-AD-associated drugs, and performed statistical analysis. The nitrogen-containing benzene substituent alerts were found to be significantly associated with the risk of drug-induced ADs (odds ratio = 2.95, p = 0.0036). Furthermore, we developed a machine-learning-based predictive model by using daily dose and nitrogen-containing benzene substituent alerts as the top inputs and achieved the predictive performance of area under curve (AUC) of 70%. Additionally, we confirmed the reactivity of the nitrogen-containing benzene substituent aniline and related metabolites using quantum chemistry analysis and explored the underlying mechanisms. These identified structural alerts could be helpful in identifying drug candidates that carry a potential risk of drug-induced ADs to improve their safety profiles.

Download Full-text