scholarly journals iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization

2021 ◽  
Author(s):  
Zhen Chen ◽  
Pei Zhao ◽  
Chen Li ◽  
Fuyi Li ◽  
Dongxu Xiang ◽  
...  

Abstract Sequence-based analysis and prediction are fundamental bioinformatic tasks that facilitate understanding of the sequence(-structure)-function paradigm for DNAs, RNAs and proteins. Rapid accumulation of sequences requires equally pervasive development of new predictive models, which depends on the availability of effective tools that support these efforts. We introduce iLearnPlus, the first machine-learning platform with graphical- and web-based interfaces for the construction of machine-learning pipelines for analysis and predictions using nucleic acid and protein sequences. iLearnPlus provides a comprehensive set of algorithms and automates sequence-based feature extraction and analysis, construction and deployment of models, assessment of predictive performance, statistical analysis, and data visualization; all without programming. iLearnPlus includes a wide range of feature sets which encode information from the input sequences and over twenty machine-learning algorithms that cover several deep-learning approaches, outnumbering the current solutions by a wide margin. Our solution caters to experienced bioinformaticians, given the broad range of options, and biologists with no programming background, given the point-and-click interface and easy-to-follow design process. We showcase iLearnPlus with two case studies concerning prediction of long noncoding RNAs (lncRNAs) from RNA transcripts and prediction of crotonylation sites in protein chains. iLearnPlus is an open-source platform available at https://github.com/Superzchen/iLearnPlus/ with the webserver at http://ilearnplus.erc.monash.edu/.

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Khushnood Abbas ◽  
Alireza Abbasi ◽  
Shi Dong ◽  
Ling Niu ◽  
Laihang Yu ◽  
...  

Abstract Background Technological and research advances have produced large volumes of biomedical data. When represented as a network (graph), these data become useful for modeling entities and interactions in biological and similar complex systems. In the field of network biology and network medicine, there is a particular interest in predicting results from drug–drug, drug–disease, and protein–protein interactions to advance the speed of drug discovery. Existing data and modern computational methods allow to identify potentially beneficial and harmful interactions, and therefore, narrow drug trials ahead of actual clinical trials. Such automated data-driven investigation relies on machine learning techniques. However, traditional machine learning approaches require extensive preprocessing of the data that makes them impractical for large datasets. This study presents wide range of machine learning methods for predicting outcomes from biomedical interactions and evaluates the performance of the traditional methods with more recent network-based approaches. Results We applied a wide range of 32 different network-based machine learning models to five commonly available biomedical datasets, and evaluated their performance based on three important evaluations metrics namely AUROC, AUPR, and F1-score. We achieved this by converting link prediction problem as binary classification problem. In order to achieve this we have considered the existing links as positive example and randomly sampled negative examples from non-existant set. After experimental evaluation we found that Prone, ACT and $$LRW_5$$ L R W 5 are the top 3 best performers on all five datasets. Conclusions This work presents a comparative evaluation of network-based machine learning algorithms for predicting network links, with applications in the prediction of drug-target and drug–drug interactions, and applied well known network-based machine learning methods. Our work is helpful in guiding researchers in the appropriate selection of machine learning methods for pharmaceutical tasks.


2019 ◽  
Author(s):  
Mahsa Torkamanian-Afshar ◽  
Hossein Lanjanian ◽  
Sajjad Nematzadeh ◽  
Maryam Tabarzad ◽  
Ali Najafi ◽  
...  

Abstract Abstract Background The RNA-protein interactions play crucial roles in the biological processes. Recent developments to clarify RNA and protein structural features have the urgent need for designing various databases, related to the specificity and the mechanism of the underlying interactions between a protein and an RNA molecule. The majority of these databases have focused on RNAs or proteins macromolecules independently, and they do not have the capability to run integrated queries on the RNA-protein complex. Theses existing databases have a linear query structure. Furthermore, they only focus on interacting (positive) samples and they do not contain non-interacting (negative) samples. Results We developed a Database for RNA-Protein Interaction Network Analysis and Aptamer Design (RPINaptaBASE). RPINaptaBASE has a nested query approach that enables users to apply nonlinear query analysis. The query engine module contains a wide range of features related to RNA and protein sequences and secondary structure elements of these macromolecules, which are helpful to generate custom datasets, especially for machine learning approaches. In this version, more than 175 features were calculated and available to users. It provides a web interface with download management services allowing users to generate desired datasets of unique RNA or protein sequences in independent lists. Furthermore; the web service empowers users to create artificial datasets of positive and negative samples from RNA-protein complexes. In order to present negative samples, the idea of distinguishing protein sequences by their clans and families was employed to efficiently generate non-interacting pairs. Conclusion This database prepares a user-friendly platform to study RNA-protein interactions. It also provides an important simplified contribution to the oligonucleotide-aptamer design process using machine learning algorithms. RPINaptaBASE is freely available at http://rpinbase.com


Author(s):  
Lucia Alessi ◽  
Roberto Savona

AbstractWhat we learned from the global financial crisis is that to get information about the underlying financial risk dynamics, we need to fully understand the complex, nonlinear, time-varying, and multidimensional nature of the data. A strand of literature has shown that machine learning approaches can make more accurate data-driven predictions than standard empirical models, thus providing more and more timely information about the building up of financial risks. Advanced machine learning techniques provide several advantages over empirical models traditionally used to monitor and predict financial developments. First, they are able to deal with high-dimensional datasets. Second, machine learning algorithms allow to deal with unbalanced datasets and retain all of the information available. Third, these methods are purely data driven. All of these characteristics contribute to their often better predictive performance. However, as “black box” models, they are still much underutilized in financial stability, a field where interpretability and accountability are crucial.


2020 ◽  
Vol 25 (40) ◽  
pp. 4296-4302 ◽  
Author(s):  
Yuan Zhang ◽  
Zhenyan Han ◽  
Qian Gao ◽  
Xiaoyi Bai ◽  
Chi Zhang ◽  
...  

Background: β thalassemia is a common monogenic genetic disease that is very harmful to human health. The disease arises is due to the deletion of or defects in β-globin, which reduces synthesis of the β-globin chain, resulting in a relatively excess number of α-chains. The formation of inclusion bodies deposited on the cell membrane causes a decrease in the ability of red blood cells to deform and a group of hereditary haemolytic diseases caused by massive destruction in the spleen. Methods: In this work, machine learning algorithms were employed to build a prediction model for inhibitors against K562 based on 117 inhibitors and 190 non-inhibitors. Results: The overall accuracy (ACC) of a 10-fold cross-validation test and an independent set test using Adaboost were 83.1% and 78.0%, respectively, surpassing Bayes Net, Random Forest, Random Tree, C4.5, SVM, KNN and Bagging. Conclusion: This study indicated that Adaboost could be applied to build a learning model in the prediction of inhibitors against K526 cells.


Electronics ◽  
2021 ◽  
Vol 10 (12) ◽  
pp. 1370
Author(s):  
Igor Vuković ◽  
Kristijan Kuk ◽  
Petar Čisar ◽  
Miloš Banđur ◽  
Đoko Banđur ◽  
...  

Moodle is a widely deployed distance learning platform that provides numerous opportunities to enhance the learning process. Moodle’s importance in maintaining the continuity of education in states of emergency and other circumstances has been particularly demonstrated in the context of the COVID-19 virus’ rapid spread. However, there is a problem with personalizing the learning and monitoring of students’ work. There is room for upgrading the system by applying data mining and different machine-learning methods. The multi-agent Observer system proposed in our paper supports students engaged in learning by monitoring their work and making suggestions based on the prediction of their final course success, using indicators of engagement and machine-learning algorithms. A novelty is that Observer collects data independently of the Moodle database, autonomously creates a training set, and learns from gathered data. Since the data are anonymized, researchers and lecturers can freely use them for purposes broader than that specified for Observer. The paper shows how the methodology, technologies, and techniques used in Observer provide an autonomous system of personalized assistance for students within Moodle platforms.


2021 ◽  
Vol 13 (1) ◽  
Author(s):  
Carl E. Belle ◽  
Vural Aksakalli ◽  
Salvy P. Russo

AbstractFor photovoltaic materials, properties such as band gap $$E_{g}$$ E g are critical indicators of the material’s suitability to perform a desired function. Calculating $$E_{g}$$ E g is often performed using Density Functional Theory (DFT) methods, although more accurate calculation are performed using methods such as the GW approximation. DFT software often used to compute electronic properties includes applications such as VASP, CRYSTAL, CASTEP or Quantum Espresso. Depending on the unit cell size and symmetry of the material, these calculations can be computationally expensive. In this study, we present a new machine learning platform for the accurate prediction of properties such as $$E_{g}$$ E g of a wide range of materials.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Imogen Schofield ◽  
David C. Brodbelt ◽  
Noel Kennedy ◽  
Stijn J. M. Niessen ◽  
David B. Church ◽  
...  

AbstractCushing’s syndrome is an endocrine disease in dogs that negatively impacts upon the quality-of-life of affected animals. Cushing’s syndrome can be a challenging diagnosis to confirm, therefore new methods to aid diagnosis are warranted. Four machine-learning algorithms were applied to predict a future diagnosis of Cushing's syndrome, using structured clinical data from the VetCompass programme in the UK. Dogs suspected of having Cushing's syndrome were included in the analysis and classified based on their final reported diagnosis within their clinical records. Demographic and clinical features available at the point of first suspicion by the attending veterinarian were included within the models. The machine-learning methods were able to classify the recorded Cushing’s syndrome diagnoses, with good predictive performance. The LASSO penalised regression model indicated the best overall performance when applied to the test set with an AUROC = 0.85 (95% CI 0.80–0.89), sensitivity = 0.71, specificity = 0.82, PPV = 0.75 and NPV = 0.78. The findings of our study indicate that machine-learning methods could predict the future diagnosis of a practicing veterinarian. New approaches using these methods could support clinical decision-making and contribute to improved diagnosis of Cushing’s syndrome in dogs.


Author(s):  
Magdalena Kukla-Bartoszek ◽  
Paweł Teisseyre ◽  
Ewelina Pośpiech ◽  
Joanna Karłowska-Pik ◽  
Piotr Zieliński ◽  
...  

AbstractIncreasing understanding of human genome variability allows for better use of the predictive potential of DNA. An obvious direct application is the prediction of the physical phenotypes. Significant success has been achieved, especially in predicting pigmentation characteristics, but the inference of some phenotypes is still challenging. In search of further improvements in predicting human eye colour, we conducted whole-exome (enriched in regulome) sequencing of 150 Polish samples to discover new markers. For this, we adopted quantitative characterization of eye colour phenotypes using high-resolution photographic images of the iris in combination with DIAT software analysis. An independent set of 849 samples was used for subsequent predictive modelling. Newly identified candidates and 114 additional literature-based selected SNPs, previously associated with pigmentation, and advanced machine learning algorithms were used. Whole-exome sequencing analysis found 27 previously unreported candidate SNP markers for eye colour. The highest overall prediction accuracies were achieved with LASSO-regularized and BIC-based selected regression models. A new candidate variant, rs2253104, located in the ARFIP2 gene and identified with the HyperLasso method, revealed predictive potential and was included in the best-performing regression models. Advanced machine learning approaches showed a significant increase in sensitivity of intermediate eye colour prediction (up to 39%) compared to 0% obtained for the original IrisPlex model. We identified a new potential predictor of eye colour and evaluated several widely used advanced machine learning algorithms in predictive analysis of this trait. Our results provide useful hints for developing future predictive models for eye colour in forensic and anthropological studies.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Chengmao Zhou ◽  
Junhong Hu ◽  
Ying Wang ◽  
Mu-Huo Ji ◽  
Jianhua Tong ◽  
...  

AbstractTo explore the predictive performance of machine learning on the recurrence of patients with gastric cancer after the operation. The available data is divided into two parts. In particular, the first part is used as a training set (such as 80% of the original data), and the second part is used as a test set (the remaining 20% of the data). And we use fivefold cross-validation. The weight of recurrence factors shows the top four factors are BMI, Operation time, WGT and age in order. In training group:among the 5 machine learning models, the accuracy of gbm was 0.891, followed by gbm algorithm was 0.876; The AUC values of the five machine learning algorithms are from high to low as forest (0.962), gbm (0.922), GradientBoosting (0.898), DecisionTree (0.790) and Logistic (0.748). And the precision of the forest is the highest 0.957, followed by the GradientBoosting algorithm (0.878). At the same time, in the test group is as follows: the highest accuracy of Logistic was 0.801, followed by forest algorithm and gbm; the AUC values of the five algorithms are forest (0.795), GradientBoosting (0.774), DecisionTree (0.773), Logistic (0.771) and gbm (0.771), from high to low. Among the five machine learning algorithms, the highest precision rate of Logistic is 1.000, followed by the gbm (0.487). Machine learning can predict the recurrence of gastric cancer patients after an operation. Besides, the first four factors affecting postoperative recurrence of gastric cancer were BMI, Operation time, WGT and age.


Sign in / Sign up

Export Citation Format

Share Document