scholarly journals Predicting Pseudogene–miRNA Associations Based on Feature Fusion and Graph Auto-Encoder

2021 ◽  
Vol 12 ◽  
Author(s):  
Shijia Zhou ◽  
Weicheng Sun ◽  
Ping Zhang ◽  
Li Li

Pseudogenes were originally regarded as non-functional components scattered in the genome during evolution. Recent studies have shown that pseudogenes can be transcribed into long non-coding RNA and play a key role at multiple functional levels in different physiological and pathological processes. microRNAs (miRNAs) are a type of non-coding RNA, which plays important regulatory roles in cells. Numerous studies have shown that pseudogenes and miRNAs have interactions and form a ceRNA network with mRNA to regulate biological processes and involve diseases. Exploring the associations of pseudogenes and miRNAs will facilitate the clinical diagnosis of some diseases. Here, we propose a prediction model PMGAE (Pseudogene–MiRNA association prediction based on the Graph Auto-Encoder), which incorporates feature fusion, graph auto-encoder (GAE), and eXtreme Gradient Boosting (XGBoost). First, we calculated three types of similarities including Jaccard similarity, cosine similarity, and Pearson similarity between nodes based on the biological characteristics of pseudogenes and miRNAs. Subsequently, we fused the above similarities to construct a similarity profile as the initial representation features for nodes. Then, we aggregated the similarity profiles and associations of nodes to obtain the low-dimensional representation vector of nodes through a GAE. In the last step, we fed these representation vectors into an XGBoost classifier to predict new pseudogene–miRNA associations (PMAs). The results of five-fold cross validation show that PMGAE achieves a mean AUC of 0.8634 and mean AUPR of 0.8966. Case studies further substantiated the reliability of PMGAE for mining PMAs and the study of endogenous RNA networks in relation to diseases.

2019 ◽  
Vol 2019 ◽  
pp. 1-13 ◽  
Author(s):  
Yunxin Xie ◽  
Chenyang Zhu ◽  
Yue Lu ◽  
Zhengwei Zhu

Lithology identification is an indispensable part in geological research and petroleum engineering study. In recent years, several mathematical approaches have been used to improve the accuracy of lithology classification. Based on our earlier work that assessed machine learning models on formation lithology classification, we optimize the boosting approaches to improve the classification ability of our boosting models with the data collected from the Daniudi gas field and Hangjinqi gas field. Three boosting models, namely, AdaBoost, Gradient Tree Boosting, and eXtreme Gradient Boosting, are evaluated with 5-fold cross validation. Regularization is applied to the Gradient Tree Boosting and eXtreme Gradient Boosting to avoid overfitting. After adapting the hyperparameter tuning approach on each boosting model to optimize the parameter set, we use stacking to combine the three optimized models to improve the classification accuracy. Results suggest that the optimized stacked boosting model has better performance concerning the evaluation matrix such as precision, recall, and f1 score compared with the single optimized boosting model. Confusion matrix also shows that the stacked model has better performance in distinguishing sandstone classes.


2020 ◽  
Author(s):  
Yongxian Fan ◽  
Wanru Wang ◽  
Qingqi Zhu

AbstractTerminator is a DNA sequence that give the RNA polymerase the transcriptional termination signal. Identifying terminators correctly can optimize the genome annotation, more importantly, it has considerable application value in disease diagnosis and therapies. However, accurate prediction methods are deficient and in urgent need. Therefore, we proposed a prediction method “iterb-PPse” for terminators by incorporating 47 nucleotide properties into PseKNC- I and PseKNC- II and utilizing Extreme Gradient Boosting to predict terminators based on Escherichia coli and Bacillus subtilis. Combing with the preceding methods, we employed three new feature extraction methods K-pwm, Base-content, Nucleotidepro to formulate raw samples. The two-step method was applied to select features. When identifying terminators based on optimized features, we compared five single models as well as 16 ensemble models. As a result, the accuracy of our method on benchmark dataset achieved 99.88%, higher than the existing state-of-the-art predictor iTerm-PseKNC in 100 times five-fold cross-validation test. It’s prediction accuracy for two independent datasets reached 94.24% and 99.45% respectively. For the convenience of users, a software was developed with the same name on the basis of “iterb-PPse”. The open software and source code of “iterb-PPse” are available at https://github.com/Sarahyouzi/iterb-PPse.


Author(s):  
William Stive Fajardo-Moreno ◽  
Rubén Dario Acosta Velásquez ◽  
Ivan Dario Castaño Pérez ◽  
Leonardo Espinosa-Leal

In this chapter, the results concerning the modeling of companies' disappearance from Bogota's market using machine learning methods are presented. The authors use the available information from Bogota's Chamber of Commerce, where the companies are registered yearly. The dataset comprises the years 2017 to 2020 with almost 3 million registries. In this work, a deep analysis of the different features of the data is presented and explained. Next, four state-of-the-art machine learning models are trained for comparison: logistic regression (LR), extreme learning machine (ELM), random forest (RF), and extreme gradient boosting (XGBoost), all with five-fold cross-validation and 50 steps in the randomized grid search. All methods showed excellent performance, with an average of 0.895 in the area under the curve (AUC), being the latter algorithm the best overall (0.97). These results are in agreement with the state-of-the-art values in the field and will be of paramount importance to assess companies' stability for Bogota's local economy.


Sensors ◽  
2019 ◽  
Vol 19 (20) ◽  
pp. 4383 ◽  
Author(s):  
Alqahtani ◽  
Gumaei ◽  
Mathkour ◽  
Maher Ben Ismail

An Intrusion detection system is an essential security tool for protecting services and infrastructures of wireless sensor networks from unseen and unpredictable attacks. Few works of machine learning have been proposed for intrusion detection in wireless sensor networks and that have achieved reasonable results. However, these works still need to be more accurate and efficient against imbalanced data problems in network traffic. In this paper, we proposed a new model to detect intrusion attacks based on a genetic algorithm and an extreme gradient boosting (XGBoot) classifier, called GXGBoost model. The latter is a gradient boosting model designed for improving the performance of traditional models to detect minority classes of attacks in the highly imbalanced data traffic of wireless sensor networks. A set of experiments were conducted on wireless sensor network-detection system (WSN-DS) dataset using holdout and 10 fold cross validation techniques. The results of 10 fold cross validation tests revealed that the proposed approach outperformed the state-of-the-art approaches and other ensemble learning classifiers with high detection rates of 98.2%, 92.9%, 98.9%, and 99.5% for flooding, scheduling, grayhole, and blackhole attacks, respectively, in addition to 99.9% for normal traffic.


PeerJ ◽  
2017 ◽  
Vol 5 ◽  
pp. e3230 ◽  
Author(s):  
Petra Povalej Brzan ◽  
Zoran Obradovic ◽  
Gregor Stiglic

Background Reduction of readmissions after discharge represents an important challenge for many hospitals and has attracted the interest of many researchers in the past few years. Most of the studies in this field focus on building cross-sectional predictive models that aim to predict the occurrence of readmission within 30-days based on information from the current hospitalization. The aim of this study is demonstration of predictive performance gain obtained by inclusion of information from historical hospitalization records among morbidly obese patients. Methods The California Statewide inpatient database was used to build regularized logistic regression models for prediction of readmission in morbidly obese patients (n = 18,881). Temporal features were extracted from historical patient hospitalization records in a one-year timeframe. Five different datasets of patients were prepared based on the number of available hospitalizations per patient. Sample size of the five datasets ranged from 4,787 patients with more than five hospitalizations to 20,521 patients with at least two hospitalization records in one year. A 10-fold cross validation was repeted 100 times to assess the variability of the results. Additionally, random forest and extreme gradient boosting were used to confirm the results. Results Area under the ROC curve increased significantly when including information from up to three historical records on all datasets. The inclusion of more than three historical records was not efficient. Similar results can be observed for Brier score and PPV value. The number of selected predictors corresponded to the complexity of the dataset ranging from an average of 29.50 selected features on the smallest dataset to 184.96 on the largest dataset based on 100 repetitions of 10-fold cross-validation. Discussion The results show positive influence of adding information from historical hospitalization records on predictive performance using all predictive modeling techniques used in this study. We can conclude that it is advantageous to build separate readmission prediction models in subgroups of patients with more hospital admissions by aggregating information from up to three previous hospitalizations.


2020 ◽  
Author(s):  
Xizhe Wang ◽  
Lijun Zhang

Abstract For fault failures of a steam turbine occur frequently and cause huge losses, it is important to identify the fault category. A steam turbine clustering fault diagnosis method based on t-distribution stochastic neighborhood embedding (t-SNE) and extreme gradient boosting (XGBoost) is proposed. Firstly, the t-SNE algorithm is used to map high-dimensional data to low-dimensional space, and data clustering is performed in low-dimensional space. Combined with the fault records of the power plant, the fault data and health data of the clustering result are distinguished. Then, the imbalance problem in the data is processed by the synthetic minority over-sampling technique (SMOTE) algorithm to obtain the steam turbine characteristic data set with fault labels. Finally, we used the XGBoost to solve this multiclassification problem. In the experiment, the method achieved the best performance with an overall accuracy of 97% and early warning at least two hours in advance. The experimental results show that this method can effectively evaluate the state and make fault warning for power plant equipment.


2021 ◽  
Vol 9 (7) ◽  
pp. e003299
Author(s):  
Rivka R Colen ◽  
Christian Rolfo ◽  
Murat Ak ◽  
Mira Ayoub ◽  
Sara Ahmed ◽  
...  

The need to identify biomarkers to predict immunotherapy response for rare cancers has been long overdue. We aimed to study this in our paper, ‘Radiomics analysis for predicting pembrolizumab response in patients with advanced rare cancers’. In this response to the Letter to the Editor by Cunha et al, we explain and discuss the reasons behind choosing LASSO (Least Absolute Shrinkage and Selection Operator) and XGBoost (eXtreme Gradient Boosting) with LOOCV (Leave-One-Out Cross-Validation) as the feature selection and classifier method, respectively for our radiomics models. Also, we highlight what care was taken to avoid any overfitting on the models. Further, we checked for the multicollinearity of the features. Additionally, we performed 10-fold cross-validation instead of LOOCV to see the predictive performance of our radiomics models.


2014 ◽  
Vol 9 (S 01) ◽  
Author(s):  
MP Ashton ◽  
I Tan ◽  
L Mackin ◽  
C Elso ◽  
E Chu ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document