scholarly journals Ranking near-native candidate protein structures via random forest classification

2019 ◽  
Vol 20 (S25) ◽  
Author(s):  
Hongjie Wu ◽  
Hongmei Huang ◽  
Weizhong Lu ◽  
Qiming Fu ◽  
Yijie Ding ◽  
...  

Abstract Background In ab initio protein-structure predictions, a large set of structural decoys are often generated, with the requirement to select best five or three candidates from the decoys. The clustered central structures with the most number of neighbors are frequently regarded as the near-native protein structures with the lowest free energy; however, limitations in clustering methods and three-dimensional structural-distance assessments make identifying exact order of the best five or three near-native candidate structures difficult. Results To address this issue, we propose a method that re-ranks the candidate structures via random forest classification using intra- and inter-cluster features from the results of the clustering. Comparative analysis indicated that our method was better able to identify the order of the candidate structures as comparing with current methods SPICKR, Calibur, and Durandal. The results confirmed that the identification of the first model were closer to the native structure in 12 of 43 cases versus four for SPICKER, and the same as the native structure in up to 27 of 43 cases versus 14 for Calibur and up to eight of 43 cases versus two for Durandal. Conclusions In this study, we presented an improved method based on random forest classification to transform the problem of re-ranking the candidate structures by an binary classification. Our results indicate that this method is a powerful method for the problem and the effect of this method is better than other methods.

2021 ◽  
Vol 12 ◽  
Author(s):  
Yuan Zhao ◽  
Zhao-Yu Fang ◽  
Cui-Xiang Lin ◽  
Chao Deng ◽  
Yun-Pei Xu ◽  
...  

In recent years, the application of single cell RNA-seq (scRNA-seq) has become more and more popular in fields such as biology and medical research. Analyzing scRNA-seq data can discover complex cell populations and infer single-cell trajectories in cell development. Clustering is one of the most important methods to analyze scRNA-seq data. In this paper, we focus on improving scRNA-seq clustering through gene selection, which also reduces the dimensionality of scRNA-seq data. Studies have shown that gene selection for scRNA-seq data can improve clustering accuracy. Therefore, it is important to select genes with cell type specificity. Gene selection not only helps to reduce the dimensionality of scRNA-seq data, but also can improve cell type identification in combination with clustering methods. Here, we proposed RFCell, a supervised gene selection method, which is based on permutation and random forest classification. We first use RFCell and three existing gene selection methods to select gene sets on 10 scRNA-seq data sets. Then, three classical clustering algorithms are used to cluster the cells obtained by these gene selection methods. We found that the gene selection performance of RFCell was better than other gene selection methods.


2020 ◽  
Vol 492 (4) ◽  
pp. 5075-5088 ◽  
Author(s):  
R M Arnason ◽  
P Barmby ◽  
N Vulic

ABSTRACT Identifying X-ray binary (XRB) candidates in nearby galaxies requires distinguishing them from possible contaminants including foreground stars and background active galactic nuclei. This work investigates the use of supervised machine learning algorithms to identify high-probability XRB candidates. Using a catalogue of 943 Chandra X-ray sources in the Andromeda galaxy, we trained and tested several classification algorithms using the X-ray properties of 163 sources with previously known types. Amongst the algorithms tested, we find that random forest classifiers give the best performance and work better in a binary classification (XRB/non-XRB) context compared to the use of multiple classes. Evaluating our method by comparing with classifications from visible-light and hard X-ray observations as part of the Panchromatic Hubble Andromeda Treasury, we find compatibility at the 90 per cent level, although we caution that the number of source in common is rather small. The estimated probability that an object is an XRB agrees well between the random forest binary and multiclass approaches and we find that the classifications with the highest confidence are in the XRB class. The most discriminating X-ray bands for classification are the 1.7–2.8, 0.5–1.0, 2.0–4.0, and 2.0–7.0 keV photon flux ratios. Of the 780 unclassified sources in the Andromeda catalogue, we identify 16 new high-probability XRB candidates and tabulate their properties for follow-up.


2020 ◽  
Author(s):  
Milan Voršilák ◽  
Michal Kolář ◽  
Ivan Čmelo ◽  
Daniel Svozil

Abstract SYBA (SYnthetic Bayesian Accessibility) is a fragment based method for the rapid classification of organic compounds as easy- (ES) or hard-to-synthesize (HS). SYBA is based on the Bayesian analysis of the frequency of molecular fragments in the database of ES and HS molecules. It was trained on ES molecules available in the ZINC15 database and on HS molecules generated by the Nonpher methodology. SYBA was compared with a random forest, that was utilized as a baseline method, as well as with other two methods for synthetic accessibility assessment: SAScore and SCScore. When used with their suggested thresholds, SYBA improves over random forest classification, albeit marginally, and outperforms SAScore and SCScore. However, with thresholds optimized by the analysis of ROC curves, SAScore improves considerably and yields similar results as SYBA. Because SYBA is based merely on fragment contributions, it can be used for the analysis of the contribution of individual molecular parts to compound synthetic accessibility. Though SYBA was developed to quickly assess compound synthetic accessibility, its underlying Bayesian framework is a general approach that can be applied to any binary classification problem. Therefore, SYBA can be easily re-trained to classify compounds by other physico-chemical or biological properties. SYBA is publicly available at https://github.com/lich-uct/syba under the GNU General Public License.


2016 ◽  
Vol 146 ◽  
pp. 370-385 ◽  
Author(s):  
Adam Hedberg-Buenz ◽  
Mark A. Christopher ◽  
Carly J. Lewis ◽  
Kimberly A. Fernandes ◽  
Laura M. Dutca ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document