scholarly journals Accurate Sequence-Based Prediction of Deleterious nsSNPs with Multiple Sequence Profiles and Putative Binding Residues

Biomolecules ◽  
2021 ◽  
Vol 11 (9) ◽  
pp. 1337
Author(s):  
Ruiyang Song ◽  
Baixin Cao ◽  
Zhenling Peng ◽  
Christopher J. Oldfield ◽  
Lukasz Kurgan ◽  
...  

Non-synonymous single nucleotide polymorphisms (nsSNPs) may result in pathogenic changes that are associated with human diseases. Accurate prediction of these deleterious nsSNPs is in high demand. The existing predictors of deleterious nsSNPs secure modest levels of predictive performance, leaving room for improvements. We propose a new sequence-based predictor, DMBS, which addresses the need to improve the predictive quality. The design of DMBS relies on the observation that the deleterious mutations are likely to occur at the highly conserved and functionally important positions in the protein sequence. Correspondingly, we introduce two innovative components. First, we improve the estimates of the conservation computed from the multiple sequence profiles based on two complementary databases and two complementary alignment algorithms. Second, we utilize putative annotations of functional/binding residues produced by two state-of-the-art sequence-based methods. These inputs are processed by a random forests model that provides favorable predictive performance when empirically compared against five other machine-learning algorithms. Empirical results on four benchmark datasets reveal that DMBS achieves AUC > 0.94, outperforming current methods, including protein structure-based approaches. In particular, DMBS secures AUC = 0.97 for the SNPdbe and ExoVar datasets, compared to AUC = 0.70 and 0.88, respectively, that were obtained by the best available methods. Further tests on the independent HumVar dataset shows that our method significantly outperforms the state-of-the-art method SNPdryad. We conclude that DMBS provides accurate predictions that can effectively guide wet-lab experiments in a high-throughput manner.

2021 ◽  
Author(s):  
Atiq Rehman ◽  
Samir Brahim Belhaouari

Abstract Detection and removal of outliers in a dataset is a fundamental preprocessing task without which the analysis of the data can be misleading. Furthermore, the existence of anomalies in the data can heavily degrade the performance of machine learning algorithms. In order to detect the anomalies in a dataset in an unsupervised manner, some novel statistical techniques are proposed in this paper. The proposed techniques are based on statistical methods considering data compactness and other properties. The newly proposed ideas are found efficient in terms of performance, ease of implementation, and computational complexity. Furthermore, two proposed techniques presented in this paper use only a single dimensional distance vector to detect the outliers, so irrespective of the data’s high dimensions, the techniques remain computationally inexpensive and feasible. Comprehensive performance analysis of the proposed anomaly detection schemes is presented in the paper, and the newly proposed schemes are found better than the state-of-the-art methods when tested on several benchmark datasets.


2021 ◽  
Vol 99 (Supplement_3) ◽  
pp. 15-16
Author(s):  
Pablo A S Fonseca ◽  
Massimo Tornatore ◽  
Angela Cánovas

Abstract Reduced fertility is one of the main causes of economic losses in dairy farms. The cost of a stillbirth is estimated in US$ 938 per case in Holstein herds. Machine learning (ML) is gaining popularity in the livestock sector as a mean to identify hidden patterns and due to its potential to address dimensionality problems. Here we investigate the application of ML algorithms for the prediction of cows with higher stillbirth susceptibility in two scenarios: cows with >25% and >33.33% of stillbirths among birth records. These thresholds correspond to percentiles 75 (still_75) and 90 (still_90), respectively. A total of 10,570 cows and 50,541 birth records were collected to perform a haplotype-based genome-wide association study. Five-hundred significant pseudo single nucleotide polymorphisms (pseudo-SNPs) (False-Discovery Rate< 0.05) were used as input features of ML-based predictions to determine if the cow is in the top-75 and top-90 percentiles. Table 1 shows the classification performance of the investigated ML and linear models. The ML models outperformed linear models for both thresholds. In general, still_75 showed higher F1 values compared to still_90, suggesting a lower misclassification ratio when a less stringent threshold is used. We observe that accuracy of the models in our study is higher when compared to ML-based prediction accuracies in other breeds, e.g. compared to the accuracies of 0.46 and 0.67 that were achieved using SNPs for body weight in Brahman and fertility traits in Nellore, respectively. Xgboost algorithm shows the highest balanced accuracy (BA; 0.625), F1-score (0.588) and area under the curve (AUC; 0.688), suggesting that xgboost can achieve the highest predictive performance and the lowest difference in misclassification ratio between classes. The ML applied over haplotype libraries is an interesting approach for the detection of animals with higher susceptibility to stillbirths due to highest predictive accuracy and relatively lower misclassification ratio.


2021 ◽  
Vol 11 (7) ◽  
pp. 3227
Author(s):  
Lkhagvadorj Munkhdalai ◽  
Keun Ho Ryu ◽  
Oyun-Erdene Namsrai ◽  
Nipon Theera-Umpon

Credit scoring is a process of determining whether a borrower is successful or unsuccessful in repaying a loan using borrowers’ qualitative and quantitative characteristics. In recent years, machine learning algorithms have become widely studied in the development of credit scoring models. Although efficiently classifying good and bad borrowers is a core objective of the credit scoring model, there is still a need for the model that can explain the relationship between input and output. In this work, we propose a novel partially interpretable adaptive softmax (PIA-Soft) regression model to achieve both state-of-the-art predictive performance and marginally interpretation between input and output. We augment softmax regression by neural networks to make it adaptive for each borrower. Our PIA-Soft model consists of two main components: linear (softmax regression) and non-linear (neural network). The linear part explains the fundamental relationship between input and output variables. The non-linear part serves to improve the prediction performance by identifying the non-linear relationship between features for each borrower. The experimental result on public benchmark datasets shows that our proposed model not only outperformed the machine learning baselines but also showed the explanations that logically related to the real-world.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Quang-Vinh Dang

Purpose This study aims to explain the state-of-the-art machine learning models that are used in the intrusion detection problem for human-being understandable and study the relationship between the explainability and the performance of the models. Design/methodology/approach The authors study a recent intrusion data set collected from real-world scenarios and use state-of-the-art machine learning algorithms to detect the intrusion. The authors apply several novel techniques to explain the models, then evaluate manually the explanation. The authors then compare the performance of model post- and prior-explainability-based feature selection. Findings The authors confirm our hypothesis above and claim that by forcing the explainability, the model becomes more robust, requires less computational power but achieves a better predictive performance. Originality/value The authors draw our conclusions based on their own research and experimental works.


2020 ◽  
Vol 36 (Supplement_2) ◽  
pp. i735-i744
Author(s):  
Fuhao Zhang ◽  
Wenbo Shi ◽  
Jian Zhang ◽  
Min Zeng ◽  
Min Li ◽  
...  

Abstract Motivation Knowledge of protein-binding residues (PBRs) improves our understanding of protein−protein interactions, contributes to the prediction of protein functions and facilitates protein−protein docking calculations. While many sequence-based predictors of PBRs were published, they offer modest levels of predictive performance and most of them cross-predict residues that interact with other partners. One unexplored option to improve the predictive quality is to design consensus predictors that combine results produced by multiple methods. Results We empirically investigate predictive performance of a representative set of nine predictors of PBRs. We report substantial differences in predictive quality when these methods are used to predict individual proteins, which contrast with the dataset-level benchmarks that are currently used to assess and compare these methods. Our analysis provides new insights for the cross-prediction concern, dissects complementarity between predictors and demonstrates that predictive performance of the top methods depends on unique characteristics of the input protein sequence. Using these insights, we developed PROBselect, first-of-its-kind consensus predictor of PBRs. Our design is based on the dynamic predictor selection at the protein level, where the selection relies on regression-based models that accurately estimate predictive performance of selected predictors directly from the sequence. Empirical assessment using a low-similarity test dataset shows that PROBselect provides significantly improved predictive quality when compared with the current predictors and conventional consensuses that combine residue-level predictions. Moreover, PROBselect informs the users about the expected predictive quality for the prediction generated from a given input protein. Availability and implementation PROBselect is available at http://bioinformatics.csu.edu.cn/PROBselect/home/index. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Atiq ur Rehman ◽  
Samir Brahim Belhaouari

AbstractDetection and removal of outliers in a dataset is a fundamental preprocessing task without which the analysis of the data can be misleading. Furthermore, the existence of anomalies in the data can heavily degrade the performance of machine learning algorithms. In order to detect the anomalies in a dataset in an unsupervised manner, some novel statistical techniques are proposed in this paper. The proposed techniques are based on statistical methods considering data compactness and other properties. The newly proposed ideas are found efficient in terms of performance, ease of implementation, and computational complexity. Furthermore, two proposed techniques presented in this paper use transformation of data to a unidimensional distance space to detect the outliers, so irrespective of the data’s high dimensions, the techniques remain computationally inexpensive and feasible. Comprehensive performance analysis of the proposed anomaly detection schemes is presented in the paper, and the newly proposed schemes are found better than the state-of-the-art methods when tested on several benchmark datasets.


2020 ◽  
Vol 17 (6) ◽  
pp. 847-856
Author(s):  
Shengbing Ren ◽  
Xiang Zhang

The problem of synthesizing adequate inductive invariants lies at the heart of automated software verification. The state-of-the-art machine learning algorithms for synthesizing invariants have gradually shown its excellent performance. However, synthesizing disjunctive invariants is a difficult task. In this paper, we propose a method k++ Support Vector Machine (SVM) integrating k-means++ and SVM to synthesize conjunctive and disjunctive invariants. At first, given a program, we start with executing the program to collect program states. Next, k++SVM adopts k-means++ to cluster the positive samples and then applies SVM to distinguish each positive sample cluster from all negative samples to synthesize the candidate invariants. Finally, a set of theories founded on Hoare logic are adopted to check whether the candidate invariants are true invariants. If the candidate invariants fail the check, we should sample more states and repeat our algorithm. The experimental results show that k++SVM is compatible with the algorithms for Intersection Of Half-space (IOH) and more efficient than the tool of Interproc. Furthermore, it is shown that our method can synthesize conjunctive and disjunctive invariants automatically


2021 ◽  
Vol 16 (1) ◽  
pp. 1-23
Author(s):  
Min-Ling Zhang ◽  
Jun-Peng Fang ◽  
Yi-Bo Wang

In multi-label classification, the task is to induce predictive models which can assign a set of relevant labels for the unseen instance. The strategy of label-specific features has been widely employed in learning from multi-label examples, where the classification model for predicting the relevancy of each class label is induced based on its tailored features rather than the original features. Existing approaches work by generating a group of tailored features for each class label independently, where label correlations are not fully considered in the label-specific features generation process. In this article, we extend existing strategy by proposing a simple yet effective approach based on BiLabel-specific features. Specifically, a group of tailored features is generated for a pair of class labels with heuristic prototype selection and embedding. Thereafter, predictions of classifiers induced by BiLabel-specific features are ensembled to determine the relevancy of each class label for unseen instance. To thoroughly evaluate the BiLabel-specific features strategy, extensive experiments are conducted over a total of 35 benchmark datasets. Comparative studies against state-of-the-art label-specific features techniques clearly validate the superiority of utilizing BiLabel-specific features to yield stronger generalization performance for multi-label classification.


Sign in / Sign up

Export Citation Format

Share Document