Identification of Disease-Specific Single Amino Acid Polymorphisms Using a Simple Random Forest at Protein-level

2021 ◽  
Vol 16 ◽  
Author(s):  
Jian He ◽  
Rongao Yuan ◽  
Lei Xu ◽  
Yanzhi Guo ◽  
Menglong Li

Background: The number of human genetic variants deposited into publicly available databases has been increasing exponentially. Among these variants, non-synonymous single nucleotide polymorphisms (nsSNPs), also known as single amino acid polymorphisms (SAPs), have been demonstrated to be strongly correlated with phenotypic variations of traits/diseases. Objective: However, the detailed mechanisms governing the disease association of SAPs remain unclear. Thus, further investigation of new attributes and improvement of the prediction becomes more and more urgent since amount of unknown disease-related SAPs need to be investigated. Method: Based on the principle of random forest (RF), we firstly constructed a new effective prediction model for SAPs associated with a particular disease from protein sequences. Four usual sequence signature extractions were separately performed to select the optimal features. Then SAP peptide lengths from 12 to 202 were also optimized. Results: The optimal models achieve higher than 90% accuracy and area under the curve (AUC) of over 0.9 on all 11 external testing datasets. Finally, the good performance on an independent test set with an accuracy higher than 95% proves the superiority of our method. Conclusion: In this paper, based on random forest (RF), we constructed 11 disease-association prediction models for SAPs from the protein sequence level. All models yield prediction accuracy higher than 90% and area under the curve (AUC) more than 0.9. Our method only using the information of protein sequences are more universal than those that depend on some additional information or predictions about the proteins.

Blood ◽  
1991 ◽  
Vol 78 (3) ◽  
pp. 681-687 ◽  
Author(s):  
A Goldberger ◽  
M Kolodziej ◽  
M Poncz ◽  
JS Bennett ◽  
PJ Newman

Abstract The subunits that comprise the platelet-specific integrin alpha IIb beta 3 are polymorphic in nature, with several allelic forms present in the human gene pool. Minor changes in the secondary and tertiary structures of platelet membrane glycoproteins (GP) IIb and IIIa encoded by these alleles can result in an alloimmune reaction after transfusion or during pregnancy. To better understand the molecular structure of the PlA alloantigen system, located on GPIIIa, and the Bak alloantigen on GPIIb, we used a heterologous mammalian expression system to express these integrin subunits in their known polymorphic forms. An expression vector containing the PlA1 form of a GPIIIa cDNA, which encodes a leucine at amino acid 33 (Leu33), was modified to express the PlA2- associated form encoding a proline at amino acid 33 (Pro33). Similarly, a Baka GPIIb cDNA expressing an isoleucine at amino acid 843 (IIe843) was modified to express the Bakb form containing a serine at the same position (Ser843). Transfection of these vectors into COS cells resulted in the synthesis of GPIIb and GPIIIa molecules that were identical in size to those present in platelet lysates. Immunoprecipitation of the GPIIIa-transfected COS lysates with PlA)- specific alloantisera indicated that the Leu33 form was recognized only by anti-PIA1 sera while the Pro33 form was bound only by anti-PlA2 sera, showing that single amino acid polymorphisms are necessary and sufficient to direct the formation of the PlA1 and PlA2 alloepitopes. Similar experiments with Bak allele-specific expression vectors indicated that while the amino acid polymorphism (IIe843 in equilibrium Ser843) was necessary, posttranslational processing of pro-IIb was required for efficient exposure of both the Baka and Bakb alloepitopes.


2020 ◽  
Author(s):  
Victoria Garcia-Montemayor ◽  
Alejandro Martin-Malo ◽  
Carlo Barbieri ◽  
Francesco Bellocchio ◽  
Sagrario Soriano ◽  
...  

Abstract Background Besides the classic logistic regression analysis, non-parametric methods based on machine learning techniques such as random forest are presently used to generate predictive models. The aim of this study was to evaluate random forest mortality prediction models in haemodialysis patients. Methods Data were acquired from incident haemodialysis patients between 1995 and 2015. Prediction of mortality at 6 months, 1 year and 2 years of haemodialysis was calculated using random forest and the accuracy was compared with logistic regression. Baseline data were constructed with the information obtained during the initial period of regular haemodialysis. Aiming to increase accuracy concerning baseline information of each patient, the period of time used to collect data was set at 30, 60 and 90 days after the first haemodialysis session. Results There were 1571 incident haemodialysis patients included. The mean age was 62.3 years and the average Charlson comorbidity index was 5.99. The mortality prediction models obtained by random forest appear to be adequate in terms of accuracy [area under the curve (AUC) 0.68–0.73] and superior to logistic regression models (ΔAUC 0.007–0.046). Results indicate that both random forest and logistic regression develop mortality prediction models using different variables. Conclusions Random forest is an adequate method, and superior to logistic regression, to generate mortality prediction models in haemodialysis patients.


2014 ◽  
Vol 16 (suppl 5) ◽  
pp. v202-v202
Author(s):  
C. L. Nilsson ◽  
A. Vegvari ◽  
E. Mostovenko ◽  
C. F. Lichti ◽  
D. Fenyo ◽  
...  

2007 ◽  
Vol 05 (06) ◽  
pp. 1215-1231 ◽  
Author(s):  
YUM LINA YIP ◽  
NATHALIE LACHENAL ◽  
VIOLAINE PILLET ◽  
ANNE-LISE VEUTHEY

The UniProt/Swiss-Prot Knowledgebase records about 30,500 variants in 5,664 proteins (Release 52.2). Most of these variants are manually curated single amino acid polymorphisms (SAPs) with references to the literature. In order to keep the list of published documents related to SAPs up to date, an automatic information retrieval method is developed to recover texts mentioning SAPs. The method is based on the use of regular expressions (patterns) and rules for the detection and validation of mutations. When evaluated using a corpus of 9,820 PubMed references, the precision of the retrieval was determined to be 89.5% over all variants. It was also found that the use of nonstandard mutation nomenclature and sequence positional correction is necessary to retrieve a significant number of relevant articles. The method was applied to the 5,664 proteins with variants. This was performed by first submitting a PubMed query to retrieve articles using gene or protein names and a list of mutation-related keywords; the SAP detection procedure was then used to recover relevant documents. The method was found to be efficient in retrieving new references on known polymorphisms. New references on known SAPs will be rendered accessible to the public via the Swiss-Prot variant pages.


PLoS ONE ◽  
2015 ◽  
Vol 10 (9) ◽  
pp. e0137379 ◽  
Author(s):  
Scott H. Millen ◽  
Mineo Watanabe ◽  
Eiji Komatsu ◽  
Fuminori Yamaguchi ◽  
Yuki Nagasawa ◽  
...  

Life ◽  
2021 ◽  
Vol 11 (8) ◽  
pp. 866
Author(s):  
Sony Hartono Wijaya ◽  
Farit Mochamad Afendi ◽  
Irmanida Batubara ◽  
Ming Huang ◽  
Naoaki Ono ◽  
...  

Background: We performed in silico prediction of the interactions between compounds of Jamu herbs and human proteins by utilizing data-intensive science and machine learning methods. Verifying the proteins that are targeted by compounds of natural herbs will be helpful to select natural herb-based drug candidates. Methods: Initially, data related to compounds, target proteins, and interactions between them were collected from open access databases. Compounds are represented by molecular fingerprints, whereas amino acid sequences are represented by numerical protein descriptors. Then, prediction models that predict the interactions between compounds and target proteins were constructed using support vector machine and random forest. Results: A random forest model constructed based on MACCS fingerprint and amino acid composition obtained the highest accuracy. We used the best model to predict target proteins for 94 important Jamu compounds and assessed the results by supporting evidence from published literature and other sources. There are 27 compounds that can be validated by professional doctors, and those compounds belong to seven efficacy groups. Conclusion: By comparing the efficacy of predicted compounds and the relations of the targeted proteins with diseases, we found that some compounds might be considered as drug candidates.


2021 ◽  
Author(s):  
yudong Li ◽  
Zhongke Feng ◽  
Ziyu Zhao ◽  
Wenyuan Ma ◽  
Shilin Chen ◽  
...  

Abstract Forest fires can cause serious harm. Scientifically predicting forest fires is an important basis for preventing them. Currently, there is little research on the prediction of long time-series forest fires in China. Choosing a suitable forest fire prediction model and predicting the probability of Chinese forest fire occurrence are of great importance to China’s forest fire prevention and control work. Based on fire hotspot, meteorological, terrain, vegetation, infrastructure, and socioeconomic data collected from 2003 to 2016, we used a random forest model as a feature-selection method to identify 13 major drivers of forest fires in China. The forest fire prediction models developed in this study are based on four machine-learning algorithms: an artificial neural network, a radial basis function network, a support-vector machine, and a random forest. The models were evaluated using the five performance indicators of accuracy, precision, recall, f1 value, and area under the curve. We used the optimal model to obtain the probability of forest fire occurrence in various provinces in China and created a spatial distribution map of the areas with high incidences of forest fires. The results showed that the prediction accuracy of the four forest fire prediction models was between 75.8% and 89.2%, and the area under the curve value was between 0.840 and 0.960. The random forest model had the highest accuracy (89.2%) and area under the curve value (0.96); thus, it was used as the optimal model to predict the probability of forest fire occurrence in China. The prediction results indicate that the areas with high incidences of forest fires are mainly concentrated in north-eastern China (Heilongjiang Province and northern Inner Mongolia Autonomous Region) and south-eastern China (including Fujian Province and Jiangxi Province). In areas at high risk of forest fire, management departments can improve forest fire prevention and control by establishing watch towers and using other monitoring equipment. This study helps in understanding the main drivers of forest fires in China, provides a reference for the selection of high-precision forest fire prediction models, and provides a scientific basis for China’s forest fire prevention and control work.


Sign in / Sign up

Export Citation Format

Share Document