SPRINT-Gly: predicting N- and O-linked glycosylation sites of human and mouse proteins by using sequence and predicted structural properties

2019 ◽  
Vol 35 (20) ◽  
pp. 4140-4146 ◽  
Author(s):  
Ghazaleh Taherzadeh ◽  
Abdollah Dehzangi ◽  
Maryam Golchin ◽  
Yaoqi Zhou ◽  
Matthew P Campbell

Abstract Motivation Protein glycosylation is one of the most abundant post-translational modifications that plays an important role in immune responses, intercellular signaling, inflammation and host-pathogen interactions. However, due to the poor ionization efficiency and microheterogeneity of glycopeptides identifying glycosylation sites is a challenging task, and there is a demand for computational methods. Here, we constructed the largest dataset of human and mouse glycosylation sites to train deep learning neural networks and support vector machine classifiers to predict N-/O-linked glycosylation sites, respectively. Results The method, called SPRINT-Gly, achieved consistent results between ten-fold cross validation and independent test for predicting human and mouse glycosylation sites. For N-glycosylation, a mouse-trained model performs equally well in human glycoproteins and vice versa, however, due to significant differences in O-linked sites separate models were generated. Overall, SPRINT-Gly is 18% and 50% higher in Matthews correlation coefficient than the next best method compared in N-linked and O-linked sites, respectively. This improved performance is due to the inclusion of novel structure and sequence-based features. Availability and implementation http://sparks-lab.org/server/SPRINT-Gly/ Supplementary information Supplementary data are available at Bioinformatics online.

2018 ◽  
Vol 1 (1) ◽  
pp. 120-130 ◽  
Author(s):  
Chunxiang Qian ◽  
Wence Kang ◽  
Hao Ling ◽  
Hua Dong ◽  
Chengyao Liang ◽  
...  

Support Vector Machine (SVM) model optimized by K-Fold cross-validation was built to predict and evaluate the degradation of concrete strength in a complicated marine environment. Meanwhile, several mathematical models, such as Artificial Neural Network (ANN) and Decision Tree (DT), were also built and compared with SVM to determine which one could make the most accurate predictions. The material factors and environmental factors that influence the results were considered. The materials factors mainly involved the original concrete strength, the amount of cement replaced by fly ash and slag. The environmental factors consisted of the concentration of Mg2+, SO42-, Cl-, temperature and exposing time. It was concluded from the prediction results that the optimized SVM model appeared to perform better than other models in predicting the concrete strength. Based on SVM model, a simulation method of variables limitation was used to determine the sensitivity of various factors and the influence degree of these factors on the degradation of concrete strength.


2020 ◽  
Vol 27 (3) ◽  
pp. 178-186 ◽  
Author(s):  
Ganesan Pugalenthi ◽  
Varadharaju Nithya ◽  
Kuo-Chen Chou ◽  
Govindaraju Archunan

Background: N-Glycosylation is one of the most important post-translational mechanisms in eukaryotes. N-glycosylation predominantly occurs in N-X-[S/T] sequon where X is any amino acid other than proline. However, not all N-X-[S/T] sequons in proteins are glycosylated. Therefore, accurate prediction of N-glycosylation sites is essential to understand Nglycosylation mechanism. Objective: In this article, our motivation is to develop a computational method to predict Nglycosylation sites in eukaryotic protein sequences. Methods: In this article, we report a random forest method, Nglyc, to predict N-glycosylation site from protein sequence, using 315 sequence features. The method was trained using a dataset of 600 N-glycosylation sites and 600 non-glycosylation sites and tested on the dataset containing 295 Nglycosylation sites and 253 non-glycosylation sites. Nglyc prediction was compared with NetNGlyc, EnsembleGly and GPP methods. Further, the performance of Nglyc was evaluated using human and mouse N-glycosylation sites. Results: Nglyc method achieved an overall training accuracy of 0.8033 with all 315 features. Performance comparison with NetNGlyc, EnsembleGly and GPP methods shows that Nglyc performs better than the other methods with high sensitivity and specificity rate. Conclusion: Our method achieved an overall accuracy of 0.8248 with 0.8305 sensitivity and 0.8182 specificity. Comparison study shows that our method performs better than the other methods. Applicability and success of our method was further evaluated using human and mouse N-glycosylation sites. Nglyc method is freely available at https://github.com/bioinformaticsML/ Ngly.


2020 ◽  
Vol 36 (Supplement_2) ◽  
pp. i857-i865
Author(s):  
Derrick Blakely ◽  
Eamon Collins ◽  
Ritambhara Singh ◽  
Andrew Norton ◽  
Jack Lanchantin ◽  
...  

Abstract Motivation Gapped k-mer kernels with support vector machines (gkm-SVMs) have achieved strong predictive performance on regulatory DNA sequences on modestly sized training sets. However, existing gkm-SVM algorithms suffer from slow kernel computation time, as they depend exponentially on the sub-sequence feature length, number of mismatch positions, and the task’s alphabet size. Results In this work, we introduce a fast and scalable algorithm for calculating gapped k-mer string kernels. Our method, named FastSK, uses a simplified kernel formulation that decomposes the kernel calculation into a set of independent counting operations over the possible mismatch positions. This simplified decomposition allows us to devise a fast Monte Carlo approximation that rapidly converges. FastSK can scale to much greater feature lengths, allows us to consider more mismatches, and is performant on a variety of sequence analysis tasks. On multiple DNA transcription factor binding site prediction datasets, FastSK consistently matches or outperforms the state-of-the-art gkmSVM-2.0 algorithms in area under the ROC curve, while achieving average speedups in kernel computation of ∼100× and speedups of ∼800× for large feature lengths. We further show that FastSK outperforms character-level recurrent and convolutional neural networks while achieving low variance. We then extend FastSK to 7 English-language medical named entity recognition datasets and 10 protein remote homology detection datasets. FastSK consistently matches or outperforms these baselines. Availability and implementation Our algorithm is available as a Python package and as C++ source code at https://github.com/QData/FastSK Supplementary information Supplementary data are available at Bioinformatics online.


2010 ◽  
Vol 26 (9) ◽  
pp. 1219-1224 ◽  
Author(s):  
Yongjin Li ◽  
Jagdish C. Patra

Abstract Motivation: Clinical diseases are characterized by distinct phenotypes. To identify disease genes is to elucidate the gene–phenotype relationships. Mutations in functionally related genes may result in similar phenotypes. It is reasonable to predict disease-causing genes by integrating phenotypic data and genomic data. Some genetic diseases are genetically or phenotypically similar. They may share the common pathogenetic mechanisms. Identifying the relationship between diseases will facilitate better understanding of the pathogenetic mechanism of diseases. Results: In this article, we constructed a heterogeneous network by connecting the gene network and phenotype network using the phenotype–gene relationship information from the OMIM database. We extended the random walk with restart algorithm to the heterogeneous network. The algorithm prioritizes the genes and phenotypes simultaneously. We use leave-one-out cross-validation to evaluate the ability of finding the gene–phenotype relationship. Results showed improved performance than previous works. We also used the algorithm to disclose hidden disease associations that cannot be found by gene network or phenotype network alone. We identified 18 hidden disease associations, most of which were supported by literature evidence. Availability: The MATLAB code of the program is available at http://www3.ntu.edu.sg/home/aspatra/research/Yongjin_BI2010.zip Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.


2014 ◽  
Vol 26 (01) ◽  
pp. 1450002 ◽  
Author(s):  
Hanguang Xiao

The early detection and intervention of artery stenosis is very important to reduce the mortality of cardiovascular disease. A novel method for predicting artery stenosis was proposed by using the input impedance of the systemic arterial tree and support vector machine (SVM). Based on the built transmission line model of a 55-segment systemic arterial tree, the input impedance of the arterial tree was calculated by using a recursive algorithm. A sample database of the input impedance was established by specifying the different positions and degrees of artery stenosis. A SVM prediction model was trained by using the sample database. 10-fold cross-validation was used to evaluate the performance of the SVM. The effects of stenosis position and degree on the accuracy of the prediction were discussed. The results showed that the mean specificity, sensitivity and overall accuracy of the SVM are 80.2%, 98.2% and 89.2%, respectively, for the 50% threshold of stenosis degree. Increasing the threshold of the stenosis degree from 10% to 90% increases the overall accuracy from 82.2% to 97.4%. Increasing the distance of the stenosis artery from the heart gradually decreases the overall accuracy from 97.1% to 58%. The deterioration of the stenosis degree to 90% increases the prediction accuracy of the SVM to more than 90% for the stenosis of peripheral artery. The simulation demonstrated theoretically the feasibility of the proposed method for predicting artery stenosis via the input impedance of the systemic arterial tree and SVM.


2018 ◽  
Vol 35 (16) ◽  
pp. 2757-2765 ◽  
Author(s):  
Balachandran Manavalan ◽  
Shaherin Basith ◽  
Tae Hwan Shin ◽  
Leyi Wei ◽  
Gwang Lee

AbstractMotivationCardiovascular disease is the primary cause of death globally accounting for approximately 17.7 million deaths per year. One of the stakes linked with cardiovascular diseases and other complications is hypertension. Naturally derived bioactive peptides with antihypertensive activities serve as promising alternatives to pharmaceutical drugs. So far, there is no comprehensive analysis, assessment of diverse features and implementation of various machine-learning (ML) algorithms applied for antihypertensive peptide (AHTP) model construction.ResultsIn this study, we utilized six different ML algorithms, namely, Adaboost, extremely randomized tree (ERT), gradient boosting (GB), k-nearest neighbor, random forest (RF) and support vector machine (SVM) using 51 feature descriptors derived from eight different feature encodings for the prediction of AHTPs. While ERT-based trained models performed consistently better than other algorithms regardless of various feature descriptors, we treated them as baseline predictors, whose predicted probability of AHTPs was further used as input features separately for four different ML-algorithms (ERT, GB, RF and SVM) and developed their corresponding meta-predictors using a two-step feature selection protocol. Subsequently, the integration of four meta-predictors through an ensemble learning approach improved the balanced prediction performance and model robustness on the independent dataset. Upon comparison with existing methods, mAHTPred showed superior performance with an overall improvement of approximately 6–7% in both benchmarking and independent datasets.Availability and implementationThe user-friendly online prediction tool, mAHTPred is freely accessible at http://thegleelab.org/mAHTPred.Supplementary informationSupplementary data are available at Bioinformatics online.


2018 ◽  
Vol 34 ◽  
pp. 52-58 ◽  
Author(s):  
Jingsheng Shi ◽  
Guanglei Zhao ◽  
Yibing Wei

The dynamic balance between acetylation and deacetylation of histones plays a crucial role in the epigenetic regulation of gene expression. It is equilibrated by two families of enzymes: histone acetyltransferases and histone deacetylases (HDACs). HDACs repress transcription by regulating the conformation of the higher-order chromatin structure. HDAC inhibitors have recently become a class of chemical agents for potential treatment of the abnormal chromatin remodeling process involved in certain cancers. In this study, we constructed a large dataset to predict the activity value of HDAC1 inhibitors. Each compound was represented with seven fingerprints, and computational models were subsequently developed to predict HDAC1 inhibitors via five machine learning methods. These methods include naïve Bayes, κ-nearest neighbor, C4.5 decision tree, random forest, and support vector machine (SVM) algorithms. The best predicting model was CDK fingerprint with SVM, which exhibited an accuracy of 0.89. This model also performed best in five-fold cross-validation. Some representative substructure alerts responsible for HDAC1 inhibitors were identified by using MoSS in KNIME, which could facilitate the identification of HDAC1 inhibitors.


2017 ◽  
Vol 17 (2) ◽  
pp. 29-38
Author(s):  
Ratih Purwati ◽  
Gunawan Ariyanto

Face Recognition merupakan teknologi komputer untuk mengidentifikasi wajah manusia melalui gambar digital yang tersimpan di database. Wajah manusia dapat berubah bentuk sesuai dengan ekspresi yang dimilikinya. Wajah manusia dapat berubah bentuk sesuai dengan eskpresi yang dimilikinya. Ekspresi wajah manusia memiliki kemiripan satu sama lain sehingga untuk mengenali suatu ekspresi adalah kepunyaan siapa akan sedikit sulit. Pengenalan wajah terus menjadi topik aktif di zaman sekarang pada penelitian bidang computer vision. Penggunaan wajah manusia sering kita jumpai pada fitur-fitur aplikasi media sosial seperti Snapchat, Snapgram dari Instagram dan banyak aplikasi sosial media lainnya yang menggunakan teknologi tersebut. Pada penelitian ini dilakukan analisa pengenalan ekpresi wajah manusia dengan pendekatan fitur alogaritma Local Binary Pattern dan mencari pengembangan alogaritma dasar Local Binary Pattern yang paling optimal dengan cara menggabungkan metode Hisogram Equalization, Support Vector Machine, dan K-fold cross validation sehingga dapat meningkatkan pengenalan gambar wajah manusia pada hasil yang terbaik. Penelitian ini menginput beberapa database wajah manusia seperti JAFFE yang merupakan gambar wajah manusia wanita jepang yang berjumlah 10 orang dengan 7 ekspresi emosional seperti marah, sedih, bahagia, jijik, kaget, takut dan netral ke dalam sistem. YALE yaitu merupakan gambar wajah manusia orang Amerika. Serta menggunakan dataset CALTECH yang merupakan gambar manusia yang terdiri dari 450 gambar dengan ukuran 896 x 592 piksel dan disimpan dalam format JPEG. Kemudian data tersebut di sesuaikan dengan bentuk tekstur wajah masing-masing. Dari hasil penggabungan ketiga metode diatas dan percobaan-percobaan yang sudah dilakukan, didapatkan hasil yang paling optimal dalam pengenalan wajah manusia yaitu menggunakan dataset JAFFE dengan resolusi 92 x 112 piksel dan dengan tingkat penggunaan processor yang tinggi dapat mempengaruhi waktu kecepatan komputasi dalam proses menjalankan sistem sehingga menghasilkan prediksi yang lebih tepat.


Author(s):  
Tiago Oliveira ◽  
Morten Thaysen-Andersen ◽  
Nicolle H. Packer ◽  
Daniel Kolarich

Protein glycosylation is one of the most common post-translational modifications that are essential for cell function across all domains of life. Changes in glycosylation are considered a hallmark of many diseases, thus making glycoproteins important diagnostic and prognostic biomarker candidates and therapeutic targets. Glycoproteomics, the study of glycans and their carrier proteins in a system-wide context, is becoming a powerful tool in glycobiology that enables the functional analysis of protein glycosylation. This ‘Hitchhiker's guide to glycoproteomics’ is intended as a starting point for anyone who wants to explore the emerging world of glycoproteomics. The review moves from the techniques that have been developed for the characterisation of single glycoproteins to technologies that may be used for a successful complex glycoproteome characterisation. Examples of the variety of approaches, methodologies, and technologies currently used in the field are given. This review introduces the common strategies to capture glycoprotein-specific and system-wide glycoproteome data from tissues, body fluids, or cells, and a perspective on how integration into a multi-omics workflow enables a deep identification and characterisation of glycoproteins — a class of biomolecules essential in regulating cell function.


BMC Cancer ◽  
2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Ying Zhu ◽  
Wang Yao ◽  
Bing-Chen Xu ◽  
Yi-Yan Lei ◽  
Qi-Kun Guo ◽  
...  

Abstract Objectives To develop and validate a radiomics model for evaluating treatment response to immune-checkpoint inhibitor plus chemotherapy (ICI + CT) in patients with advanced esophageal squamous cell carcinoma (ESCC). Methods A total of 64 patients with advance ESCC receiving first-line ICI + CT at two centers between January 2019 and June 2020 were enrolled in this study. Both 2D ROIs and 3D ROIs were segmented. ComBat correction was applied to minimize the potential bias on the results due to different scan protocols. A total of 788 features were extracted and radiomics models were built on corrected/uncorrected 2D and 3D features by using 5-fold cross-validation. The performance of the radiomics models was assessed by its discrimination, calibration and clinical usefulness with independent validation. Results Five features and support vector machine algorithm were selected to build the 2D uncorrected, 2D corrected, 3D uncorrected and 3D corrected radiomics models. The 2D radiomics models significantly outperformed the 3D radiomics models in both primary and validation cohorts. When ComBat correction was used, the performance of 2D models was better (p = 0.0059) in the training cohort, and significantly better (p < 0.0001) in the validation cohort. The 2D corrected radiomics model yielded the optimal performance and was used to build the nomogram. The calibration curve of the radiomics model demonstrated good agreement between prediction and observation and the decision curve analysis confirmed the clinical utility. Conclusions The easy-to-use 2D corrected radiomics model could facilitate noninvasive preselection of ESCC patients who would benefit from ICI + CT.


Sign in / Sign up

Export Citation Format

Share Document