splice site prediction
Recently Published Documents


TOTAL DOCUMENTS

62
(FIVE YEARS 9)

H-INDEX

13
(FIVE YEARS 2)

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Nicolas Scalzitti ◽  
Arnaud Kress ◽  
Romain Orhand ◽  
Thomas Weber ◽  
Luc Moulinier ◽  
...  

Abstract Background Ab initio prediction of splice sites is an essential step in eukaryotic genome annotation. Recent predictors have exploited Deep Learning algorithms and reliable gene structures from model organisms. However, Deep Learning methods for non-model organisms are lacking. Results We developed Spliceator to predict splice sites in a wide range of species, including model and non-model organisms. Spliceator uses a convolutional neural network and is trained on carefully validated data from over 100 organisms. We show that Spliceator achieves consistently high accuracy (89–92%) compared to existing methods on independent benchmarks from human, fish, fly, worm, plant and protist organisms. Conclusions Spliceator is a new Deep Learning method trained on high-quality data, which can be used to predict splice sites in diverse organisms, ranging from human to protists, with consistently high accuracy.


2021 ◽  
Author(s):  
Robin Steinhaus ◽  
Sebastian Proft ◽  
Markus Schuelke ◽  
David N Cooper ◽  
Jana Marie Schwarz ◽  
...  

Abstract Here we present an update to MutationTaster, our DNA variant effect prediction tool. The new version uses a different prediction model and attains higher accuracy than its predecessor, especially for rare benign variants. In addition, we have integrated many sources of data that only became available after the last release (such as gnomAD and ExAC pLI scores) and changed the splice site prediction model. To more easily assess the relevance of detected known disease mutations to the clinical phenotype of the patient, MutationTaster now provides information on the diseases they cause. Further changes represent a major overhaul of the interfaces to increase user-friendliness whilst many changes under the hood have been designed to accelerate the processing of uploaded VCF files. We also offer an API for the rapid automated query of smaller numbers of variants from within other software. MutationTaster2021 integrates our disease mutation search engine, MutationDistiller, to prioritise variants from VCF files using the patient's clinical phenotype. The novel version is available at https://www.genecascade.org/MutationTaster2021/. This website is free and open to all users and there is no login requirement.


Gene X ◽  
2020 ◽  
Vol 5 ◽  
pp. 100035 ◽  
Author(s):  
Somayah Albaradei ◽  
Arturo Magana-Mora ◽  
Maha Thafar ◽  
Mahmut Uludag ◽  
Vladimir B. Bajic ◽  
...  

2020 ◽  
Vol 18 (04) ◽  
pp. 2050024
Author(s):  
Santhosh Amilpur ◽  
Raju Bhukya

Splice site prediction is crucial for understanding underlying gene regulation, gene function for better genome annotation. Many computational methods exist for recognizing the splice sites. Although most of the methods achieve a competent performance, their interpretability remains challenging. Moreover, all traditional machine learning methods manually extract features, which is tedious job. To address these challenges, we propose a deep learning-based approach (EDeepSSP) that employs convolutional neural networks (CNNs) architecture for automatic feature extraction and effectively predicts splice sites. Our model, EDeepSSP, divulges the opaque nature of CNN by extracting significant motifs and explains why these motifs are vital for predicting splice sites. In this study, experiments have been conducted on six benchmark acceptors and donor datasets of humans, cress, and fly. The results show that EDeepSSP has outperformed many state-of-the-art approaches. EDeepSSP achieves the highest area under the receiver operating characteristic curve (AUC_ROC) and area under the precision-recall curve (AUC_PR) of 99.32% and 99.26% on human donor datasets, respectively. We also analyze various filter activities, feature activations, and extracted significant motifs responsible for the splice site prediction. Further, we validate the learned motifs of our model against known motifs of JASPAR splice site database.


PeerJ ◽  
2020 ◽  
Vol 8 ◽  
pp. e9470
Author(s):  
Thanyathorn Thanapattheerakul ◽  
Worrawat Engchuan ◽  
Jonathan H. Chan

Mutations that cause an error in the splicing of a messenger RNA (mRNA) can lead to diseases in humans. Various computational models have been developed to recognize the sequence pattern of the splice sites. In recent studies, Convolutional Neural Network (CNN) architectures were shown to outperform other existing models in predicting the splice sites. However, an insufficient effort has been put into extending the CNN model to predict the effect of the genomic variants on the splicing of mRNAs. This study proposes a framework to elaborate on the utility of CNNs to assess the effect of splice variants on the identification of potential disease-causing variants that disrupt the RNA splicing process. Five models, including three CNN-based and two non-CNN machine learning based, were trained and compared using two existing splice site datasets, Genome Wide Human splice sites (GWH) and a dataset provided at the Deep Learning and Artificial Intelligence winter school 2018 (DLAI). The donor sites were also used to test on the HSplice tool to evaluate the predictive models. To improve the effectiveness of predictive models, two datasets were combined. The CNN model with four convolutional layers showed the best splice site prediction performance with an AUPRC of 93.4% and 88.8% for donor and acceptor sites, respectively. The effects of variants on splicing were estimated by applying the best model on variant data from the ClinVar database. Based on the estimation, the framework could effectively differentiate pathogenic variants from the benign variants (p = 5.9 × 10−7). These promising results support that the proposed framework could be applied in future genetic studies to identify disease causing loci involving the splicing mechanism. The datasets and Python scripts used in this study are available on the GitHub repository at https://github.com/smiile8888/rna-splice-sites-recognition.


2019 ◽  
Vol 20 (S23) ◽  
Author(s):  
Ruohan Wang ◽  
Zishuai Wang ◽  
Jianping Wang ◽  
Shuaicheng Li

Abstract Background Identifying splice sites is a necessary step to analyze the location and structure of genes. Two dinucleotides, GT and AG, are highly frequent on splice sites, and many other patterns are also on splice sites with important biological functions. Meanwhile, the dinucleotides occur frequently at the sequences without splice sites, which makes the prediction prone to generate false positives. Most existing tools select all the sequences with the two dimers and then focus on distinguishing the true splice sites from those pseudo ones. Such an approach will lead to a decrease in false positives; however, it will result in non-canonical splice sites missing. Result We have designed SpliceFinder based on convolutional neural network (CNN) to predict splice sites. To achieve the ab initio prediction, we used human genomic data to train our neural network. An iterative approach is adopted to reconstruct the dataset, which tackles the data unbalance problem and forces the model to learn more features of splice sites. The proposed CNN obtains the classification accuracy of 90.25%, which is 10% higher than the existing algorithms. The method outperforms other existing methods in terms of area under receiver operating characteristics (AUC), recall, precision, and F1 score. Furthermore, SpliceFinder can find the exact position of splice sites on long genomic sequences with a sliding window. Compared with other state-of-the-art splice site prediction tools, SpliceFinder generates results in about half lower false positive while keeping recall higher than 0.8. Also, SpliceFinder captures the non-canonical splice sites. In addition, SpliceFinder performs well on the genomic sequences of Drosophila melanogaster, Mus musculus, Rattus, and Danio rerio without retraining. Conclusion Based on CNN, we have proposed a new ab initio splice site prediction tool, SpliceFinder, which generates less false positives and can detect non-canonical splice sites. Additionally, SpliceFinder is transferable to other species without retraining. The source code and additional materials are available at https://gitlab.deepomics.org/wangruohan/SpliceFinder.


2019 ◽  
Vol 17 (06) ◽  
pp. 1950034
Author(s):  
Abolfazl Bahrami ◽  
Ali Najafi ◽  
Mohammadreza Hashemi ◽  
Seyed Reza Miraie-Ashtiani

This study aimed to introduce an algorithm and identify intein motif and blocks involved in protein splicing, and explore the underlying methods in the development of detection of protein motifs. Inteins are mobile protein splicing elements capable of self-splicing post-translationally. They exist in viruses and bacteriophage, notwithstanding this broad phylogenetic distribution, all inteins apportion common structural features. A method was developed to predict intein in a raw sequence, using a ranking and scoring scheme based on amino acid [Formula: see text] value tables. This method aided in the identification and assessment of patterns characterizing the intein sequences. New intein conserved properties are revealed and the known ones are described and localized. We have computed the [Formula: see text] value of each amino acid at block A positions [Formula: see text] to [Formula: see text], block B positions [Formula: see text] to [Formula: see text] and block G positions [Formula: see text]7 to [Formula: see text] for the three categories. The consensus amino acids thus found are listed at the end of each row. We gave statistics for the distance between the blocks, block A to B, block B to F, and block F to G with the average being 66.1, 294, and 10.2 amino acids, respectively. The actual blocks A, B, and G of the one intein found in vacuolar membrane ATPase subunit, a precursor protein, are ranked 1. The results indicate all of the block sequences that are found in nine proteins are ranked at top of the list. The intein sequence is used to search the databases for intein-like proteins. Understanding the functional, structural, and dynamical aspects of inteins is important for intein engineering and the betterment of intein database.


Sign in / Sign up

Export Citation Format

Share Document