scholarly journals A deep learning framework with hybrid encoding for protein coding regions prediction in biological sequences

2020 ◽  
Author(s):  
Chao Wei ◽  
Junying Zhang ◽  
Xiguo Yuan ◽  
Zongzhen He ◽  
Guojun Liu

ABSTRACTProtein coding regions prediction is a very important but overlooked subtask for tasks such as prediction of complete gene structure, coding/noncoding RNA. Many machine learning methods have been proposed for this problem, they first encode a biological sequence into numerical values and then feed them into a classifier for final prediction. However, encoding schemes directly influence the classifier capability to capture coding features and how to choose a proper encoding scheme remains uncertain. Recently, we proposed a protein coding region prediction method in transcript sequences based on a bidirectional recurrent neural network with non-overlapping kmer, and achieved considerable improvement over existing methods, but there is still much room to improve the performance. In fact, kmer features that count the occurrence frequency of trinucleotides only reflect the local sequence order information between the most contiguous nucleotides, which loses almost all the global sequence order information. In viewing of the point, we here present a deep learning framework with hybrid encoding for protein coding regions prediction in biological sequences, which effectively exploiting global sequence order information, non-overlapping kmer features and statistical dependencies among coding labels. Evaluated on genomic and transcript sequences, our proposed method significantly outperforms existing state-of-the-art methods.

1991 ◽  
Vol 11 (3) ◽  
pp. 1770-1776
Author(s):  
R G Collum ◽  
D F Clayton ◽  
F W Alt

We found that the canary N-myc gene is highly related to mammalian N-myc genes in both the protein-coding region and the long 3' untranslated region. Examined coding regions of the canary c-myc gene were also highly related to their mammalian counterparts, but in contrast to N-myc, the canary and mammalian c-myc genes were quite divergent in their 3' untranslated regions. We readily detected N-myc and c-myc expression in the adult canary brain and found N-myc expression both at sites of proliferating neuronal precursors and in mature neurons.


2020 ◽  
Vol 21 (15) ◽  
pp. 5222 ◽  
Author(s):  
Xiao-Nan Fan ◽  
Shao-Wu Zhang ◽  
Song-Yao Zhang ◽  
Jin-Jie Ni

Long non-coding RNAs (lncRNAs) play crucial roles in diverse biological processes and human complex diseases. Distinguishing lncRNAs from protein-coding transcripts is a fundamental step for analyzing the lncRNA functional mechanism. However, the experimental identification of lncRNAs is expensive and time-consuming. In this study, we presented an alignment-free multimodal deep learning framework (namely lncRNA_Mdeep) to distinguish lncRNAs from protein-coding transcripts. LncRNA_Mdeep incorporated three different input modalities, then a multimodal deep learning framework was built for learning the high-level abstract representations and predicting the probability whether a transcript was lncRNA or not. LncRNA_Mdeep achieved 98.73% prediction accuracy in a 10-fold cross-validation test on humans. Compared with other eight state-of-the-art methods, lncRNA_Mdeep showed 93.12% prediction accuracy independent test on humans, which was 0.94%~15.41% higher than that of other eight methods. In addition, the results on 11 cross-species datasets showed that lncRNA_Mdeep was a powerful predictor for predicting lncRNAs.


2019 ◽  
Vol 109 (6) ◽  
pp. 983-992 ◽  
Author(s):  
Dan Edward V. Villamor ◽  
Kenneth C. Eastwell

Western X (WX) disease, caused by ‘Candidatus Phytoplasma pruni’, is a devastating disease of sweet cherry resulting in the production of small, bitter-flavored fruits that are unmarketable. Escalation of WX disease in Washington State prompted the development of a rapid detection assay based on recombinase polymerase amplification (RPA) to facilitate timely removal and replacement of diseased trees. Here, we report on a reliable RPA assay targeting putative immunodominant protein coding regions that showed comparable sensitivity to polymerase chain reaction (PCR) in detecting ‘Ca. Phytoplasma pruni’ from crude sap of sweet cherry tissues. Apart from the predominant strain of ‘Ca. Phytoplasma pruni’, the RPA assay also detected a novel strain of phytoplasma from several WX-affected trees. Multilocus sequence analyses using the immunodominant protein A (idpA), imp, rpoE, secY, and 16S ribosomal RNA regions from several ‘Ca. Phytoplasma pruni’ isolates from WX-affected trees showed that this novel phytoplasma strain represents a new subgroup within the 16SrIII group. Examination of high-throughput sequencing data from total RNA of WX-affected trees revealed that the imp coding region is highly expressed, and as supported by quantitative reverse transcription PCR data, it showed higher RNA transcript levels than the previously proposed idpA coding region of ‘Ca. Phytoplasma pruni’.


Entropy ◽  
2021 ◽  
Vol 23 (10) ◽  
pp. 1324
Author(s):  
Garin Newcomb ◽  
Khalid Sayood

One of the important steps in the annotation of genomes is the identification of regions in the genome which code for proteins. One of the tools used by most annotation approaches is the use of signals extracted from genomic regions that can be used to identify whether the region is a protein coding region. Motivated by the fact that these regions are information bearing structures we propose signals based on measures motivated by the average mutual information for use in this task. We show that these signals can be used to identify coding and noncoding sequences with high accuracy. We also show that these signals are robust across species, phyla, and kingdom and can, therefore, be used in species agnostic genome annotation algorithms for identifying protein coding regions. These in turn could be used for gene identification.


2020 ◽  
Author(s):  
Antoine Despinasse ◽  
Yongjin Park ◽  
Michael Lapi ◽  
Manolis Kellis

ABSTRACTDespite all the work done, mapping GWAS SNPs in non-coding regions to their target genes remains a challenge. The SNP can be associated with target genes by eQTL analysis. Here we introduce a method to make these eQTLs more robust. Instead of correlating the gene expression with the SNP value like in eQTLs, we correlate it with epigenomic data. This epigenomic data is very expensive and noisy. We therefore predict the epigenomic data from the DNA sequence using the deep learning framework DeepSEA (Zhou and Troyanskaya, 2015).


2019 ◽  
Author(s):  
If Barnes ◽  
Ximena Ibarra-Soria ◽  
Stephen Fitzgerald ◽  
Jose Gonzalez ◽  
Claire Davidson ◽  
...  

Abstract Olfactory receptor (OR) genes are the largest multi-gene family in the mammalian genome, with over 850 in human and nearly 1500 genes in mouse. The expansion of the OR gene repertoire has occurred through numerous duplication events followed by diversification, resulting in a large number of highly similar paralogous genes. These characteristics have made the annotation of the complete OR gene repertoire a complex task. Most OR genes have been predicted in silico and are typically annotated as intronless coding sequences. Here we have developed an expert curation pipeline to analyse and annotate every OR gene in the human and mouse reference genomes. By combining evidence from structural features, evolutionary conservation and experimental data, we have unified the annotation of these gene families, and have systematically determined the protein-coding potential of each locus. We have defined the non-coding regions of many OR genes, enabling us to generate full-length transcript models. We found that 13 human and 41 mouse OR loci have coding sequences that are split across two exons. These split OR genes are conserved across mammals, and are expressed at the same level as protein-coding OR genes with an intronless coding region. Our findings challenge the long-standing and widespread notion that the coding region of a vertebrate OR gene is contained within a single exon.


2021 ◽  
Vol 12 ◽  
Author(s):  
Fabien Degalez ◽  
Frédéric Jehl ◽  
Kévin Muret ◽  
Maria Bernard ◽  
Frédéric Lecerf ◽  
...  

Most single-nucleotide polymorphisms (SNPs) are located in non-coding regions, but the fraction usually studied is harbored in protein-coding regions because potential impacts on proteins are relatively easy to predict by popular tools such as the Variant Effect Predictor. These tools annotate variants independently without considering the potential effect of grouped or haplotypic variations, often called “multi-nucleotide variants” (MNVs). Here, we used a large RNA-seq dataset to survey MNVs, comprising 382 chicken samples originating from 11 populations analyzed in the companion paper in which 9.5M SNPs— including 3.3M SNPs with reliable genotypes—were detected. We focused our study on in-codon MNVs and evaluate their potential mis-annotation. Using GATK HaplotypeCaller read-based phasing results, we identified 2,965 MNVs observed in at least five individuals located in 1,792 genes. We found 41.1% of them showing a novel impact when compared to the effect of their constituent SNPs analyzed separately. The biggest impact variation flux concerns the originally annotated stop-gained consequences, for which around 95% were rescued; this flux is followed by the missense consequences for which 37% were reannotated with a different amino acid. We then present in more depth the rescued stop-gained MNVs and give an illustration in the SLC27A4 gene. As previously shown in human datasets, our results in chicken demonstrate the value of haplotype-aware variant annotation, and the interest to consider MNVs in the coding region, particularly when searching for severe functional consequence such as stop-gained variants.


Author(s):  
Min Zeng ◽  
Yifan Wu ◽  
Chengqian Lu ◽  
Fuhao Zhang ◽  
Fang-Xiang Wu ◽  
...  

Abstract Long non-coding RNAs (lncRNAs) are a class of RNA molecules with more than 200 nucleotides. A growing amount of evidence reveals that subcellular localization of lncRNAs can provide valuable insights into their biological functions. Existing computational methods for predicting lncRNA subcellular localization use k-mer features to encode lncRNA sequences. However, the sequence order information is lost by using only k-mer features. We proposed a deep learning framework, DeepLncLoc, to predict lncRNA subcellular localization. In DeepLncLoc, we introduced a new subsequence embedding method that keeps the order information of lncRNA sequences. The subsequence embedding method first divides a sequence into some consecutive subsequences and then extracts the patterns of each subsequence, last combines these patterns to obtain a complete representation of the lncRNA sequence. After that, a text convolutional neural network is employed to learn high-level features and perform the prediction task. Compared with traditional machine learning models, popular representation methods and existing predictors, DeepLncLoc achieved better performance, which shows that DeepLncLoc could effectively predict lncRNA subcellular localization. Our study not only presented a novel computational model for predicting lncRNA subcellular localization but also introduced a new subsequence embedding method which is expected to be applied in other sequence-based prediction tasks. The DeepLncLoc web server is freely accessible at http://bioinformatics.csu.edu.cn/DeepLncLoc/, and source code and datasets can be downloaded from https://github.com/CSUBioGroup/DeepLncLoc.


2020 ◽  
Author(s):  
Xiao-Nan Fan ◽  
Shao-Wu Zhang ◽  
Song-Yao Zhang ◽  
Jin-Jie Ni

Abstract Background: Long non-coding RNAs (lncRNAs) play crucial roles in diverse biological processes and human complex diseases. Distinguishing lncRNAs from protein-coding transcripts is a fundamental step for analyzing lncRNA functional mechanism. However, the experimental identification of lncRNAs is expensive and time-consuming. Results: In this study, we present an alignment-free multimodal deep learning framework (namely lncRNA_Mdeep) to distinguish lncRNAs from protein-coding transcripts. LncRNA_Mdeep incorporates three different input modalities (i.e. OFH modality, k-mer modality, and sequence modality), then a multimodal deep learning framework is built for learning the high-level abstract representations and predicting the probability whether a transcript is lncRNA or not. Conclusions: LncRNA_Mdeep achieves 98.73% prediction accuracy in 10-fold cross-validation test on human. Compared with other eight state-of-the-art methods, lncRNA_Mdeep shows 93.12% prediction accuracy independent test on human, which is 0.94%~15.41% higher than that of other eight methods. In addition, the results on 11 cross-species datasets show that lncRNA_Mdeep is a powerful predictor for identifying lncRNAs. The source code can be downloaded from https://github.com/NWPU-903PR/lncRNA_Mdeep.


2019 ◽  
Author(s):  
If H. A. Barnes ◽  
Ximena Ibarra-Soria ◽  
Stephen Fitzgerald ◽  
Jose M. Gonzalez ◽  
Claire Davidson ◽  
...  

ABSTRACTOlfactory receptor (OR) genes are the largest multi-gene family in the mammalian genome, with over 850 in human and nearly 1500 genes in mouse. The expansion of the OR gene repertoire has occurred through numerous duplication events followed by diversification, resulting in a large number of highly similar paralogous genes. These characteristics have made the annotation of the complete OR gene repertoire a complex task. Most OR genes have been predicted in silico and are typically annotated as intronless coding sequences. Here we have developed an expert curation pipeline to analyse and annotate every OR gene in the human and mouse reference genomes. By combining evidence from structural features, evolutionary conservation and experimental data, we have unified the annotation of these gene families, and have systematically determined the protein-coding potential of each locus. We have defined the non-coding regions of many OR genes, enabling us to generate full-length transcript models. We found that 13 human and 41 mouse OR loci have coding sequences that are split across two exons. These split OR genes are conserved across mammals, and are expressed at the same level as protein-coding OR genes with an intronless coding region. Our findings challenge the long-standing and widespread notion that the coding region of a vertebrate OR gene is contained within a single exon.


Sign in / Sign up

Export Citation Format

Share Document