scholarly journals Deep learning predicts short non-coding RNA functions from only raw sequence data

2020 ◽  
Vol 16 (11) ◽  
pp. e1008415
Author(s):  
Teresa Maria Rosaria Noviello ◽  
Francesco Ceccarelli ◽  
Michele Ceccarelli ◽  
Luigi Cerulo

Small non-coding RNAs (ncRNAs) are short non-coding sequences involved in gene regulation in many biological processes and diseases. The lack of a complete comprehension of their biological functionality, especially in a genome-wide scenario, has demanded new computational approaches to annotate their roles. It is widely known that secondary structure is determinant to know RNA function and machine learning based approaches have been successfully proven to predict RNA function from secondary structure information. Here we show that RNA function can be predicted with good accuracy from a lightweight representation of sequence information without the necessity of computing secondary structure features which is computationally expensive. This finding appears to go against the dogma of secondary structure being a key determinant of function in RNA. Compared to recent secondary structure based methods, the proposed solution is more robust to sequence boundary noise and reduces drastically the computational cost allowing for large data volume annotations. Scripts and datasets to reproduce the results of experiments proposed in this study are available at: https://github.com/bioinformatics-sannio/ncrna-deep.

2020 ◽  
Author(s):  
Teresa M.R. Noviello ◽  
Michele Ceccarelli ◽  
Luigi Cerulo

AbstractNon-coding RNAs (ncRNAs) are small non-coding sequences involved in gene regulation in many biological processes and diseases. The lack of a complete comprehension of their biological functionality, especially in a genome-wide scenario, has demanded new computational approaches to annotate their roles. It is widely known that secondary structure is determinant to know RNA function and machine learning based approaches have been successfully proven to predict RNA function from secondary structure information.Here we show that RNA function can be predicted with good accuracy from raw sequence information without the necessity of computing secondary structure features which is computationally expensive. This finding appears to go against the dogma of secondary structure being a key determinant of function in RNA. Compared to recent secondary structure based methods, the proposed solution is more robust to sequence boundary noise and reduces drastically the computational cost allowing for large data volume annotations.Scripts and datasets to reproduce the results of experiments proposed in this study are available at: https://github.com/bioinformatics-sannio/ncrna-deep


Pathogens ◽  
2020 ◽  
Vol 9 (11) ◽  
pp. 925 ◽  
Author(s):  
Marta Szabat ◽  
Dagny Lorent ◽  
Tomasz Czapik ◽  
Maria Tomaszewska ◽  
Elzbieta Kierzek ◽  
...  

Influenza is an important research subject around the world because of its threat to humanity. Influenza A virus (IAV) causes seasonal epidemics and sporadic, but dangerous pandemics. A rapid antigen changes and recombination of the viral RNA genome contribute to the reduced effectiveness of vaccination and anti-influenza drugs. Hence, there is a necessity to develop new antiviral drugs and strategies to limit the influenza spread. IAV is a single-stranded negative sense RNA virus with a genome (viral RNA—vRNA) consisting of eight segments. Segments within influenza virion are assembled into viral ribonucleoprotein (vRNP) complexes that are independent transcription-replication units. Each step in the influenza life cycle is regulated by the RNA and is dependent on its interplay and dynamics. Therefore, viral RNA can be a proper target to design novel therapeutics. Here, we briefly described examples of anti-influenza strategies based on the antisense oligonucleotide (ASO), small interfering RNA (siRNA), microRNA (miRNA) and catalytic nucleic acids. In particular we focused on the vRNA structure-function relationship as well as presented the advantages of using secondary structure information in predicting therapeutic targets and the potential future of this field.


2021 ◽  
Author(s):  
Maureen Rebecca Smith ◽  
Maria Trofimova ◽  
Ariane Weber ◽  
Yannick Duport ◽  
Denise Kuhnert ◽  
...  

In May 2021, over 160 million SARS-CoV-2 infections have been reported worldwide. Yet, the true amount of infections is unknown and believed to exceed the reported numbers by several fold, depending on national testing policies that can strongly affect the proportion of undetected cases. To overcome this testing bias and better assess SARS-CoV-2 transmission dynamics, we propose a genome-based computational pipeline, GInPipe, to reconstruct the SARS-CoV-2 incidence dynamics through time. After validating GInPipe against in silico generated outbreak data, as well as more complex phylodynamic analyses, we use the pipeline to reconstruct incidence histories in Denmark, Scotland, Switzerland, and Victoria (Australia) solely from viral sequence data. The proposed method robustly reconstructs the different pandemic waves in the investigated countries and regions, does not require phylodynamic reconstruction, and can be directly applied to publicly deposited SARS-CoV-2 sequencing data sets. We observe differences in the relative magnitude of reconstructed versus reported incidences during times with sparse availability of diagnostic tests. Using the reconstructed incidence dynamics, we assess how testing policies may have affected the probability to diagnose and report infected individuals. We find that under-reporting was highest in mid 2020 in all analysed countries, coinciding with liberal testing policies at times of low test capacities. Due to the increased use of real-time sequencing, it is envisaged that GInPipe can complement established surveillance tools to monitor the SARS-CoV-2 pandemic and evaluate testing policies. The method executes within minutes on very large data sets and is freely available as a fully automated pipeline from https://github.com/KleistLab/GInPipe.


2019 ◽  
Vol 19 (1) ◽  
Author(s):  
Russell J. S. Orr ◽  
Marianne N. Haugen ◽  
Björn Berning ◽  
Philip Bock ◽  
Robyn L. Cumming ◽  
...  

Abstract Background Understanding the phylogenetic relationships among species is one of the main goals of systematic biology. Simultaneously, credible phylogenetic hypotheses are often the first requirement for unveiling the evolutionary history of traits and for modelling macroevolutionary processes. However, many non-model taxa have not yet been sequenced to an extent such that statistically well-supported molecular phylogenies can be constructed for these purposes. Here, we use a genome-skimming approach to extract sequence information for 15 mitochondrial and 2 ribosomal operon genes from the cheilostome bryozoan family, Adeonidae, Busk, 1884, whose current systematics is based purely on morphological traits. The members of the Adeonidae are, like all cheilostome bryozoans, benthic, colonial, marine organisms. Adeonids are also geographically widely-distributed, often locally common, and are sometimes important habitat-builders. Results We successfully genome-skimmed 35 adeonid colonies representing 6 genera (Adeona, Adeonellopsis, Bracebridgia, Adeonella, Laminopora and Cucullipora). We also contributed 16 new, circularised mitochondrial genomes to the eight previously published for cheilostome bryozoans. Using the aforementioned mitochondrial and ribosomal genes, we inferred the relationships among these 35 samples. Contrary to some previous suggestions, the Adeonidae is a robustly supported monophyletic clade. However, the genera Adeonella and Laminopora are in need of revision: Adeonella is polyphyletic and Laminopora paraphyletically forms a clade with some Adeonella species. Additionally, we assign a sequence clustering identity using cox1 barcoding region of 99% at the species and 83% at the genus level. Conclusions We provide sequence data, obtained via genome-skimming, that greatly increases the resolution of the phylogenetic relationships within the adeonids. We present a highly-supported topology based on 17 genes and substantially increase availability of circularised cheilostome mitochondrial genomes, and highlight how we can extend our pipeline to other bryozoans.


Genome ◽  
2005 ◽  
Vol 48 (3) ◽  
pp. 411-416 ◽  
Author(s):  
Hikmet Budak ◽  
Robert C Shearman ◽  
Ismail Dweikat

Buffalograss (Buchloë dactyloides (Nutt.) Englem), a C4 turfgrass species, is native to the Great Plains region of North America. The evolutionary implications of buffalograss are unclear. Sequencing of rbcL and matK genes from plastid and the cob gene from mitochondrial genomes was examined to elucidate buffalograss evolution. This study is the first to report sequencing of these genes from organelle genomes in the genus Buchloë. Comparisons of sequence data from the mitochondrial and plastid genome revealed that all genotypes contained the same cytoplasmic origin. There were some rearrangements detected in mitochondrial genome. The buffalograss genome appears to have evolved through the rearrangements of convergent subgenomic domains. Combined analyses of plastid genes suggest that the evolutionary process in Buchloë accessions studied was monophyletic rather than polyphyletic. However, since plastid and mitochondrial genomes are generally uniparentally inherited, the evolutionary history of these genomes may not reflect the evolutionary history of the organism, especially in a species in which out-crossing is common. The sequence information obtained from this study can be used as a genome-specific marker for investigation of the buffalograss polyploidy complex and testing of the mode of plastid and mitochondrial transmission in genus Buchloë.Key words: buffalograss, evolution, organelle genomes, turfgrass.


2019 ◽  
Author(s):  
Jakob Peder Pettersen

AbstractBackgroundStructural RNA genes play important and various roles in gene expression and its regulation. Finding such RNA genes in a genome poses a challenge, which in most cases is solved by homology approaches. Ab intio methods for prediction exist, but are not that much explored.ResultsWe introduce hairpin which identify potential structural RNA genes only based on the sequence. We use the algorithm to predict RNA genes in Escherichia coli K-12. When looking at very short regions of the genome, we do not get results differing very much from a random shuffling of the genome. However, at longer stretches it is a clear biological signal. It turns out that none of the regions predicted to code for RNA genes have such an annotation in literature.ConclusionsArbitrary DNA sequences seem to give rise to transcripts with secondary structures similar to real ncRNA. We therefore conclude that exclusively looking at secondary structure base-parings is in general a futile approach.


Author(s):  
Fenny Martha Dwivany ◽  
Muhammad Rifki Ramadhan ◽  
Carolin Lim ◽  
Agus Sutanto ◽  
Husna Nugrahapraja ◽  
...  

Banana is one of the most essential commodities in Bali island. It is not only for nutrition sources but also for cultural and religious aspects. However, Bali banana genetic diversity has not been explored; therefore, in this study, we focused on its genetic relationship using a molecular approach. This research aimed to determine the genetic relationship of Bali banana cultivars using the internal transcribed spacer 2 (ITS-2) region as a molecular marker. A total of 39 banana samples (Musa spp. L.) were collected from Bali island. The ITS-2 DNA regions were then amplified and sequenced from both ends. ITS-2 sequences were predicted using the ITS2 Database (http://its2.bioapps.biozentrum.uni-wuerzburg.de/). The multiple sequences alignment was performed using ClustalX for nucleotide-based tree and LocARNA to provide the secondary structure information. Phylogenetic trees were constructed using neighbor-joining (Kimura-2-parameter model, 1,000 bootstrap). The result showed that two clades were formed, one clade was abundant in A genome (AA and AAA), and the other rich in the B genome (BB and ABB). This result suggested that cultivars that had similar genomic compositions tended to be grouped within the same clade and separated with different genomic compositions. This study gives perspectives that ITS-2 sequences in bananas are quite similar and differ much compared to other families. Secondary structure has been described to provide more robust resolving power in phylogenetic analysis.


2020 ◽  
Vol 52 (1) ◽  
Author(s):  
Sara de las Heras-Saldana ◽  
Bryan Irvine Lopez ◽  
Nasir Moghaddar ◽  
Woncheoul Park ◽  
Jong-eun Park ◽  
...  

Abstract Background In this study, we assessed the accuracy of genomic prediction for carcass weight (CWT), marbling score (MS), eye muscle area (EMA) and back fat thickness (BFT) in Hanwoo cattle when using genomic best linear unbiased prediction (GBLUP), weighted GBLUP (wGBLUP), and a BayesR model. For these models, we investigated the potential gain from using pre-selected single nucleotide polymorphisms (SNPs) from a genome-wide association study (GWAS) on imputed sequence data and from gene expression information. We used data on 13,717 animals with carcass phenotypes and imputed sequence genotypes that were split in an independent GWAS discovery set of varying size and a remaining set for validation of prediction. Expression data were used from a Hanwoo gene expression experiment based on 45 animals. Results Using a larger number of animals in the reference set increased the accuracy of genomic prediction whereas a larger independent GWAS discovery dataset improved identification of predictive SNPs. Using pre-selected SNPs from GWAS in GBLUP improved accuracy of prediction by 0.02 for EMA and up to 0.05 for BFT, CWT, and MS, compared to a 50 k standard SNP array that gave accuracies of 0.50, 0.47, 0.58, and 0.47, respectively. Accuracy of prediction of BFT and CWT increased when BayesR was applied with the 50 k SNP array (0.02 and 0.03, respectively) and was further improved by combining the 50 k array with the top-SNPs (0.06 and 0.04, respectively). By contrast, using BayesR resulted in limited improvement for EMA and MS. wGBLUP did not improve accuracy but increased prediction bias. Based on the RNA-seq experiment, we identified informative expression quantitative trait loci, which, when used in GBLUP, improved the accuracy of prediction slightly, i.e. between 0.01 and 0.02. SNPs that were located in genes, the expression of which was associated with differences in trait phenotype, did not contribute to a higher prediction accuracy. Conclusions Our results show that, in Hanwoo beef cattle, when SNPs are pre-selected from GWAS on imputed sequence data, the accuracy of prediction improves only slightly whereas the contribution of SNPs that are selected based on gene expression is not significant. The benefit of statistical models to prioritize selected SNPs for estimating genomic breeding values is trait-specific and depends on the genetic architecture of each trait.


Author(s):  
Gururaj Tejeshwar ◽  
Siddesh Gaddadadevra Mat

Introduction: The primary structure of the protein is a polypeptide chain made up of a sequence of amino acids. What happens due to interaction between the atoms of the backbone is that it forms within a polypeptide a folded structure which is very much within the secondary structure. These alignments can be made more accurate by the inclusion of secondary structure information. Objective: It is difficult to identify the sequence information embedded in the secondary structure of the protein. However, Deep learning methods can be used for solving the identification of the sequence information in the protein structures. Methods: The scope of the proposed work is to increase the accuracy of identifying the sequence information in the primary structure and the tertiary structure, thereby increasing the accuracy of the predicted protein secondary structure (PSS). In this proposed work, homology is eliminated by a Recurrent Neural Network (RNN) based network that consists of three layers namely bi-directional Long Short term Memory (LSTM), time distributed layer and Softmax layer. Results: The proposed LDS model achieves an accuracy of approx. 86% for the prediction of the three-state secondary structure of the protein. Conclusion: The gap between the number of protein primary structures and secondary structures we know is huge and increasing. Machine learning is trying to reduce this gap. In most of the other pre attempts in predicting the secondary structure of proteins the data is divided according to homology of the proteins. This limits the efficiency of the predicting model and limits the inputs given to such models. Hence in our model homology has not been considered while collecting the data for training or testing out model. This has led to our model to not be affected by the homology of the protein fed to it and hence remove that restriction, so any protein can be fed to it.


Sign in / Sign up

Export Citation Format

Share Document