scholarly journals Computational studies on RNA processing in higher eukaryotes

2021 ◽  
Author(s):  
◽  
Mirko Brüggemann

Most cellular processes are regulated by RNA-binding proteins (RBPs). These RBPs usually use defined binding sites to recognize and directly interact with their target RNA molecule. Individual-nucleotide resolution UV crosslinking and immunoprecipitation (iCLIP) experiments are an important tool to de- scribe such interactions in cell cultures in-vivo. This experimental protocol yields millions of individual sequencing reads from which the binding spec- trum of the RBP under study can be deduced. In this PhD thesis I studied how RNA processing is driven from RBP binding by analyzing iCLIP-derived sequencing datasets. First, I described a complete data analysis pipeline to detect RBP binding sites from iCLIP sequencing reads. This workflow covers all essential process- ing steps, from the first quality control to the final annotation of binding sites. I described the accurate integration of biological iCLIP replicates to boost the initial peak calling step while ensuring high specificity through replicate re- producibility analysis. Further I proposed a routine to level binding site width to streamline downstream analysis processes. This was exemplified in the re- analysis of the binding spectrum of the U2 small nuclear RNA auxiliary factor 2 (U2AF2, U2AF65). I recaptured the known dominance of U2AF65 to bind to intronic sequences of protein-coding genes, where it likely recognizes the polypyrimidine tract as part of the core spliceosome machinery. In the second part of my thesis, I analyzed the binding spectrum of the serine and arginine rich splicing factor 6 (SRSF6) in the context of diabetes. In pancreatic beta-cells, the expression of SRSF6 is regulated by the transcription factor GLIS3, which encodes for a diabetes susceptibility gene. It is known that SRSF6 promotes beta-cell death through the splicing dysregulation of genes essential to beta-cell function and survival. However, the exact mechanism of how these RNAs are targeted by SRSF6 remains poorly understood. Here, I applied the defined iCLIP processing pipeline to describe the binding landscape of the splicing factor SRSF6 in the human pancreatic beta-cell line EndoC-H1. The initial binding sites definition revealed a predominant binding to coding sequences (CDS) of protein-coding genes. This was followed up by extensive motif analysis which revealed a so far, in human, unknown purine-rich binding motif. SRSF6 seemed to specifically recognize repetitions of the triplet GAA. I also showed that the number of contiguous triplets correlated with increasing binding site strength. I further integrated RNA-sequencing data from the same cell type, with SRSF6 in KD and in basal conditions, to analyze SRSF6- related splicing changes. I showed that the exact positioning of SRSF6 on alternatively spliced exons regulates the produced transcript isoforms. This mechanism seemed to control exons in several known susceptibility genes for diabetes. In summary, in my PhD thesis, I presented a comprehensive workflow for the processing of iCLIP-derived sequencing data. I applied this pipeline on a dataset from pancreatic beta-cells to unveil the impact of SRSF6-mediated splicing changes. Thus, my analysis provides novel insights into the regulation of diabetes susceptibility genes.

2019 ◽  
Vol 8 (31) ◽  
Author(s):  
Rikky W. Purbojati ◽  
Daniela I. Drautz-Moses ◽  
Akira Uchida ◽  
Anthony Wong ◽  
Megan E. Clare ◽  
...  

Brevundimonas sp. strain SGAir0440 was isolated from indoor air samples collected in Singapore. Its genome was assembled using single-molecule real-time sequencing data, resulting in one circular chromosome with a length of 3.1 Mbp. The genome consists of 3,033 protein-coding genes, 48 tRNAs, and 6 rRNA operons.


eLife ◽  
2020 ◽  
Vol 9 ◽  
Author(s):  
Jun Yao ◽  
Douglas C Wu ◽  
Ryan M Nottingham ◽  
Alan M Lambowitz

Human plasma contains > 40,000 different coding and non-coding RNAs that are potential biomarkers for human diseases. Here, we used thermostable group II intron reverse transcriptase sequencing (TGIRT-seq) combined with peak calling to simultaneously profile all RNA biotypes in apheresis-prepared human plasma pooled from healthy individuals. Extending previous TGIRT-seq analysis, we found that human plasma contains largely fragmented mRNAs from > 19,000 protein-coding genes, abundant full-length, mature tRNAs and other structured small non-coding RNAs, and less abundant tRNA fragments and mature and pre-miRNAs. Many of the mRNA fragments identified by peak calling correspond to annotated protein-binding sites and/or have stable predicted secondary structures that could afford protection from plasma nucleases. Peak calling also identified novel repeat RNAs, miRNA-sized RNAs, and putatively structured intron RNAs of potential biological, evolutionary, and biomarker significance, including a family of full-length excised intron RNAs, subsets of which correspond to mirtron pre-miRNAs or agotrons.


GigaScience ◽  
2019 ◽  
Vol 8 (10) ◽  
Author(s):  
Yunhai Guo ◽  
Yi Zhang ◽  
Qin Liu ◽  
Yun Huang ◽  
Guangyao Mao ◽  
...  

AbstractBackgroundAchatina fulica, the giant African snail, is the largest terrestrial mollusk species. Owing to its voracious appetite, wide environmental adaptability, high growth rate, and reproductive capacity, it has become an invasive species across the world, mainly in Southeast Asia, Japan, the western Pacific islands, and China. This pest can damage agricultural crops and is an intermediate host of many parasites that can threaten human health. However, genomic information of A. fulica remains limited, hindering genetic and genomic studies for invasion control and management of the species.FindingsUsing a k-mer–based method, we estimated the A. fulica genome size to be 2.12 Gb, with a high repeat content up to 71%. Roughly 101.6 Gb genomic long-read data of A. fulica were generated from the Pacific Biosciences sequencing platform and assembled to produce a first A. fulica genome of 1.85 Gb with a contig N50 length of 726 kb. Using contact information from the Hi-C sequencing data, we successfully anchored 99.32% contig sequences into 31 chromosomes, leading to the final contig and scaffold N50 length of 721 kb and 59.6 Mb, respectively. The continuity, completeness, and accuracy were evaluated by genome comparison with other mollusk genomes, BUSCO assessment, and genomic read mapping. A total of 23,726 protein-coding genes were predicted from the assembled genome, among which 96.34% of the genes were functionally annotated. The phylogenetic analysis using whole-genome protein-coding genes revealed that A. fulica separated from a common ancestor with Biomphalaria glabrata ∼182 million years ago.ConclusionTo our knowledge, the A. fulica genome is the first terrestrial mollusk genome published to date. The chromosome sequence of A. fulica will provide the research community with a valuable resource for population genetics and environmental adaptation studies for the species, as well as investigations of the chromosome-level of evolution within mollusks.


2008 ◽  
Vol 36 (4) ◽  
pp. 590-594 ◽  
Author(s):  
Sylvain Egloff ◽  
Dawn O'Reilly ◽  
Shona Murphy

In addition to protein-coding genes, mammalian pol II (RNA polymerase II) transcribes independent genes for some non-coding RNAs, including the spliceosomal U1 and U2 snRNAs (small nuclear RNAs). snRNA genes differ from protein-coding genes in several key respects and some of the mechanisms involved in expression are gene-type-specific. For example, snRNA gene promoters contain an essential PSE (proximal sequence element) unique to these genes, the RNA-encoding regions contain no introns, elongation of transcription is P-TEFb (positive transcription elongation factor b)-independent and RNA 3′-end formation is directed by a 3′-box rather than a cleavage and polyadenylation signal. However, the CTD (C-terminal domain) of pol II closely couples transcription with RNA 5′ and 3′ processing in expression of both gene types. Recently, it was shown that snRNA promoter-specific recognition of the 3′-box RNA processing signal requires a novel phosphorylation mark on the pol II CTD. This new mark plays a critical role in the recruitment of the snRNA gene-specific RNA-processing complex, Integrator. These new findings provide the first example of a phosphorylation mark on the CTD heptapeptide that can be read in a gene-type-specific manner, reinforcing the notion of a CTD code. Here, we review the control of expression of snRNA genes from initiation to termination of transcription.


2021 ◽  
Author(s):  
Noah Dukler ◽  
Mehreen R Mughal ◽  
Ritika Ramani ◽  
Yi-Fei Huang ◽  
Adam Siepel

Genome sequencing of tens of thousands of human individuals has recently enabled the measurement of large selective effects for mutations to protein-coding genes. Here we describe a new method, called ExtRaINSIGHT, for measuring similar selective effects at individual sites in noncoding as well as in coding regions of the human genome. ExtRaINSIGHT estimates the prevalance of strong purifying selection, or "ultraselection" (λs), as the fractional depletion of rare single-nucleotide variants (minor allele frequency <0.1%) in a target set of genomic sites relative to matched sites that are putatively neutrally evolving, in a manner that controls for local variation and neighbor-dependence in mutation rate. We show using simulations that, above an appropriate threshold, λs is closely related to the average site-specific selection coefficient against heterozygous point mutations, as predicted at mutation-selection balance. Applying ExtRaINSIGHT to 71,702 whole genome sequences from gnomAD v3, we find particularly strong evidence of ultraselection in evolutionarily ancient miRNAs and neuronal protein-coding genes, as well as at splice sites. Moreover, our estimated selection coefficient against heterozygous amino-acid replacements across the genome (at 1.4%) is substantially larger than previous estimates based on smaller sample sizes. By contrast, we find weak evidence of ultraselection in other noncoding RNAs and transcription factor binding sites, and only modest evidence in ultraconserved elements and human accelerated regions. We estimate that ~0.3-0.5% of the human genome is ultraselected, with one third to one half of ultraselected sites falling in coding regions. These estimates suggest ~0.3-0.4 lethal or nearly lethal de novo mutations per potential human zygote, together with ~2 de novo mutations that are more weakly deleterious. Overall, our study sheds new light on the genome-wide distribution of fitness effects for new point mutations by combining deep new sequencing data sets and classical theory from population genetics.


2013 ◽  
Vol 42 (5) ◽  
pp. 2820-2832 ◽  
Author(s):  
Nicolas Philippe ◽  
Elias Bou Samra ◽  
Anthony Boureux ◽  
Alban Mancheron ◽  
Florence Rufflé ◽  
...  

Abstract Recent sequencing technologies that allow massive parallel production of short reads are the method of choice for transcriptome analysis. Particularly, digital gene expression (DGE) technologies produce a large dynamic range of expression data by generating short tag signatures for each cell transcript. These tags can be mapped back to a reference genome to identify new transcribed regions that can be further covered by RNA-sequencing (RNA-Seq) reads. Here, we applied an integrated bioinformatics approach that combines DGE tags, RNA-Seq, tiling array expression data and species-comparison to explore new transcriptional regions and their specific biological features, particularly tissue expression or conservation. We analysed tags from a large DGE data set (designated as ‘TranscriRef’). We then annotated 750 000 tags that were uniquely mapped to the human genome according to Ensembl. We retained transcripts originating from both DNA strands and categorized tags corresponding to protein-coding genes, antisense, intronic- or intergenic-transcribed regions and computed their overlap with annotated non-coding transcripts. Using this bioinformatics approach, we identified ∼34 000 novel transcribed regions located outside the boundaries of known protein-coding genes. As demonstrated using sequencing data from human pluripotent stem cells for biological validation, the method could be easily applied for the selection of tissue-specific candidate transcripts. DigitagCT is available at http://cractools.gforge.inria.fr/softwares/digitagct.


1999 ◽  
Vol 144 (4) ◽  
pp. 617-629 ◽  
Author(s):  
Kelly P. Smith ◽  
Phillip T. Moen ◽  
Karen L. Wydner ◽  
John R. Coleman ◽  
Jeanne B. Lawrence

Analysis of six endogenous pre-mRNAs demonstrates that localization at the periphery or within splicing factor-rich (SC-35) domains is not restricted to a few unusually abundant pre-mRNAs, but is apparently a more common paradigm of many protein-coding genes. Different genes are preferentially transcribed and their RNAs processed in different compartments relative to SC-35 domains. These differences do not simply correlate with the complexity, nuclear abundance, or position within overall nuclear space. The distribution of spliceosome assembly factor SC-35 did not simply mirror the distribution of individual pre-mRNAs, but rather suggested that individual domains contain both specific pre-mRNA(s) as well as excess splicing factors. This is consistent with a multifunctional compartment, to which some gene loci and their RNAs have access and others do not. Despite similar molar abundance in muscle fiber nuclei, nascent transcript “trees” of highly complex dystrophin RNA are cotranscriptionally spliced outside of SC-35 domains, whereas posttranscriptional “tracks” of more mature myosin heavy chain transcripts overlap domains. Further analyses supported that endogenous pre-mRNAs exhibit distinct structural organization that may reflect not only the expression and complexity of the gene, but also constraints of its chromosomal context and kinetics of its RNA metabolism.


2021 ◽  
Vol 12 ◽  
Author(s):  
Yongle Hu ◽  
Dongna Ma ◽  
Shuju Ning ◽  
Qi Ye ◽  
Xuanxuan Zhao ◽  
...  

Strobilanthes cusia (Nees) Kuntze is an important plant used to process the traditional Chinese herbal medicines “Qingdai” and “Nanbanlangen”. The key active ingredients are indole alkaloids (IAs) that exert antibacterial, antiviral, and antitumor pharmacological activities and serve as natural dyes. We assembled the S. cusia genome at the chromosome level through combined PacBio circular consensus sequencing (CCS) and Hi-C sequencing data. Hi-C data revealed a draft genome size of 913.74 Mb, with 904.18 Mb contigs anchored into 16 pseudo-chromosomes. Contig N50 and scaffold N50 were 35.59 and 68.44 Mb, respectively. Of the 32,974 predicted protein-coding genes, 96.52% were functionally annotated in public databases. We predicted 675.66 Mb repetitive sequences, 47.08% of sequences were long terminal repeat (LTR) retrotransposons. Moreover, 983 Strobilanthes-specific genes (SSGs) were identified for the first time, accounting for ~2.98% of all protein-coding genes. Further, 245 putative centromeric and 29 putative telomeric fragments were identified. The transcriptome analysis identified 2,975 differentially expressed genes (DEGs) enriched in phenylpropanoid, flavonoid, and triterpenoid biosynthesis. This systematic characterization of key enzyme-coding genes associated with the IA pathway and basic helix-loop-helix (bHLH) transcription factor family formed a network from the shikimate pathway to the indole alkaloid synthesis pathway in S. cusia. The high-quality S. cusia genome presented herein is an essential resource for the traditional Chinese medicine genomics studies and understanding the genetic underpinning of IA biosynthesis.


2019 ◽  
Author(s):  
Alice Lunardon ◽  
Nathan R. Johnson ◽  
Emily Hagerott ◽  
Tamia Phifer ◽  
Seth Polydore ◽  
...  

AbstractPlant endogenous small RNAs (sRNAs) are important regulators of gene expression. There are two broad categories of plant sRNAs: microRNAs (miRNAs) and endogenous short interfering RNAs (siRNAs). MicroRNA loci are relatively well-annotated but comprise only a small minority of the total sRNA pool; siRNA locus annotations have lagged far behind. Here, we used a large dataset of published and newly generated sRNA sequencing data (1,333 sRNA-seq libraries containing over 20 billion reads) and a uniform bioinformatic pipeline to produce comprehensive sRNA locus annotations of 47 diverse plants, yielding over 2.7 million sRNA loci. The two most numerous classes of siRNA loci produced mainly 24 nucleotide and 21 nucleotide siRNAs, respectively. 24 nucleotide-dominated siRNA loci usually occurred in intergenic regions, especially at the 5’-flanking regions of protein-coding genes. In contrast, 21 nucleotide-dominated siRNA loci were most often derived from double-stranded RNA precursors copied from spliced mRNAs. Genic 21 nucleotide-dominated loci were especially common from disease resistance genes, including from a large number of monocots. Individual siRNA sequences of all types showed very little conservation across species, while mature miRNAs were more likely to be conserved. We developed a web server where our data and several search and analysis tools are freely accessible at http://plantsmallrnagenes.science.psu.edu.


2019 ◽  
Vol 39 (2) ◽  
Author(s):  
Yingai Zhang ◽  
Shunlan Wang ◽  
Jingchuan Xiao ◽  
Hailong Zhou

Abstract Hepatocellular carcinoma (HCC) is the most frequent primary liver cancer, which has poor outcome. The present study aimed to investigate the key genes implicated in the progression and prognosis of HCC. The RNA-sequencing data of HCC was extracted from The Cancer Genome Atlas (TCGA) database. Using the R package (DESeq), the differentially expressed genes (DEGs) were analyzed. Based on the Cluepedia plug-in in Cytoscape software, enrichment analysis for the protein-coding genes amongst the DEGs was conducted. Subsequently, protein–protein interaction (PPI) network was built by Cytoscape software. Using survival package, the genes that could distinguish the survival differences of the HCC samples were explored. Moreover, quantitative real-time reverse transcription-PCR (qRT-PCR) experiments were used to detect the expression of key genes. There were 2193 DEGs in HCC samples. For the protein-coding genes amongst the DEGs, multiple functional terms and pathways were enriched. In the PPI network, cyclin-dependent kinase 1 (CDK1), polo-like kinase 1 (PLK1), Fos proto-oncogene, AP-1 transcription factor subunit (FOS), serum amyloid A1 (SAA1), and lysophosphatidic acid receptor 3 (LPAR3) were hub nodes. CDK1 interacting with PLK1 and FOS, and LPAR3 interacting with FOS and SAA1 were found in the PPI network. Amongst the 40 network modules, 4 modules were with scores not less than 10. Survival analysis showed that anterior gradient 2 (AGR2) and RLN3 could differentiate the high- and low-risk groups, which were confirmed by qRT-PCR. CDK1, PLK1, FOS, SAA1, and LPAR3 might be key genes affecting the progression of HCC. Besides, AGR2 and RLN3 might be implicated in the prognosis of HCC.


Sign in / Sign up

Export Citation Format

Share Document