scholarly journals DeeReCT-TSS: A novel meta-learning-based method annotates TSS in multiple cell types based on DNA sequences and RNA-seq data

Author(s):  
Juexiao Zhou ◽  
bin zhang ◽  
Haoyang Li ◽  
Longxi Zhou ◽  
Zhongxiao Li ◽  
...  

Abstract The accurate annotation of transcription start sites (TSSs) and their usage is critical for the mechanistic understanding of gene regulation under different biological contexts. To fulfil this, on one hand, specific high-throughput experimental technologies have been developed to capture TSSs in a genome-wide manner. On the other hand, various computational tools have also been developed for in silico prediction of TSSs solely based on genomic sequences. Most of these computational tools cast the problem as a binary classification task on a balanced dataset and thus result in drastic false positive predictions when applied on the genome-scale. To address these issues, we present DeeReCT-TSS, a deep-learning-based method that is capable of TSSs identification across the whole genome based on both DNA sequences and conventional RNA-seq data. We show that by effectively incorporating these two sources of information, DeeReCT-TSS significantly outperforms other solely sequence-based methods on the precise annotation of TSSs used in different cell types. Furthermore, we develop a meta-learning-based extension for simultaneous transcription start site (TSS) annotation on 10 cell types, which enables the identification of cell-type-specific TSS. Finally, we demonstrate the high precision of DeeReCT-TSS on two independent datasets from the ENCODE project by correlating our predicted TSSs with experimentally defined TSS chromatin states. Our application, pre-trained models and data are available at https://github.com/JoshuaChou2018/DeeReCT-TSS_release.

2021 ◽  
Author(s):  
Juexiao Zhou ◽  
Bin Zhang ◽  
Haoyang Li ◽  
Longxi Zhou ◽  
Zhongxiao Li ◽  
...  

The accurate annotation of TSSs and their usage is critical for the mechanistic understanding of gene regulation under different biological contexts. To fulfill this, specific high-throughput experimental technologies have been developed to capture TSSs in a genome-wide manner. Various computational tools have also been developed for in silico prediction of TSSs solely based on genomic sequences. Most of these tools have drastic false positive predictions when applied on the genome-scale. Here, we present DeeReCT-TSS, a deep-learning-based method that is capable of TSSs identification across the whole genome based on DNA sequences and conventional RNA-seq data. We show that by effectively incorporating these two sources of information, DeeReCT-TSS significantly outperforms other solely sequence-based methods on the precise annotation of TSSs used in different cell types. Furthermore, we develop a meta-learning-based extension for simultaneous transcription start site (TSS) annotation on 10 cell types, which enables the identification of cell-type-specific TSS. Finally, we demonstrate the high precision of DeeReCT-TSS on two independent datasets from the ENCODE project by correlating our predicted TSSs with experimentally defined TSS chromatin states.


2020 ◽  
Vol 295 (12) ◽  
pp. 3990-4000 ◽  
Author(s):  
Sandeep Singh ◽  
Karol Szlachta ◽  
Arkadi Manukyan ◽  
Heather M. Raimer ◽  
Manikarna Dinda ◽  
...  

DNA double-stranded breaks (DSBs) are strongly associated with active transcription, and promoter-proximal pausing of RNA polymerase II (Pol II) is a critical step in transcriptional regulation. Mapping the distribution of DSBs along actively expressed genes and identifying the location of DSBs relative to pausing sites can provide mechanistic insights into transcriptional regulation. Using genome-wide DNA break mapping/sequencing techniques at single-nucleotide resolution in human cells, we found that DSBs are preferentially located around transcription start sites of highly transcribed and paused genes and that Pol II promoter-proximal pausing sites are enriched in DSBs. We observed that DSB frequency at pausing sites increases as the strength of pausing increases, regardless of whether the pausing sites are near or far from annotated transcription start sites. Inhibition of topoisomerase I and II by camptothecin and etoposide treatment, respectively, increased DSBs at the pausing sites as the concentrations of drugs increased, demonstrating the involvement of topoisomerases in DSB generation at the pausing sites. DNA breaks generated by topoisomerases are short-lived because of the religation activity of these enzymes, which these drugs inhibit; therefore, the observation of increased DSBs with increasing drug doses at pausing sites indicated active recruitment of topoisomerases to these sites. Furthermore, the enrichment and locations of DSBs at pausing sites were shared among different cell types, suggesting that Pol II promoter-proximal pausing is a common regulatory mechanism. Our findings support a model in which topoisomerases participate in Pol II promoter-proximal pausing and indicated that DSBs at pausing sites contribute to transcriptional activation.


2016 ◽  
Author(s):  
Francisco Avila Cobos ◽  
Jasper Anckaert ◽  
Pieter-Jan Volders ◽  
Dries Rombaut ◽  
Jo Vandesompele ◽  
...  

AbstractSummaryReconstructing transcript models from RNA-sequencing (RNA-seq) data and establishing these as independent transcriptional units can be a challenging task. The Zipper plot is an application that enables users to interrogate putative transcription start sites (TSSs) in relation to various features that are indicative for transcriptional activity. These features are obtained from publicly available datasets including CAGE-sequencing (CAGE-seq), ChIP-sequencing (ChIP-seq) for histone marks and DNasesequencing (DNase-seq). The Zipper plot application requires three input fields (chromosome, genomic coordinate (hg19) of the TSS and strand) and generates a report that includes a detailed summary table, a Zipper plot and several statistics derived from this plot.Availability and ImplementationThe Zipper plot is implemented using the statistical programming language R and is freely available at http://[email protected]; [email protected]; [email protected] informationSupplementary Methods available online.


2021 ◽  
Author(s):  
Jill E Moore ◽  
Xiao-Ou Zhang ◽  
Shaimae I Elhajjajy ◽  
Kaili Fan ◽  
Fairlie Reese ◽  
...  

Accurate transcription start site (TSS) annotations are essential for understanding transcriptional regulation and its role in human disease. Gene collections such as GENCODE contain annotations for tens of thousands of TSSs, but not all of these annotations are experimentally validated nor do they contain information on cell type-specific usage. Therefore, we sought to generate a collection of experimentally validated TSSs by integrating RNA Annotation and Mapping of Promoters for the Analysis of Gene Expression (RAMPAGE) data from 115 cell and tissue types, which resulted in a collection of approximately 50 thousand representative RAMPAGE peaks. These peaks were primarily proximal to GENCODE-annotated TSSs and were concordant with other transcription assays. Because RAMPAGE uses paired-end reads, we were then able to connect peaks to transcripts by analyzing the genomic positions of the 3' ends of read mates. Using this paired-end information, we classified the vast majority (37 thousand) of our RAMPAGE peaks as verified TSSs, updating TSS annotations for 20% of GENCODE genes. We also found that these updated TSS annotations were supported by epigenomic and other transcriptomic datasets. To demonstrate the utility of this RAMPAGE rPeak collection, we intersected it with the NHGRI/EBI GWAS catalog and identified new candidate GWAS genes. Overall, our work demonstrates the importance of integrating experimental data to further refine TSS annotations and provides a valuable resource for the biological community.


2021 ◽  
Author(s):  
Jose M. G. Vilar ◽  
Leonor Saiz

The prevalent one-dimensional alignment of genomic signals to a reference landmark is a cornerstone of current methods to study transcription and its DNA-dependent processes but it is prone to mask potential relations among multiple DNA elements. We developed a systematic approach to align genomic signals to multiple locations simultaneously by expanding the dimensionality of the genomic-coordinate space. We analyzed transcription in human and uncovered a complex dependence on the relative position of neighboring transcription start sites (TSSs) that is consistently conserved among cell types. The dependence ranges from enhancement to suppression of transcription depending on the relative distances to the TSSs, their intragenic position, and the transcriptional activity of the gene. Our results reveal a conserved hierarchy of alternative TSS usage within a previously unrecognized level of genomic organization and provide a general methodology to analyze complex functional relationships among multiple types of DNA elements.


2019 ◽  
Author(s):  
Bo Yan ◽  
George Tzertzinis ◽  
Ira Schildkraut ◽  
Laurence Ettwiller

AbstractMethodologies for determining eukaryotic Transcription Start Sites (TSS) rely on the selection of the 5’ canonical cap structure of Pol-II transcripts and are consequently ignoring entire classes of TSS derived from other RNA polymerases which play critical roles in various cell functions. To overcome this limitation, we developed ReCappable-seq and identified TSS from Pol-ll and non-Pol-II transcripts at nucleotide resolution. Applied to the human transcriptome, ReCappable-seq identifies Pol-II TSS with higher specificity than CAGE and reveals a rich landscape of TSS associated notably with Pol-III transcripts which have been previously not possible to study on a genome-wide scale. Novel TSS consistent with non-Pol-II transcripts can be found in the nuclear and mitochondrial genomes. By identifying TSS derived from all RNA-polymerases, ReCappable-seq reveals distinct epigenetic marks among Pol-lI and non-Pol-II TSS and provides a unique opportunity to concurrently interrogate the regulatory landscape of coding and non-coding RNA.


2017 ◽  
Author(s):  
Charles Cole ◽  
Ashley Byrne ◽  
Anna E. Beaudin ◽  
E. Camilla Forsberg ◽  
Christopher Vollmers

AbstractRNA-seq is a powerful technique to investigate and quantify entire transcriptomes. Recent advances in the field have made it possible to explore the transcriptomes of single cells. However, most widely used RNA-seq protocols fail to provide crucial information regarding transcription start sites. Here we present a protocol, Tn5Prime, that takes advantage of the Tn5 transposase based Smartseq2 protocol to create RNA-seq libraries that capture the 5’ end of transcripts. The Tn5Prime method dramatically streamlines the 5’ capture process and is both cost effective and reliable. By applying Tn5Prime to bulk RNA and single cell samples we were able to define transcription start sites as well as quantify transcriptomes at high accuracy and reproducibility. Additionally, similar to 3’ end based high-throughput methods like Drop-Seq and 10X Genomics Chromium, the 5’ capture Tn5Prime method allows the introduction of cellular identifiers during reverse transcription, simplifying the analysis of large numbers of single cells. In contrast to 3’ end based methods, Tn5Prime also enables the assembly of the variable 5’ ends of antibody sequences present in single B-cell data. Therefore, Tn5Prime presents a robust tool for both basic and applied research into the adaptive immune system and beyond.


2021 ◽  
Author(s):  
Olga Borisovna Botvinnik ◽  
Pranathi Vemuri ◽  
N. Tessa Pierce Ward ◽  
Phoenix Aja Logan ◽  
Saba Nafees ◽  
...  

Single-cell RNA-seq (scRNA-seq) is a powerful tool for cell type identification but is not readily applicable to organisms without well-annotated reference genomes. Of the approximately 10 million animal species predicted to exist on earth, >99.9% do not have any submitted genome assembly. To enable scRNA-seq for the vast majority of animals on the planet, here we introduce the concept of "k-mer homology," combining biochemical synonyms in degenerate protein alphabets with uniform data subsampling via MinHash into a pipeline called Kmermaid, to directly detect similar cell types across species from transcriptomic data without the need for a reference genome. Underpinning kmermaid is the tool Orpheum, a memory-efficient method for extracting high-confidence protein-coding sequences from RNA-seq data. After validating kmermaid using datasets from human and mouse lung, we applied Kmermaid to the Chinese horseshoe bat (Rhinolophus sinicus), where we propagated cellular compartment labels at high fidelity. Our pipeline provides a high-throughput tool that enables analyses of transcriptomic data across divergent species' transcriptomes in a genome- and gene annotation-agnostic manner. Thus, the combination of Kmermaid and Orpheum identifies cellular type-specific sequences that may be missing from genome annotations and empowers molecular cellular phenotyping for novel model organisms and species.


Author(s):  
Emily Warman ◽  
David Forrest ◽  
Joseph T. Wade ◽  
David C. Grainger

ABSTRACTPromoters are DNA sequences that stimulate the initiation of transcription. In all prokaryotes, promoters are believed to drive transcription in a single direction. Here we show that prokaryotic promoters are frequently bidirectional and drive divergent transcription. Mechanistically, this occurs because key promoter elements have inherent symmetry and often coincide on opposite DNA strands. Reciprocal stimulation between divergent transcription start sites also contributes. Horizontally acquired DNA is enriched for bidirectional promoters suggesting that they represent an early step in prokaryotic promoter evolution.


2021 ◽  
Author(s):  
Yanling Peng ◽  
Qitong Huang ◽  
Rui Kamada ◽  
Keiko Ozato ◽  
Yubo Zhang ◽  
...  

Alternative transcription start sites (TSSs) usage plays a critical role in gene transcription regulation in mammals. However, precisely identifying alternative TSSs remains challenging at the genome-wide level. Here, we report a single-cell genomic technology for alternative TSSs annotation and cell heterogeneity detection. In the method, we utilize Fluidigm C1 system to capture individual cells of interest, SMARTer cDNA synthesis kit to recover full-length cDNAs, then dual priming oligonucleotide system to specifically enrich TSSs for genomic analysis. We apply this method to a genome-wide study of alternative TSSs identification in two different IFN-β stimulated mouse embryonic fibroblasts (MEFs). We quantify the performance of our method and find it as accurate as other single cell methods for the detection of TSSs. Our data are also clearly discriminate two IFN-β stimulated MEFs. Moreover, our results indicate 82% expressed genes in these two cell types containing multiple TSSs, which is much higher than previous predictions based on CAGE (58%) or empirical determination (54%) in various cell types. This indicates that alternative TSSs are more pervasive than expected and implies our strategy could position them at an unprecedented sensitivity. It would be helpful for elucidating their biological insights in future.


Sign in / Sign up

Export Citation Format

Share Document