Quantitative RNA-seq analysis of the Campylobacter jejuni transcriptome

C ampylobacter jejuni is the most common bacterial cause of foodborne disease in the developed world. Its general physiology and biochemistry, as well as the mechanisms enabling it to colonize and cause disease in various hosts, are not well understood, and new approaches are required to understand its basic biology. High-throughput sequencing technologies provide unprecedented opportunities for functional genomic research. Recent studies have shown that direct Illumina sequencing of cDNA (RNA-seq) is a useful technique for the quantitative and qualitative examination of transcriptomes. In this study we report RNA-seq analyses of the transcriptomes of C. jejuni (NCTC11168) and its rpoN mutant. This has allowed the identification of hitherto unknown transcriptional units, and further defines the regulon that is dependent on rpoN for expression. The analysis of the NCTC11168 transcriptome was supplemented by additional proteomic analysis using liquid chromatography-MS. The transcriptomic and proteomic datasets represent an important resource for the Campylobacter research community.

Download Full-text

Impact of human gene annotations on RNA-seq differential expression analysis

10.21203/rs.3.rs-301856/v1 ◽

2021 ◽

Author(s):

Yu Hamaguchi ◽

Chao Zeng ◽

Michiaki Hamada

Keyword(s):

Differential Expression ◽

High Throughput ◽

High Throughput Sequencing ◽

Human Gene ◽

Gene Annotation ◽

Differential Expression Analysis ◽

Rna Seq ◽

Gene Annotations ◽

Sequencing Technologies ◽

The Impact

Abstract Background: Differential expression (DE) analysis of RNA-seq data typically depends on gene annotations. Different sets of gene annotations are available for the human genome and are continually updated–a process complicated with the development and application of high-throughput sequencing technologies. However, the impact of the complexity of gene annotations on DE analysis remains unclear.Results: Using “mappability”, a metric of the complexity of gene annotation, we compared three distinct human gene annotations, GENCODE, RefSeq, and NONCODE, and evaluated how mappability affected DE analysis. We found that mappability was significantly different among the human gene annotations. We also found that increasing mappability improved the performance of DE analysis, and the impact of mappability mainly evident in the quantification step and propagated downstream of DE analysis systematically.Conclusions: We assessed how the complexity of gene annotations affects DE analysis using mappability. Our findings indicate that the growth and complexity of gene annotations negatively impact the performance of DE analysis, suggesting that an approach that excludes unnecessary gene models from gene annotations improves the performance of DE analysis.

Download Full-text

Index suffix–prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression

Bioinformatics ◽

10.1093/bioinformatics/bty936 ◽

2018 ◽

Vol 35 (12) ◽

pp. 2066-2074 ◽

Cited By ~ 11

Author(s):

Yuansheng Liu ◽

Zuguo Yu ◽

Marcel E Dinger ◽

Jinyan Li

Keyword(s):

High Throughput Sequencing ◽

De Novo ◽

Reference Sequence ◽

Supplementary Information ◽

The Novel ◽

Rna Seq ◽

File Size ◽

Sequencing Technologies ◽

Efficient Storage ◽

Merging Process

Abstract Motivation Advanced high-throughput sequencing technologies have produced massive amount of reads data, and algorithms have been specially designed to contract the size of these datasets for efficient storage and transmission. Reordering reads with regard to their positions in de novo assembled contigs or in explicit reference sequences has been proven to be one of the most effective reads compression approach. As there is usually no good prior knowledge about the reference sequence, current focus is on the novel construction of de novo assembled contigs. Results We introduce a new de novo compression algorithm named minicom. This algorithm uses large k-minimizers to index the reads and subgroup those that have the same minimizer. Within each subgroup, a contig is constructed. Then some pairs of the contigs derived from the subgroups are merged into longer contigs according to a (w, k)-minimizer-indexed suffix–prefix overlap similarity between two contigs. This merging process is repeated after the longer contigs are formed until no pair of contigs can be merged. We compare the performance of minicom with two reference-based methods and four de novo methods on 18 datasets (13 RNA-seq datasets and 5 whole genome sequencing datasets). In the compression of single-end reads, minicom obtained the smallest file size for 22 of 34 cases with significant improvement. In the compression of paired-end reads, minicom achieved 20–80% compression gain over the best state-of-the-art algorithm. Our method also achieved a 10% size reduction of compressed files in comparison with the best algorithm under the reads-order preserving mode. These excellent performances are mainly attributed to the exploit of the redundancy of the repetitive substrings in the long contigs. Availability and implementation https://github.com/yuansliu/minicom Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A comprehensive review of scaffolding methods in genome assembly

Briefings in Bioinformatics ◽

10.1093/bib/bbab033 ◽

2021 ◽

Author(s):

Junwei Luo ◽

Yawei Wei ◽

Mengna Lyu ◽

Zhengjiang Wu ◽

Xiaoyan Liu ◽

...

Keyword(s):

Genome Assembly ◽

High Throughput Sequencing ◽

Rapid Development ◽

Genomic Research ◽

Future Research ◽

Sequencing Data ◽

Sequencing Technologies ◽

Biological Studies ◽

Downstream Analysis

Abstract In the field of genome assembly, scaffolding methods make it possible to obtain a more complete and contiguous reference genome, which is the cornerstone of genomic research. Scaffolding methods typically utilize the alignments between contigs and sequencing data (reads) to determine the orientation and order among contigs and to produce longer scaffolds, which are helpful for genomic downstream analysis. With the rapid development of high-throughput sequencing technologies, diverse types of reads have emerged over the past decade, especially in long-range sequencing, which have greatly enhanced the assembly quality of scaffolding methods. As the number of scaffolding methods increases, biology and bioinformatics researchers need to perform in-depth analyses of state-of-the-art scaffolding methods. In this article, we focus on the difficulties in scaffolding, the differences in characteristics among various kinds of reads, the methods by which current scaffolding methods address these difficulties, and future research opportunities. We hope this work will benefit the design of new scaffolding methods and the selection of appropriate scaffolding methods for specific biological studies.

Download Full-text

Pipeliner: A Nextflow-based framework for the definition of sequencing data processing pipelines

10.1101/476515 ◽

2018 ◽

Cited By ~ 1

Author(s):

Anthony Federico ◽

Tanya Karagiannis ◽

Kritika Karri ◽

Dileep Kishore ◽

Yusuke Koga ◽

...

Keyword(s):

Data Processing ◽

High Throughput Sequencing ◽

Digital Gene Expression ◽

Rna Seq ◽

Sequencing Data ◽

Sequencing Technologies ◽

Computing Environments ◽

Scripting Language ◽

Definition Of ◽

User Friendly

AbstractThe advent of high-throughput sequencing technologies has led to the need for flexible and user-friendly data pre-processing platforms. The Pipeliner framework provides an out-of-the-box solution for processing various types of sequencing data. It combines the Nextflow scripting language and Anaconda package manager to generate modular computational workflows. We have used Pipeliner to create several pipelines for sequencing data processing including bulk RNA-seq, single-cell RNA-seq (scRNA-seq), as well as Digital Gene Expression (DGE) data. This report highlights the design methodology behind Pipeliner which enables the development of highly flexible and reproducible pipelines that are easy to extend and maintain on multiple computing environments. We also provide a quick start user guide demonstrating how to setup and execute available pipelines with toy datasets.

Download Full-text

Combining tRNA sequencing methods to characterize plant tRNA expression and post-transcriptional modification

10.1101/790451 ◽

2019 ◽

Author(s):

Jessica M. Warren ◽

Thalia Salinas-Giegé ◽

Guillaume Hummel ◽

Nicole L. Coots ◽

Joshua M. Svendsen ◽

...

Keyword(s):

High Throughput Sequencing ◽

Digital Pcr ◽

Full Length ◽

Trna Genes ◽

Rna Seq ◽

Preparation Methods ◽

Trna Modifications ◽

Sequencing Technologies ◽

Trna Sequencing ◽

Almost All

ABSTRACTDifferences in tRNA expression have been implicated in a remarkable number of biological processes. There is growing evidence that tRNA genes can play dramatically different roles depending on both expression and post-transcriptional modification, yet sequencing tRNAs to measure abundance and detect modifications remains challenging. Their secondary structure and extensive post-transcriptional modifications interfere with RNA-seq library preparation methods and have limited the utility of high-throughput sequencing technologies. Here, we combine two modifications to standard RNA-seq methods by treating with the demethylating enzyme AlkB and ligating with tRNA-specific adapters in order to sequence tRNAs from four species of flowering plants, a group that has been shown to have some of the most extensive rates of post-transcriptional tRNA modifications. This protocol has the advantage of detecting full-length tRNAs and sequence variants that can be used to infer many post-transcriptional modifications. We used the resulting data to produce a modification index of almost all unique reference tRNAs in Arabidopsis thaliana, which exhibited many anciently conserved similarities with humans but also positions that appear to be “hot spots” for modifications in angiosperm tRNAs. We also found evidence based on northern blot analysis and droplet digital PCR that, even after demethylation treatment, tRNA-seq can produce highly biased estimates of absolute expression levels most likely due to biased reverse transcription. Nevertheless, the generation of full-length tRNA sequences with modification data is still promising for assessing differences in relative tRNA expression across treatments, tissues or subcellular fractions and help elucidate the functional roles of tRNA modifications.

Download Full-text

Impact of human gene annotations on RNA-seq differential expression analysis

BMC Genomics ◽

10.1186/s12864-021-08038-7 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yu Hamaguchi ◽

Chao Zeng ◽

Michiaki Hamada

Keyword(s):

Differential Expression ◽

High Throughput ◽

High Throughput Sequencing ◽

Human Gene ◽

Gene Annotation ◽

Differential Expression Analysis ◽

Rna Seq ◽

Gene Annotations ◽

Sequencing Technologies ◽

The Impact

Abstract Background Differential expression (DE) analysis of RNA-seq data typically depends on gene annotations. Different sets of gene annotations are available for the human genome and are continually updated–a process complicated with the development and application of high-throughput sequencing technologies. However, the impact of the complexity of gene annotations on DE analysis remains unclear. Results Using “mappability”, a metric of the complexity of gene annotation, we compared three distinct human gene annotations, GENCODE, RefSeq, and NONCODE, and evaluated how mappability affected DE analysis. We found that mappability was significantly different among the human gene annotations. We also found that increasing mappability improved the performance of DE analysis, and the impact of mappability mainly evident in the quantification step and propagated downstream of DE analysis systematically. Conclusions We assessed how the complexity of gene annotations affects DE analysis using mappability. Our findings indicate that the growth and complexity of gene annotations negatively impact the performance of DE analysis, suggesting that an approach that excludes unnecessary gene models from gene annotations improves the performance of DE analysis.

Download Full-text

Host-virus chimeric events in SARS-CoV2 infected cells are infrequent and artifactual

10.1101/2021.02.17.431704 ◽

2021 ◽

Author(s):

Bingyu Yan ◽

Srishti Chakravorty ◽

Carmen Mirabelli ◽

Luopin Wang ◽

Jorge L. Trujillo-Ochoa ◽

...

Keyword(s):

Genome Rearrangement ◽

High Throughput Sequencing ◽

Rna Virus ◽

Fruit Fly ◽

Template Switching ◽

Rna Seq ◽

Sequencing Technologies ◽

Cellular Genes ◽

Infected Cells ◽

Alignment Errors

AbstractPathogenic mechanisms underlying severe SARS-CoV2 infection remain largely unelucidated. High throughput sequencing technologies that capture genome and transcriptome information are key approaches to gain detailed mechanistic insights from infected cells. These techniques readily detect both pathogen and host-derived sequences, providing a means of studying host-pathogen interactions. Recent studies have reported the presence of host-virus chimeric (HVC) RNA in RNA-seq data from SARS-CoV2 infected cells and interpreted these findings as evidence of viral integration in the human genome as a potential pathogenic mechanism. Since SARS-CoV2 is a positive sense RNA virus that replicates in the cytoplasm it does not have a nuclear phase in its life cycle, it is biologically unlikely to be in a location where splicing events could result in genome integration. Here, we investigated the biological authenticity of HVC events. In contrast to true biological events such as mRNA splicing and genome rearrangement events, which generate reproducible chimeric sequencing fragments across different biological isolates, we found that HVC events across >100 RNA-seq libraries from patients with COVID-19 and infected cell lines, were highly irreproducible. RNA-seq library preparation is inherently error-prone due to random template switching during reverse transcription of RNA to cDNA. By counting chimeric events observed when constructing an RNA-seq library from human RNA and spike-in RNA from an unrelated species, such as fruit-fly, we estimated that ~1% of RNA-seq reads are artifactually chimeric. In SARS-CoV2 RNA-seq we found that the frequency of HVC events was, in fact, not greater than this background “noise”. Finally, we developed a novel experimental approach to enrich SARS-CoV2 sequences from bulk RNA of infected cells. This method enriched viral sequences but did not enrich for HVC events, suggesting that the majority of HVC events are, in all likelihood, artifacts of library construction. In conclusion, our findings indicate that HVC events observed in RNA-sequencing libraries from SARS-CoV2 infected cells are extremely rare and are likely artifacts arising from either random template switching of reverse-transcriptase and/or sequence alignment errors. Therefore, the observed HVC events do not support SARS-CoV2 fusion to cellular genes and/or integration into human genomes.

Download Full-text

Evaluation of tools for long read RNA-seq splice-aware alignment

10.1101/126656 ◽

2017 ◽

Cited By ~ 1

Author(s):

Krešimir Križanović ◽

Amina Echchiki ◽

Julien Roux ◽

Mile Šikić

Keyword(s):

High Throughput Sequencing ◽

Genetic Research ◽

Error Rates ◽

Rna Seq ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Long Reads ◽

Gapped Alignment ◽

Long Read ◽

Gene Expression Levels

AbstractMotivationHigh–throughput sequencing has transformed the study of gene expression levels through RNA-seq, a technique that is now routinely used by various fields, such as genetic research or diagnostics. The advent of third generation sequencing technologies providing significantly longer reads opens up new possibilities. However, the high error rates common to these technologies set new bioinformatics challenges for the gapped alignment of reads to their genomic origin. In this study, we have explored how currently available RNA-seq splice-aware alignment tools cope with increased read lengths and error rates. All tested tools were initially developed for short NGS reads, but some have claimed support for long PacBio or even ONT MinION reads.ResultsThe tools were tested on synthetic and real datasets from the PacBio and ONT MinION technologies, and both alignment quality and resource usage were compared across tools. The effect of error correction of long reads was explored, both using self-correction and correction with an external short reads dataset. A tool was developed for evaluating RNA-seq alignment results. This tool can be used to compare the alignment of simulated reads to their genomic origin, or to compare the alignment of real reads to a set of annotated transcripts.Our tests show that while some RNA-seq aligners were unable to cope with long error-prone reads, others produced overall good results. We further show that alignment accuracy can be improved using error-corrected reads.Availabilityhttps://github.com/kkrizanovic/[email protected]

Download Full-text

The Influence of Memory-Aware Computation on Distributed BLAST

Current Bioinformatics ◽

10.2174/1574893613666180601080811 ◽

2019 ◽

Vol 14 (2) ◽

pp. 157-163

Author(s):

Majid Hajibaba ◽

Mohsen Sharifi ◽

Saeid Gorgin

Keyword(s):

Search Time ◽

Genomic Research ◽

Local Alignment ◽

Negative Effects ◽

Sequencing Technologies ◽

Percent Improvement ◽

Fast Processing ◽

Search Tool ◽

Memory Awareness ◽

Generation Sequencing

Background: One of the pivotal challenges in nowadays genomic research domain is the fast processing of voluminous data such as the ones engendered by high-throughput Next-Generation Sequencing technologies. On the other hand, BLAST (Basic Local Alignment Search Tool), a longestablished and renowned tool in Bioinformatics, has shown to be incredibly slow in this regard. Objective: To improve the performance of BLAST in the processing of voluminous data, we have applied a novel memory-aware technique to BLAST for faster parallel processing of voluminous data. Method: We have used a master-worker model for the processing of voluminous data alongside a memory-aware technique in which the master partitions the whole data in equal chunks, one chunk for each worker, and consequently each worker further splits and formats its allocated data chunk according to the size of its memory. Each worker searches every split data one-by-one through a list of queries. Results: We have chosen a list of queries with different lengths to run insensitive searches in a huge database called UniProtKB/TrEMBL. Our experiments show 20 percent improvement in performance when workers used our proposed memory-aware technique compared to when they were not memory aware. Comparatively, experiments show even higher performance improvement, approximately 50 percent, when we applied our memory-aware technique to mpiBLAST. Conclusion: We have shown that memory-awareness in formatting bulky database, when running BLAST, can improve performance significantly, while preventing unexpected crashes in low-memory environments. Even though distributed computing attempts to mitigate search time by partitioning and distributing database portions, our memory-aware technique alleviates negative effects of page-faults on performance.

Download Full-text

Reassortment of Genome Segments Creates Stable Lineages Among Strains of Orchid Fleck Virus Infecting Citrus in Mexico

Phytopathology ◽

10.1094/phyto-07-19-0253-fi ◽

2020 ◽

Vol 110 (1) ◽

pp. 106-120 ◽

Cited By ~ 1

Author(s):

Avijit Roy ◽

Andrew L. Stone ◽

Gabriel Otero-Colina ◽

Gang Wei ◽

Ronald H. Brlansky ◽

...

Keyword(s):

High Throughput Sequencing ◽

Sensu Stricto ◽

Genome Segment ◽

Rt Pcr ◽

Sequence Comparisons ◽

Orchid Fleck Virus ◽

Reverse Transcription Pcr ◽

Sequencing Technologies ◽

Negative Sense

The genus Dichorhavirus contains viruses with bipartite, negative-sense, single-stranded RNA genomes that are transmitted by flat mites to hosts that include orchids, coffee, the genus Clerodendrum, and citrus. A dichorhavirus infecting citrus in Mexico is classified as a citrus strain of orchid fleck virus (OFV-Cit). We previously used RNA sequencing technologies on OFV-Cit samples from Mexico to develop an OFV-Cit–specific reverse transcription PCR (RT-PCR) assay. During assay validation, OFV-Cit–specific RT-PCR failed to produce an amplicon from some samples with clear symptoms of OFV-Cit. Characterization of this virus revealed that dichorhavirus-like particles were found in the nucleus. High-throughput sequencing of small RNAs from these citrus plants revealed a novel citrus strain of OFV, OFV-Cit2. Sequence comparisons with known orchid and citrus strains of OFV showed variation in the protein products encoded by genome segment 1 (RNA1). Strains of OFV clustered together based on host of origin, whether orchid or citrus, and were clearly separated from other dichorhaviruses described from infected citrus in Brazil. The variation in RNA1 between the original (now OFV-Cit1) and the new (OFV-Cit2) strain was not observed with genome segment 2 (RNA2), but instead, a common RNA2 molecule was shared among strains of OFV-Cit1 and -Cit2, a situation strikingly similar to OFV infecting orchids. We also collected mites at the affected groves, identified them as Brevipalpus californicus sensu stricto, and confirmed that they were infected by OFV-Cit1 or with both OFV-Cit1 and -Cit2. OFV-Cit1 and -Cit2 have coexisted at the same site in Toliman, Queretaro, Mexico since 2012. OFV strain-specific diagnostic tests were developed.

Download Full-text