Characterization of segmental duplications and large inversions using Linked-Reads

Mapping Intimacies ◽

10.1101/394528 ◽

2018 ◽

Cited By ~ 4

Author(s):

Fatih Karaoglanoglu ◽

Camir Ricketts ◽

Marzieh Eslami Rasekh ◽

Ezgi Ebren ◽

Iman Hajirasouliha ◽

...

Keyword(s):

High Throughput Sequencing ◽

Segmental Duplications ◽

Sequencing Data ◽

Full Spectrum ◽

Genomic Structural Variation ◽

Split Read ◽

Long Read ◽

Novel Algorithms ◽

Insertion Locus

AbstractMany algorithms aimed at characterizing genomic structural variation (SV) have been developed since the inception of high-throughput sequencing. However, the full spectrum of SVs in the human genome is not yet assessed. Most of the existing methods focus on discovery and genotyping of deletions, insertions, and mobile elements. Detection of balanced SVs with no gain or loss of genomic segments (e.g., inversions) is particularly a challenging task. Long read sequencing has been leveraged to find short inversions but there is still a need to develop methods to detect large genomic inversions. Furthermore, currently there are no algorithms to predict the insertion locus of large interspersed segmental duplications.Here we propose novel algorithms to characterize large (>40Kbp) interspersed segmental duplications and (>80Kbp) inversions using Linked-Read sequencing data. Linked-Read sequencing provides long range information, where Illumina reads are tagged with barcodes that can be used to assign short reads to pools of larger (30-50 Kbp) molecules. Our methods rely on split molecule sequence signature that we have previously described [11]. Similar to the split read, split molecules refer to large segments of DNA that span an SV breakpoint. Therefore, when mapped to the reference genome, the mapping of these segments would be discontinuous. We redesign our earlier algorithm, VALOR, to specifically leverage Linked-Read sequencing data to discover large inversions and characterize interspersed segmental duplications. We implement our new algorithms in a new software package, called VALOR2.AvailabilityVALOR2 is available at https://github.com/BilkentCompGen/valor.

TagSeqTools: a flexible and comprehensive analysis pipeline for NAD tagSeq data

10.1101/2020.03.09.982934 ◽

2020 ◽

Cited By ~ 1

Author(s):

Huan Zhong ◽

Zongwei Cai ◽

Zhu Yang ◽

Yiji Xia

Keyword(s):

Rna Sequencing ◽

Comprehensive Analysis ◽

Enzymatic Reactions ◽

Computational Tool ◽

Sequencing Data ◽

Analysis Pipeline ◽

Oxford Nanopore ◽

Long Read ◽

Identification And Characterization

AbstractNAD tagSeq has recently been developed for the identification and characterization of NAD+-capped RNAs (NAD-RNAs). This method adopts a strategy of chemo-enzymatic reactions to label the NAD-RNAs with a synthetic RNA tag before subjecting to the Oxford Nanopore direct RNA sequencing. A computational tool designed for analyzing the sequencing data of tagged RNA will facilitate the broader application of this method. Hence, we introduce TagSeqTools as a flexible, general pipeline for the identification and quantification of tagged RNAs (i.e., NAD+-capped RNAs) using long-read transcriptome sequencing data generated by NAD tagSeq method. TagSeqTools comprises two major modules, TagSeek for differentiating tagged and untagged reads, and TagSeqQuant for the quantitative and further characterization analysis of genes and isoforms. Besides, the pipeline also integrates some advanced functions to identify antisense or splicing, and supports the data reformation for visualization. Therefore, TagSeqTools provides a convenient and comprehensive workflow for researchers to analyze the data produced by the NAD tagSeq method or other tagging-based experiments using Oxford nanopore direct RNA sequencing. The pipeline is available at https://github.com/dorothyzh/TagSeqTools, under Apache License 2.0.

SQANTI: extensive characterization of long read transcript sequences for quality control in full-length transcriptome identification and quantification

10.1101/118083 ◽

2017 ◽

Cited By ~ 9

Author(s):

Manuel Tardaguila ◽

Lorena de la Fuente ◽

Cristina Marti ◽

Cécile Pereira ◽

Francisco Jose Pardo-Palacios ◽

...

Keyword(s):

High Throughput Sequencing ◽

Full Length ◽

The Novel ◽

Extensive Evaluation ◽

Long Reads ◽

Long Read ◽

Novel Transcripts ◽

Mouse Transcriptome

ABSTRACTHigh-throughput sequencing of full-length transcripts using long reads has paved the way for the discovery of thousands of novel transcripts, even in very well annotated organisms as mice and humans. Nonetheless, there is a need for studies and tools that characterize these novel isoforms. Here we present SQANTI, an automated pipeline for the classification of long-read transcripts that computes 47 descriptors that can be used to assess the quality of the data and of the preprocessing pipelines. We applied SQANTI to a neuronal mouse transcriptome using PacBio long reads and illustrate how the tool is effective in readily describing the composition of and characterizing the full-length transcriptome. We perform extensive evaluation of ToFU PacBio transcripts by PCR to reveal that an important number of the novel transcripts are technical artifacts of the sequencing approach, and that SQANTI quality descriptors can be used to engineer a filtering strategy to remove them. Most novel transcripts in this curated transcriptome are novel combinations of existing splice sites, result more frequently in novel ORFs than novel UTRs and are enriched in both general metabolic and neural specific functions. We show that these new transcripts have a major impact in the correct quantification of transcript levels by state-of-the-art short-read based quantification algorithms. By comparing our iso-transcriptome with public proteomics databases we find that alternative isoforms are elusive to proteogenomics detection and are variable in protein changes with respect to the principal isoform of their genes. SQANTI allows the user to maximize the analytical outcome of long read technologies by providing the tools to deliver quality-evaluated and curated full-length transcriptomes. SQANTI is available at https://bitbucket.org/ConesaLab/sqanti.

Identifying contamination with advanced visualization and analysis practices: metagenomic approaches for eukaryotic genome assemblies

PeerJ ◽

10.7717/peerj.1839 ◽

2016 ◽

Vol 4 ◽

pp. e1839 ◽

Cited By ~ 57

Author(s):

Tom O. Delmont ◽

A. Murat Eren

Keyword(s):

High Throughput Sequencing ◽

Draft Genome ◽

Cost Effective ◽

Single Copy ◽

Eukaryotic Genome ◽

Sequencing Data ◽

Bacterial Genomes ◽

Long Read ◽

Domains Of Life ◽

Genome Assemblies

High-throughput sequencing provides a fast and cost-effective mean to recover genomes of organisms from all domains of life. However, adequate curation of the assembly results against potential contamination of non-target organisms requires advanced bioinformatics approaches and practices. Here, we re-analyzed the sequencing data generated for the tardigradeHypsibius dujardini,and created a holistic display of the eukaryotic genome assembly using DNA data originating from two groups and eleven sequencing libraries. By using bacterial single-copy genes, k-mer frequencies, and coverage values of scaffolds we could identify and characterize multiple near-complete bacterial genomes from the raw assembly, and curate a 182 Mbp draft genome forH. dujardinisupported by RNA-Seq data. Our results indicate that most contaminant scaffolds were assembled from Moleculo long-read libraries, and most of these contaminants have differed between library preparations. Our re-analysis shows that visualization and curation of eukaryotic genome assemblies can benefit from tools designed to address the needs of today’s microbiologists, who are constantly challenged by the difficulties associated with the identification of distinct microbial genomes in complex environmental metagenomes.

Strengths and Biases of High-Throughput Sequencing Data in the Characterization of Freshwater Ciliate Microbiomes

Microbial Ecology ◽

10.1007/s00248-016-0912-8 ◽

2016 ◽

Vol 73 (4) ◽

pp. 865-875 ◽

Cited By ~ 6

Author(s):

Vittorio Boscaro ◽

Alessia Rossi ◽

Claudia Vannini ◽

Franco Verni ◽

Sergei I. Fokin ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Sequencing Data ◽

High Throughput Sequencing Data

Multi-platform discovery of haplotype-resolved structural variation in human genomes

10.1101/193144 ◽

2017 ◽

Cited By ~ 32

Author(s):

Mark J.P. Chaisson ◽

Ashley D. Sanders ◽

Xuefang Zhao ◽

Ankit Malhotra ◽

David Porubsky ◽

...

Keyword(s):

Genome Sequencing ◽

Large Scale ◽

Structural Variation ◽

High Throughput Sequencing ◽

Whole Genome Sequencing Data ◽

Sequencing Data ◽

Full Spectrum ◽

Variant Discovery ◽

Sequencing Technologies ◽

Sequencing Studies

ABSTRACTThe incomplete identification of structural variants (SVs) from whole-genome sequencing data limits studies of human genetic diversity and disease association. Here, we apply a suite of long-read, short-read, and strand-specific sequencing technologies, optical mapping, and variant discovery algorithms to comprehensively analyze three human parent–child trios to define the full spectrum of human genetic variation in a haplotype-resolved manner. We identify 818,054 indel variants (<50 bp) and 27,622 SVs (≥50 bp) per human genome. We also discover 156 inversions per genome—most of which previously escaped detection. Fifty-eight of the inversions we discovered intersect with the critical regions of recurrent microdeletion and microduplication syndromes. Taken together, our SV callsets represent a sevenfold increase in SV detection compared to most standard high-throughput sequencing studies, including those from the 1000 Genomes Project. The method and the dataset serve as a gold standard for the scientific community and we make specific recommendations for maximizing structural variation sensitivity for future large-scale genome sequencing studies.

neoantigenR: An annotation based pipeline for tumor neoantigen identification from sequencing data

10.1101/171843 ◽

2017 ◽

Cited By ~ 4

Author(s):

Shaojun Tang ◽

Subha Madhavan

Keyword(s):

Alternative Splicing ◽

Cancer Immunotherapy ◽

High Throughput ◽

High Throughput Sequencing ◽

Sequencing Data ◽

Peptide Epitopes ◽

Cancer Antigens ◽

Long Read ◽

Personalized Cancer ◽

Specific Peptide

AbstractStudies indicate that more than 90% of human genes are alternatively spliced, suggesting the complexity of the transcriptome assembly and analysis. The splicing process is often disrupted, resulting in both functional and non-functional end-products (Sveen et al. 2016) in many cancers. Harnessing the immune system to fight against malignant cancers carrying aberrantly mutated or spliced products is becoming a promising approach to cancer therapy. Advances in immune checkpoint blockade have elicited adaptive immune responses with promising clinical responses to treatments against human malignancies (Tumor Neoantigens in Personalized Cancer Immunotherapy 2017). Emerging data suggest that recognition of patient-specific mutation-associated cancer antigens (i.e. from alternative splicing isoforms) may allow scientists to dissect the immune response in the activity of clinical immunotherapies (Schumacher and Schreiber 2015). The advent of high-throughput sequencing technology has provided a comprehensive view of both splicing aberrations and somatic mutations across a range of human malignancies, allowing for a deeper understanding of the interplay of various disease mechanisms.Meanwhile, studies show that the number of transcript isoforms reported to date may be limited by the short-read sequencing due to the inherit limitation of transcriptome reconstruction algorithms, whereas long-read sequencing is able to significantly improve the detection of alternative splicing variants since there is no need to assemble full-length transcripts from short reads. The analysis of these high-throughput long-read sequencing data may permit a systematic view of tumor specific peptide epitopes (also known as neoantigens) that could serve as targets for immunotherapy (Tumor Neoantigens in Personalized Cancer Immunotherapy 2017).Currently, there is no software pipeline available that can efficiently produce mutation-associated cancer antigens from raw high-throughput sequencing data on patient tumor DNA (The Problem with Neoantigen Prediction 2017). In addressing this issue, we introduce a R package that allows the discoveries of peptide epitope candidates, which are the tumor-specific peptide fragments containing potential functional neoantigens. These peptide epitopes consist of structure variants including insertion, deletions, alternative sequences, and peptides from nonsynonymous mutations. Analysis of these precursor candidates with widely used tools such as netMHC allows for the accurate in-silico prediction of neoantigens. The pipeline named neoantigeR is currently hosted in https://github.com/ICBI/neoantigeR.

Precise characterization of somatic structural variations and mobile element insertions from paired long-read sequencing data with nanomonsv

10.1101/2020.07.22.214262 ◽

2020 ◽

Author(s):

Yuichi Shiraishi ◽

Junji Koya ◽

Kenichi Chiba ◽

Yuki Saito ◽

Ai Okada ◽

...

Keyword(s):

Matched Control ◽

Mobile Element ◽

Sequencing Data ◽

Structural Variations ◽

Short Read ◽

Long Read ◽

Functional Consequences ◽

Single Base Resolution ◽

Mutational Processes

AbstractWe introduce our novel software, nanomonsv, for detecting somatic structural variations (SVs) using tumor and matched control long-read sequencing data with a single-base resolution. Using paired long-read sequencing data from three cancer cell-lines and their matched lymphoblastoid lines, we demonstrate that our approach can identify not only somatic SVs that can be captured with short-read technologies but also novel ones especially those whose breakpoints are located in repeat regions. In addition, we have developed a workflow for classifying mobile element insertions while elucidating their in-depth properties such as 5′ truncations, internal inversion as well as source sites in the case of LINE1 transductions. Finally, we identify complex SVs probably caused by replication mechanisms or telomere crisis by examining the co-occurrence of multiple somatic SVs in common supporting reads. In summary, our approaches applied to cancer long-read sequencing data can reveal various features of somatic SVs and will lead to further understanding of mutational processes and functional consequences of somatic SVs.

Characterization of the mitochondrial genome ofArge bellaWei & Du sp. nov. (Hymenoptera: Argidae)

PeerJ ◽

10.7717/peerj.6131 ◽

2018 ◽

Vol 6 ◽

pp. e6131 ◽

Cited By ~ 3

Author(s):

Shiyu Du ◽

Gengyun Niu ◽

Tommi Nyman ◽

Meicai Wei

Keyword(s):

Mitochondrial Genome ◽

High Throughput Sequencing ◽

Complete Mitochondrial Genome ◽

Nucleotide Composition ◽

Sequencing Data ◽

Protein Coding ◽

High Throughput Sequencing Data ◽

Rna Genes ◽

Ancestral Type

We describeArge bellaWei & Du sp. nov., a large and beautiful species of Argidae from south China, and report its mitochondrial genome based on high-throughput sequencing data. We present the gene order, nucleotide composition of protein-coding genes (PCGs), and the secondary structures of RNA genes. The nearly complete mitochondrial genome ofA. bellahas a length of 15,576 bp and a typical set of 37 genes (22 tRNAs, 13 PCGs, and 2 rRNAs). Three tRNAs are rearranged in theA. bellamitochondrial genome as compared to the ancestral type in insects:trnMandtrnQare shuffled, whiletrnWis translocated from thetrnW-trnC-trnYcluster to a location downstream oftrnI. All PCGs are initiated by ATN codons, and terminated with TAA, TA or T as stop codons. All tRNAs have a typical cloverleaf secondary structure, except fortrnS1. H821 ofrrnSand H976 ofrrnLare redundant. A phylogenetic analysis based on mitochondrial genome sequences ofA. bella, 21 other symphytan species, two apocritan representatives, and four outgroup taxa supports the placement of Argidae as sister to the Pergidae within the symphytan superfamily Tenthredinoidea.

Discovery of tandem and interspersed segmental duplications using high throughput sequencing

10.1101/393694 ◽

2018 ◽

Cited By ~ 1

Author(s):

Arda Soylev ◽

Thong Le ◽

Hajar Amini ◽

Can Alkan ◽

Fereydoun Hormozdiari

Keyword(s):

False Discovery Rate ◽

High Throughput ◽

High Throughput Sequencing ◽

Real Data ◽

Whole Genome Sequencing Data ◽

Data Sets ◽

Segmental Duplications ◽

Sequencing Data ◽

Link Type ◽

False Discovery

AbstractMotivationSeveral algorithms have been developed that use high throughput sequencing technology to characterize structural variations. Most of the existing approaches focus on detecting relatively simple types of SVs such as insertions, deletions, and short inversions. In fact, complex SVs are of crucial importance and several have been associated with genomic disorders. To better understand the contribution of complex SVs to human disease, we need new algorithms to accurately discover and genotype such variants. Additionally, due to similar sequencing signatures, inverted duplications or gene conversion events that include inverted segmental duplications are often characterized as simple inversions; and duplications and gene conversions in direct orientation may be called as simple deletions. Therefore, there is still a need for accurate algorithms to fully characterize complex SVs and thus improve calling accuracy of more simple variants.ResultsWe developed novel algorithms to accurately characterize tandem, direct and inverted interspersed segmental duplications using short read whole genome sequencing data sets. We integrated these methods to our TARDIS tool, which is now capable of detecting various types of SVs using multiple sequence signatures such as read pair, read depth and split read. We evaluated the prediction performance of our algorithms through several experiments using both simulated and real data sets. In the simulation experiments, using a 30× coverage TARDIS achieved 96% sensitivity with only 4% false discovery rate. For experiments that involve real data, we used two haploid genomes (CHM1 and CHM13) and one human genome (NA12878) from the Illumina Platinum Genomes set. Comparison of our results with orthogonal PacBio call sets from the same genomes revealed higher accuracy for TARDIS than state of the art methods. Furthermore, we showed a surprisingly low false discovery rate of our approach for discovery of tandem, direct and inverted interspersed segmental duplications prediction on CHM1 (less than 5% for the top 50 predictions).AvailabilityTARDIS source code is available at https://github.com/BilkentCompGen/tardis, and a corresponding Docker image is available at https://hub.docker.com/r/alkanlab/tardis/[email protected] and [email protected]

Experimental Design-Based Functional Mining and Characterization of High-Throughput Sequencing Data in the Sequence Read Archive

PLoS ONE ◽

10.1371/journal.pone.0077910 ◽

2013 ◽

Vol 8 (10) ◽

pp. e77910 ◽

Cited By ~ 20

Author(s):

Takeru Nakazato ◽

Tazro Ohta ◽

Hidemasa Bono

Keyword(s):

Experimental Design ◽

High Throughput ◽

High Throughput Sequencing ◽

Sequencing Data ◽

Sequence Read Archive ◽

High Throughput Sequencing Data