WGA-LP: a pipeline for Whole Genome Assembly of contaminated reads

Summary: Whole Genome Assembly (WGA) of bacterial genomes with short reads is a quite common task as DNA sequencing has become cheaper with the advances of its technology. The process of assembling a genome has no absolute golden standard (Del Angel et al. (2018)) and it requires to perform a sequence of steps each of which can involve combinations of many different tools. However, the quality of the final assembly is always strongly related to the quality of the input data. With this in mind we built WGA-LP, a package that connects state-of-art programs and novel scripts to check and improve the quality of both samples and resulting assemblies. WGA-LP, with its conservative decontamination approach, has shown to be capable of creating high quality assemblies even in the case of contaminated reads. Availability and Implementation: WGA-LP is available on GitHub (https://github.com/redsnic/WGA-LP) and Docker Hub (https://hub.docker.com/r/redsnic/wgalp). The web app for node visualization is hosted by shinyapps.io (https://redsnic.shinyapps.io/ContigCoverageVisualizer/). Contact: Nicolò Rossi, [email protected] Supplementary information: Supplementary data are available at bioRxiv online.

Download Full-text

PDR: a new genome assembly evaluation metric based on genetics concerns

Bioinformatics ◽

10.1093/bioinformatics/btaa704 ◽

2020 ◽

Author(s):

Luyu Xie ◽

Limsoon Wong

Keyword(s):

Genome Assembly ◽

Pairwise Distance ◽

Supplementary Information ◽

Supplementary Data ◽

Assembly Quality ◽

Genetic Studies ◽

A Genome ◽

Assembly Evaluation ◽

Evaluation Metric

Abstract Motivation Existing genome assembly evaluation metrics provide only limited insight on specific aspects of genome assembly quality, and sometimes even disagree with each other. For better integrative comparison between assemblies, we propose, here, a new genome assembly evaluation metric, Pairwise Distance Reconstruction (PDR). It derives from a common concern in genetic studies, and takes completeness, contiguity, and correctness into consideration. We also propose an approximation implementation to accelerate PDR computation. Results Our results on publicly available datasets affirm PDR’s ability to integratively assess the quality of a genome assembly. In fact, this is guaranteed by its definition. The results also indicated the error introduced by approximation is extremely small and thus negligible. Availabilityand implementation https://github.com/XLuyu/PDR. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A Multireference-Based Whole Genome Assembly for the Obligate Ant-Following Antbird, Rhegmatorhina melanosticta (Thamnophilidae)

Diversity ◽

10.3390/d11090144 ◽

2019 ◽

Vol 11 (9) ◽

pp. 144 ◽

Cited By ~ 4

Author(s):

Laís Coelho ◽

Lukas Musher ◽

Joel Cracraft

Keyword(s):

Genome Assembly ◽

High Throughput Sequencing ◽

Population Genomics ◽

De Novo ◽

Structural Difference ◽

Whole Genome ◽

Sequencing Technology ◽

A Genome ◽

Avian Genomes ◽

Chromosome Level

Current generation high-throughput sequencing technology has facilitated the generation of more genomic-scale data than ever before, thus greatly improving our understanding of avian biology across a range of disciplines. Recent developments in linked-read sequencing (Chromium 10×) and reference-based whole-genome assembly offer an exciting prospect of more accessible chromosome-level genome sequencing in the near future. We sequenced and assembled a genome of the Hairy-crested Antbird (Rhegmatorhina melanosticta), which represents the first publicly available genome for any antbird (Thamnophilidae). Our objectives were to (1) assemble scaffolds to chromosome level based on multiple reference genomes, and report on differences relative to other genomes, (2) assess genome completeness and compare content to other related genomes, and (3) assess the suitability of linked-read sequencing technology for future studies in comparative phylogenomics and population genomics studies. Our R. melanosticta assembly was both highly contiguous (de novo scaffold N50 = 3.3 Mb, reference based N50 = 53.3 Mb) and relatively complete (contained close to 90% of evolutionarily conserved single-copy avian genes and known tetrapod ultraconserved elements). The high contiguity and completeness of this assembly enabled the genome to be successfully mapped to the chromosome level, which uncovered a consistent structural difference between R. melanosticta and other avian genomes. Our results are consistent with the observation that avian genomes are structurally conserved. Additionally, our results demonstrate the utility of linked-read sequencing for non-model genomics. Finally, we demonstrate the value of our R. melanosticta genome for future researchers by mapping reduced representation sequencing data, and by accurately reconstructing the phylogenetic relationships among a sample of thamnophilid species.

Download Full-text

Metassembler: Merging and optimizing de novo genome assemblies

10.1101/016352 ◽

2015 ◽

Author(s):

Alejandro Hernandez Wences ◽

Michael Schatz

Keyword(s):

Open Source ◽

Genome Assembly ◽

De Novo ◽

A Genome ◽

Genome Assemblies ◽

Multiple Algorithms

Genome assembly projects typically run multiple algorithms in an attempt to find the single best assembly, although those assemblies often have complementary, if untapped, strengths and weaknesses. We present our metassembler algorithm that merges multiple assemblies of a genome into a single superior sequence. We apply it to the four genomes from the Assemblathon competitions and show it consistently and substantially improves the contiguity and quality of each assembly. We also develop guidelines for metassembly by systematically evaluating 120 permutations of merging the top 5 assemblies of the first Assemblathon competition. The software is open-source at http://metassembler.sourceforge.net.

Download Full-text

Aligning optical maps to de Bruijn graphs

Bioinformatics ◽

10.1093/bioinformatics/btz069 ◽

2019 ◽

Vol 35 (18) ◽

pp. 3250-3256 ◽

Cited By ~ 1

Author(s):

Kingshuk Mukherjee ◽

Bahar Alipanahi ◽

Tamer Kahveci ◽

Leena Salmela ◽

Christina Boucher

Keyword(s):

Single Molecule ◽

Genome Assembly ◽

Sequence Data ◽

Supplementary Information ◽

De Bruijn Graph ◽

Structural Variations ◽

Regular Feature ◽

A Genome ◽

De Bruijn ◽

Optical Maps

Abstract Motivation Optical maps are high-resolution restriction maps (Rmaps) that give a unique numeric representation to a genome. Used in concert with sequence reads, they provide a useful tool for genome assembly and for discovering structural variations and rearrangements. Although they have been a regular feature of modern genome assembly projects, optical maps have been mainly used in post-processing step and not in the genome assembly process itself. Several methods have been proposed for pairwise alignment of single molecule optical maps—called Rmaps, or for aligning optical maps to assembled reads. However, the problem of aligning an Rmap to a graph representing the sequence data of the same genome has not been studied before. Such an alignment provides a mapping between two sets of data: optical maps and sequence data which will facilitate the usage of optical maps in the sequence assembly step itself. Results We define the problem of aligning an Rmap to a de Bruijn graph and present the first algorithm for solving this problem which is based on a seed-and-extend approach. We demonstrate that our method is capable of aligning 73% of Rmaps generated from the Escherichia coli genome to the de Bruijn graph constructed from short reads generated from the same genome. We validate the alignments and show that our method achieves an accuracy of 99.6%. We also show that our method scales to larger genomes. In particular, we show that 76% of Rmaps can be aligned to the de Bruijn graph in the case of human data. Availability and implementation The software for aligning optical maps to de Bruijn graph, omGraph is written in C++ and is publicly available under GNU General Public License at https://github.com/kingufl/omGraph. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Whole-genome sequencing of eukaryotes: From sequencing of DNA fragments to a genome assembly

Russian Journal of Genetics ◽

10.1134/s102279541705012x ◽

2017 ◽

Vol 53 (6) ◽

pp. 631-639 ◽

Cited By ~ 4

Author(s):

K. S. Zadesenets ◽

N. I. Ershov ◽

N. B. Rubtsov

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Genome Assembly ◽

Whole Genome ◽

Dna Fragments ◽

A Genome

Download Full-text

Whole-Genome Sequencing of Sinocyclocheilus maitianheensis Reveals Phylogenetic Evolution and Immunological Variances in Various Sinocyclocheilus Fishes

Frontiers in Genetics ◽

10.3389/fgene.2021.736500 ◽

2021 ◽

Vol 12 ◽

Author(s):

Ruihan Li ◽

Xiaoai Wang ◽

Chao Bian ◽

Zijian Gao ◽

Yuanwei Zhang ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Genome Assembly ◽

Genetic Resource ◽

Whole Genome ◽

Protein Coding ◽

A Genome ◽

Surface Dwelling ◽

Phylogenetic Evolution ◽

Comparative Phylogenetic Analysis

An adult Sinocyclocheilus maitianheensis, a surface-dwelling golden-line barbel fish, was collected from Maitian river (Kunming City, Yunnan Province, China) for whole-genome sequencing, assembly, and annotation. We obtained a genome assembly of 1.7 Gb with a scaffold N50 of 1.4 Mb and a contig N50 of 24.7 kb. A total of 39,977 protein-coding genes were annotated. Based on a comparative phylogenetic analysis of five Sinocyclocheilus species and other five representative vertebrates with published genome sequences, we found that S. maitianheensis is close to Sinocyclocheilus anophthalmus (a cave-restricted species with similar locality). Moreover, the assembled genomes of S. maitianheensis and other four Sinocyclocheilus counterparts were used for a fourfold degenerative third-codon transversion (4dTv) analysis. The recent whole-genome duplication (WGD) event was therefore estimated to occur about 18.1 million years ago. Our results also revealed a decreased tendency of copy number in many important genes related to immunity and apoptosis in cave-restricted Sinocyclocheilus species. In summary, we report the first genome assembly of S. maitianheensis, which provides a valuable genetic resource for comparative studies on cavefish biology, species protection, and practical aquaculture of this potentially economical fish.

Download Full-text

InfoTrim: A DNA Read Quality Trimmer Using Entropy

10.1101/201442 ◽

2017 ◽

Author(s):

Jacob Porter ◽

Liqing Zhang

Keyword(s):

Genome Assembly ◽

Bisulfite Sequencing ◽

Low Complexity ◽

Whole Genome ◽

Biological Sequence ◽

Alignment Quality ◽

Whole Genome Bisulfite Sequencing ◽

Sequence Complexity ◽

Genome Bisulfite Sequencing

AbstractBiological DNA reads are often trimmed before mapping, genome assembly, and other tasks to improve the quality of the results. Biological sequence complexity relates to alignment quality as low complexity regions can align poorly. There are many read trimmers, but many do not use sequence complexity for trimming. Alignment of reads generated from whole genome bisulfite sequencing is especially challenging since bisulfite treated reads tend to reduce sequence complexity. InfoTrim, a new read trimmer, was created to explore these issues. It is evaluated against five other trimmers using four read mappers on real and simulated bisulfite treated DNA data. InfoTrim produces reasonable results consistent with other trimmers.

Download Full-text

plot2DO: a tool to assess the quality and distribution of genomic data

10.1101/189449 ◽

2017 ◽

Author(s):

Răzvan V. Chereji

Keyword(s):

Micrococcal Nuclease ◽

Supplementary Information ◽

Gene Promoters ◽

Functional Regions ◽

Genome Wide ◽

Micrococcal Nuclease Digestion ◽

A Genome ◽

Nuclease Digestion ◽

Wide Scale

AbstractSummaryMicrococcal nuclease digestion followed by deep sequencing (MNase-seq) is the most used method to investigate nucleosome organization on a genome-wide scale. We present plot2DO, a software package for creating 2D occupancy plots, which allows biologists to evaluate the quality of MNase-seq data and to visualize the distribution of nucleosomes near the functional regions of the genome (e.g. gene promoters, origins of replication, etc.).Availability And ImplementationThe plot2DO open source package is freely available on GitHub at https://github.com/rchereji/plot2DO under the MIT [email protected] InformationSupplementary data are available at Bioinformatics online.

Download Full-text

TIGER: inferring DNA replication timing from whole-genome sequence data

Bioinformatics ◽

10.1093/bioinformatics/btab166 ◽

2021 ◽

Cited By ~ 1

Author(s):

Amnon Koren ◽

Dashiell J Massey ◽

Alexa N Bracci

Keyword(s):

Dna Replication ◽

Genome Sequence ◽

Genomic Dna ◽

Sequence Data ◽

Replication Timing ◽

Whole Genome Sequence ◽

Supplementary Information ◽

Whole Genome ◽

Genome Sequence Data ◽

Dna Replication Timing

Abstract Motivation Genomic DNA replicates according to a reproducible spatiotemporal program, with some loci replicating early in S phase while others replicate late. Despite being a central cellular process, DNA replication timing studies have been limited in scale due to technical challenges. Results We present TIGER (Timing Inferred from Genome Replication), a computational approach for extracting DNA replication timing information from whole genome sequence data obtained from proliferating cell samples. The presence of replicating cells in a biological specimen leads to non-uniform representation of genomic DNA that depends on the timing of replication of different genomic loci. Replication dynamics can hence be observed in genome sequence data by analyzing DNA copy number along chromosomes while accounting for other sources of sequence coverage variation. TIGER is applicable to any species with a contiguous genome assembly and rivals the quality of experimental measurements of DNA replication timing. It provides a straightforward approach for measuring replication timing and can readily be applied at scale. Availability and Implementation TIGER is available at https://github.com/TheKorenLab/TIGER. Supplementary information Supplementary data are available at Bioinformatics online

Download Full-text

Chromosome-level genome assembly of a regenerable maize inbred line A188

Genome Biology ◽

10.1186/s13059-021-02396-x ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Guifang Lin ◽

Cheng He ◽

Jun Zheng ◽

Dal-Hoe Koo ◽

Ha Le ◽

...

Keyword(s):

Inbred Line ◽

Genome Assembly ◽

Gene Function ◽

Maize Inbred Line ◽

Carotenoid Cleavage Dioxygenase ◽

Structural Variations ◽

Embryonic Callus ◽

Network Analyses ◽

A Genome ◽

Chromosome Level

Abstract Background The maize inbred line A188 is an attractive model for elucidation of gene function and improvement due to its high embryogenic capacity and many contrasting traits to the first maize reference genome, B73, and other elite lines. The lack of a genome assembly of A188 limits its use as a model for functional studies. Results Here, we present a chromosome-level genome assembly of A188 using long reads and optical maps. Comparison of A188 with B73 using both whole-genome alignments and read depths from sequencing reads identify approximately 1.1 Gb of syntenic sequences as well as extensive structural variation, including a 1.8-Mb duplication containing the Gametophyte factor1 locus for unilateral cross-incompatibility, and six inversions of 0.7 Mb or greater. Increased copy number of carotenoid cleavage dioxygenase 1 (ccd1) in A188 is associated with elevated expression during seed development. High ccd1 expression in seeds together with low expression of yellow endosperm 1 (y1) reduces carotenoid accumulation, accounting for the white seed phenotype of A188. Furthermore, transcriptome and epigenome analyses reveal enhanced expression of defense pathways and altered DNA methylation patterns of the embryonic callus. Conclusions The A188 genome assembly provides a high-resolution sequence for a complex genome species and a foundational resource for analyses of genome variation and gene function in maize. The genome, in comparison to B73, contains extensive intra-species structural variations and other genetic differences. Expression and network analyses identify discrete profiles for embryonic callus and other tissues.

Download Full-text