An Improved Genome Assembly of Azadirachta indica A. Juss.

Rapid Mycobacterium tuberculosis spoligotyping from uncorrected long reads using Galru

10.1101/2020.05.31.126490 ◽

2020 ◽

Author(s):

Andrew J. Page ◽

Nabil-Fareed Alikhan ◽

Michael Strinden ◽

Thanh Le Viet ◽

Timofey Skvortsov

Keyword(s):

Mycobacterium Tuberculosis ◽

State Of The Art ◽

Sequence Data ◽

Human Pathogen ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Long Reads ◽

Long Read

AbstractSpoligotyping of Mycobacterium tuberculosis provides a subspecies classification of this major human pathogen. Spoligotypes can be predicted from short read genome sequencing data; however, no methods exist for long read sequence data such as from Nanopore or PacBio. We present a novel software package Galru, which can rapidly detect the spoligotype of a Mycobacterium tuberculosis sample from as little as a single uncorrected long read. It allows for near real-time spoligotyping from long read data as it is being sequenced, giving rapid sample typing. We compare it to the existing state of the art software and find it performs identically to the results obtained from short read sequencing data. Galru is freely available from https://github.com/quadram-institute-bioscience/galru under the GPLv3 open source licence.

LGAAP: Leishmaniinae Genome Assembly and Annotation Pipeline

Microbiology Resource Announcements ◽

10.1128/mra.00439-21 ◽

2021 ◽

Vol 10 (29) ◽

Author(s):

Hatim Almutairi ◽

Michael D. Urbaniak ◽

Michelle D. Bates ◽

Narissara Jariyapan ◽

Godwin Kwakye-Nuako ◽

...

Keyword(s):

Open Source ◽

Genome Assembly ◽

Computational Pipeline ◽

Sequencing Data ◽

Annotation Pipeline ◽

Short Read ◽

Short Read Sequencing ◽

Comparable Size

We present the LGAAP computational pipeline, which was successfully used to assemble six genomes of the parasite subfamily Leishmaniinae to chromosome-scale completeness from a combination of long- and short-read sequencing data. LGAAP is open source, and we suggest that it may easily be ported for assembly of any genome of comparable size (∼35 Mb).

Draft genome assembly and transcriptome sequencing of the golden algae Hydrurus foetidus (Chrysophyceae)

F1000Research ◽

10.12688/f1000research.16734.1 ◽

2019 ◽

Vol 8 ◽

pp. 401

Author(s):

Jon Bråte ◽

Janina Fuss ◽

Kjetill S. Jakobsen ◽

Dag Klaveness

Keyword(s):

Genome Assembly ◽

Draft Genome ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Draft Genome Assembly ◽

Alpine Regions ◽

Long Reads ◽

Branching Patterns ◽

Variable Morphology ◽

Generation Sequencing

Hydrurus foetidus is a freshwater alga belonging to the phylum Heterokonta. It thrives in cold rivers in polar and high alpine regions. It has several morphological traits reminiscent of single-celled eukaryotes, but can also form macroscopic thalli. Despite its ability to produce polyunsaturated fatty acids, its life under cold conditions and its variable morphology, very little is known about its genome and transcriptome. Here, we present an extensive set of next-generation sequencing data, including genomic short reads from Illumina sequencing and long reads from Nanopore sequencing, as well as full length cDNAs from PacBio IsoSeq sequencing and a small RNA dataset (smaller than 200 bp) sequenced with Illumina. We combined this data with, to our knowledge, the first draft genome assembly of a chrysophyte algae. The assembly consists of 5069 contigs to a total assembly size of 171 Mb and a 77% BUSCO completeness. The new data generated here may contribute to a better understanding of the evolution and ecological roles of chrysophyte algae, as well as to resolve the branching patterns within the Heterokonta.

DACCOR - Detection, charACterization, and reconstruction of Repetitive regions in bacterial genomes

10.7287/peerj.preprints.3480v1 ◽

2017 ◽

Author(s):

Alexander Seitz ◽

Friederike Hanssen ◽

Kay Nieselt

Keyword(s):

Base Pair ◽

Repetitive Sequence ◽

Reference Genome ◽

De Novo ◽

Treponema Pallidum ◽

Sequencing Data ◽

Bacterial Genomes ◽

Short Read ◽

Short Reads ◽

Short Read Sequencing

The reconstruction of genomes using mapping based approaches with short reads experiences difficulties when resolving repetitive regions. These repetitive regions in genomes result in low mapping qualities of the respective reads, which in turn lead to many unresolved bases of the genotypers. Currently, the reconstruction of these regions is often based on modified references in which the repetitive regions are masked. However, for many references such masked genomes are not available or are based on repetitive regions of other genomes. Our idea is to identify repetitive regions in the reference genome de novo. These regions can then be used to reconstruct them separately using short read sequencing data. Afterwards the reconstructed repetitive sequence can be inserted into the reconstructed genome. We present the program DACCOR, which performs these steps automatically. Our results show an increased base pair resolution of the repetitive regions in the reconstruction of Treponema pallidum samples, resulting in fewer unresolved bases.

DACCOR–Detection, characterization, and reconstruction of repetitive regions in bacterial genomes

PeerJ ◽

10.7717/peerj.4742 ◽

2018 ◽

Vol 6 ◽

pp. e4742 ◽

Cited By ~ 1

Author(s):

Alexander Seitz ◽

Friederike Hanssen ◽

Kay Nieselt

Keyword(s):

Base Pair ◽

Repetitive Sequence ◽

Reference Genome ◽

De Novo ◽

Treponema Pallidum ◽

Sequencing Data ◽

Bacterial Genomes ◽

Short Read ◽

Short Reads ◽

Short Read Sequencing

The reconstruction of genomes using mapping-based approaches with short reads experiences difficulties when resolving repetitive regions. These repetitive regions in genomes result in low mapping qualities of the respective reads, which in turn lead to many unresolved bases. Currently, the reconstruction of these regions is often based on modified references in which the repetitive regions are masked. However, for many references, such masked genomes are not available or are based on repetitive regions of other genomes. Our idea is to identify repetitive regions in the reference genome de novo. These regions can then be used to reconstruct them separately using short read sequencing data. Afterward, the reconstructed repetitive sequence can be inserted into the reconstructed genome. We present the program detection, characterization, and reconstruction of repetitive regions, which performs these steps automatically. Our results show an increased base pair resolution of the repetitive regions in the reconstruction of Treponema pallidum samples, resulting in fewer unresolved bases.

Bacterial genome reduction as a result of short read sequence assembly

10.1101/091314 ◽

2016 ◽

Cited By ~ 1

Author(s):

Charles H.D. Williamson ◽

Andrew Sanchez ◽

Adam Vazquez ◽

Joshua Gutman ◽

Jason W. Sahl

Keyword(s):

Genome Assembly ◽

Bacterial Genome ◽

Draft Genome ◽

Coding Region ◽

Entire Genome ◽

Short Read ◽

Short Reads ◽

Repeat Structure ◽

Short Read Sequence ◽

Genome Assemblies

AbstractHigh-throughput comparative genomics has changed our view of bacterial evolution and relatedness. Many genomic comparisons, especially those regarding the accessory genome that is variably conserved across strains in a species, are performed using assembled genomes. For completed genomes, an assumption is made that the entire genome was incorporated into the genome assembly, while for draft assemblies, often constructed from short sequence reads, an assumption is made that genome assembly is an approximation of the entire genome. To understand the potential effects of short read assemblies on the estimation of the complete genome, we downloaded all completed bacterial genomes from GenBank, simulated short reads, assembled the simulated short reads and compared the resulting assembly to the completed assembly. Although most simulated assemblies demonstrated little reduction, others were reduced by as much as 25%, which was correlated with the repeat structure of the genome. A comparative analysis of lost coding region sequences demonstrated that up to 48 CDSs or up to ~112,000 bases of coding region sequence, were missing from some draft assemblies compared to their finished counterparts. Although this effect was observed to some extent in 32% of genomes, only minimal effects were observed on pan-genome statistics when using simulated draft genome assemblies. The benefits and limitations of using draft genome assemblies should be fully realized before interpreting data from assembly-based comparative analyses.

DACCOR - Detection, charACterization, and reconstruction of Repetitive regions in bacterial genomes

10.7287/peerj.preprints.3480 ◽

2017 ◽

Author(s):

Alexander Seitz ◽

Friederike Hanssen ◽

Kay Nieselt

Keyword(s):

Base Pair ◽

Repetitive Sequence ◽

Reference Genome ◽

De Novo ◽

Treponema Pallidum ◽

Sequencing Data ◽

Bacterial Genomes ◽

Short Read ◽

Short Reads ◽

Short Read Sequencing

The reconstruction of genomes using mapping based approaches with short reads experiences difficulties when resolving repetitive regions. These repetitive regions in genomes result in low mapping qualities of the respective reads, which in turn lead to many unresolved bases of the genotypers. Currently, the reconstruction of these regions is often based on modified references in which the repetitive regions are masked. However, for many references such masked genomes are not available or are based on repetitive regions of other genomes. Our idea is to identify repetitive regions in the reference genome de novo. These regions can then be used to reconstruct them separately using short read sequencing data. Afterwards the reconstructed repetitive sequence can be inserted into the reconstructed genome. We present the program DACCOR, which performs these steps automatically. Our results show an increased base pair resolution of the repetitive regions in the reconstruction of Treponema pallidum samples, resulting in fewer unresolved bases.

Evaluating the accuracy of Listeria monocytogenes assemblies from quasimetagenomic samples using long and short reads

BMC Genomics ◽

10.1186/s12864-021-07702-2 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Seth Commichaux ◽

Kiran Javkar ◽

Padmini Ramachandran ◽

Niranjan Nagarajan ◽

Denis Bertrand ◽

...

Keyword(s):

Public Health ◽

Public Health Response ◽

High Quality ◽

Short Read ◽

Short Reads ◽

The Core ◽

Long Reads ◽

Health Response ◽

Long Read ◽

Core Genes

Abstract Background Whole genome sequencing of cultured pathogens is the state of the art public health response for the bioinformatic source tracking of illness outbreaks. Quasimetagenomics can substantially reduce the amount of culturing needed before a high quality genome can be recovered. Highly accurate short read data is analyzed for single nucleotide polymorphisms and multi-locus sequence types to differentiate strains but cannot span many genomic repeats, resulting in highly fragmented assemblies. Long reads can span repeats, resulting in much more contiguous assemblies, but have lower accuracy than short reads. Results We evaluated the accuracy of Listeria monocytogenes assemblies from enrichments (quasimetagenomes) of naturally-contaminated ice cream using long read (Oxford Nanopore) and short read (Illumina) sequencing data. Accuracy of ten assembly approaches, over a range of sequencing depths, was evaluated by comparing sequence similarity of genes in assemblies to a complete reference genome. Long read assemblies reconstructed a circularized genome as well as a 71 kbp plasmid after 24 h of enrichment; however, high error rates prevented high fidelity gene assembly, even at 150X depth of coverage. Short read assemblies accurately reconstructed the core genes after 28 h of enrichment but produced highly fragmented genomes. Hybrid approaches demonstrated promising results but had biases based upon the initial assembly strategy. Short read assemblies scaffolded with long reads accurately assembled the core genes after just 24 h of enrichment, but were highly fragmented. Long read assemblies polished with short reads reconstructed a circularized genome and plasmid and assembled all the genes after 24 h enrichment but with less fidelity for the core genes than the short read assemblies. Conclusion The integration of long and short read sequencing of quasimetagenomes expedited the reconstruction of a high quality pathogen genome compared to either platform alone. A new and more complete level of information about genome structure, gene order and mobile elements can be added to the public health response by incorporating long read analyses with the standard short read WGS outbreak response.

REscan: inferring repeat expansions and structural variation in paired-end short read sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btaa753 ◽

2020 ◽

Author(s):

Russell Lewis McLaughlin

Keyword(s):

Structural Variation ◽

Sequence Data ◽

Neurological Diseases ◽

Repeat Expansion ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Repeat Expansions ◽

Paired End Sequencing

Abstract Motivation Repeat expansions are an important class of genetic variation in neurological diseases. However, the identification of novel repeat expansions using conventional sequencing methods is a challenge due to their typical lengths relative to short sequence reads and difficulty in producing accurate and unique alignments for repetitive sequence. However, this latter property can be harnessed in paired-end sequencing data to infer the possible locations of repeat expansions and other structural variation. Results This article presents REscan, a command-line utility that infers repeat expansion loci from paired-end short read sequencing data by reporting the proportion of reads orientated towards a locus that do not have an adequately mapped mate. A high REscan statistic relative to a population of data suggests a repeat expansion locus for experimental follow-up. This approach is validated using genome sequence data for 259 cases of amyotrophic lateral sclerosis, of which 24 are positive for a large repeat expansion in C9orf72, showing that REscan statistics readily discriminate repeat expansion carriers from non-carriers. Availabilityand implementation C source code at https://github.com/rlmcl/rescan (GNU General Public Licence v3).

First draft genome of loach (Orenectus shuilongensis; Cypriniformes: Nemacheilidae) provide insights into the evolution of cavefish

10.21203/rs.3.rs-192229/v1 ◽

2021 ◽

Author(s):

Zhijin Liu ◽

Xuekun Qian ◽

Ziming Wang ◽

Huamei Wen ◽

Ling Han ◽

...

Keyword(s):

Single Molecule ◽

Genome Assembly ◽

Eye Development ◽

Draft Genome ◽

Evolutionary Process ◽

Integrated Approach ◽

Sequencing Data ◽

Retina Development ◽

Draft Genome Assembly ◽

Surface Dwelling

Abstract BcakgroundLoaches of the superfamily Cobitoidea (Cypriniformes, Nemacheilidae) are small elongated bottom-dwelling freshwater fishes with several barbels near the mouth. The genus Oreonectes with 18 currently recognized species contains representatives for all three key stages of the evolutionary process (a surface-dwelling lifestyle, facultative cave persistence, and permanent cave dwelling). Some Oreonectes species show typical cave dwelling-related traits, such as partial or complete leucism and regression of the eyes, rendering them as suitable study objects of micro-evolution. Genome information of Oreonectes species is therefore an indispensable resource for research into the evolution of cavefishes.ResultsHere we assembled the genome sequence of O. shuilongensis, a surface-dwelling species, using an integrated approach that combined PacBio single-molecule real-time sequencing and Illumina X-ten paired-end sequencing. Based on in total 50.9 Gb of sequencing data, our genome assembly from Canu and Pilon spans approximately 515.64 Mb (estimated coverage of 100 ×), containing 803 contigs with N50 values of 5.58 Mb. 25,247 protein-coding genes were predicted, of which 95.65% have been functionally annotated. We also performed genome re-sequencing of three additional cave-dwelling Oreonectes fishes. Twenty-nine pseudogenes annotated using DAVID showed significant enrichment for the GO terms of “eye development” and “retina development in camera-type eye”. It is presumed that these pseudogenes might lead to eye degeneration of semi/complete cave-dwelling Oreonectes species. Furthermore, Mc1r (melanocortin-1 receptor) is a pseudogenization by a deletion in O. daqikongensis, likely blocking biosynthesis of melanin and leading to the albino phenotype.ConclusionsWe here report the first draft genome assembly of Oreonectes fishes, which is also the first genome reference for Cobitidea fishes. Pseudogenization of genes related to body color and eye development may be responsible for loss of pigmentation and vision deterioration in cave-dwelling species. This genome assembly will contribute to the study of the evolution and adaptation of fishes within Oreonectes and beyond (Cobitidea).