RandAL: a randomized approach to aligning DNA sequences to reference genomes

AbstractAdvances in genomics have the potential to revolutionize clinical diagnostics. Here, we examine the microbiome of vitreous (intraocular body fluid) from patients who developed endophthalmitis following cataract surgery or intravitreal injection. Endophthalmitis is an inflammation of the intraocular cavity and can lead to a permanent loss of vision. As controls, we included vitreous from endophthalmitis-negative patients, balanced salt solution used during vitrectomy, and DNA extraction blanks. We compared two DNA isolation procedures and found that an ultraclean production of reagents appeared to reduce background DNA in these low microbial biomass samples. We created a curated microbial genome database (>5700 genomes) and designed a metagenomics workflow with filtering steps to reduce DNA sequences originating from: i) human hosts, ii) ambiguousness/contaminants in public microbial reference genomes, and iii) the environment. Our metagenomic read classification revealed in nearly all cases the same microorganism than was determined in cultivation‐ and mass spectrometry-based analyses. For some patients, we identified the sequence type of the microorganism and antibiotic resistance genes through analyses of whole genome sequence (WGS) assemblies of isolates and metagenomic assemblies. Together, we conclude that genomics-based analyses of human ocular body fluid specimens can provide actionable information relevant to infectious disease management.

Download Full-text

Assembly and Annotation of an Ashkenazi Human Reference Genome

10.1101/2020.03.18.997395 ◽

2020 ◽

Cited By ~ 2

Author(s):

Alaina Shumate ◽

Aleksey V. Zimin ◽

Rachel M. Sherman ◽

Daniela Puiu ◽

Justin M. Wagner ◽

...

Keyword(s):

Dna Sequences ◽

Reference Genome ◽

Gene Families ◽

Gene Content ◽

Specific Reference ◽

Protein Coding ◽

Human Reference Genome ◽

Protein Coding Genes ◽

Reference Genomes ◽

Similar Gene

AbstractHere we describe the assembly and annotation of the genome of an Ashkenazi individual and the creation of a new, population-specific human reference genome. This genome is more contiguous and more complete than GRCh38, the latest version of the human reference genome, and is annotated with highly similar gene content. The Ashkenazi reference genome, Ash1, contains 2,973,118,650 nucleotides as compared to 2,937,639,212 in GRCh38. Annotation identified 20,157 protein-coding genes, of which 19,563 are >99% identical to their counterparts on GRCh38. Most of the remaining genes have small differences. 40 of the protein-coding genes in GRCh38 are missing from Ash1; however, all of these genes are members of multi-gene families for which Ash1 contains other copies. 11 genes appear on different chromosomes from their homologs in GRCh38. Alignment of DNA sequences from an unrelated Ashkenazi individual to Ash1 identified ~1 million fewer homozygous SNPs than alignment of those same sequences to the more-distant GRCh38 genome, illustrating one of the benefits of population-specific reference genomes.

Download Full-text

Re-examination of two diatom reference genomes using long-read sequencing

BMC Genomics ◽

10.1186/s12864-021-07666-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Gina V. Filloramo ◽

Bruce A. Curtis ◽

Emma Blanche ◽

John M. Archibald

Keyword(s):

Repetitive Dna ◽

Dna Sequences ◽

Optical Mapping ◽

Sequence Data ◽

Thalassiosira Pseudonana ◽

Rapid Expansion ◽

Model Organisms ◽

Algal Group ◽

Long Read ◽

Reference Genomes

Abstract Background The marine diatoms Thalassiosira pseudonana and Phaeodactylum tricornutum are valuable model organisms for exploring the evolution, diversity and ecology of this important algal group. Their reference genomes, published in 2004 and 2008, respectively, were the product of traditional Sanger sequencing. In the case of T. pseudonana, optical restriction site mapping was employed to further clarify and contextualize chromosome-level scaffolds. While both genomes are considered highly accurate and reasonably contiguous, they still contain many unresolved regions and unordered/unlinked scaffolds. Results We have used Oxford Nanopore Technologies long-read sequencing to update and validate the quality and contiguity of the T. pseudonana and P. tricornutum genomes. Fine-scale assessment of our long-read derived genome assemblies allowed us to resolve previously uncertain genomic regions, further characterize complex structural variation, and re-evaluate the repetitive DNA content of both genomes. We also identified 1862 previously undescribed genes in T. pseudonana. In P. tricornutum, we used transposable element detection software to identify 33 novel copia-type LTR-RT insertions, indicating ongoing activity and rapid expansion of this superfamily as the organism continues to be maintained in culture. Finally, Bionano optical mapping of P. tricornutum chromosomes was combined with long-read sequence data to explore the potential of long-read sequencing and optical mapping for resolving haplotypes. Conclusion Despite its potential to yield highly contiguous scaffolds, long-read sequencing is not a panacea. Even for relatively small nuclear genomes such as those investigated herein, repetitive DNA sequences cause problems for current genome assembly algorithms. Determining whether a long-read derived genomic assembly is ‘better’ than one produced using traditional sequence data is not straightforward. Our revised reference genomes for P. tricornutum and T. pseudonana nevertheless provide additional insight into the structure and evolution of both genomes, thereby providing a more robust foundation for future diatom research.

Download Full-text

Accurate sequence variant genotyping in cattle using variation-aware genome graphs

10.1101/460345 ◽

2018 ◽

Cited By ~ 1

Author(s):

Danang Crysnanto ◽

Christine Wurmser ◽

Hubert Pausch

Keyword(s):

Dna Sequences ◽

Reference Genome ◽

Sequence Data ◽

Sequence Variant ◽

Sequence Variants ◽

Sequencing Data ◽

Reference Allele ◽

Reference Genomes ◽

Genome Graph ◽

Genotype Concordance

Background: The genotyping of sequence variants typically involves as a first step the alignment of sequencing reads to a linear reference genome. Because a linear reference genome represents only a small fraction of sequence variation within a species, reference allele bias may occur at highly polymorphic or diverged regions of the genome. Graph-based methods facilitate to compare sequencing reads to a variation-aware genome graph that incorporates non-redundant DNA sequences that segregate within a species. We compared accuracy and sensitivity of graph-based sequence variant genotyping using the Graphtyper software to two widely used methods, i.e., GATK and SAMtools, that rely on linear reference genomes using whole-genomes sequencing data of 49 Original Braunvieh cattle. Results: We discovered 21,140,196, 20,262,913 and 20,668,459 polymorphic sites using GATK, Graphtyper, and SAMtools, respectively. Comparisons between sequence variant and microarray-derived genotypes showed that Graphtyper outperformed both GATK and SAMtools in terms of genotype concordance, non-reference sensitivity, and non-reference discrepancy. The sequence variant genotypes that were obtained using Graphtyper had the lowest number of mendelian inconsistencies for both SNPs and indels in nine sire-son pairs with sequence data. Genotype phasing and imputation using the Beagle software improved the quality of the sequence variant genotypes for all tools evaluated particularly for animals that have been sequenced at low coverage. Following imputation, the concordance between sequence- and microarray-derived genotypes was almost identical for the three methods evaluated, i.e., 99.32, 99.46, and 99.24 % for GATK, Graphtyper, and SAMtools, respectively. Variant filtration based on commonly used criteria improved the genotype concordance slightly but it also decreased sensitivity. Graphtyper required considerably more computing resources than SAMtools but it required less than GATK. Conclusions: Sequence variant genotyping using Graphtyper is accurate, sensitive and computationally feasible in cattle. Graph-based methods enable sequence variant genotyping from variation-aware reference genomes that may incorporate cohort-specific sequence variants which is not possible with the current implementations of state-of-the-art methods that rely on linear reference genomes.

Download Full-text

GRAFIMO: Variant and haplotype aware motif scanning on pangenome graphs

PLoS Computational Biology ◽

10.1371/journal.pcbi.1009444 ◽

2021 ◽

Vol 17 (9) ◽

pp. e1009444

Author(s):

Manuel Tognon ◽

Vincenzo Bonnici ◽

Erik Garrison ◽

Rosalba Giugno ◽

Luca Pinello

Keyword(s):

Dna Sequences ◽

Binding Sites ◽

Specific Binding ◽

Genomic Variation ◽

Link Type ◽

Scanning Procedure ◽

Expression Of Genes ◽

Potential Binding ◽

Command Line Tool ◽

Reference Genomes

Transcription factors (TFs) are proteins that promote or reduce the expression of genes by binding short genomic DNA sequences known as transcription factor binding sites (TFBS). While several tools have been developed to scan for potential occurrences of TFBS in linear DNA sequences or reference genomes, no tool exists to find them in pangenome variation graphs (VGs). VGs are sequence-labelled graphs that can efficiently encode collections of genomes and their variants in a single, compact data structure. Because VGs can losslessly compress large pangenomes, TFBS scanning in VGs can efficiently capture how genomic variation affects the potential binding landscape of TFs in a population of individuals. Here we present GRAFIMO (GRAph-based Finding of Individual Motif Occurrences), a command-line tool for the scanning of known TF DNA motifs represented as Position Weight Matrices (PWMs) in VGs. GRAFIMO extends the standard PWM scanning procedure by considering variations and alternative haplotypes encoded in a VG. Using GRAFIMO on a VG based on individuals from the 1000 Genomes project we recover several potential binding sites that are enhanced, weakened or missed when scanning only the reference genome, and which could constitute individual-specific binding events. GRAFIMO is available as an open-source tool, under the MIT license, at https://github.com/pinellolab/GRAFIMO and https://github.com/InfOmics/GRAFIMO.

Download Full-text

Removing Host-derived DNA Sequences from Microbial Metagenomes via Mapping to Reference Genomes

The Plant Microbiome - Methods in Molecular Biology ◽

10.1007/978-1-0716-1040-4_13 ◽

2020 ◽

pp. 147-153

Author(s):

Yun Kit Yeoh

Keyword(s):

Dna Sequences ◽

Reference Genomes

Download Full-text

Sensitive detection of pre-integration intermediates of LTR retrotransposons in crop plants

10.1101/317479 ◽

2018 ◽

Author(s):

Jungnam Cho ◽

Matthias Benoit ◽

Marco Catoni ◽

Hajk-Georg Drost ◽

Anna Brestovitsky ◽

...

Keyword(s):

Dna Sequences ◽

Crop Plants ◽

Ltr Retrotransposon ◽

Ltr Retrotransposons ◽

Evolutionary Time ◽

Developmentally Regulated ◽

Bioinformatic Pipeline ◽

Alternative Approach ◽

Reference Genomes

AbstractRetrotransposons have played an important role in the evolution of host genomes1,2. Their impact on host chromosomes is mainly deduced from the composition of DNA sequences, which have been fixed over evolutionary time. These studies provide important “snapshots” reflecting historical activities of transposons but do not predict current transposition potential. We previously reported Sequence-Independent Retrotransposon Trapping (SIRT) as a methodology that, by identification of extrachromosomal linear DNA (eclDNA), revealed the presence of active LTR retrotransposons in Arabidopsis9. Unfortunately, SIRT cannot be applied to large and transposon-rich genomes of crop plants. We have since developed an alternative approach named ALE-seq (amplification of LTR of eclDNAs followed by sequencing). ALE-seq reveals sequences of 5’ LTRs of eclDNAs after two-step amplification: in vitro transcription and subsequent reverse transcription. Using ALE-seq in rice, we detected eclDNAs for a novel Copia family LTR retrotransposon, Go-on, which is activated by heat stress. Sequencing of rice accessions revealed that Go-on has preferentially accumulated in indica rice grown at higher temperatures. Furthermore, ALE-seq applied to tomato fruits identified a developmentally regulated Gypsy family of retrotransposons. Importantly, a bioinformatic pipeline adapted for ALE-seq data analyses allows the direct and reference-free annotation of new active retroelements. This pipeline allows assessment of LTR retrotransposon activities in organisms for which genomic sequences and/or reference genomes are unavailable or are of low quality.

Download Full-text

Transcription factor/DNA interactions visualized by electron spectroscopic imaging

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100122654 ◽

1992 ◽

Vol 50 (1) ◽

pp. 450-451

Author(s):

David P. Bazett-Jones ◽

Mark L. Brown

Keyword(s):

Transcription Factor ◽

Dna Sequences ◽

Transcription Initiation ◽

Molecular Mechanisms ◽

Phosphorus Content ◽

Spectroscopic Imaging ◽

Electron Spectroscopic Imaging ◽

Dna Backbone ◽

Two Parameters ◽

High Level

A multisubunit RNA polymerase enzyme is ultimately responsible for transcription initiation and elongation of RNA, but recognition of the proper start site by the enzyme is regulated by general, temporal and gene-specific trans-factors interacting at promoter and enhancer DNA sequences. To understand the molecular mechanisms which precisely regulate the transcription initiation event, it is crucial to elucidate the structure of the transcription factor/DNA complexes involved. Electron spectroscopic imaging (ESI) provides the opportunity to visualize individual DNA molecules. Enhancement of DNA contrast with ESI is accomplished by imaging with electrons that have interacted with inner shell electrons of phosphorus in the DNA backbone. Phosphorus detection at this intermediately high level of resolution (≈lnm) permits selective imaging of the DNA, to determine whether the protein factors compact, bend or wrap the DNA. Simultaneously, mass analysis and phosphorus content can be measured quantitatively, using adjacent DNA or tobacco mosaic virus (TMV) as mass and phosphorus standards. These two parameters provide stoichiometric information relating the ratios of protein:DNA content.

Download Full-text

DNA sequence mapping in interphase and metaphase chromosomes by fluorescence in situ hybridization

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100122885 ◽

1992 ◽

Vol 50 (1) ◽

pp. 496-497

Author(s):

Barbara Trask ◽

Susan Allen ◽

Anne Bergmann ◽

Mari Christensen ◽

Anne Fertitta ◽

...

Keyword(s):

In Situ Hybridization ◽

Dna Sequence ◽

Dna Sequences ◽

Dual Band ◽

Nick Translation ◽

Metaphase Chromosomes ◽

Band Pass ◽

Texas Red ◽

Fluorescent Spot

Using fluorescence in situ hybridization (FISH), the positions of DNA sequences can be discretely marked with a fluorescent spot. The efficiency of marking DNA sequences of the size cloned in cosmids is 90-95%, and the fluorescent spots produced after FISH are ≈0.3 μm in diameter. Sites of two sequences can be distinguished using two-color FISH. Different reporter molecules, such as biotin or digoxigenin, are incorporated into DNA sequence probes by nick translation. These reporter molecules are labeled after hybridization with different fluorochromes, e.g., FITC and Texas Red. The development of dual band pass filters (Chromatechnology) allows these fluorochromes to be photographed simultaneously without registration shift.

Download Full-text