Optimization of I/O Intensive Genome Assemblies on the Cori Supercomputer with Burst Buffer

AbstractGenome-enabled biotechnologies have the potential to accelerate breeding efforts in long-lived perennial crop species. Despite the transformative potential of molecular tools in pecan and other outcrossing tree species, highly heterozygous genomes, significant presence–absence gene content variation, and histories of interspecific hybridization have constrained breeding efforts. To overcome these challenges, here, we present diploid genome assemblies and annotations of four outbred pecan genotypes, including a PacBio HiFi chromosome-scale assembly of both haplotypes of the ‘Pawnee’ cultivar. Comparative analysis and pan-genome integration reveal substantial and likely adaptive interspecific genomic introgressions, including an over-retained haplotype introgressed from bitternut hickory into pecan breeding pedigrees. Further, by leveraging our pan-genome presence–absence and functional annotation database among genomes and within the two outbred haplotypes of the ‘Lakota’ genome, we identify candidate genes for pest and pathogen resistance. Combined, these analyses and resources highlight significant progress towards functional and quantitative genomics in highly diverse and outbred crops.

Download Full-text

Chromosomal assembly of the nuclear genome of the endosymbiont-bearing trypanosomatid Angomonas deanei

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkaa018 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

John W Davey ◽

Carolina M C Catta-Preta ◽

Sally James ◽

Sarah Forrester ◽

Maria Cristina M Motta ◽

...

Keyword(s):

Chromosome Number ◽

Noncoding Rnas ◽

Nuclear Genome ◽

Supernumerary Chromosome ◽

Ribosomal Rnas ◽

Protein Coding ◽

Transfer Rnas ◽

Protein Coding Genes ◽

Oxford Nanopore ◽

Genome Assemblies

Abstract Angomonas deanei is an endosymbiont-bearing trypanosomatid with several highly fragmented genome assemblies and unknown chromosome number. We present an assembly of the A. deanei nuclear genome based on Oxford Nanopore sequence that resolves into 29 complete or close-to-complete chromosomes. The assembly has several previously unknown special features; it has a supernumerary chromosome, a chromosome with a 340-kb inversion, and there is a translocation between two chromosomes. We also present an updated annotation of the chromosomal genome with 10,365 protein-coding genes, 59 transfer RNAs, 26 ribosomal RNAs, and 62 noncoding RNAs.

Download Full-text

The Distribution of Several Genomic Virulence Determinants Does Not Corroborate the Established Serotyping Classification of Bacillus thuringiensis

International Journal of Molecular Sciences ◽

10.3390/ijms22052244 ◽

2021 ◽

Vol 22 (5) ◽

pp. 2244

Author(s):

Anton E. Shikov ◽

Yury V. Malovichko ◽

Arseniy A. Lobov ◽

Maria E. Belousova ◽

Anton A. Nizhnikov ◽

...

Keyword(s):

Bacillus Thuringiensis ◽

Protein Identification ◽

Core Gene ◽

Comparative Genomic ◽

Virulence Determinants ◽

Strain Characterization ◽

Proteomic Approach ◽

Liquid Chromatography Tandem Mass ◽

Identity Threshold ◽

Genome Assemblies

Bacillus thuringiensis, commonly referred to as Bt, is an object of the lasting interest of microbiologists due to its highly effective insecticidal properties, which make Bt a prominent source of biologicals. To categorize the exuberance of Bt strains discovered, serotyping assays are utilized in which flagellin serves as a primary seroreactive molecule. Despite its convenience, this approach is not indicative of Bt strains’ phenotypes, neither it reflects actual phylogenetic relationships within the species. In this respect, comparative genomic and proteomic techniques appear more informative, but their use in Bt strain classification remains limited. In the present work, we used a bottom-up proteomic approach based on fluorescent two-dimensional difference gel electrophoresis (2D-DIGE) coupled with liquid chromatography/tandem mass spectrometry(LC-MS/MS) protein identification to assess which stage of Bt culture, vegetative or spore, would be more informative for strain characterization. To this end, the proteomic differences for the israelensis-attributed strains were assessed to compare sporulating cultures of the virulent derivative to the avirulent one as well as to the vegetative stage virulent bacteria. Using the same approach, virulent spores of the israelensis strain were also compared to the spores of strains belonging to two other major Bt serovars, namely darmstadiensis and thuringiensis. The identified proteins were analyzed regarding the presence of the respective genes in the 104 Bt genome assemblies available at open access with serovar attributions specified. Of 21 proteins identified, 15 were found to be encoded in all the present assemblies at 67% identity threshold, including several virulence factors. Notable, individual phylogenies of these core genes conferred neither the serotyping nor the flagellin-based phylogeny but corroborated the reconstruction based on phylogenomics approaches in terms of tree topology similarity. In its turn, the distribution of accessory protein genes was not confined to the existing serovars. The obtained results indicate that neither gene presence nor the core gene sequence may serve as distinctive bases for the serovar attribution, undermining the notion that the serotyping system reflects strains’ phenotypic or genetic similarity. We also provide a set of loci, which fit in with the phylogenomics data plausibly and thus may serve for draft phylogeny estimation of the novel strains.

Download Full-text

Genome Assembly of the Canadian Two-row Malting Barley Cultivar AAC Synergy

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab031 ◽

2021 ◽

Author(s):

Wayne Xu ◽

James R Tucker ◽

Wubishet A Bekele ◽

Frank M You ◽

Yong-Bi Fu ◽

...

Keyword(s):

Reference Genome ◽

Single Copy ◽

Barley Cultivar ◽

Malting Barley ◽

Orthologous Genes ◽

Hordeum Vulgare L ◽

Chromosome Conformation ◽

Mate Pair ◽

Genome Assemblies ◽

First Time

Abstract Barley (Hordeum vulgare L.) is one of the most important global crops. The six-row barley cultivar Morex reference genome has been used by the barley research community worldwide. However, this reference genome can have limitations when used for genomic and genetic diversity analysis studies, gene discovery, and marker development when working in two-row germplasm that is more common to Canadian barley. Here we assembled, for the first time, the genome sequence of a Canadian two-row malting barley, cultivar AAC Synergy. We applied deep Illumina paired-end reads, long mate-pair reads, PacBio sequences, 10X chromium linked read libraries, and chromosome conformation capture sequencing (Hi-C) to generate a contiguous assembly. The genome assembled from super-scaffolds had a size of 4.85 Gb, N50 of 2.32 Mb and an estimated 93.9% of complete genes from a plant database (BUSCO, benchmarking universal single-copy orthologous genes). After removal of small scaffolds (< 300 Kb), the assembly was arranged into pseudomolecules of 4.14 Gb in size with seven chromosomes plus unanchored scaffolds. The completeness and annotation of the assembly were assessed by comparing it with the updated version of six-row Morex and recently released two-row Golden Promise genome assemblies.

Download Full-text

Analysis of 56,348 Genomes Identifies the Relationship between Antibiotic and Metal Resistance and the Spread of Multidrug-Resistant Non-Typhoidal Salmonella

Microorganisms ◽

10.3390/microorganisms9071468 ◽

2021 ◽

Vol 9 (7) ◽

pp. 1468

Author(s):

Gavin J. Fenske ◽

Joy Scaria

Keyword(s):

Antibiotic Resistance ◽

Foodborne Pathogen ◽

Antibiotic Resistance Genes ◽

Multidrug Resistant ◽

Metal Resistance ◽

Group 3 ◽

Group 2 ◽

Occurrence Matrix ◽

Genome Assemblies ◽

Group 1

Salmonella enterica is common foodborne pathogen that generates both enteric and systemic infections in hosts. Antibiotic resistance is common is certain serovars of the pathogen and of great concern to public health. Recent reports have documented the co-occurrence of metal resistance with antibiotic resistance in one serovar of S. enterica. Therefore, we sought to identify possible co-occurrence in a large genomic dataset. Genome assemblies of 56,348 strains of S. enterica comprising 20 major serovars were downloaded from NCBI. The downloaded assemblies were quality controlled and in silico serotyped to ensure consistency and avoid improper annotation from public databases. Metal and antibiotic resistance genes were identified in the genomes as well as plasmid replicons. Co-occurrent genes were identified by constructing a co-occurrence matrix and grouping said matrix using k-means clustering. Three groups of co-occurrent genes were identified using k-means clustering. Group 1 was comprised of the pco and sil operons that confer resistance to copper and silver, respectively. Group 1 was distributed across four serovars. Group 2 contained the majority of the genes and little to no co-occurrence was observed. Metal and antibiotic co-occurrence was identified in group 3 that contained genes conferring resistance to: arsenic, mercury, beta-lactams, sulfonamides, and tetracyclines. Group 3 genes were also associated with an IncQ1 class plasmid replicon. Metal and antibiotic co-occurrence from group 3 genes is mostly isolated to one clade of S. enterica I 4,[5],12:i:-.

Download Full-text

Mitochondrial Genome Assemblies of Elysia timida and Elysia cornigera and the Response of Mitochondrion-Associated Metabolism during Starvation

Genome Biology and Evolution ◽

10.1093/gbe/evx129 ◽

2017 ◽

Vol 9 (7) ◽

pp. 1873-1879 ◽

Cited By ~ 7

Author(s):

Cessa Rauch ◽

Gregor Christa ◽

Jan de Vries ◽

Christian Woehle ◽

Sven B. Gould

Keyword(s):

Mitochondrial Genome ◽

Genome Assemblies

Download Full-text

Hapo-G, haplotype-aware polishing of genome assemblies with accurate reads

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab034 ◽

2021 ◽

Vol 3 (2) ◽

Author(s):

Jean-Marc Aury ◽

Benjamin Istace

Keyword(s):

Single Molecule ◽

Direct Consequence ◽

High Quality ◽

Sequencing Errors ◽

Coding Regions ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

Genome Assemblies

Abstract Single-molecule sequencing technologies have recently been commercialized by Pacific Biosciences and Oxford Nanopore with the promise of sequencing long DNA fragments (kilobases to megabases order) and then, using efficient algorithms, provide high quality assemblies in terms of contiguity and completeness of repetitive regions. However, the error rate of long-read technologies is higher than that of short-read technologies. This has a direct consequence on the base quality of genome assemblies, particularly in coding regions where sequencing errors can disrupt the coding frame of genes. In the case of diploid genomes, the consensus of a given gene can be a mixture between the two haplotypes and can lead to premature stop codons. Several methods have been developed to polish genome assemblies using short reads and generally, they inspect the nucleotide one by one, and provide a correction for each nucleotide of the input assembly. As a result, these algorithms are not able to properly process diploid genomes and they typically switch from one haplotype to another. Herein we proposed Hapo-G (Haplotype-Aware Polishing Of Genomes), a new algorithm capable of incorporating phasing information from high-quality reads (short or long-reads) to polish genome assemblies and in particular assemblies of diploid and heterozygous genomes.

Download Full-text

Extended haplotype-phasing of long-read de novo genome assemblies using Hi-C

Nature Communications ◽

10.1038/s41467-020-20536-y ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Zev N. Kronenberg ◽

Arang Rhie ◽

Sergey Koren ◽

Gregory T. Concepcion ◽

Paul Peluso ◽

...

Keyword(s):

Zebra Finch ◽

Cultured Cells ◽

De Novo ◽

Single Cells ◽

Variant Calling ◽

Chromatin Interaction ◽

Extended Haplotype ◽

Benchmark Datasets ◽

And Performance ◽

Genome Assemblies

AbstractHaplotype-resolved genome assemblies are important for understanding how combinations of variants impact phenotypes. To date, these assemblies have been best created with complex protocols, such as cultured cells that contain a single-haplotype (haploid) genome, single cells where haplotypes are separated, or co-sequencing of parental genomes in a trio-based approach. These approaches are impractical in most situations. To address this issue, we present FALCON-Phase, a phasing tool that uses ultra-long-range Hi-C chromatin interaction data to extend phase blocks of partially-phased diploid assembles to chromosome or scaffold scale. FALCON-Phase uses the inherent phasing information in Hi-C reads, skipping variant calling, and reduces the computational complexity of phasing. Our method is validated on three benchmark datasets generated as part of the Vertebrate Genomes Project (VGP), including human, cow, and zebra finch, for which high-quality, fully haplotype-resolved assemblies are available using the trio-based approach. FALCON-Phase is accurate without having parental data and performance is better in samples with higher heterozygosity. For cow and zebra finch the accuracy is 97% compared to 80–91% for human. FALCON-Phase is applicable to any draft assembly that contains long primary contigs and phased associate contigs.

Download Full-text

Genome sequencing of four culinary herbs reveals terpenoid genes underlying chemodiversity in the Nepetoideae

DNA Research ◽

10.1093/dnares/dsaa016 ◽

2020 ◽

Vol 27 (3) ◽

Author(s):

Nolan Bornowski ◽

John P Hamilton ◽

Pan Liao ◽

Joshua C Wood ◽

Natalia Dudareva ◽

...

Keyword(s):

Leaf Tissue ◽

Terpene Synthases ◽

Genomic Analyses ◽

Genes Encoding ◽

Medicinal Properties ◽

Culinary Herbs ◽

Rosmarinus Officinalis L ◽

Origanum Majorana ◽

Genome Assemblies ◽

High Quality Genome

Abstract Species within the mint family, Lamiaceae, are widely used for their culinary, cultural, and medicinal properties due to production of a wide variety of specialized metabolites, especially terpenoids. To further our understanding of genome diversity in the Lamiaceae and to provide a resource for mining biochemical pathways, we generated high-quality genome assemblies of four economically important culinary herbs, namely, sweet basil (Ocimum basilicum L.), sweet marjoram (Origanum majorana L.), oregano (Origanum vulgare L.), and rosemary (Rosmarinus officinalis L.), and characterized their terpenoid diversity through metabolite profiling and genomic analyses. A total 25 monoterpenes and 11 sesquiterpenes were identified in leaf tissue from the 4 species. Genes encoding enzymes responsible for the biosynthesis of precursors for mono- and sesqui-terpene synthases were identified in all four species. Across all 4 species, a total of 235 terpene synthases were identified, ranging from 27 in O. majorana to 137 in the tetraploid O. basilicum. This study provides valuable resources for further investigation of the genetic basis of chemodiversity in these important culinary herbs.

Download Full-text