scholarly journals Evolutionary superscaffolding and chromosome anchoring to improve Anopheles genome assemblies

BMC Biology ◽  
2020 ◽  
Vol 18 (1) ◽  
Author(s):  
Robert M. Waterhouse ◽  
Sergey Aganezov ◽  
Yoann Anselmetti ◽  
Jiyoung Lee ◽  
Livio Ruzzante ◽  
...  

Abstract Background New sequencing technologies have lowered financial barriers to whole genome sequencing, but resulting assemblies are often fragmented and far from ‘finished’. Updating multi-scaffold drafts to chromosome-level status can be achieved through experimental mapping or re-sequencing efforts. Avoiding the costs associated with such approaches, comparative genomic analysis of gene order conservation (synteny) to predict scaffold neighbours (adjacencies) offers a potentially useful complementary method for improving draft assemblies. Results We evaluated and employed 3 gene synteny-based methods applied to 21 Anopheles mosquito assemblies to produce consensus sets of scaffold adjacencies. For subsets of the assemblies, we integrated these with additional supporting data to confirm and complement the synteny-based adjacencies: 6 with physical mapping data that anchor scaffolds to chromosome locations, 13 with paired-end RNA sequencing (RNAseq) data, and 3 with new assemblies based on re-scaffolding or long-read data. Our combined analyses produced 20 new superscaffolded assemblies with improved contiguities: 7 for which assignments of non-anchored scaffolds to chromosome arms span more than 75% of the assemblies, and a further 7 with chromosome anchoring including an 88% anchored Anopheles arabiensis assembly and, respectively, 73% and 84% anchored assemblies with comprehensively updated cytogenetic photomaps for Anopheles funestus and Anopheles stephensi. Conclusions Experimental data from probe mapping, RNAseq, or long-read technologies, where available, all contribute to successful upgrading of draft assemblies. Our evaluations show that gene synteny-based computational methods represent a valuable alternative or complementary approach. Our improved Anopheles reference assemblies highlight the utility of applying comparative genomics approaches to improve community genomic resources.

2018 ◽  
Author(s):  
Robert M. Waterhouse ◽  
Sergey Aganezov ◽  
Yoann Anselmetti ◽  
Jiyoung Lee ◽  
Livio Ruzzante ◽  
...  

AbstractBackgroundNew sequencing technologies have lowered financial barriers to whole genome sequencing, but resulting assemblies are often fragmented and far from ‘finished’. Updating multi-scaffold drafts to chromosome-level status can be achieved through experimental mapping or re-sequencing efforts. Avoiding the costs associated with such approaches, comparative genomic analysis of gene order conservation (synteny) to predict scaffold neighbours (adjacencies) offers a potentially useful complementary method for improving draft assemblies.ResultsWe employed three gene synteny-based methods applied to 21 Anopheles mosquito assemblies to produce consensus sets of scaffold adjacencies. For subsets of the assemblies we integrated these with additional supporting data to confirm and complement the synteny-based adjacencies: six with physical mapping data that anchor scaffolds to chromosome locations, 13 with paired-end RNA sequencing (RNAseq) data, and three with new assemblies based on re-scaffolding or Pacific Biosciences long-read data. Our combined analyses produced 20 new superscaffolded assemblies with improved contiguities: seven for which assignments of non-anchored scaffolds to chromosome arms span more than 75% of the assemblies, and a further seven with chromosome anchoring including an 88% anchored Anopheles arabiensis assembly and, respectively, 73% and 84% anchored assemblies with comprehensively updated cytogenetic photomaps for Anopheles funestus and Anopheles stephensi.ConclusionsExperimental data from probe mapping, RNAseq, or long-read technologies, where available, all contribute to successful upgrading of draft assemblies. Our comparisons show that gene synteny-based computational methods represent a valuable alternative or complementary approach. Our improved Anopheles reference assemblies highlight the utility of applying comparative genomics approaches to improve community genomic resources.


2021 ◽  
Vol 10 (46) ◽  
Author(s):  
Kentaro Miyazaki ◽  
Natsuko Tokito

Complete genome resequencing was conducted for Thermus thermophilus strain TMY by hybrid assembly of Oxford Nanopore Technologies long-read and MGI short-read data. Errors in the previously reported genome sequence determined by PacBio technology alone were corrected, allowing for high-quality comparative genomic analysis of closely related T. thermophilus genomes.


2020 ◽  
Vol 7 (1) ◽  
Author(s):  
Xian-Gui Yi ◽  
Xia-Qing Yu ◽  
Jie Chen ◽  
Min Zhang ◽  
Shao-Wei Liu ◽  
...  

Abstract Cerasus serrulata is a flowering cherry germplasm resource for ornamental purposes. In this work, we present a de novo chromosome-scale genome assembly of C. serrulata by the use of Nanopore and Hi-C sequencing technologies. The assembled C. serrulata genome is 265.40 Mb across 304 contigs and 67 scaffolds, with a contig N50 of 1.56 Mb and a scaffold N50 of 31.12 Mb. It contains 29,094 coding genes, 27,611 (94.90%) of which are annotated in at least one functional database. Synteny analysis indicated that C. serrulata and C. avium have 333 syntenic blocks composed of 14,072 genes. Blocks on chromosome 01 of C. serrulata are distributed on all chromosomes of C. avium, implying that chromosome 01 is the most ancient or active of the chromosomes. The comparative genomic analysis confirmed that C. serrulata has 740 expanded gene families, 1031 contracted gene families, and 228 rapidly evolving gene families. By the use of 656 single-copy orthologs, a phylogenetic tree composed of 10 species was constructed. The present C. serrulata species diverged from Prunus yedoensis ~17.34 million years ago (Mya), while the divergence of C. serrulata and C. avium was estimated to have occurred ∼21.44 Mya. In addition, a total of 148 MADS-box family gene members were identified in C. serrulata, accompanying the loss of the AGL32 subfamily and the expansion of the SVP subfamily. The MYB and WRKY gene families comprising 372 and 66 genes could be divided into seven and eight subfamilies in C. serrulata, respectively, based on clustering analysis. Nine hundred forty-one plant disease-resistance genes (R-genes) were detected by searching C. serrulata within the PRGdb. This research provides high-quality genomic information about C. serrulata as well as insights into the evolutionary history of Cerasus species.


2016 ◽  
Author(s):  
Eric Disdero ◽  
Jonathan Filée

AbstractMotivationPopulation genomic analysis of transposable elements has greatly benefited from recent advances of sequencing technologies. However, the propensity of transposable elements to nest in highly repeated regions of genomes limits the efficiency of bioinformatic tools when short read sequences technology is used.ResultsLoRTE is the first tool able to use PacBio long read sequences to identify transposon deletions and insertions between a reference genome and genomes of different strains or populations. Tested against Drosophila melanogaster PacBio datasets, LoRTE appears to be a reliable and broadly applicable tools to study the dynamic and evolutionary impact of transposable elements using low coverage, long read sequences.Availability and ImplementationLoRTE is available at http://www.egce.cnrs-gif.fr/?p=6422. It is written in Python 2.7 and only requires the NCBI BLAST + package. LoRTE can be used on standard computer with limited RAM resources and reasonable running time even with large [email protected]


Author(s):  
James Fulton ◽  
Jeremy Brawner ◽  
Jose Huguet-Tapia ◽  
Katherine E Smith ◽  
Randy Fernandez ◽  
...  

Fusarium wilt, caused by Fusarium oxysporum f. sp. niveum (Fon), is a soilborne disease which significantly limits yield in watermelon (Citrullus lanatus) and occasionally causes the loss of an entire year’s harvest. Reference-quality de novo genomic assemblies of pathogenic and non-pathogenic strains were generated using a combination of next-generation and third-generation sequencing technologies. Chromosomal-level genomes were produced with representatives from all Fon races facilitating comparative genomic analysis and the identification of chromosomal structural variation . Syntenic analysis between isolates allowed differentiation of the core and lineage-specific portions of their genomes. This research will support future efforts to refine the scientific understanding of the molecular and genetic factors underpinning the Fon host range, develop diagnostic assays for each of the four races, and decipher the evolutionary history of race 3.


2019 ◽  
Vol 11 (7) ◽  
pp. 1959-1964 ◽  
Author(s):  
Jessica M Nelson ◽  
Duncan A Hauser ◽  
José A Gudiño ◽  
Yessenia A Guadalupe ◽  
John C Meeks ◽  
...  

Abstract Plant endosymbiosis with nitrogen-fixing cyanobacteria has independently evolved in diverse plant lineages, offering a unique window to study the evolution and genetics of plant–microbe interaction. However, very few complete genomes exist for plant cyanobionts, and therefore little is known about their genomic and functional diversity. Here, we present four complete genomes of cyanobacteria isolated from bryophytes. Nanopore long-read sequencing allowed us to obtain circular contigs for all the main chromosomes and most of the plasmids. We found that despite having a low 16S rRNA sequence divergence, the four isolates exhibit considerable genome reorganizations and variation in gene content. Furthermore, three of the four isolates possess genes encoding vanadium (V)-nitrogenase (vnf), which is uncommon among diazotrophs and has not been previously reported in plant cyanobionts. In two cases, the vnf genes were found on plasmids, implying possible plasmid-mediated horizontal gene transfers. Comparative genomic analysis of vnf-containing cyanobacteria further identified a conserved gene cluster. Many genes in this cluster have not been functionally characterized and would be promising candidates for future studies to elucidate V-nitrogenase function and regulation.


DNA Research ◽  
2019 ◽  
Vol 26 (4) ◽  
pp. 353-363 ◽  
Author(s):  
Xiu Feng ◽  
Yintao Jia ◽  
Ren Zhu ◽  
Kang Chen ◽  
Yifeng Chen

Abstract The lakes on the Qinghai-Tibet Plateau (QTP) are the largest and highest lake group in the world. Gymnocypris selincuoensis is the only cyprinid fish living in lake Selincuo, the largest lake on QTP. However, its genetic resource is still blank, limiting studies on molecular and genetic analysis. In this study, the transcriptome of G. selincuoensis was first generated by using PacBio Iso-Seq and Illumina RNA-seq. A full-length (FL) transcriptome with 75,435 transcripts was obtained by Iso-Seq with N50 length of 3,870 bp. Among all transcripts, 75,016 were annotated to public databases, 64,710 contain complete open reading frames and 2,811 were long non-coding RNAs. Based on all- vs.-all BLAST, 2,069 alternative splicing events were detected, and 80% of them were validated by reverse transcription polymerase chain reaction (RT-PCR). Tissue gene expression atlas showed that the number of detected expressed transcripts ranged from 37,397 in brain to 19,914 in muscle, with 10,488 transcripts detected in all seven tissues. Comparative genomic analysis with other cyprinid fishes identified 77 orthologous genes with potential positive selection (Ka/Ks > 0.3). A total of 56,696 perfect simple sequence repeats were identified from FL transcripts. Our results provide valuable genetic resources for further studies on adaptive evolution, gene expression and population genetics in G. selincuoensis and other congeneric fishes.


2021 ◽  
Author(s):  
Thomas Hackl ◽  
Florian Trigodet ◽  
A Murat Eren ◽  
Steven J Biller ◽  
John M Eppley ◽  
...  

Long-read sequencing technologies hold big promises for the genomic analysis of complex samples such as microbial communities. Yet, despite improving accuracy, basic gene prediction on long-read data is still often impaired by frameshifts resulting from small indels. Consensus polishing using either complementary short reads or to a lesser extent the long reads themselves can mitigate this effect but requires universally high sequencing depth, which is difficult to achieve in complex samples where the majority of community members are rare. Here we present proovframe, a software implementing an alternative approach to overcome frameshift errors in long-read assemblies and raw long reads. We utilize protein-to-nucleotide alignments against reference databases to pinpoint indels in contigs or reads and correct them by deleting or inserting 1-2 bases, thereby conservatively restoring reading-frame fidelity in aligned regions. Using simulated and real-world benchmark data we show that proovframe performs comparably to short-read-based polishing on assembled data, works well with remote protein homologs, and can even be applied to raw reads directly. Together, our results demonstrate that protein-guided frameshift correction significantly improves the analyzability of long-read data both in combination with and as an alternative to common polishing strategies. Proovframe is available from https://github.com/thackl/proovframe.


2019 ◽  
Author(s):  
Han Ming Gan ◽  
Christopher M. Austin

AbstractBackgroundVibrio parahaemolyticus MVP1 was isolated from a Malaysian aquaculture farm affected with shrimp acute hepatopancreatic necrosis disease (AHPND). Its genome was previously sequenced on the Illumina MiSeq platform and assembled de novo producing a relatively fragmented assembly. Despite identifying the binary toxin genes in the MVP1 draft genome that were linked to AHPND, the toxin genes were localized on a very small contig precluding proper analysis of gene neighbourhood.MethodsThe genome of Vibrio parahaemolyticus MVP1 was sequenced on the Nanopore MinION device to obtain long reads that can span longer repeats and improve genome contiguity. De novo genome assembly was subsequently performed using long-read only assembler (Flye) followed by genome polishing as well as hybrid assembler (Unicycler).ResultsLong-read only assembly produced three complete circular MVP1 contigs consisting of chromosome 1, chromosome 2 and the pVa plasmid that pirABvp binary toxin genes. Polishing of the long read assembly with Illumina short reads was necessary to remove indel errors. The complete assembly of the pVa plasmid could not be achieved using Illumina reads due to the presence of identical repetitive elements flanking the binary toxin genes leading to multiple contigs. Whereas these regions were fully spanned by the Nanopore long reads resulting in a single contig. In addition, alignment of Illumina reads to the complete genome assembly indicated there is sequencing bias as read depth was lowest in low-GC genomic regions. Comparative genomic analysis revealed the presence of a gene cluster coding for additional insecticidal toxins in chromosome 2 of MVP1 that may further contribute to host pathogenesis pending functional validation. Scanning of all publicly available V. parahaemolyticus genomes revealed the presence of a single AinS-family quorum-sensing system in this species that can be targeted for future microbial management.ConclusionsWe generated the first chromosome-scale genome assembly of a Malaysian pirABVp-bearing V. parahaemolyticus isolate. Structural variations identified from comparative genomic analysis provide new insights into the genomic features of V. parahaemolyticus MVP1 that may be associated with host colonization and pathogenicity.


2019 ◽  
Author(s):  
S Arredondo-Alonso ◽  
J Top ◽  
AC Schürch ◽  
A McNally ◽  
S Puranen ◽  
...  

AbstractEnterococcus faecium is a gut commensal of many mammals but is also recognized as a major nosocomial human pathogen, as it is listed on the WHO global priority list of multi-drug resistant organisms. Previous research has suggested that nosocomial strains have multiple zoonotic origins and are only distantly related to those involved in human commensal colonization. Here we present the first comprehensive population-wide joint genomic analysis of hospital, commensal and animal isolates using both short- and long-read sequencing techniques. This enabled us to investigate the population plasmidome, core genome variation and genome architecture in detail, using a combination of machine learning, population genomics and genome-wide co-evolution analysis. We observed a high level of genome plasticity with large-scale inversions and heterogeneous chromosome sizes, collectively painting a high-resolution picture of the adaptive landscape of E. faecium, and identified plasmids as the main indicator for host-specificity. Given the increasing availability of long-read sequencing technologies, our approach could be widely applied to other human and animal pathogen populations to unravel fine-scale mechanisms of their evolution.


Sign in / Sign up

Export Citation Format

Share Document