scholarly journals Coverage-Versus-Length Plots, a Simple Quality Control Step for de Novo Yeast Genome Sequence Assemblies

2019 ◽  
pp. g3.200745.2018 ◽  
Author(s):  
Alexander P. Douglass ◽  
Caoimhe E. O'Brien ◽  
Benjamin Offei ◽  
Aisling Y. Coughlan ◽  
Raúl A. Ortiz-Merino ◽  
...  
Cell Reports ◽  
2019 ◽  
Vol 26 (3) ◽  
pp. 759-774.e5 ◽  
Author(s):  
Markus Habich ◽  
Silja Lucia Salscheider ◽  
Lena Maria Murschall ◽  
Michaela Nicole Hoehne ◽  
Manuel Fischer ◽  
...  

2018 ◽  
Author(s):  
Alexander P. Douglass ◽  
Caoimhe E. O’Brien ◽  
Benjamin Offei ◽  
Aisling Y. Coughlan ◽  
Raúl A. Ortiz-Merino ◽  
...  

AbstractIllumina sequencing has revolutionized yeast genomics, with prices for commercial draft genome sequencing now below $200. The popular SPAdes assembler makes it simple to generate a de novo genome assembly for any yeast species. However, whereas making genome assemblies has become routine, understanding what they contain is still challenging. Here, we show how graphing the information that SPAdes provides about the length and coverage of each scaffold can be used to investigate the nature of an assembly, and to diagnose possible problems. Scaffolds derived from mitochondrial DNA, ribosomal DNA, and yeast plasmids can be identified by their high coverage. Contaminating data, such as cross-contamination from other samples in a multiplex sequencing run, can be identified by its low coverage. Scaffolds derived from the bacteriophage PhiX174 and Lambda DNAs that are frequently used as molecular standards in Illumina protocols can also be detected. Assemblies of yeast genomes with high heterozygosity, such as interspecies hybrids, often contain two types of scaffold: regions of the genome where the two alleles assembled into two separate scaffolds and each has a coverage level C, and regions where the two alleles co-assembled (collapsed) into a single scaffold that has a coverage level 2C. Visualizing the data with Coverage-versus-Length (CVL) plots, which can be done using Microsoft Excel or Google Sheets, provides a simple method to understand the structure of a genome assembly and detect aberrant scaffolds or contigs. We provide a Python script that allows assemblies to be filtered to remove contaminants identified in CVL plots.100-word article summaryWe describe a simple new method, Coverage-versus-Length plots, for examining de novo genome sequence assemblies. These plots enable researchers to detect scaffolds that have unusually high or unusually low coverage, which allows contaminants, and scaffolds that come from atypical parts of the organism’s DNA complement, to be detected. We show that contaminants are common in yeast genomes sequenced in multiplex Illumina runs. We provide instructions for making plots using Microsoft Excel or Google Sheets, and software for filtering assemblies to remove contaminants. Contaminants can be detected and removed, even without knowing their source.


2016 ◽  
Author(s):  
Meike Becker ◽  
Nils Andersen ◽  
Helmut Erlenkeuser ◽  
Matthew. P. Humphreys ◽  
Toste Tanhua ◽  
...  

Abstract. The stable carbon isotope composition of dissolved inorganic carbon (δ13C-DIC) can be used to quantify fluxes within the carbon system. For example, knowing the δ13C-DIC signature of the inorganic carbon pool can help to describe the exchange between ocean and atmosphere as well as the amount of anthropogenic carbon in the water column. The measurements can also be used for evaluating modeled carbon fluxes, for making basin wide estimates, studying seasonal and interannual variability or decadal trends in interior ocean biogeochemistry. For all these purposes, it is not only important to have a sufficient amount of data, but these data must also be internally consistent and of high quality. In this study, we present a δ13C-DIC dataset for the North Atlantic, which has undergone secondary quality control. The data originate from oceanographic research cruises between 1981 and 2012. During a primary quality control step based on simple range tests obviously bad data were flagged. In a second quality control step, biases between measurements from different cruises were quantified through a crossover analysis using nearby data of the respective cruises and absolute values of biased cruises were adjusted in the data product. the crossover analysis was possible for 22 of the 29 cruises in our dataset and adjustments were applied to 10 of these. The internal accuracy of this dataset is 0.017 ‰. The dataset is available via CDIAC at http://cdiac.ornl.gov/oceans/ndp_096/NAC13v1.html, doi:10.3334/CDIAC/OTG.NAC13v1.


2020 ◽  
Vol 48 (18) ◽  
pp. e106-e106 ◽  
Author(s):  
Jenna E Gallegos ◽  
Mark F Rogers ◽  
Charlotte A Cialek ◽  
Jean Peccoud

Abstract Plasmids are a foundational tool for basic and applied research across all subfields of biology. Increasingly, researchers in synthetic biology are relying on and developing massive libraries of plasmids as vectors for directed evolution, combinatorial gene circuit tests, and for CRISPR multiplexing. Verification of plasmid sequences following synthesis is a crucial quality control step that creates a bottleneck in plasmid fabrication workflows. Crucially, researchers often elect to forego the cumbersome verification step, potentially leading to reproducibility and—depending on the application—security issues. In order to facilitate plasmid verification to improve the quality and reproducibility of life science research, we developed a fast, simple, and open source pipeline for assembly and verification of plasmid sequences from Illumina reads. We demonstrate that our pipeline, which relies on de novo assembly, can also be used to detect contaminating sequences in plasmid samples. In addition to presenting our pipeline, we discuss the role for verification and quality control in the increasingly complex life science workflows ushered in by synthetic biology.


2020 ◽  
Author(s):  
Jenna. E. Gallegos ◽  
Mark F. Rogers ◽  
Charlotte Cialek ◽  
Jean Peccoud

AbstractPlasmids are a foundational tool for basic and applied research across all subfields of biology. Increasingly, researchers in synthetic biology are relying on and developing massive libraries of plasmids as vectors for directed evolution, combinatorial gene circuit tests, and for CRISPR multiplexing. Verification of plasmid sequences following synthesis is a crucial quality control step that creates a bottleneck in plasmid fabrication workflows. Crucially, researchers often elect to forego the cumbersome verification step, potentially leading to reproducibility and— depending on the application—security issues. In order to facilitate plasmid verification to improve the quality and reproducibility of life science research, we developed a fast, simple, and open source pipeline for assembly and verification of plasmid sequences from Illumina reads. We demonstrate that our pipeline, which relies on de novo assembly, can also be used to detect contaminating sequences in plasmid samples. In addition to presenting our pipeline, we discuss the role for verification and quality control in the increasingly complex life science workflows ushered in by synthetic biology.


2016 ◽  
Vol 8 (2) ◽  
pp. 559-570 ◽  
Author(s):  
Meike Becker ◽  
Nils Andersen ◽  
Helmut Erlenkeuser ◽  
Matthew P. Humphreys ◽  
Toste Tanhua ◽  
...  

Abstract. The stable carbon isotope composition of dissolved inorganic carbon (δ13C-DIC) can be used to quantify fluxes within the carbon system. For example, knowing the δ13C signature of the inorganic carbon pool can help in describing the amount of anthropogenic carbon in the water column. The measurements can also be used for evaluating modeled carbon fluxes, for making basin-wide estimates of anthropogenic carbon, and for studying seasonal and interannual variability or decadal trends in interior ocean biogeochemistry. For all these purposes, it is not only important to have a sufficient amount of data, but these data must also be internally consistent and of high quality. In this study, we present a δ13C-DIC dataset for the North Atlantic which has undergone secondary quality control. The data originate from oceanographic research cruises between 1981 and 2014. During a primary quality control step based on simple range tests, obviously bad data were flagged. In a second quality control step, biases between measurements from different cruises were quantified through a crossover analysis using nearby data of the respective cruises, and values of biased cruises were adjusted in the data product. The crossover analysis was possible for 24 of the 32 cruises in our dataset, and adjustments were applied to 11 cruises. The internal accuracy of this dataset is 0.017 ‰. The dataset is available via the Carbon Dioxide Information Analysis Center (CDIAC) at http://cdiac.ornl.gov/oceans/ndp_096/NAC13v1.html, doi:10.3334/CDIAC/OTG.NAC13v1.


BioTechniques ◽  
2001 ◽  
Vol 31 (1) ◽  
pp. 62-65 ◽  
Author(s):  
E. Taylor ◽  
D. Cogdell ◽  
K. Coombes ◽  
L. Hu ◽  
L. Ramdas ◽  
...  

Genes ◽  
2021 ◽  
Vol 12 (2) ◽  
pp. 246
Author(s):  
Xiaomeng Chen ◽  
Rui Li ◽  
Yonglin Wang ◽  
Aining Li

An emerging poplar canker caused by the gram-negative bacterium, Lonsdalea populi, has led to high mortality of hybrid poplars Populus × euramericana in China and Europe. The molecular bases of pathogenicity and bark adaptation of L. populi have become a focus of recent research. This study revealed the whole genome sequence and identified putative virulence factors of L. populi. A high-quality L. populi genome sequence was assembled de novo, with a genome size of 3,859,707 bp, containing approximately 3434 genes and 107 RNAs (75 tRNA, 22 rRNA, and 10 ncRNA). The L. populi genome contained 380 virulence-associated genes, mainly encoding for adhesion, extracellular enzymes, secretory systems, and two-component transduction systems. The genome had 110 carbohydrate-active enzyme (CAZy)-coding genes and putative secreted proteins. The antibiotic-resistance database annotation listed that L. populi was resistant to penicillin, fluoroquinolone, and kasugamycin. Analysis of comparative genomics found that L. populi exhibited the highest homology with the L. britannica genome and L. populi encompassed 1905 specific genes, 1769 dispensable genes, and 1381 conserved genes, suggesting high evolutionary diversity and genomic plasticity. Moreover, the pan genome analysis revealed that the N-5-1 genome is an open genome. These findings provide important resources for understanding the molecular basis of the pathogenicity and biology of L. populi and the poplar-bacterium interaction.


Author(s):  
Corrinne E Grover ◽  
Daojun Yuan ◽  
Mark A Arick ◽  
Emma R Miller ◽  
Guanjing Hu ◽  
...  

Abstract Cotton is an important textile crop whose gains in production over the last century have been challenged by various diseases. Because many modern cultivars are susceptible to several pests and pathogens, breeding efforts have included attempts to introgress wild, naturally resistant germplasm into elite lines. Gossypium stocksii is a wild cotton species native to Africa, which is part of a clade of vastly understudied species. Most of what is known about this species comes from pest resistance surveys and/or breeding efforts, which suggests that G. stocksii could be a valuable reservoir of natural pest resistance. Here we present a high-quality de novo genome sequence for G. stocksii. We compare the G. stocksii genome with resequencing data from a closely related, understudied species (G. somalense) to generate insight into the relatedness of these cotton species. Finally, we discuss the utility of the G. stocksii genome for understanding pest resistance in cotton, particularly resistance to cotton leaf curl virus.


Sign in / Sign up

Export Citation Format

Share Document