scholarly journals Assessing species coverage and assembly quality of rapidly accumulating sequenced genomes

2021 ◽  
Author(s):  
Romain Feron ◽  
Robert Michael Waterhouse

Ambitious initiatives to coordinate genome sequencing of Earth's biodiversity mean that the accumulation of genomic data is growing rapidly. In addition to cataloguing biodiversity, these data provide the basis for understanding biological function and evolution. Accurate and complete genome assemblies offer a comprehensive and reliable foundation upon which to advance our understanding of organismal biology at genetic, species, and ecosystem levels. However, ever-changing sequencing technologies and analysis methods mean that available data are often heterogeneous in quality. In order to guide forthcoming genome generation efforts and promote efficient prioritisation of resources, it is thus essential to define and monitor taxonomic coverage and quality of the data. Here we present an automated analysis workflow that surveys genome assemblies from the United States National Center for Biotechnology Information (NCBI), assesses their completeness using the relevant Benchmarking Universal Single-Copy Orthologue (BUSCO) datasets, and collates the results into an interactively browsable resource. We apply our workflow to produce a community resource of available assemblies from the phylum Arthropoda, the Arthropoda Assembly Assessment Catalogue. Using this resource, we survey current taxonomic coverage and assembly quality at the NCBI, we examine how key assembly metrics relate to gene content completeness, and we compare results from using different BUSCO lineage datasets. These results demonstrate how the workflow can be used to build a community resource that enables large-scale assessments to survey species coverage and data quality of available genome assemblies, and to guide prioritisations for ongoing and future sampling, sequencing, and genome generation initiatives.

2008 ◽  
Vol 136 (9) ◽  
pp. 3465-3476 ◽  
Author(s):  
Paul J. Roebber ◽  
Kyle L. Swanson ◽  
Jugal K. Ghorai

Abstract This research examines whether an adequate representation of flow features on the synoptic scale allows for the skillful inference of mesoscale precipitating systems. The focus is on the specific problem of landfalling systems on the west coast of the United States for a variety of synoptic types that lead to significant rainfall. The methodology emphasizes rigorous hypothesis testing within a controlled hindcast setting to quantify the significance of the results. The role of lateral boundary conditions is explicitly accounted for by the study. The hypotheses that (a) uncertainty in the large-scale analysis and (b) upstream buffer size have no impact on the skill of precipitation simulations are each rejected at a high level of confidence, with the results showing that mean precipitation skill is higher where low analysis uncertainty exists and for small nested grids. This indicates that an important connection exists between the quality of the synoptic information and predictability at the mesoscale in this environment, despite the absence of such information in the initialization or boundary conditions. Further, the flow-through of synoptic information strongly constrains the evolution of the mesoscale such that a small upstream buffer produces superior results consistent with the higher quality of the information crossing the boundary. Some preliminary evidence that synoptic type has an influence on precipitation skill is also found. The implications of these results for data assimilation, forecasting, and climate modeling are discussed.


2021 ◽  
Vol 12 ◽  
Author(s):  
Tyler D. Bechtel ◽  
John G. Gibbons

Listeria monocytogenes is the major causative agent of the foodborne illness listeriosis. Listeriosis presents as flu-like symptoms in healthy individuals, and can be fatal for children, elderly, pregnant women, and immunocompromised individuals. Estimates suggest that L. monocytogenes results in ∼1,600 illnesses and ∼260 deaths annually in the United States. L. monocytogenes can survive and persist in a variety of harsh environments, including conditions encountered in production of fermented dairy products such as cheese. For instance, microbial growth is often limited in soft cheese fermentation because of harsh pH, water content, and salt concentrations. However, L. monocytogenes has caused a number of deadly listeriosis outbreaks through the contamination of cheese. The purpose of this study was to understand if genetically distinct populations of L. monocytogenes are associated with particular foods, including cheese and dairy. To address this goal, we analyzed the population genetic structure of 504 L. monocytogenes strains isolated from food with publicly available genome assemblies. We identified 10 genetically distinct populations spanning L. monocytogenes lineages 1, II, and III and serotypes 1/2a, 1/2b, 1/2c, 4b, and 4c. We observed an overrepresentation of isolates from specific populations with cheese (population 2), fruit/vegetable (population 2), seafood (populations 5, 8 and 9) and meat (population 10). We used the Large Scale Blast Score Ratio pipeline and Roary to identify genes unique to population 1 and population 2 in comparison with all other populations, and screened for the presence of antimicrobial resistance genes and virulence genes across all isolates. We identified > 40 genes that were present at high frequency in population 1 and population 2 and absent in most other isolates. Many of these genes encoded for transcription factors, and cell surface anchored proteins. Additionally, we found that the virulence genes aut and ami were entirely or partially deleted in population 2. These results indicate that some L. monocytogenes populations may exhibit associations with particular foods, including cheese, and that gene content may contribute to this pattern.


2015 ◽  
Vol 9S4 ◽  
pp. BBI.S29333 ◽  
Author(s):  
Stefan E. Seemann ◽  
Christian Anthon ◽  
Oana Palasca ◽  
Jan Gorodkin

The era of high-throughput sequencing has made it relatively simple to sequence genomes and transcriptomes of individuals from many species. In order to analyze the resulting sequencing data, high-quality reference genome assemblies are required. However, this is still a major challenge, and many domesticated animal genomes still need to be sequenced deeper in order to produce high-quality assemblies. In the meanwhile, ironically, the extent to which RNA seq and other next-generation data is produced frequently far exceeds that of the genomic sequence. Furthermore, basic comparative analysis is often affected by the lack of genomic sequence. Herein, we quantify the quality of the genome assemblies of 20 domesticated animals and related species by assessing a range of measurable parameters, and we show that there is a positive correlation between the fraction of mappable reads from RNAseq data and genome assembly quality. We rank the genomes by their assembly quality and discuss the implications for genotype analyses.


2016 ◽  
Author(s):  
Karyn Meltz Steinberg ◽  
Tina Graves Lindsay ◽  
Valerie A. Schneider ◽  
Mark J.P. Chaisson ◽  
Chad Tomlinson ◽  
...  

ABSTRACTDe novo assembly of human genomes is now a tractable effort due in part to advances in sequencing and mapping technologies. We use PacBio single-molecule, real-time (SMRT) sequencing and BioNano genomic maps to construct the first de novo assembly of NA19240, a Yoruban individual from Africa. This chromosome-scaffolded assembly of 3.08 Gb with a contig N50 of 7.25 Mb and a scaffold N50 of 78.6 Mb represents one of the most contiguous high-quality human genomes. We utilize a BAC library derived from NA19240 DNA and novel haplotype-resolving sequencing technologies and algorithms to characterize regions of complex genomic architecture that are normally lost due to compression to a linear haploid assembly. Our results demonstrate that multiple technologies are still necessary for complete genomic representation, particularly in regions of highly identical segmental duplications. Additionally, we show that diploid assembly has utility in improving the quality of de novo human genome assemblies.


2019 ◽  
Author(s):  
Nico Borgsmüller ◽  
Yoann Gloaguen ◽  
Tobias Opialla ◽  
Eric Blanc ◽  
Emilie Sicard ◽  
...  

AbstractLack of reliable peak detection impedes automated analysis of large scale GC-MS metabolomics datasets. Performance and outcome of individual peak-picking algorithms can differ widely depending on both algorithmic approach and parameters as well as data acquisition method. Comparing and contrasting between algorithms is thus difficult. Here we present a workflow for improved peak picking (WiPP), a parameter optimising, multi-algorithm peak detection for GC-MS metabolomics. WiPP evaluates the quality of detected peaks using a machine learning-based classification scheme based on seven peak classes. The quality information returned by the classifier for each individual peak is merged with results from different peak detection algorithms to create one final high quality peak set for immediate down stream analysis. Medium and low quality peaks are kept for further inspection. By applying WiPP to standard compound mixes and a complex biological dataset we demonstrate that peak detection is improved through the novel way to assign peak quality, an automated parameter optimisation, and results integration across different embedded peak picking algorithms. Furthermore, our approach can provide an impartial performance comparison of different peak picking algorithms. WiPP is freely available on GitHub (https://github.com/bihealth/WiPP) under MIT licence.


2020 ◽  
Author(s):  
Jean-Marc Aury ◽  
Benjamin Istace

Single-molecule sequencing technologies have recently been commercialized by Pacific Biosciences and Oxford Nanopore with the promise of sequencing long DNA fragments (kilobases to megabases order) and then, using efficient algorithms, provide high quality assemblies in terms of contiguity and completeness of repetitive regions. However, the error rate of long-read technologies is higher than that of short-read technologies. This has a direct consequence on the base quality of genome assemblies, particularly in coding regions where sequencing errors can disrupt the coding frame of genes. In the case of diploid genomes, the consensus of a given gene can be a mixture between the two haplotypes and can lead to premature stop codons. Several methods have been developed to polish genome assemblies using short reads and generally, they inspect the nucleotide one by one, and provide a correction for each nucleotide of the input assembly. As a result, these algorithms are not able to properly process diploid genomes and they typically switch from one haplotype to another. Herein we proposed Hapo-G (Haplotype-Aware Polishing Of Genomes), a new algorithm capable of incorporating phasing information from short reads to polish genome assemblies and in particular assemblies of diploid and heterozygous genomes.


2019 ◽  
Author(s):  
Bita Khalili ◽  
Mattia Tomasoni ◽  
Mirjam Mattei ◽  
Roger Mallol Parera ◽  
Reyhan Sonmez ◽  
...  

AbstractIdentification of metabolites in large-scale 1H NMR data from human biofluids remains challenging due to the complexity of the spectra and their sensitivity to pH and ionic concentrations. In this work, we test the capacity of three analysis tools to extract metabolite signatures from 968 NMR profiles of human urine samples. Specifically, we studied sets of co-varying features derived from Principal Component Analysis (PCA), the Iterative Signature Algorithm (ISA) and Averaged Correlation Profiles (ACP), a new method we devised inspired by the STOCSY approach. We used our previously developed metabomatching method to match the sets generated by these algorithms to NMR spectra of individual metabolites available in public databases. Based on the number and quality of the matches we concluded that both ISA and ACP can robustly identify about a dozen metabolites, half of which were shared, while PCA did not produce any signatures with robust matches.


2007 ◽  
Vol 11 (2_suppl) ◽  
pp. S14-S22
Author(s):  
Yves Poulin ◽  
Aditya K. Gupta ◽  
John D. Amiss

Etanercept is a fully human dimeric fusion protein that reversibly binds tumor necrosis factor α. The first approved indication for etanercept was for the treatment of rheumatoid arthritis. It has also been shown to be highly efficacious in numerous large-scale trials for the treatment of plaque psoriasis; this indication was approved in Canada, the United States, and Europe. The recommended dosing of etanercept for plaque psoriasis is 50 mg twice weekly for 12 weeks, followed by a maintenance dose of 50 mg per week. Etanercept given at 50 mg twice weekly for 12 weeks significantly improved plaque psoriasis, as assessed by the Psoriasis Area and Severity Index (PASI), in which 75% reduction in PASI scores (PASI 75) has been the gold standard for judging effective therapy. Dosing given for 12 weeks produced PASI 75 rates of 47 to 49% in the phase 3 clinical trials. Longer treatment periods at this dosage have been investigated, from 24 to 48 weeks, with PASI 75 increasing to 63%. The importance of quality of life for psoriasis patients has been the focus of recent trials, and etanercept has been shown significant improvement in quality of life measures. Interim results from a phase 3b study suggest that etanercept may help reduce the burden of health care resources use by psoriatics. Etanercept has also shown efficacy in nail psoriasis. Case reports indicate that etanercept may be useful in psoriatic erythroderma, pustular psoriasis, guttate psoriasis, and palmopustular psoriasis. Etanercept is an effective biologic agent currently approved for the management of plaque psoriasis and psoriatic arthritis.


1988 ◽  
Vol 16 (4) ◽  
pp. 323-335
Author(s):  
Urs Karrer

A research project was started in 1985 to explore large-scale production systems which have a strong impact on the development of quality courseware. The exploration and evaluation of these production systems contribute to the explanation of the overall unsatisfactory quality of courseware. This article focuses on results of a survey which was conducted in January 1987 addressing more than sixty profit and nonprofit institutions in England, the federal Republic of Germany, the Netherlands, Switzerland, and the United States. The survey revealed interesting results in various fields. The five working hypotheses (production strategy, production approach, and quality factors for courseware development) were confirmed to a great extent. These results may be instructional for institutions which recently joined this area and/or are planning to do so.


Nature ◽  
2020 ◽  
Vol 587 (7833) ◽  
pp. 246-251 ◽  
Author(s):  
Joel Armstrong ◽  
Glenn Hickey ◽  
Mark Diekhans ◽  
Ian T. Fiddes ◽  
Adam M. Novak ◽  
...  

AbstractNew genome assemblies have been arriving at a rapidly increasing pace, thanks to decreases in sequencing costs and improvements in third-generation sequencing technologies1–3. For example, the number of vertebrate genome assemblies currently in the NCBI (National Center for Biotechnology Information) database4 increased by more than 50% to 1,485 assemblies in the year from July 2018 to July 2019. In addition to this influx of assemblies from different species, new human de novo assemblies5 are being produced, which enable the analysis of not only small polymorphisms, but also complex, large-scale structural differences between human individuals and haplotypes. This coming era and its unprecedented amount of data offer the opportunity to uncover many insights into genome evolution but also present challenges in how to adapt current analysis methods to meet the increased scale. Cactus6, a reference-free multiple genome alignment program, has been shown to be highly accurate, but the existing implementation scales poorly with increasing numbers of genomes, and struggles in regions of highly duplicated sequences. Here we describe progressive extensions to Cactus to create Progressive Cactus, which enables the reference-free alignment of tens to thousands of large vertebrate genomes while maintaining high alignment quality. We describe results from an alignment of more than 600 amniote genomes, which is to our knowledge the largest multiple vertebrate genome alignment created so far.


Sign in / Sign up

Export Citation Format

Share Document