scholarly journals SCGid: a consensus approach to contig filtering and genome prediction from single-cell sequencing libraries of uncultured eukaryotes

2019 ◽  
Vol 36 (7) ◽  
pp. 1994-2000
Author(s):  
Kevin R Amses ◽  
William J Davis ◽  
Timothy Y James

Abstract Motivation Whole-genome sequencing of uncultured eukaryotic genomes is complicated by difficulties in acquiring sufficient amounts of tissue. Single-cell genomics (SCG) by multiple displacement amplification provides a technical workaround, yielding whole-genome libraries which can be assembled de novo. Downsides of multiple displacement amplification include coverage biases and exacerbation of contamination. These factors affect assembly continuity and fidelity, complicating discrimination of genomes from contamination and noise by available tools. Uncultured eukaryotes and their relatives are often underrepresented in large sequence data repositories, further impairing identification and separation. Results We compare the ability of filtering approaches to remove contamination and resolve eukaryotic draft genomes from SCG metagenomes, finding significant variation in outcomes. To address these inconsistencies, we introduce a consensus approach that is codified in the SCGid software package. SCGid parallelly filters assemblies using different approaches, yielding three intermediate drafts from which consensus is drawn. Using genuine and mock SCG metagenomes, we show that our approach corrects for variation among draft genomes predicted by individual approaches and outperforms them in recapitulating published drafts in a fast and repeatable way, providing a useful alternative to available methods and manual curation. Availability and implementation The SCGid package is implemented in python and R. Source code is available at http://www.github.com/amsesk/SCGid under the GNU GPL 3.0 license. Supplementary information Supplementary data are available at Bioinformatics online.

2016 ◽  
Author(s):  
Peter A. Andrews ◽  
Ivan Iossifov ◽  
Jude Kendall ◽  
Steven Marks ◽  
Lakshmi Muthuswamy ◽  
...  

AbstractMotivationStandard genome sequence alignment tools primarily designed to find one alignment per read have difficulty detecting inversion, translocation and large insertion and deletion (indel) events. Moreover, dedicated split read alignment methods that depend only upon the reference genome may misidentify or find too many potential split read alignments because of reference genome anomalies.MethodsWe introduce MUMdex, a Maximal Unique Match (MUM)-based genomic analysis software package consisting of a sequence aligner to the reference genome, a storage-indexing format and analysis software. Discordant reference alignments of MUMs are especially suitable for identifying inversion, translocation and large indel differences in unique regions. Extracted population databases are used as filters for flaws in the reference genome. We describe the concepts underlying MUM-based analysis, the software implementation and its usage.ResultsWe demonstrate via simulation that the MUMdex aligner and alignment format are able to correctly detect and record genomic events. We characterize alignment performance and output file sizes for human whole genome data and compare to Bowtie 2 and the BAM format. Preliminary results demonstrate the practicality of the analysis approach by detecting de novo mutation candidates in human whole genome DNA sequence data from 510 families. We provide a population database of events from these families for use by others.Availabilityhttp://mumdex.com/[email protected] (or [email protected])Supplementary informationSupplementary data are available online.


Author(s):  
Amnon Koren ◽  
Dashiell J Massey ◽  
Alexa N Bracci

Abstract Motivation Genomic DNA replicates according to a reproducible spatiotemporal program, with some loci replicating early in S phase while others replicate late. Despite being a central cellular process, DNA replication timing studies have been limited in scale due to technical challenges. Results We present TIGER (Timing Inferred from Genome Replication), a computational approach for extracting DNA replication timing information from whole genome sequence data obtained from proliferating cell samples. The presence of replicating cells in a biological specimen leads to non-uniform representation of genomic DNA that depends on the timing of replication of different genomic loci. Replication dynamics can hence be observed in genome sequence data by analyzing DNA copy number along chromosomes while accounting for other sources of sequence coverage variation. TIGER is applicable to any species with a contiguous genome assembly and rivals the quality of experimental measurements of DNA replication timing. It provides a straightforward approach for measuring replication timing and can readily be applied at scale. Availability and Implementation TIGER is available at https://github.com/TheKorenLab/TIGER. Supplementary information Supplementary data are available at Bioinformatics online


2020 ◽  
Vol 36 (10) ◽  
pp. 3242-3243 ◽  
Author(s):  
Samuel O’Donnell ◽  
Gilles Fischer

Abstract Summary MUM&Co is a single bash script to detect structural variations (SVs) utilizing whole-genome alignment (WGA). Using MUMmer’s nucmer alignment, MUM&Co can detect insertions, deletions, tandem duplications, inversions and translocations greater than 50 bp. Its versatility depends upon the WGA and therefore benefits from contiguous de-novo assemblies generated by third generation sequencing technologies. Benchmarked against five WGA SV-calling tools, MUM&Co outperforms all tools on simulated SVs in yeast, plant and human genomes and performs similarly in two real human datasets. Additionally, MUM&Co is particularly unique in its ability to find inversions in both simulated and real datasets. Lastly, MUM&Co’s primary output is an intuitive tabulated file containing a list of SVs with only necessary genomic details. Availability and implementation https://github.com/SAMtoBAM/MUMandCo. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Author(s):  
Víctor García-Olivares ◽  
Adrián Muñoz-Barrera ◽  
José Miguel Lorenzo-Salazar ◽  
Carlos Zaragoza-Trello ◽  
Luis A. Rubio-Rodríguez ◽  
...  

AbstractThe mitochondrial genome (mtDNA) is of interest for a range of fields including evolutionary, forensic, and medical genetics. Human mitogenomes can be classified into evolutionary related haplogroups that provide ancestral information and pedigree relationships. Because of this and the advent of high-throughput sequencing (HTS) technology, there is a diversity of bioinformatic tools for haplogroup classification. We present a benchmarking of the 11 most salient tools for human mtDNA classification using empirical whole-genome (WGS) and whole-exome (WES) short-read sequencing data from 36 unrelated donors. Besides, because of its relevance, we also assess the best performing tool in third-generation long noisy read WGS data obtained with nanopore technology for a subset of the donors. We found that, for short-read WGS, most of the tools exhibit high accuracy for haplogroup classification irrespective of the input file used for the analysis. However, for short-read WES, Haplocheck and MixEmt were the most accurate tools. Based on the performance shown for WGS and WES, and the accompanying qualitative assessment, Haplocheck stands out as the most complete tool. For third-generation HTS data, we also showed that Haplocheck was able to accurately retrieve mtDNA haplogroups for all samples assessed, although only after following assembly-based approaches (either based on a referenced-based assembly or a hybrid de novo assembly). Taken together, our results provide guidance for researchers to select the most suitable tool to conduct the mtDNA analyses from HTS data.


2019 ◽  
Author(s):  
Mostafa Karimi ◽  
Shaowen Zhu ◽  
Yue Cao ◽  
Yang Shen

AbstractMotivationFacing data quickly accumulating on protein sequence and structure, this study is addressing the following question: to what extent could current data alone reveal deep insights into the sequence-structure relationship, such that new sequences can be designed accordingly for novel structure folds?ResultsWe have developed novel deep generative models, constructed low-dimensional and generalizable representation of fold space, exploited sequence data with and without paired structures, and developed ultra-fast fold predictor as an oracle providing feedback. The resulting semi-supervised gcWGAN is assessed with the oracle over 100 novel folds not in the training set and found to generate more yields and cover 3.6 times more target folds compared to a competing data-driven method (cVAE). Assessed with structure predictor over representative novel folds (including one not even part of basis folds), gcWGAN designs are found to have comparable or better fold accuracy yet much more sequence diversity and novelty than cVAE. gcWGAN explores uncharted sequence space to design proteins by learning from current sequence-structure data. The ultra fast data-driven model can be a powerful addition to principle-driven design methods through generating seed designs or tailoring sequence space.AvailabilityData and source codes will be available upon [email protected] informationSupplementary data are available at Bioinformatics online.


PeerJ ◽  
2018 ◽  
Vol 6 ◽  
pp. e5089 ◽  
Author(s):  
Bruno A. S. de Medeiros ◽  
Brian D. Farrell

Whole-genome amplification by multiple displacement amplification (MDA) is a promising technique to enable the use of samples with only limited amount of DNA for the construction of RAD-seq libraries. Previous work has shown that, when the amount of DNA used in the MDA reaction is large, double-digest RAD-seq (ddRAD) libraries prepared with amplified genomic DNA result in data that are indistinguishable from libraries prepared directly from genomic DNA. Based on this observation, here we evaluate the quality of ddRAD libraries prepared from MDA-amplified genomic DNA when the amount of input genomic DNA and the coverage obtained for samples is variable. By simultaneously preparing libraries for five species of weevils (Coleoptera, Curculionidae), we also evaluate the likelihood that potential contaminants will be encountered in the assembled dataset. Overall, our results indicate that MDA may not be able to rescue all samples with small amounts of DNA, but it does produce ddRAD libraries adequate for studies of phylogeography and population genetics even when conditions are not optimal. We find that MDA makes it harder to predict the number of loci that will be obtained for a given sequencing effort, with some samples behaving like traditional libraries and others yielding fewer loci than expected. This seems to be caused both by stochastic and deterministic effects during amplification. Further, the reduction in loci is stronger in libraries with lower amounts of template DNA for the MDA reaction. Even though a few samples exhibit substantial levels of contamination in raw reads, the effect is very small in the final dataset, suggesting that filters imposed during dataset assembly are important in removing contamination. Importantly, samples with strong signs of contamination and biases in heterozygosity were also those with fewer loci shared in the final dataset, suggesting that stringent filtering of samples with significant amounts of missing data is important when assembling data derived from MDA-amplified genomic DNA. Overall, we find that the combination of MDA and ddRAD results in high-quality datasets for population genetics as long as the sequence data is properly filtered during assembly.


2017 ◽  
Author(s):  
Catherine Anscombe ◽  
Raju.V Misra ◽  
Saheer Gharbia

AbstractWhilst next generation sequencing is frequently used to whole genome sequence bacteria from cultures, it’s rarely applied directly to clinical samples. Therefore, this study addresses the issue of applying NGS microbial diagnostics directly to blood samples. To demonstrate the potential of direct from blood sequencing a bacteria spiked blood model was developed. Horse blood was spiked with clinical samples of E. coli and S. aureus, and a process developed to isolate bacterial cells whilst removing the majority of host DNA. One sample of each isolate was then amplified using ϕ29 multiple displacement amplification (MDA) and sequenced. The total processing time, from sample to amplified DNA ready for sequencing was 3.5 hours, significantly faster than the 18-hour overnight culture step which is typically required. Both bacteria showed 100% survival through the processing. The direct from sample sequencing resulted in greater than 92% genome coverage of the pathogens whilst limiting the sequencing of host genome (less than 7% of all reads). Analysis of de novo assembled reads allowed accurate genotypic antibiotic resistance prediction. The sample processing is easily applicable to multiple sequencing platforms. Overall this model demonstrates potential to rapidly generate whole genome bacterial data directly from blood.


2019 ◽  
Vol 35 (21) ◽  
pp. 4207-4212 ◽  
Author(s):  
Narciso M Quijada ◽  
David Rodríguez-Lázaro ◽  
Jose María Eiros ◽  
Marta Hernández

Abstract Motivation The progress of High Throughput Sequencing (HTS) technologies and the reduction in the sequencing costs are such that Whole Genome Sequencing (WGS) could replace many traditional laboratory assays and procedures. Exploiting the volume of data produced by HTS platforms requires substantial computing skills and this is the main bottleneck in the implementation of WGS as a routine laboratory technique. The way in which the vast amount of results are presented to researchers and clinicians with no specialist knowledge of genome sequencing is also a significant issue. Results Here we present TORMES, a user-friendly pipeline for WGS analysis of bacteria from any origin generated by HTS on Illumina platforms. TORMES is designed for non-bioinformatician users, and automates the steps required for WGS analysis directly from the raw sequence data: sequence quality filtering, de novo assembly, draft genome ordering against a reference, genome annotation, multi-locus sequence typing (MLST), searching for antibiotic resistance and virulence genes, and pangenome comparisons. Once the analysis is finished, TORMES generates and interactive web-like report that can be opened in any web browser and shared and revised by researchers in a simple manner. TORMES can be run by using very simple commands and represent a quick an easy way to perform WGS analysis. Availability and implementation TORMES is free available at https://github.com/nmquijada/tormes. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Dmitry Meleshko ◽  
Patrick Marks ◽  
Stephen Williams ◽  
Iman Hajirasouliha

AbstractMotivationEmerging Linked-Read (aka read-cloud) technologies such as the 10x Genomics Chromium system have great potential for accurate detection and phasing of largescale human genome structural variations (SVs). By leveraging the long-range information encoded in Linked-Read sequencing, computational techniques are able to detect and characterize complex structural variations that are previously undetectable by short-read methods. However, there is no available Linked-Read method for detection and assembly of novel sequence insertions, DNA sequences present in a given sequenced sample but missing in the reference genome, without requiring whole genome de novo assembly. In this paper, we propose a novel integrated alignment-based and local-assembly-based algorithm, Novel-X, that effectively uses the barcode information encoded in Linked-Read sequencing datasets to improve detection of such events without the need of whole genome de novo assembly. We evaluated our method on two haploid human genomes, CHM1 and CHM13, sequenced on the 10x Genomics Chromium system. These genomes have been also characterized with high coverage PacBio long-reads recently. We also tested our method on NA12878, the wellknown HapMap CEPH diploid genome and the child genome in a Yoruba trio (NA19240) which was recently studied on multiple sequencing platforms. Detecting insertion events is very challenging using short reads and the only viable available solution is by long-read sequencing (e.g. PabBio or ONT). Our experiments, however, show that Novel-X finds many insertions that cannot be found by state of the art tools using short-read sequencing data but present in PacBio data. Since Linked-Read sequencing is significantly cheaper than long-read sequencing, our method using Linked-Reads enables routine large-scale screenings of sequenced genomes for novel sequence insertions.AvailabilitySoftware is freely available at https://github.com/1dayac/[email protected] informationSupplementary data are available at https://github.com/1dayac/novel_insertions_supplementary


Sign in / Sign up

Export Citation Format

Share Document