Improved Compression of DNA Sequencing Data with Cascading Bloom Filters

2018 ◽  
Vol 29 (08) ◽  
pp. 1249-1255
Author(s):  
Kamil Salikhov

Modern DNA sequencing technologies generate prodigious volumes of sequence data consisting of short DNA fragments (reads). Storing and transferring this data is often challenging. With this motivation, several specialized compression methods have been developed. In this paper, we present an improvement of the lossless reference-free compression algorithm, suggested by Rozov et al., based on the technique of cascading Bloom filters. Through computational experiments on real data, we demonstrate that our method results in a significant associated memory reduction in practice.

2019 ◽  
Vol 2019 ◽  
pp. 1-9 ◽  
Author(s):  
Maleeha Najam ◽  
Raihan Ur Rasool ◽  
Hafiz Farooq Ahmad ◽  
Usman Ashraf ◽  
Asad Waqar Malik

Storing and processing of large DNA sequences has always been a major problem due to increasing volume of DNA sequence data. However, a number of solutions have been proposed but they require significant computation and memory. Therefore, an efficient storage and pattern matching solution is required for DNA sequencing data. Bloom filters (BFs) represent an efficient data structure, which is mostly used in the domain of bioinformatics for classification of DNA sequences. In this paper, we explore more dimensions where BFs can be used other than classification. A proposed solution is based on Multiple Bloom Filters (MBFs) that finds all the locations and number of repetitions of the specified pattern inside a DNA sequence. Both of these factors are extremely important in determining the type and intensity of any disease. This paper serves as a first effort towards optimizing the search for location and frequency of substrings in DNA sequences using MBFs. We expect that further optimizations in the proposed solution can bring remarkable results as this paper presents a proof of concept implementation for a given set of data using proposed MBFs technique. Performance evaluation shows improved accuracy and time efficiency of the proposed approach.


2021 ◽  
Vol 16 (1) ◽  
Author(s):  
Leah L. Weber ◽  
Mohammed El-Kebir

Abstract Background Cancer arises from an evolutionary process where somatic mutations give rise to clonal expansions. Reconstructing this evolutionary process is useful for treatment decision-making as well as understanding evolutionary patterns across patients and cancer types. In particular, classifying a tumor’s evolutionary process as either linear or branched and understanding what cancer types and which patients have each of these trajectories could provide useful insights for both clinicians and researchers. While comprehensive cancer phylogeny inference from single-cell DNA sequencing data is challenging due to limitations with current sequencing technology and the complexity of the resulting problem, current data might provide sufficient signal to accurately classify a tumor’s evolutionary history as either linear or branched. Results We introduce the Linear Perfect Phylogeny Flipping (LPPF) problem as a means of testing two alternative hypotheses for the pattern of evolution, which we prove to be NP-hard. We develop Phyolin, which uses constraint programming to solve the LPPF problem. Through both in silico experiments and real data application, we demonstrate the performance of our method, outperforming a competing machine learning approach. Conclusion Phyolin is an accurate, easy to use and fast method for classifying an evolutionary trajectory as linear or branched given a tumor’s single-cell DNA sequencing data.


2018 ◽  
Author(s):  
Adrian Fritz ◽  
Peter Hofmann ◽  
Stephan Majda ◽  
Eik Dahms ◽  
Johannes Dröge ◽  
...  

Shotgun metagenome data sets of microbial communities are highly diverse, not only due to the natural variation of the underlying biological systems, but also due to differences in laboratory protocols, replicate numbers, and sequencing technologies. Accordingly, to effectively assess the performance of metagenomic analysis software, a wide range of benchmark data sets are required. Here, we describe the CAMISIM microbial community and metagenome simulator. The software can model different microbial abundance profiles, multi-sample time series and differential abundance studies, includes real and simulated strain-level diversity, and generates second and third generation sequencing data from taxonomic profiles or de novo. Gold standards are created for sequence assembly, genome binning, taxonomic binning, and taxonomic profiling. CAMSIM generated the benchmark data sets of the first CAMI challenge. For two simulated multi-sample data sets of the human and mouse gut microbiomes we observed high functional congruence to the real data. As further applications, we investigated the effect of varying evolutionary genome divergence, sequencing depth, and read error profiles on two popular metagenome assemblers, MEGAHIT and metaSPAdes, on several thousand small data sets generated with CAMISIM. CAMISIM can simulate a wide variety of microbial communities and metagenome data sets together with truth standards for method evaluation. All data sets and the software are freely available at: https://github.com/CAMI-challenge/CAMISIM


PeerJ ◽  
2015 ◽  
Vol 3 ◽  
pp. e1419 ◽  
Author(s):  
Jose E. Kroll ◽  
Jihoon Kim ◽  
Lucila Ohno-Machado ◽  
Sandro J. de Souza

Motivation.Alternative splicing events (ASEs) are prevalent in the transcriptome of eukaryotic species and are known to influence many biological phenomena. The identification and quantification of these events are crucial for a better understanding of biological processes. Next-generation DNA sequencing technologies have allowed deep characterization of transcriptomes and made it possible to address these issues. ASEs analysis, however, represents a challenging task especially when many different samples need to be compared. Some popular tools for the analysis of ASEs are known to report thousands of events without annotations and/or graphical representations. A new tool for the identification and visualization of ASEs is here described, which can be used by biologists without a solid bioinformatics background.Results.A software suite namedSplicing Expresswas created to perform ASEs analysis from transcriptome sequencing data derived from next-generation DNA sequencing platforms. Its major goal is to serve the needs of biomedical researchers who do not have bioinformatics skills.Splicing Expressperforms automatic annotation of transcriptome data (GTF files) using gene coordinates available from the UCSC genome browser and allows the analysis of data from all available species. The identification of ASEs is done by a known algorithm previously implemented in another tool namedSplooce. As a final result,Splicing Expresscreates a set of HTML files composed of graphics and tables designed to describe the expression profile of ASEs among all analyzed samples. By using RNA-Seq data from the Illumina Human Body Map and the Rat Body Map, we show thatSplicing Expressis able to perform all tasks in a straightforward way, identifying well-known specific events.Availability and Implementation.Splicing Expressis written in Perl and is suitable to run only in UNIX-like systems. More details can be found at:http://www.bioinformatics-brazil.org/splicingexpress.


2017 ◽  
Author(s):  
Luke Zappia ◽  
Belinda Phipson ◽  
Alicia Oshlack

AbstractAs single-cell RNA sequencing technologies have rapidly developed, so have analysis methods. Many methods have been tested, developed and validated using simulated datasets. Unfortunately, current simulations are often poorly documented, their similarity to real data is not demonstrated, or reproducible code is not available.Here we present the Splatter Bioconductor package for simple, reproducible and well-documented simulation of single-cell RNA-seq data. Splatter provides an interface to multiple simulation methods including Splat, our own simulation, based on a gamma-Poisson distribution. Splat can simulate single populations of cells, populations with multiple cell types or differentiation paths.


2019 ◽  
Author(s):  
Xian Fan ◽  
Mohammadamin Edrisi ◽  
Nicholas Navin ◽  
Luay Nakhleh

AbstractSingle-cell DNA sequencing technologies are enabling the study of mutations and their evolutionary trajectories in cancer. Somatic copy number aberrations (CNAs) have been implicated in the development and progression of various types of cancer. A wide array of methods for CNA detection has been either developed specifically for or adapted to single-cell DNA sequencing data. Understanding the strengths and limitations that are unique to each of these methods is very important for obtaining accurate copy number profiles from single-cell DNA sequencing data. Here we review the major steps that are followed by these methods when analyzing such data, and then review the strengths and limitations of the methods individually. In terms of segmenting the genome into regions of different copy numbers, we categorize the methods into three groups, select a representative method from each group that has been commonly used in this context, and benchmark them on simulated as well as real datasets. While single-cell DNA sequencing is very promising for elucidating and understanding CNAs, even the best existing method does not exceed 80% accuracy. New methods that significantly improve upon the accuracy of these three methods are needed. Furthermore, with the large datasets being generated, the methods must be computationally efficient.


2021 ◽  
Vol 8 ◽  
Author(s):  
Elizabeth S. Krestoff ◽  
James P. Creecy ◽  
Wayne D. Lord ◽  
Michelle L. Haynie ◽  
James A. Coyer ◽  
...  

Because Kemp’s ridley sea turtles (Lepidochelys kempii) are critically endangered and closely related to the vulnerable olive ridleys (L. olivacea), it is essential for forensic investigations and conservation efforts to distinguish these species when only skeletal elements remain. DNA extraction and analysis by DNA sequencing of genetic markers is the only method to determine species identity reliably, yet these methods are significantly compromised when DNA becomes degraded. To evaluate the role that time and environment play in obtaining high-quality DNA sequencing data, we placed skeletal elements of a terrestrial mammal (Bos taurus) and L. kempii in a supratidal and subtidal environment for 3 years. Bi-annual sampling revealed that after 3 years, mitochondrial DNA (mtDNA) consistently identified each species from each environment. Our results show that mtDNA recovery from bone and identification for Kemp’s ridley sea turtles was possible up to 3 years in both environments. All sequencing data obtained was accurate and robust, but DNA sequencing results were not consistent after 664-days of exposure. Our findings led us to conclude that if sufficient DNA is extracted from bone samples, then high-quality sequence data can be obtained, and the resulting sequence data accurately reflects the reference sequence for the given gene marker. This study provides evidence that DNA can be extracted and analyzed from challenging biological substrates, like bone, when these substrates are exposed to seasonally dynamic maritime environmental conditions for up to 3-years.


2020 ◽  
Author(s):  
Collin Giguere ◽  
Harsh Vardhan Dubey ◽  
Vishal Kumar Sarsani ◽  
Hachem Saddiki ◽  
Shai He ◽  
...  

AbstractBackgroundRecently, it has become possible to collect next-generation DNA sequencing data sets that are composed of multiple samples from multiple biological units where each of these samples may be from a single cell or bulk tissue. Yet, there does not yet exist a tool for simulating DNA sequencing data from such a nested sampling arrangement with single-cell and bulk samples so that developers of analysis methods can assess accuracy and precision.ResultsWe have developed a tool that simulates DNA sequencing data from hierarchically grouped (correlated) samples where each sample is designated bulk or single-cell. Our tool uses a simple configuration file to define the experimental arrangement and can be integrated into software pipelines for testing of variant callers or other genomic tools.ConclusionsThe DNA sequencing data generated by our simulator is representative of real data and integrates seamlessly with standard downstream analysis tools.


2016 ◽  
Author(s):  
Joseph Ward ◽  
Christian Cole ◽  
Melanie Febrer ◽  
Geoffrey Barton

AbstractMotivationThe current generation of DNA sequencing technologies produce a large amount of data quickly. All of these data need to pass some form of quality control processing and checking before they can be used for any analysis. The large number of samples that are run through Illumina sequencing machines makes the process of quality control an onerous and time-consuming task that requires multiple pieces of information from several sources.ResultsAlmostSignificant is an open-source platform for aggregating multiple sources of quality metrics as well as meta-data associated with DNA sequencing runs from Illumina sequencing machines. AlmostSignificant is a graphical platform to streamline the quality control of DNA sequencing data, to collect and store these data for future reference and to collect extra meta-data associated with the sequencing runs to check for errors and monitor the volume of data produced by the associated machines. AlmostSignificant has been used to track the quality of over 80 sequencing runs covering over 2500 samples produced over the last three years.AvailabilityThe code and documentation for AlmostSignificant is freely available at https://github.com/bartongroup/[email protected], [email protected]


2017 ◽  
Author(s):  
Soroush Samadian ◽  
Jeff P. Bruce ◽  
Trevor J. Pugh

AbstractSomatic copy number variations (CNVs) play a crucial role in development of many human cancers. The broad availability of next-generation sequencing data has enabled the development of algorithms to computationally infer CNV profiles from a variety of data types including exome and targeted sequence data; currently the most prevalent types of cancer genomics data. However, systemic evaluation and comparison of these tools remains challenging due to a lack of ground truth reference sets. To address this need, we have developed Bamgineer, a tool written in Python to introduce user-defined haplotype-phased allele-specific copy number events into an existing Binary Alignment Mapping (BAM) file, with a focus on targeted and exome sequencing experiments. As input, this tool requires a read alignment file (BAM format), lists of non-overlapping genome coordinates for introduction of gains and losses (bed file), and an optional file defining known haplotypes (vcf format). To improve runtime performance, Bamgineer introduces the desired CNVs in parallel using queuing and parallel processing on a local machine or on a high-performance computing cluster. As proof-of-principle, we applied Bamgineer to a single high-coverage (mean: 220X) exome sequence file from a blood sample to simulate copy number profiles of 3 exemplar tumors from each of 10 tumor types at 5 tumor cellularity levels (20-100%, 150 BAM files in total). To demonstrate feasibility beyond exome data, we introduced read alignments to a targeted 5-gene cell-free DNA sequencing library to simulate EGFR amplifications at frequencies consistent with circulating tumor DNA (10, 1, 0.1 and 0.01%) while retaining the multimodal insert size distribution of the original data. We expect Bamgineer to be of use for development and systematic benchmarking of CNV calling algorithms by users using locally-generated data for a variety of applications. The source code is freely available at http://github.com/pughlab/bamgineer.Author summaryWe present Bamgineer, a software program to introduce user-defined, haplotype-specific copy number variants (CNVs) at any frequency into standard Binary Alignment Mapping (BAM) files. Copy number gains are simulated by introducing new DNA sequencing read pairs sampled from existing reads and modified to contain SNPs of the haplotype of interest. This approach retains biases of the original data such as local coverage, strand bias, and insert size. Deletions are simulated by removing reads corresponding to one or both haplotypes. In our proof-of-principle study, we simulated copy number profiles from 10 cancer types at varying cellularity levels typically encountered in clinical samples. We also demonstrated introduction of low frequency CNVs into cell-free DNA sequencing data that retained the bimodal fragment size distribution characteristic of these data. Bamgineer is flexible and enables users to simulate CNVs that reflect characteristics of locally-generated sequence files and can be used for many applications including development and benchmarking of CNV inference tools for a variety of data types.


Sign in / Sign up

Export Citation Format

Share Document