scholarly journals GTC: a novel attempt to maintenance of huge genome collections compressed

2017 ◽  
Author(s):  
Agnieszka Danek ◽  
Sebastian Deorowicz

AbstractMotivationResultsWe present GTC, a novel compressed data structure for representation of huge collections of genetic variation data. GTC significantly outperforms existing solutions in terms of compression ratio and time of answering various types of queries. We show that the largest of publicly available database of about 60 thousand haplotypes at about 40 million SNPs can be stored in less than 4 Gbytes, while the queries related to variants are answered in a fraction of a second.AvailabilityGTC can be downloaded from https://github.com/refresh-bio/GTC or http://sun.aei.polsl.pl/REFRESH/[email protected]

2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Mikko Rautiainen ◽  
Tobias Marschall

Abstract Genome graphs can represent genetic variation and sequence uncertainty. Aligning sequences to genome graphs is key to many applications, including error correction, genome assembly, and genotyping of variants in a pangenome graph. Yet, so far, this step is often prohibitively slow. We present GraphAligner, a tool for aligning long reads to genome graphs. Compared to the state-of-the-art tools, GraphAligner is 13x faster and uses 3x less memory. When employing GraphAligner for error correction, we find it to be more than twice as accurate and over 12x faster than extant tools.Availability: Package manager: https://anaconda.org/bioconda/graphalignerand source code: https://github.com/maickrau/GraphAligner


2011 ◽  
Vol 69 (4) ◽  
pp. 353-359 ◽  
Author(s):  
Kwang H. Choi ◽  
Brandon W. Higgs ◽  
Jens R. Wendland ◽  
Jonathan Song ◽  
Francis J. McMahon ◽  
...  

2013 ◽  
Vol 41 (W1) ◽  
pp. W104-W108 ◽  
Author(s):  
Tune H. Pers ◽  
Piotr Dworzyński ◽  
Cecilia Engel Thomas ◽  
Kasper Lage ◽  
Søren Brunak

2017 ◽  
Author(s):  
Fatemeh Almodaresi ◽  
Hirak Sarkar ◽  
Rob Patro

AbstractWe present a novel data structure for representing and indexing the compacted colored de Bruijn graph, which allows for efficient pattern matching and retrieval of the reference information associated with each k-mer. As the popularity of the de Bruijn graph as an index has increased over the past few years, so have the number of proposed representations of this structure. Existing structures typically fall into two categories; those that are hashing-based and provide very fast access to the underlying k-mer information, and those that are space-frugal and provide asymptotically efficient but practically slower pattern search.Our representation achieves a compromise between these two extremes. By building upon minimum perfect hashing, carefully organizing our data structure, and making use of succinct representations where applicable, our data structure provides practically fast k-mer lookup while greatly reducing the space compared to traditional hashing-based implementations. Further, we describe a sampling scheme built on the same underlying representation, which provides the ability to trade off k-mer query speed for a reduction in the de Bruijn graph index size. We believe this representation strikes a desirable balance between speed and space usage, and it will allow for fast search on large reference sequences.Pufferfish is developed in C++11, is open source (GPL v3), and is available at https://github.com/COMBINE-lab/Pufferfish. The scripts used to generate the results in this manuscript are available at https://github.com/COMBINE-lab/pufferfish_experiments.


2021 ◽  
Author(s):  
Kwangbom Choi ◽  
Matthew J. Vincent ◽  
Gary A. Churchill

AbstractSummaryThe abundance of genomic feature such as gene expression is often estimated from observed total number of alignment incidences in the targeted genome regions. We introduce a generic data structure and associated file format for alignment incidence data so that method developers can create novel pipelines comprising models, each optimal for read alignment, post-alignment QC, and quantification across multiple sequencing modalities.Availability and Implementationalntools software is freely available at https://github.com/churchill-lab/alntools under MIT [email protected] or [email protected]


2021 ◽  
Author(s):  
Aaron Chuah ◽  
Sean Li ◽  
Andrea Do ◽  
Matt A Field ◽  
T. Daniel Andrews

AbstractSummaryMissense mutations that change protein stability are strongly associated with human inherited genetic disease. With the recent availability of predicted structures for all human proteins generated using the AlphaFold2 prediction model, genome-wide assessment of the stability effects of genetic variation can, for the first time, be easily performed. This facilitates the interrogation of personal genetic variation for potentially pathogenic effects through the application of stability metrics. Here, we present a novel algorithm to prioritise variants predicted to strongly destabilise essential proteins, available as both a standalone software package and a web-based tool. We demonstrate the utility of this tool by showing that at values of the Stability Sort Z-score above 1.6, pathogenic, protein-destabilising variants from ClinVar are detected at a 58% enrichment, over and above the destabilising (but presumably non-pathogenic) variation already present in the HapMap NA12878 genome.Availability and ImplementationStabilitySort is available as both a web service (http://130.56.244.113/StabilitySort/) and can be deployed as a standalone system (https://gitlab.com/baaron/StabilitySort)[email protected]


2016 ◽  
Author(s):  
Mehran Karimzadeh ◽  
Carl Ernst ◽  
Anshul Kundaje ◽  
Michael M. Hoffman

AbstractMotivationShort-read sequencing enables assessment of genetic and biochemical traits of individual genomic regions, such as the location of genetic variation, protein binding, and chemical modifications. Every region in a genome assembly has a property called mappability which measures the extent to which it can be uniquely mapped by sequence reads. In regions of lower mappability, estimates of genomic and epigenomic characteristics from sequencing assays are less reliable. At best, sequencing assays will produce misleadingly low numbers of reads in these regions. At worst, these regions have increased susceptibility to spurious mapping from reads from other regions of the genome with sequencing errors or unexpected genetic variation. Bisulfite sequencing approaches used to identify DNA methylation exacerbate these problems by introducing large numbers of reads that map to multiple regions. While many tools consider mappability during the read mapping process, subsequent analysis often loses this information. Both to correct assumptions of uniformity in downstream analysis, and to identify regions where the analysis is less reliable, it is necessary to know the mappability of both ordinary and bisulfite-converted genomes.ResultsWe introduce the Umap software for identifying uniquely mappable regions of any genome. Its Bismap extension identifies mappability of the bisulfite-converted genome. With a read length of 24 bp, 18.7% of the unmodified genome and 33.5% of the bisulfite-converted genome is not uniquely mappable. This complicates interpretation of functional genomics experiments using short-read sequencing, especially in regulatory regions. For example, 81% of human CpG islands overlap with regions that are not uniquely mappable. Similarly, in some ENCODE ChIP-seq datasets, up to 50% of peaks overlap with regions that are not uniquely mappable. We also explored differentially methylated regions from a case-control study and identified regions that were not uniquely mappable. In the widely used 450K methylation array, 4,230 probes are not uniquely mappable. Genome mappability is higher with longer sequencing reads, but most publicly available ChIP-seq and reduced representation bisulfite sequencing datasets have shorter reads. Therefore, uneven and low mappability remains a concern in a majority of existing data.AvailabilityA Umap and Bismap track hub for human genome assemblies GRCh37/hg19 and GRCh38/hg38, and mouse assemblies GRCm37/mm9 and GRCm38/mm10 is available at http://bismap.hoffmanlab.org for use with the UCSC and Ensembl genome browsers. We have deposited in Zenodo the current version of our software (https://doi.org/10.5281/zenodo.800648) and the mappability data used in this project (https://doi.org/10.5281/zenodo.800645). In addition, the software (https://bitbucket.org/hoffmanlab/umap) is freely available under the GNU General Public License, version 3 (GPLv3)[email protected]


2017 ◽  
Author(s):  
Sebastian Deorowicz ◽  
Joanna Walczyszyn ◽  
Agnieszka Debudaj-Grabysz

AbstractMotivationBioinformatics databases grow rapidly and achieve values hardly to imagine a decade ago. Among numerous bioinformatics processes generating hundreds of GB is multiple sequence alignments of protein families. Its largest database, i.e., Pfam, consumes 40–230 GB, depending of the variant. Storage and transfer of such massive data has become a challenge.ResultsWe propose a novel compression algorithm, MSAC (Multiple Sequence Alignment Compressor), designed especially for aligned data. It is based on a generalisation of the positional Burrows–Wheeler transform for non-binary alphabets. MSAC handles FASTA, as well as Stockholm files. It offers up to six times better compression ratio than other commonly used compressors, i.e., gzip. Performed experiments resulted in an analysis of the influence of a protein family size on the compression ratio.AvailabilityMSAC is available for free at https://github.com/refresh-bio/msac and http://sun.aei.polsl.pl/REFRESH/[email protected] materialSupplementary data are available at the publisher Web site.


2022 ◽  
Vol 16 (2) ◽  
pp. 1-21
Author(s):  
Michael Nelson ◽  
Sridhar Radhakrishnan ◽  
Chandra Sekharan ◽  
Amlan Chatterjee ◽  
Sudhindra Gopal Krishna

Time-evolving web and social network graphs are modeled as a set of pages/individuals (nodes) and their arcs (links/relationships) that change over time. Due to their popularity, they have become increasingly massive in terms of their number of nodes, arcs, and lifetimes. However, these graphs are extremely sparse throughout their lifetimes. For example, it is estimated that Facebook has over a billion vertices, yet at any point in time, it has far less than 0.001% of all possible relationships. The space required to store these large sparse graphs may not fit in most main memories using underlying representations such as a series of adjacency matrices or adjacency lists. We propose building a compressed data structure that has a compressed binary tree corresponding to each row of each adjacency matrix of the time-evolving graph. We do not explicitly construct the adjacency matrix, and our algorithms take the time-evolving arc list representation as input for its construction. Our compressed structure allows for directed and undirected graphs, faster arc and neighborhood queries, as well as the ability for arcs and frames to be added and removed directly from the compressed structure (streaming operations). We use publicly available network data sets such as Flickr, Yahoo!, and Wikipedia in our experiments and show that our new technique performs as well or better than our benchmarks on all datasets in terms of compression size and other vital metrics.


Sign in / Sign up

Export Citation Format

Share Document