GTC: a novel attempt to maintenance of huge genome collections compressed

Mapping Intimacies ◽

10.1101/131649 ◽

2017 ◽

Author(s):

Agnieszka Danek ◽

Sebastian Deorowicz

Keyword(s):

Genetic Variation ◽

Data Structure ◽

Compression Ratio ◽

Link Type ◽

Variation Data ◽

Compressed Data

AbstractMotivationResultsWe present GTC, a novel compressed data structure for representation of huge collections of genetic variation data. GTC significantly outperforms existing solutions in terms of compression ratio and time of answering various types of queries. We show that the largest of publicly available database of about 60 thousand haplotypes at about 40 million SNPs can be stored in less than 4 Gbytes, while the queries related to variants are answered in a fraction of a second.AvailabilityGTC can be downloaded from https://github.com/refresh-bio/GTC or http://sun.aei.polsl.pl/REFRESH/[email protected]

Download Full-text

GraphAligner: rapid and versatile sequence-to-graph alignment

Genome Biology ◽

10.1186/s13059-020-02157-2 ◽

2020 ◽

Vol 21 (1) ◽

Cited By ~ 1

Author(s):

Mikko Rautiainen ◽

Tobias Marschall

Keyword(s):

Genetic Variation ◽

Error Correction ◽

Genome Assembly ◽

State Of The Art ◽

Source Code ◽

The State ◽

Graph Alignment ◽

Link Type ◽

Long Reads

Abstract Genome graphs can represent genetic variation and sequence uncertainty. Aligning sequences to genome graphs is key to many applications, including error correction, genome assembly, and genotyping of variants in a pangenome graph. Yet, so far, this step is often prohibitively slow. We present GraphAligner, a tool for aligning long reads to genome graphs. Compared to the state-of-the-art tools, GraphAligner is 13x faster and uses 3x less memory. When employing GraphAligner for error correction, we find it to be more than twice as accurate and over 12x faster than extant tools.Availability: Package manager: https://anaconda.org/bioconda/graphalignerand source code: https://github.com/maickrau/GraphAligner

Download Full-text

Coalescent-Based Method for Learning Parameters of Admixture Events from Large-Scale Genetic Variation Data

IEEE/ACM Transactions on Computational Biology and Bioinformatics ◽

10.1109/tcbb.2013.98 ◽

2013 ◽

Vol 10 (5) ◽

pp. 1137-1149

Author(s):

Ming-Chi Tsai ◽

Guy Blelloch ◽

R. Ravi ◽

Russell Schwartz

Keyword(s):

Genetic Variation ◽

Large Scale ◽

Variation Data

Download Full-text

Gene Expression and Genetic Variation Data Implicate PCLO in Bipolar Disorder

Biological Psychiatry ◽

10.1016/j.biopsych.2010.09.042 ◽

2011 ◽

Vol 69 (4) ◽

pp. 353-359 ◽

Cited By ~ 39

Author(s):

Kwang H. Choi ◽

Brandon W. Higgs ◽

Jens R. Wendland ◽

Jonathan Song ◽

Francis J. McMahon ◽

...

Keyword(s):

Gene Expression ◽

Bipolar Disorder ◽

Genetic Variation ◽

Variation Data

Download Full-text

MetaRanker 2.0: a web server for prioritization of genetic variation data

Nucleic Acids Research ◽

10.1093/nar/gkt387 ◽

2013 ◽

Vol 41 (W1) ◽

pp. W104-W108 ◽

Cited By ~ 21

Author(s):

Tune H. Pers ◽

Piotr Dworzyński ◽

Cecilia Engel Thomas ◽

Kasper Lage ◽

Søren Brunak

Keyword(s):

Genetic Variation ◽

Web Server ◽

Variation Data

Download Full-text

A space and time-efficient index for the compacted colored de Bruijn graph

10.1101/191874 ◽

2017 ◽

Cited By ~ 3

Author(s):

Fatemeh Almodaresi ◽

Hirak Sarkar ◽

Rob Patro

Keyword(s):

Data Structure ◽

Pattern Search ◽

De Bruijn Graph ◽

Existing Structures ◽

Link Type ◽

Reference Information ◽

De Bruijn ◽

Colored De Bruijn Graph ◽

Asymptotically Efficient ◽

Fast Access

AbstractWe present a novel data structure for representing and indexing the compacted colored de Bruijn graph, which allows for efficient pattern matching and retrieval of the reference information associated with each k-mer. As the popularity of the de Bruijn graph as an index has increased over the past few years, so have the number of proposed representations of this structure. Existing structures typically fall into two categories; those that are hashing-based and provide very fast access to the underlying k-mer information, and those that are space-frugal and provide asymptotically efficient but practically slower pattern search.Our representation achieves a compromise between these two extremes. By building upon minimum perfect hashing, carefully organizing our data structure, and making use of succinct representations where applicable, our data structure provides practically fast k-mer lookup while greatly reducing the space compared to traditional hashing-based implementations. Further, we describe a sampling scheme built on the same underlying representation, which provides the ability to trade off k-mer query speed for a reduction in the de Bruijn graph index size. We believe this representation strikes a desirable balance between speed and space usage, and it will allow for fast search on large reference sequences.Pufferfish is developed in C++11, is open source (GPL v3), and is available at https://github.com/COMBINE-lab/Pufferfish. The scripts used to generate the results in this manuscript are available at https://github.com/COMBINE-lab/pufferfish_experiments.

Download Full-text

Decoupling alignment strategy from feature quantification using a standard alignment incidence data structure

10.1101/2021.02.16.431379 ◽

2021 ◽

Author(s):

Kwangbom Choi ◽

Matthew J. Vincent ◽

Gary A. Churchill

Keyword(s):

Gene Expression ◽

Data Structure ◽

Genomic Feature ◽

File Format ◽

Read Alignment ◽

Incidence Data ◽

Link Type ◽

Generic Data ◽

Alignment Strategy ◽

Standard Alignment

AbstractSummaryThe abundance of genomic feature such as gene expression is often estimated from observed total number of alignment incidences in the targeted genome regions. We introduce a generic data structure and associated file format for alignment incidence data so that method developers can create novel pipelines comprising models, each optimal for read alignment, post-alignment QC, and quantification across multiple sequencing modalities.Availability and Implementationalntools software is freely available at https://github.com/churchill-lab/alntools under MIT [email protected] or [email protected]

Download Full-text

StabilitySort: assessment of protein stability changes on a genome-wide scale to prioritise potentially pathogenic genetic variation

10.1101/2021.11.28.470298 ◽

2021 ◽

Author(s):

Aaron Chuah ◽

Sean Li ◽

Andrea Do ◽

Matt A Field ◽

T. Daniel Andrews

Keyword(s):

Genetic Variation ◽

Protein Stability ◽

Missense Mutations ◽

Link Type ◽

Genome Wide ◽

Pathogenic Variation ◽

A Genome ◽

Human Proteins ◽

Wide Scale ◽

The Stability

AbstractSummaryMissense mutations that change protein stability are strongly associated with human inherited genetic disease. With the recent availability of predicted structures for all human proteins generated using the AlphaFold2 prediction model, genome-wide assessment of the stability effects of genetic variation can, for the first time, be easily performed. This facilitates the interrogation of personal genetic variation for potentially pathogenic effects through the application of stability metrics. Here, we present a novel algorithm to prioritise variants predicted to strongly destabilise essential proteins, available as both a standalone software package and a web-based tool. We demonstrate the utility of this tool by showing that at values of the Stability Sort Z-score above 1.6, pathogenic, protein-destabilising variants from ClinVar are detected at a 58% enrichment, over and above the destabilising (but presumably non-pathogenic) variation already present in the HapMap NA12878 genome.Availability and ImplementationStabilitySort is available as both a web service (http://130.56.244.113/StabilitySort/) and can be deployed as a standalone system (https://gitlab.com/baaron/StabilitySort)[email protected]

Download Full-text

Umap and Bismap: quantifying genome and methylome mappability

10.1101/095463 ◽

2016 ◽

Cited By ~ 5

Author(s):

Mehran Karimzadeh ◽

Carl Ernst ◽

Anshul Kundaje ◽

Michael M. Hoffman

Keyword(s):

Genetic Variation ◽

Bisulfite Sequencing ◽

Cpg Islands ◽

Read Length ◽

Methylation Array ◽

Short Read ◽

Short Read Sequencing ◽

Sequencing Errors ◽

Link Type ◽

A Genome

AbstractMotivationShort-read sequencing enables assessment of genetic and biochemical traits of individual genomic regions, such as the location of genetic variation, protein binding, and chemical modifications. Every region in a genome assembly has a property called mappability which measures the extent to which it can be uniquely mapped by sequence reads. In regions of lower mappability, estimates of genomic and epigenomic characteristics from sequencing assays are less reliable. At best, sequencing assays will produce misleadingly low numbers of reads in these regions. At worst, these regions have increased susceptibility to spurious mapping from reads from other regions of the genome with sequencing errors or unexpected genetic variation. Bisulfite sequencing approaches used to identify DNA methylation exacerbate these problems by introducing large numbers of reads that map to multiple regions. While many tools consider mappability during the read mapping process, subsequent analysis often loses this information. Both to correct assumptions of uniformity in downstream analysis, and to identify regions where the analysis is less reliable, it is necessary to know the mappability of both ordinary and bisulfite-converted genomes.ResultsWe introduce the Umap software for identifying uniquely mappable regions of any genome. Its Bismap extension identifies mappability of the bisulfite-converted genome. With a read length of 24 bp, 18.7% of the unmodified genome and 33.5% of the bisulfite-converted genome is not uniquely mappable. This complicates interpretation of functional genomics experiments using short-read sequencing, especially in regulatory regions. For example, 81% of human CpG islands overlap with regions that are not uniquely mappable. Similarly, in some ENCODE ChIP-seq datasets, up to 50% of peaks overlap with regions that are not uniquely mappable. We also explored differentially methylated regions from a case-control study and identified regions that were not uniquely mappable. In the widely used 450K methylation array, 4,230 probes are not uniquely mappable. Genome mappability is higher with longer sequencing reads, but most publicly available ChIP-seq and reduced representation bisulfite sequencing datasets have shorter reads. Therefore, uneven and low mappability remains a concern in a majority of existing data.AvailabilityA Umap and Bismap track hub for human genome assemblies GRCh37/hg19 and GRCh38/hg38, and mouse assemblies GRCm37/mm9 and GRCm38/mm10 is available at http://bismap.hoffmanlab.org for use with the UCSC and Ensembl genome browsers. We have deposited in Zenodo the current version of our software (https://doi.org/10.5281/zenodo.800648) and the mappability data used in this project (https://doi.org/10.5281/zenodo.800645). In addition, the software (https://bitbucket.org/hoffmanlab/umap) is freely available under the GNU General Public License, version 3 (GPLv3)[email protected]

Download Full-text

MSAC: Compression of multiple sequence alignment files

10.1101/240341 ◽

2017 ◽

Cited By ~ 1

Author(s):

Sebastian Deorowicz ◽

Joanna Walczyszyn ◽

Agnieszka Debudaj-Grabysz

Keyword(s):

Sequence Alignment ◽

Compression Ratio ◽

Multiple Sequence Alignment ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Link Type ◽

Bioinformatics Databases ◽

Supplementary Material ◽

Burrows Wheeler Transform

AbstractMotivationBioinformatics databases grow rapidly and achieve values hardly to imagine a decade ago. Among numerous bioinformatics processes generating hundreds of GB is multiple sequence alignments of protein families. Its largest database, i.e., Pfam, consumes 40–230 GB, depending of the variant. Storage and transfer of such massive data has become a challenge.ResultsWe propose a novel compression algorithm, MSAC (Multiple Sequence Alignment Compressor), designed especially for aligned data. It is based on a generalisation of the positional Burrows–Wheeler transform for non-binary alphabets. MSAC handles FASTA, as well as Stockholm files. It offers up to six times better compression ratio than other commonly used compressors, i.e., gzip. Performed experiments resulted in an analysis of the influence of a protein family size on the compression ratio.AvailabilityMSAC is available for free at https://github.com/refresh-bio/msac and http://sun.aei.polsl.pl/REFRESH/[email protected] materialSupplementary data are available at the publisher Web site.

Download Full-text

Queryable Compression on Time-evolving Web and Social Networks with Streaming

ACM Transactions on the Web ◽

10.1145/3495012 ◽

2022 ◽

Vol 16 (2) ◽

pp. 1-21

Author(s):

Michael Nelson ◽

Sridhar Radhakrishnan ◽

Chandra Sekharan ◽

Amlan Chatterjee ◽

Sudhindra Gopal Krishna

Keyword(s):

Data Structure ◽

Adjacency Matrix ◽

Binary Tree ◽

Network Data ◽

Data Sets ◽

Undirected Graphs ◽

Sparse Graphs ◽

Compressed Data ◽

Over Time ◽

Better Than

Time-evolving web and social network graphs are modeled as a set of pages/individuals (nodes) and their arcs (links/relationships) that change over time. Due to their popularity, they have become increasingly massive in terms of their number of nodes, arcs, and lifetimes. However, these graphs are extremely sparse throughout their lifetimes. For example, it is estimated that Facebook has over a billion vertices, yet at any point in time, it has far less than 0.001% of all possible relationships. The space required to store these large sparse graphs may not fit in most main memories using underlying representations such as a series of adjacency matrices or adjacency lists. We propose building a compressed data structure that has a compressed binary tree corresponding to each row of each adjacency matrix of the time-evolving graph. We do not explicitly construct the adjacency matrix, and our algorithms take the time-evolving arc list representation as input for its construction. Our compressed structure allows for directed and undirected graphs, faster arc and neighborhood queries, as well as the ability for arcs and frames to be added and removed directly from the compressed structure (streaming operations). We use publicly available network data sets such as Flickr, Yahoo!, and Wikipedia in our experiments and show that our new technique performs as well or better than our benchmarks on all datasets in terms of compression size and other vital metrics.

Download Full-text