Crypt4GH: a file format standard enabling native access to encrypted data

Abstract Motivation The majority of genome analysis tools and pipelines require data to be decrypted for access. This potentially leaves sensitive genetic data exposed, either because the unencrypted data is not removed after analysis, or because the data leaves traces on the permanent storage medium. Results : We defined a file container specification enabling direct byte-level compatible random access to encrypted genetic data stored in community standards such as SAM/BAM/CRAM/VCF/BCF. By standardizing this format, we show how it can be added as a native file format to genomic libraries, enabling direct analysis of encrypted data without the need to create a decrypted copy. Availability and implementation The Crypt4GH specification can be found at: http://samtools.github.io/hts-specs/crypt4gh.pdf. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

cd2sbgnml: bidirectional conversion between CellDesigner and SBGN formats

Bioinformatics ◽

10.1093/bioinformatics/btz969 ◽

2020 ◽

Vol 36 (8) ◽

pp. 2620-2622 ◽

Cited By ~ 3

Author(s):

Irina Balaur ◽

Ludovic Roy ◽

Alexander Mazein ◽

S Gökberk Karaca ◽

Ugur Dogrusoz ◽

...

Keyword(s):

Systems Biology ◽

Web Service ◽

Large Scale ◽

Supplementary Information ◽

Markup Language ◽

File Format ◽

Signalling Network ◽

Lesser General Public License ◽

Systems Biology Markup Language ◽

General Public License

Abstract Motivation CellDesigner is a well-established biological map editor used in many large-scale scientific efforts. However, the interoperability between the Systems Biology Graphical Notation (SBGN) Markup Language (SBGN-ML) and the CellDesigner’s proprietary Systems Biology Markup Language (SBML) extension formats remains a challenge due to the proprietary extensions used in CellDesigner files. Results We introduce a library named cd2sbgnml and an associated web service for bidirectional conversion between CellDesigner’s proprietary SBML extension and SBGN-ML formats. We discuss the functionality of the cd2sbgnml converter, which was successfully used for the translation of comprehensive large-scale diagrams such as the RECON Human Metabolic network and the complete Atlas of Cancer Signalling Network, from the CellDesigner file format into SBGN-ML. Availability and implementation The cd2sbgnml conversion library and the web service were developed in Java, and distributed under the GNU Lesser General Public License v3.0. The sources along with a set of examples are available on GitHub (https://github.com/sbgn/cd2sbgnml and https://github.com/sbgn/cd2sbgnml-webservice, respectively). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SPRING: a next-generation compressor for FASTQ data

Bioinformatics ◽

10.1093/bioinformatics/bty1015 ◽

2018 ◽

Vol 35 (15) ◽

pp. 2674-2676 ◽

Cited By ~ 18

Author(s):

Shubham Chandak ◽

Kedar Tatwawadi ◽

Idoia Ochoa ◽

Mikel Hernaez ◽

Tsachy Weissman

Keyword(s):

High Throughput Sequencing ◽

Random Access ◽

Lossless Compression ◽

General Purpose ◽

Supplementary Information ◽

High Coverage ◽

Sequencing Technologies ◽

Long Read ◽

Previous State ◽

Computational Resources

Abstract Motivation High-Throughput Sequencing technologies produce huge amounts of data in the form of short genomic reads, associated quality values and read identifiers. Because of the significant structure present in these FASTQ datasets, general-purpose compressors are unable to completely exploit much of the inherent redundancy. Although there has been a lot of work on designing FASTQ compressors, most of them lack in support of one or more crucial properties, such as support for variable length reads, scalability to high coverage datasets, pairing-preserving compression and lossless compression. Results In this work, we propose SPRING, a reference-free compressor for FASTQ files. SPRING supports a wide variety of compression modes and features, including lossless compression, pairing-preserving compression, lossy compression of quality values, long read compression and random access. SPRING achieves substantially better compression than existing tools, for example, SPRING compresses 195 GB of 25× whole genome human FASTQ from Illumina’s NovaSeq sequencer to less than 7 GB, around 1.6× smaller than previous state-of-the-art FASTQ compressors. SPRING achieves this improvement while using comparable computational resources. Availability and implementation SPRING can be downloaded from https://github.com/shubhamchandak94/SPRING. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

RCytoGPS: An R Package for Reading and Visualizing Cytogenetics Data

Bioinformatics ◽

10.1093/bioinformatics/btab683 ◽

2021 ◽

Author(s):

Zachary B Abrams ◽

Dwayne G Tally ◽

Lynne V Abruzzo ◽

Kevin R Coombes

Keyword(s):

Large Scale ◽

International System ◽

Genetic Data ◽

R Package ◽

Supplementary Information ◽

Supplementary Data ◽

Computational Tools ◽

Text Format ◽

Karyotype Analyses ◽

Computational Analyses

Abstract Summary Cytogenetics data, or karyotypes, are among the most common clinically used forms of genetic data. Karyotypes are stored as standardized text strings using the International System for Human Cytogenomic Nomenclature (ISCN). Historically, these data have not been used in large-scale computational analyses due to limitations in the ISCN text format and structure. Recently developed computational tools such as CytoGPS have enabled large-scale computational analyses of karyotypes. To further enable such analyses, we have now developed RCytoGPS, an R package that takes JSON files generated from CytoGPS.org and converts them into objects in R. This conversion facilitates the analysis and visualizations of karyotype data. In effect this tool streamlines the process of performing large-scale karyotype analyses, thus advancing the field of computational cytogenetic pathology. Availability and Implementation Freely available at https://CRAN.R-project.org/package=RCytoGPS. The code for the underlying CytoGPS software can be found at https://github.com/i2-wustl/CytoGPS. Supplementary information There is no supplementary data.

Download Full-text

emeraLD: Rapid Linkage Disequilibrium Estimation with Massive Data Sets

10.1101/301366 ◽

2018 ◽

Cited By ~ 1

Author(s):

Corbin Quick ◽

Christian Fuchsberger ◽

Daniel Taliun ◽

Gonçalo Abecasis ◽

Michael Boehnke ◽

...

Keyword(s):

Linkage Disequilibrium ◽

Association Studies ◽

Random Access ◽

Supplementary Information ◽

Data Sets ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Genome Wide ◽

Wide Range ◽

Supplementary Material

AbstractSummaryEstimating linkage disequilibrium (LD) is essential for a wide range of summary statistics-based association methods for genome-wide association studies (GWAS). Large genetic data sets, e.g. the TOPMed WGS project and UK Biobank, enable more accurate and comprehensive LD estimates, but increase the computational burden of LD estimation. Here, we describe emeraLD (Efficient Methods for Estimation and Random Access of LD), a computational tool that leverages sparsity and haplotype structure to estimate LD orders of magnitude faster than existing tools.Availability and ImplementationemeraLD is implemented in C++, and is open source under GPLv3. Source code, documentation, an R interface, and utilities for analysis of summary statistics are freely available at http://github.com/statgen/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

sangeranalyseR: simple and interactive analysis of Sanger sequencing data in R

10.1101/2020.05.18.102459 ◽

2020 ◽

Author(s):

Kuan-Hao Chao ◽

Kirston Barton ◽

Sarah Palmer ◽

Robert Lanfear

Keyword(s):

Sanger Sequencing ◽

Reference Sequence ◽

Supplementary Information ◽

File Format ◽

Bioconductor Package ◽

Sequencing Data ◽

Interactive Analysis ◽

Link Type ◽

Online Documentation ◽

Wide Range

AbstractSummarysangeranalyseR is an interactive R/Bioconductor package and two associated Shiny applications designed for analysing Sanger sequencing from data from the ABIF file format in R. It allows users to go from loading reads to saving aligned contigs in a few lines of R code. sangeranalyseR provides a wide range of options for a number of commonly-performed actions including read trimming, detecting secondary peaks, viewing chromatograms, and detecting indels using a reference sequence. All parameters can be adjusted interactively either in R or in the associated Shiny applications. sangeranalyseR comes with extensive online documentation, and outputs detailed interactive HTML reports.Availability and implementationsangeranalyseR is implemented in R and released under an MIT license. It is available for all platforms on Bioconductor (https://bioconductor.org/packages/sangeranalyseR) and on Github (https://github.com/roblanf/sangeranalyseR)[email protected] informationDocumentation at https://sangeranalyser.readthedocs.io/.

Download Full-text

TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes

Bioinformatics ◽

10.1093/bioinformatics/btz157 ◽

2019 ◽

Vol 35 (19) ◽

pp. 3679-3683 ◽

Cited By ~ 8

Author(s):

Aritra Bose ◽

Vassilis Kalantzis ◽

Eugenia-Maria Kontopoulou ◽

Mai Elkady ◽

Peristera Paschou ◽

...

Keyword(s):

Principal Component Analysis ◽

Large Scale ◽

Human Genetics ◽

Random Access ◽

Principal Component ◽

Component Analysis ◽

Supplementary Information ◽

Subspace Iteration ◽

System Memory ◽

Traditional Approaches

Abstract Motivation Principal Component Analysis is a key tool in the study of population structure in human genetics. As modern datasets become increasingly larger in size, traditional approaches based on loading the entire dataset in the system memory (Random Access Memory) become impractical and out-of-core implementations are the only viable alternative. Results We present TeraPCA, a C++ implementation of the Randomized Subspace Iteration method to perform Principal Component Analysis of large-scale datasets. TeraPCA can be applied both in-core and out-of-core and is able to successfully operate even on commodity hardware with a system memory of just a few gigabytes. Moreover, TeraPCA has minimal dependencies on external libraries and only requires a working installation of the BLAS and LAPACK libraries. When applied to a dataset containing a million individuals genotyped on a million markers, TeraPCA requires <5 h (in multi-threaded mode) to accurately compute the 10 leading principal components. An extensive experimental analysis shows that TeraPCA is both fast and accurate and is competitive with current state-of-the-art software for the same task. Availability and implementation Source code and documentation are both available at https://github.com/aritra90/TeraPCA. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

PyRanges: efficient comparison of genomic intervals in Python

10.1101/609396 ◽

2019 ◽

Cited By ~ 1

Author(s):

Endre Bakken Stovner ◽

Pål Sætrom

Keyword(s):

Supplementary Information ◽

Supplementary Data ◽

Genomic Libraries ◽

Link Type ◽

Simple Set ◽

Set Operations ◽

Wide Range ◽

Genomic Analyses ◽

Associated Data ◽

Memory Efficient

AbstractSummaryComplex genomic analyses often use sequences of simple set operations like intersection, overlap, and nearest on genomic intervals. These operations, coupled with some custom programming, allow a wide range of analyses to be performed. To this end, we have written PyRanges, a data structure for representing and manipulating genomic intervals and their associated data in Python. Run single-threaded on binary set operations, PyRanges is in median 2.3-9.6 times faster than the popular R GenomicRanges library and is equally memory efficient; run multi-threaded on 8 cores, our library is up to 123 times faster. PyRanges is therefore ideally suited both for individual analyses and as a foundation for future genomic libraries in Python.AvailabilityPyRanges is available open-source under the MIT license at https://github.com/biocore-NTNU/pyranges and documentation exists at https://biocore-NTNU.github.io/pyranges/[email protected] informationSupplementary data are available.

Download Full-text

RCytoGPS: An R Package for Reading and Visualizing Cytogenetics Data

10.1101/2021.03.16.389791 ◽

2021 ◽

Author(s):

Zachary B. Abrams ◽

Dwayne G. Tally ◽

Lynne V. Abruzzo ◽

Kevin R. Coombes

Keyword(s):

Large Scale ◽

International System ◽

Genetic Data ◽

R Package ◽

Supplementary Information ◽

Supplementary Data ◽

Computational Tools ◽

Text Format ◽

Karyotype Analyses ◽

Computational Analyses

AbstractSummaryCytogenetics data, or karyotypes, are among the most common clinically used forms of genetic data. Karyotypes are stored as standardized text strings using the International System for Human Cytogenomic Nomenclature (ISCN). Historically, these data have not been used in large-scale computational analyses due to limitations in the ISCN text format and structure. Recently developed computational tools such as CytoGPS have enabled large-scale computational analyses of karyotypes. To further enable such analyses, we have now developed RCytoGPS, an R package that takes JSON files generated from CytoGPS.org and converts them into objects in R. This conversion facilitates the analysis and visualizations of karyotype data. In effect this tool streamlines the process of performing large-scale karyotype analyses, thus advancing the field of computational cytogenetic pathology.Availability and ImplementationFreely available at https://CRAN.R-project.org/package=RCytoGPSSupplementary informationSupplementary data are available at Bioinformatics online.

Download Full-text

Inference of population admixture network from local gene genealogies: a coalescent-based maximum likelihood approach

Bioinformatics ◽

10.1093/bioinformatics/btaa465 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i326-i334

Author(s):

Yufeng Wu

Keyword(s):

Maximum Likelihood ◽

Population Genetic ◽

Network Inference ◽

Demographic History ◽

Genetic Data ◽

Supplementary Information ◽

Population Admixture ◽

Population Genetic Data ◽

Multispecies Coalescent ◽

Population Demographic

Abstract Motivation Population admixture is an important subject in population genetics. Inferring population demographic history with admixture under the so-called admixture network model from population genetic data is an established problem in genetics. Existing admixture network inference approaches work with single genetic polymorphisms. While these methods are usually very fast, they do not fully utilize the information [e.g. linkage disequilibrium (LD)] contained in population genetic data. Results In this article, we develop a new admixture network inference method called GTmix. Different from existing methods, GTmix works with local gene genealogies that can be inferred from population haplotypes. Local gene genealogies represent the evolutionary history of sampled haplotypes and contain the LD information. GTmix performs coalescent-based maximum likelihood inference of admixture networks with inferred local genealogies based on the well-known multispecies coalescent (MSC) model. GTmix utilizes various techniques to speed up the likelihood computation on the MSC model and the optimal network search. Our simulations show that GTmix can infer more accurate admixture networks with much smaller data than existing methods, even when these existing methods are given much larger data. GTmix is reasonably efficient and can analyze population genetic datasets of current interests. Availability and implementation The program GTmix is available for download at: https://github.com/yufengwudcs/GTmix. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Occupancy spectrum distribution: application for coalescence simulation with generic mergers

Bioinformatics ◽

10.1093/bioinformatics/btaa090 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3279-3280

Author(s):

Arnaud Becheler ◽

L Lacey Knowles

Keyword(s):

Joint Probability ◽

Order Relation ◽

Genetic Data ◽

Supplementary Information ◽

Joint Probability Distribution ◽

Efficient Simulation ◽

Different Types ◽

Spectrum Distribution ◽

Low Probability ◽

And Performance

Abstract Motivation As the density of sampled population increases, especially as studies incorporate aspects of the spatial landscape to study evolutionary processes, efficient simulation of genetic data under the coalescent becomes a primary challenge. Beyond the computational demands, coalescence-based simulation strategies have to be reconsidered because traditional assumptions about the dynamics of coalescing lineages within local populations may be violated (e.g. more than two daughter lineages may coalesce to a parent at low population densities). Specifically, to efficiently assign n lineages to m parents, the order relation between n and m strongly affects the relevant algorithm for the coalescent simulator (e.g. only when n<2m, it is reasonable to assume that two lineages, at most, can be assigned to the same parent). Controlling the details of the simulation model as a function of n and m is then crucial to represent accurately and efficiently the assignment process, but current implementations make it difficult to switch between different types of lineage mergers at run-time or even compile-time. Results With the described occupancy spectrum and algorithm that generates the support of the joint probability distribution of the occupancy spectrum; computation is much faster than realizing the whole assignment process under the coalescent. Using general definitions of lineage merges, which also makes the codebase reusable, we implement several variants of coalescent mergers, including an approximation where low probability spectrums are discarded. Comparison of runtimes and performance of the different C++ highly reusable coalescence mergers (binary, multiple, hybrids) are given, and we illustrate their potential utility with example applications. Availability and implementation All components are integrated into Quetzal, an open-source C++ library for coalescence available at https://becheler.github.io/pages/quetzal.html. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text