IBDkin: fast estimation of kinship coefficients from identity by descent segments

Ying Zhou; Sharon R Browning; Brian L Browning

doi:10.1093/bioinformatics/btaa569

IBDkin: fast estimation of kinship coefficients from identity by descent segments

Bioinformatics ◽

10.1093/bioinformatics/btaa569 ◽

2020 ◽

Vol 36 (16) ◽

pp. 4519-4520

Author(s):

Ying Zhou ◽

Sharon R Browning ◽

Brian L Browning

Keyword(s):

Software Package ◽

Large Datasets ◽

Supplementary Information ◽

Supplementary Data ◽

Uk Biobank ◽

Identity By Descent ◽

Fast Estimation ◽

Kinship Coefficients ◽

Related Individuals ◽

The Uk

Abstract Motivation Estimation of pairwise kinship coefficients in large datasets is computationally challenging because the number of related individuals increases quadratically with sample size. Results We present IBDkin, a software package written in C for estimating kinship coefficients from identity by descent (IBD) segments. We use IBDkin to estimate kinship coefficients for 7.95 billion pairs of individuals in the UK Biobank who share at least one detected IBD segment with length ≥ 4 cM. Availability and implementation https://github.com/YingZhou001/IBDkin. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

MetumpX—a metabolomics support package for untargeted mass spectrometry

Bioinformatics ◽

10.1093/bioinformatics/btz765 ◽

2019 ◽

Vol 36 (5) ◽

pp. 1647-1648 ◽

Cited By ~ 1

Author(s):

Bilal Wajid ◽

Hasan Iqbal ◽

Momina Jamil ◽

Hafsa Rafique ◽

Faria Anwar

Keyword(s):

Mass Spectrometry ◽

Data Analysis ◽

Small Molecules ◽

Software Package ◽

Life Sciences ◽

Supplementary Information ◽

Supplementary Data ◽

Software Packages ◽

Develop Software ◽

User Friendly

Abstract Motivation Metabolomics is a data analysis and interpretation field aiming to study functions of small molecules within the organism. Consequently Metabolomics requires researchers in life sciences to be comfortable in downloading, installing and scripting of software that are mostly not user friendly and lack basic GUIs. As the researchers struggle with these skills, there is a dire need to develop software packages that can automatically install software pipelines truly speeding up the learning curve to build software workstations. Therefore, this paper aims to provide MetumpX, a software package that eases in the installation of 103 software by automatically resolving their individual dependencies and also allowing the users to choose which software works best for them. Results MetumpX is a Ubuntu-based software package that facilitate easy download and installation of 103 tools spread across the standard metabolomics pipeline. As far as the authors know MetumpX is the only solution of its kind where the focus lies on automating development of software workstations. Availability and implementation https://github.com/hasaniqbal777/MetumpX-bin. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SOAPMetaS: profiling large metagenome datasets efficiently on distributed clusters

Bioinformatics ◽

10.1093/bioinformatics/btaa697 ◽

2020 ◽

Author(s):

Shixu He ◽

Zhibo Huang ◽

Xiaohan Wang ◽

Lin Fang ◽

Shengkang Li ◽

...

Keyword(s):

Big Data ◽

Large Volume ◽

Machine Tools ◽

High Performance ◽

Marker Gene ◽

Source Code ◽

Large Datasets ◽

Supplementary Information ◽

Supplementary Data ◽

Multiple Sample

Abstract Summary Rapid increase of the data size in metagenome researches has raised the demand for new tools to process large datasets efficiently. To accelerate the metagenome profiling process in the scenario of big data, we developed SOAPMetaS, a marker gene-based multiple-sample metagenome profiling tool built on Apache Spark. SOAPMetaS demonstrates high performance and scalability to process large datasets. It can process 80 samples of FASTQ data, summing up to 416 GiB, in around half an hour; and the accuracy of species profiling results of SOAPMetaS is similar to that of MetaPhlAn2. SOAPMetaS can deal with a large volume of metagenome data more efficiently than common-used single-machine tools. Availability and implementation Source code is implemented in Java and freely available at https://github.com/BGI-flexlab/SOAPMetaS. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Crosslink: A fast, scriptable genetic mapper for outcrossing species

10.1101/135277 ◽

2017 ◽

Cited By ~ 6

Author(s):

Robert J. Vickerstaff ◽

Richard J. Harrison

Keyword(s):

Large Datasets ◽

Supplementary Information ◽

Supplementary Data ◽

Link Type ◽

Mapping Software ◽

Outcrossing Species ◽

Supplementary Material ◽

Novel Approaches ◽

Similar Accuracy ◽

General Public License

AbstractSummaryCrosslink is genetic mapping software for outcrossing species designed to run efficiently on large datasets by combining the best from existing tools with novel approaches. Tests show it runs much faster than several comparable programs whilst retaining a similar accuracy.Availability and implementationAvailable under the GNU General Public License version 2 from https://github.com/eastmallingresearch/[email protected] informationSupplementary data are available at Bioinformatics online and from https://github.com/eastmallingresearch/crosslink/releases/tag/v0.5.

Download Full-text

Phylonium: fast estimation of evolutionary distances from large samples of similar genomes

Bioinformatics ◽

10.1093/bioinformatics/btz903 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2040-2046 ◽

Cited By ~ 2

Author(s):

Fabian Klötzl ◽

Bernhard Haubold

Keyword(s):

Disease Outbreaks ◽

Supplementary Information ◽

Whole Genome ◽

Command Line ◽

Supplementary Data ◽

Large Samples ◽

Fast Estimation ◽

Unix Command ◽

Similar Accuracy ◽

Single Sequence

Abstract Motivation Tracking disease outbreaks by whole-genome sequencing leads to the collection of large samples of closely related sequences. Five years ago, we published a method to accurately compute all pairwise distances for such samples by indexing each sequence. Since indexing is slow, we now ask whether it is possible to achieve similar accuracy when indexing only a single sequence. Results We have implemented this idea in the program phylonium and show that it is as accurate as its predecessor and roughly 100 times faster when applied to all 2678 Escherichia coli genomes contained in ENSEMBL. One of the best published programs for rapidly computing pairwise distances, mash, analyzes the same dataset four times faster but, with default settings, it is less accurate than phylonium. Availability and implementation Phylonium runs under the UNIX command line; its C++ sources and documentation are available from github.com/evolbioinf/phylonium. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

VCFShark: how to squeeze a VCF file

Bioinformatics ◽

10.1093/bioinformatics/btab211 ◽

2021 ◽

Author(s):

Sebastian Deorowicz ◽

Agnieszka Danek ◽

Marek Kokot

Keyword(s):

Large Datasets ◽

Main Memory ◽

Supplementary Information ◽

Genotype Data ◽

Supplementary Data ◽

Variant Call Format ◽

Variant Call ◽

Order Of Magnitude ◽

Better Than ◽

De Facto Standards

Abstract Summary Variant Call Format (VCF) files with results of sequencing projects take a lot of space. We propose the VCFShark, which is able to compress VCF files up to an order of magnitude better than the de facto standards (gzipped VCF and BCF). The advantage over competitors is the greatest when compressing VCF files containing large amounts of genotype data. The processing speeds up to 100 MB/s and main memory requirements lower than 30 GB allow to use our tool at typical workstations even for large datasets. Availability and implementation https://github.com/refresh-bio/vcfshark. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

STAR Chimeric Post For Rapid Detection of Circular RNA and Fusion Transcripts

10.1101/139808 ◽

2017 ◽

Author(s):

Nicholas K. Akers ◽

Eric E. Schadt ◽

Bojan Losic

Keyword(s):

False Positive ◽

Rapid Detection ◽

Software Package ◽

Circular Rna ◽

Large Datasets ◽

Supplementary Information ◽

High Dimensional ◽

Rna Detection ◽

Chimeric Rna ◽

Detection And Quantification

AbstractMotivationThe biological relevance of chimeric RNA alignments is now well established. Chimera arising as chromosomal fusions are often drivers of cancer, and recently discovered circular RNA are only now being characterized. While software already exists for fusion discovery and quantitation, high false positive rates and high run-times hamper scalable fusion discovery on large datasets. Furthermore, very little software is available for circular RNA detection and quantification.ResultsHere we present STAR Chimeric Post (STARChip), a novel software package that processes chimeric alignments from the STAR aligner and produces annotated circular RNA and high precision fusions in a rapid, efficient, and scalable manner that is appropriate for high dimensional medical omics datasets.Availability and ImplementationSTARChip is available at https://github.com/LosicLab/[email protected] or [email protected] InformationSupplementary figures and tables are available online.

Download Full-text

capC-MAP: software for analysis of Capture-C data

Bioinformatics ◽

10.1093/bioinformatics/btz480 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4773-4775 ◽

Cited By ~ 1

Author(s):

Adam Buckle ◽

Nick Gilbert ◽

Davide Marenduzzo ◽

Chris A Brackley

Keyword(s):

Software Package ◽

Experimental Methods ◽

Ease Of Use ◽

Supplementary Information ◽

Command Line ◽

Supplementary Data ◽

Chromosome Conformation ◽

Chromatin Interactions ◽

Genome Wide ◽

Genomic Locations

Abstract Summary Capture-C is a member of the chromosome-conformation-capture family of experimental methods which probes the 3D organization of chromosomes within the cell nucleus. It provides high-resolution information on the genome-wide chromatin interactions from a set of ‘target’ genomic locations, and is growing in popularity as a tool for improving our understanding of cis-regulation and gene function. Yet, analysis of the data is complicated, and to date there has been no dedicated or easy-to-use software to automate the process. We present capC-MAP, a software package for the analysis of Capture-C data. Availability and implementation Implemented with both ease of use and flexibility in mind, capC-MAP is a suit of programs written in C++ and Python, where each program can be run separately, or an entire analysis can be performed with a single command line. It is available under an open-source licence at https://github.com/cbrackley/capC-MAP, as well as via the conda package manager, and should run on any standard Unix-style system. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Efficient haplotype matching between a query and a panel for genealogical search

Bioinformatics ◽

10.1093/bioinformatics/btz347 ◽

2019 ◽

Vol 35 (14) ◽

pp. i233-i241 ◽

Cited By ~ 5

Author(s):

Ardalan Naseri ◽

Erwin Holzhauser ◽

Degui Zhi ◽

Shaojie Zhang

Keyword(s):

Additional Data ◽

Simulated Data ◽

Supplementary Information ◽

Large Panel ◽

Wide Availability ◽

Speed Up ◽

On Line ◽

Related Individuals ◽

The Uk ◽

Burrows Wheeler Transform

Abstract Motivation With the wide availability of whole-genome genotype data, there is an increasing need for conducting genetic genealogical searches efficiently. Computationally, this task amounts to identifying shared DNA segments between a query individual and a very large panel containing millions of haplotypes. The celebrated Positional Burrows-Wheeler Transform (PBWT) data structure is a pre-computed index of the panel that enables constant time matching at each position between one haplotype and an arbitrarily large panel. However, the existing algorithm (Durbin’s Algorithm 5) can only identify set-maximal matches, the longest matches ending at any location in a panel, while in real genealogical search scenarios, multiple ‘good enough’ matches are desired. Results In this work, we developed two algorithmic extensions of Durbin’s Algorithm 5, that can find all L-long matches, matches longer than or equal to a given length L, between a query and a panel. In the first algorithm, PBWT-Query, we introduce ‘virtual insertion’ of the query into the PBWT matrix of the panel, and then scanning up and down for the PBWT match blocks with length greater than L. In our second algorithm, L-PBWT-Query, we further speed up PBWT-Query by introducing additional data structures that allow us to avoid iterating through blocks of incomplete matches. The efficiency of PBWT-Query and L-PBWT-Query is demonstrated using the simulated data and the UK Biobank data. Our results show that our proposed algorithms can detect related individuals for a given query efficiently in very large cohorts which enables a fast on-line query search. Availability and implementation genome.ucf.edu/pbwt-query Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Parent-of-origin effects in the UK Biobank

10.21203/rs.3.rs-1073842/v1 ◽

2021 ◽

Author(s):

Olivier Delaneau ◽

Robin Hofmeister ◽

Simone Rubinacci ◽

Diogo Ribeiro ◽

Zoltan Kutalik ◽

...

Keyword(s):

Platelet Count ◽

Molecular Mechanisms ◽

Probabilistic Approach ◽

Uk Biobank ◽

Chromosome X ◽

Identity By Descent ◽

Large Dataset ◽

Genome Wide ◽

Parent Of Origin ◽

The Uk

Abstract Identical genetic variations can have different phenotypic effects depending on their parent of origin (PofO). Yet, studies focussing on PofO effects have been largely limited in terms of sample size due to the need of parental genomes or known genealogies. Here, we used a novel probabilistic approach to infer PofO of individual alleles in the UK Biobank that does not require parental genomes nor prior knowledge of genealogy. Our model uses Identity-By-Descent (IBD) sharing with second- and third-degree relatives to assign alleles to parental groups and leverages chromosome X data in males to distinguish maternal from paternal groups. When combined with robust haplotype inference and haploid imputation, this allowed us to infer the PofO at 5.4 million variants genome-wide for 26,393 UK Biobank individuals. We used this large dataset to systematically screen 59 biomarkers and 38 anthropomorphic phenotypes for PofO effects and discovered 101 significant associations, demonstrating that this type of effects is widespread. Notably, we retrieved well known PofO effects, such as the MEG3/DLK1 locus on platelet count, and we discovered many new ones often at loci outside currently known imprinted regions and previously thought to harbour additive associations, implying that the underlying molecular mechanisms may be more complex than expected.

Download Full-text

Fast and accurate long-range phasing in a UK Biobank cohort

10.1101/028282 ◽

2015 ◽

Cited By ~ 5

Author(s):

Po-Ru Loh ◽

Pier Francesco Palamara ◽

Alkes L Price

Keyword(s):

Long Range ◽

Error Rate ◽

Rare Variants ◽

Imputation Accuracy ◽

Computational Cost ◽

Uk Biobank ◽

Identical By Descent ◽

Icelandic Population ◽

Related Individuals ◽

The Uk

Recent work has leveraged the extensive genotyping of the Icelandic population to perform long-range phasing (LRP), enabling accurate imputation and association analysis of rare variants in target samples typed on genotyping arrays. Here, we develop a fast and accurate LRP method, Eagle, that extends this paradigm to populations with much smaller proportions of genotyped samples by harnessing long (>4cM) identical-by-descent (IBD) tracts shared among distantly related individuals. We applied Eagle to N=150K samples (0.2% of the British population) from the UK Biobank, and we determined that it is 1-2 orders of magnitude faster than existing methods while achieving similar or better phasing accuracy (switch error rate ≈0.3%, corresponding to perfect phase in most 10Mb segments). We also observed that when used within an imputation pipeline, Eagle pre-phasing improved downstream imputation accuracy compared to pre-phasing in batches using existing methods (as necessary to achieve comparable computational cost).

Download Full-text