A statistical nonparametric method for identifying consistently important features across samples

Mapping Intimacies ◽

10.1101/833624 ◽

2019 ◽

Author(s):

Natalie Sauerwald ◽

Carl Kingsford

Keyword(s):

Human Body ◽

Simulated Data ◽

Housekeeping Genes ◽

Housekeeping Gene ◽

False Positives ◽

Supplementary Information ◽

Sequencing Data ◽

Data Types ◽

Link Type ◽

Body Map

AbstractIn many applications, a consistently high measurement across many samples can indicate particularly meaningful or useful information for quality control or biological interpretation. Identification of these strong features among many others can be challenging especially when the samples cannot be expected to have the same distribution or range of values. We present a general method called conserved feature discovery (CFD) for identifying features with consistently strong signals across multiple conditions or samples. Given any real-valued data, CFD requires no parameters, makes no assumptions on the shape of the underlying sample distributions, and is robust to differences across these distributions.We show that with high probability CFD identifies all true positives and no false positives under certain assumptions on the median and variance distributions of the feature measurements. Using simulated data, we show that CFD is tolerant to a small percentage of poor quality samples and robust to false positives. Applying CFD to RNA sequencing data from the Human Body Map project and GTEx, we identify housekeeping genes as highly expressed genes across tissue types and compare to housekeeping gene lists from previous methods. CFD is consistent between the Human Body Map and GTEx data sets, and identifies lists of genes enriched for basic cellular processes as expected. The framework can be easily adapted for many data types and desired feature properties.AvailabilityCode for CFD and scripts to reproduce the figures and analysis in this work are available at https://github.com/Kingsford-Group/cfd.Supplementary informationSupplementary data are available at https://github.com/Kingsford-Group/cfd.

Download Full-text

hypeR: An R Package for Geneset Enrichment Workflows

10.1101/656637 ◽

2019 ◽

Cited By ~ 1

Author(s):

Anthony Federico ◽

Stefano Monti

Keyword(s):

High Throughput Sequencing ◽

R Package ◽

Supplementary Information ◽

Sequencing Data ◽

Wide Audience ◽

Popular Method ◽

Link Type ◽

High Throughput Sequencing Data ◽

One Stop ◽

Recent Version

ABSTRACTSummaryGeneset enrichment is a popular method for annotating high-throughput sequencing data. Existing tools fall short in providing the flexibility to tackle the varied challenges researchers face in such analyses, particularly when analyzing many signatures across multiple experiments. We present a comprehensive R package for geneset enrichment workflows that offers multiple enrichment, visualization, and sharing methods in addition to novel features such as hierarchical geneset analysis and built-in markdown reporting. hypeR is a one-stop solution to performing geneset enrichment for a wide audience and range of use cases.Availability and implementationThe most recent version of the package is available at https://github.com/montilab/hypeR.Supplementary informationComprehensive documentation and tutorials, are available at https://montilab.github.io/hypeR-docs.

Download Full-text

Whisper: Read sorting allows robust mapping of sequencing data

10.1101/240358 ◽

2017 ◽

Author(s):

Sebastian Deorowicz ◽

Agnieszka Debudaj-Grabysz ◽

Adam Gudyś ◽

Szymon Grabowski

Keyword(s):

Reference Genome ◽

Variant Calling ◽

Real Data ◽

Supplementary Information ◽

Sequencing Data ◽

Suffix Arrays ◽

Link Type ◽

Mapping Tool ◽

Reverse Complement ◽

Comparable Accuracy

AbstractMotivationMapping reads to a reference genome is often the first step in a sequencing data analysis pipeline. Mistakes made at this computationally challenging stage cannot be recovered easily.ResultsWe present Whisper, an accurate and high-performant mapping tool, based on the idea of sorting reads and then mapping them against suffix arrays for the reference genome and its reverse complement. Employing task and data parallelism as well as storing temporary data on disk result in superior time efficiency at reasonable memory requirements. Whisper excels at large NGS read collections, in particular Illumina reads with typical WGS coverage. The experiments with real data indicate that our solution works in about 15% of the time needed by the well-known Bowtie2 and BWA-MEM tools at a comparable accuracy (validated in variant calling pipeline).AvailabilityWhisper is available for free from https://github.com/refresh-bio/Whisper or http://sun.aei.polsl.pl/REFRESH/Whisper/[email protected] informationSupplementary data are available at publisher Web site.

Download Full-text

re-Searcher: GUI-based bioinformatics tool for simplified genomics data mining of VCF files

PeerJ ◽

10.7717/peerj.11333 ◽

2021 ◽

Vol 9 ◽

pp. e11333

Author(s):

Daniyar Karabayev ◽

Askhat Molkenov ◽

Kaiyrgali Yerulanuly ◽

Ilyas Kabimoldayev ◽

Asset Daniyarov ◽

...

Keyword(s):

Web Application ◽

High Throughput Sequencing ◽

Sequencing Data ◽

Data Types ◽

Standard Format ◽

Standard Data ◽

Additional Information ◽

Link Type ◽

Sequencing Platforms ◽

User Friendly

Background High-throughput sequencing platforms generate a massive amount of high-dimensional genomic datasets that are available for analysis. Modern and user-friendly bioinformatics tools for analysis and interpretation of genomics data becomes essential during the analysis of sequencing data. Different standard data types and file formats have been developed to store and analyze sequence and genomics data. Variant Call Format (VCF) is the most widespread genomics file type and standard format containing genomic information and variants of sequenced samples. Results Existing tools for processing VCF files don’t usually have an intuitive graphical interface, but instead have just a command-line interface that may be challenging to use for the broader biomedical community interested in genomics data analysis. re-Searcher solves this problem by pre-processing VCF files by chunks to not load RAM of computer. The tool can be used as standalone user-friendly multiplatform GUI application as well as web application (https://nla-lbsb.nu.edu.kz). The software including source code as well as tested VCF files and additional information are publicly available on the GitHub repository (https://github.com/LabBandSB/re-Searcher).

Download Full-text

sangeranalyseR: simple and interactive analysis of Sanger sequencing data in R

10.1101/2020.05.18.102459 ◽

2020 ◽

Author(s):

Kuan-Hao Chao ◽

Kirston Barton ◽

Sarah Palmer ◽

Robert Lanfear

Keyword(s):

Sanger Sequencing ◽

Reference Sequence ◽

Supplementary Information ◽

File Format ◽

Bioconductor Package ◽

Sequencing Data ◽

Interactive Analysis ◽

Link Type ◽

Online Documentation ◽

Wide Range

AbstractSummarysangeranalyseR is an interactive R/Bioconductor package and two associated Shiny applications designed for analysing Sanger sequencing from data from the ABIF file format in R. It allows users to go from loading reads to saving aligned contigs in a few lines of R code. sangeranalyseR provides a wide range of options for a number of commonly-performed actions including read trimming, detecting secondary peaks, viewing chromatograms, and detecting indels using a reference sequence. All parameters can be adjusted interactively either in R or in the associated Shiny applications. sangeranalyseR comes with extensive online documentation, and outputs detailed interactive HTML reports.Availability and implementationsangeranalyseR is implemented in R and released under an MIT license. It is available for all platforms on Bioconductor (https://bioconductor.org/packages/sangeranalyseR) and on Github (https://github.com/roblanf/sangeranalyseR)[email protected] informationDocumentation at https://sangeranalyser.readthedocs.io/.

Download Full-text

NanoPack: visualizing and processing long read sequencing data

10.1101/237180 ◽

2017 ◽

Cited By ~ 2

Author(s):

Wouter De Coster ◽

Svenn D’Hert ◽

Darrin T. Schultz ◽

Marc Cruts ◽

Christine Van Broeckhoven

Keyword(s):

Web Service ◽

Graphical User Interface ◽

Source Code ◽

Supplementary Information ◽

Command Line ◽

Sequencing Data ◽

Link Type ◽

Oxford Nanopore ◽

Long Read ◽

Oxford Nanopore Technologies

AbstractSummary: Here we describe NanoPack, a set of tools developed for visualization and processing of long read sequencing data from Oxford Nanopore Technologies and Pacific Biosciences.Availability and Implementation: The NanoPack tools are written in Python3 and released under the GNU GPL3.0 Licence. The source code can be found at https://github.com/wdecoster/nanopack, together with links to separate scripts and their documentation. The scripts are compatible with Linux, Mac OS and the MS Windows 10 subsystem for linux and are available as a graphical user interface, a web service at http://nanoplot.bioinf.be and command line tools.Contact:[email protected] information: Supplementary tables and figures are available at Bioinformatics online.

Download Full-text

Bivartect: accurate and memory-saving breakpoint detection by direct read comparison

Bioinformatics ◽

10.1093/bioinformatics/btaa059 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2725-2730

Author(s):

Keisuke Shimmura ◽

Yuki Kato ◽

Yukio Kawahara

Keyword(s):

Genome Editing ◽

High Throughput Sequencing ◽

Variant Calling ◽

Simulated Data ◽

Supplementary Information ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Single Node ◽

Single Nucleotide ◽

Target Sites

Abstract Motivation Genetic variant calling with high-throughput sequencing data has been recognized as a useful tool for better understanding of disease mechanism and detection of potential off-target sites in genome editing. Since most of the variant calling algorithms rely on initial mapping onto a reference genome and tend to predict many variant candidates, variant calling remains challenging in terms of predicting variants with low false positives. Results Here we present Bivartect, a simple yet versatile variant caller based on direct comparison of short sequence reads between normal and mutated samples. Bivartect can detect not only single nucleotide variants but also insertions/deletions, inversions and their complexes. Bivartect achieves high predictive performance with an elaborate memory-saving mechanism, which allows Bivartect to run on a computer with a single node for analyzing small omics data. Tests with simulated benchmark and real genome-editing data indicate that Bivartect was comparable to state-of-the-art variant callers in positive predictive value for detection of single nucleotide variants, even though it yielded a substantially small number of candidates. These results suggest that Bivartect, a reference-free approach, will contribute to the identification of germline mutations as well as off-target sites introduced during genome editing with high accuracy. Availability and implementation Bivartect is implemented in C++ and available along with in silico simulated data at https://github.com/ykat0/bivartect. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Identification and Testing of Reference Genes for qRT-PCR Analysis During Pear Fruit Development

10.21203/rs.3.rs-910473/v1 ◽

2021 ◽

Author(s):

Guoming Wang ◽

Zhihua Guo ◽

Xueping Wang ◽

Hongru Gao ◽

Kaijie Qi ◽

...

Keyword(s):

Fruit Development ◽

Reference Genes ◽

Housekeeping Genes ◽

Housekeeping Gene ◽

Pcr Analysis ◽

Quantitative Real Time Pcr ◽

Sequencing Data ◽

Pear Fruit ◽

Qrt Pcr ◽

Fruit Development And Ripening

Abstract Background: Quantitative real-time PCR (qRT-PCR) is currently one of the most reliable and improved tools for analyzing gene expression. Various studies have shown that housekeeping genes was varied with cultivars, tissues and treatment. The reliable and stable reference genes were necessarily identified and evaluated according to different experimental requirements. Result: In this study, 10 candidate reference genes were initially screened based on the transcriptome sequencing data of four pear fruit development stages of three different pear cultivars, including a candidate housekeeping gene PbrTUB. Furthermore, we ranked the expression stability of 10 candidate reference genes using algorithms GeNorm, NormFinder, BestKeeper and ReFinder. Finally, the result showed that Pbr028511, Pbr038418 and Pbr041114 were the most stable reference gene in Cuiguan, Housui and Xueqing fruit, respectively. Concludion: Thee results provide a valuable resource that serve as significant reference for gene function explorations and molecular mechanism studies in fruit development and ripening of different pear cultivars.

Download Full-text

Inference of genome 3D architecture by modeling overdispersion of Hi-C data

10.1101/2021.02.04.429864 ◽

2021 ◽

Author(s):

Nelle Varoquaux ◽

William S. Noble ◽

Jean-Philippe Vert

Keyword(s):

3D Model ◽

Negative Binomial ◽

Simulated Data ◽

Poisson Model ◽

Random Variable ◽

Dispersion Parameter ◽

Supplementary Information ◽

Data Sets ◽

Link Type ◽

The Mean

We address the challenge of inferring a consensus 3D model of genome architecture from Hi-C data. Existing approaches most often rely on a two step algorithm: first convert the contact counts into distances, then optimize an objective function akin to multidimensional scaling (MDS) to infer a 3D model. Other approaches use a maximum likelihood approach, modeling the contact counts between two loci as a Poisson random variable whose intensity is a decreasing function of the distance between them. However, a Poisson model of contact counts implies that the variance of the data is equal to the mean, a relationship that is often too restrictive to properly model count data.We first confirm the presence of overdispersion in several real Hi-C data sets, and we show that the overdispersion arises even in simulated data sets. We then propose a new model, called Pastis-NB, where we replace the Poisson model of contact counts by a negative binomial one, which is parametrized by a mean and a separate dispersion parameter. The dispersion parameter allows the variance to be adjusted independently from the mean, thus better modeling overdispersed data. We compare the results of Pastis-NB to those of several previously published algorithms: three MDS-based methods (ShRec3D, ChromSDE, and Pastis-MDS) and a statistical methods based on a Poisson model of the data (Pastis-PM). We show that the negative binomial inference yields more accurate structures on simulated data, and more robust structures than other models across real Hi-C replicates and across different resolutions.A Python implementation of Pastis-NB is available at https://github.com/hiclib/pastis under the BSD licenseSupplementary information is available at https://nellev.github.io/pastisnb/

Download Full-text

Detecting evolutionary patterns of cancers using consensus trees

Bioinformatics ◽

10.1093/bioinformatics/btaa801 ◽

2020 ◽

Vol 36 (Supplement_2) ◽

pp. i684-i691

Author(s):

Sarah Christensen ◽

Juho Kim ◽

Nicholas Chia ◽

Oluwasanmi Koyejo ◽

Mohammed El-Kebir

Keyword(s):

Evolutionary Process ◽

Simulated Data ◽

Therapy Response ◽

Supplementary Information ◽

Driver Mutations ◽

Sequencing Data ◽

Large Solution ◽

Evolutionary Patterns ◽

Cancer Subtypes ◽

Evolutionary Trajectories

Abstract Motivation While each cancer is the result of an isolated evolutionary process, there are repeated patterns in tumorigenesis defined by recurrent driver mutations and their temporal ordering. Such repeated evolutionary trajectories hold the potential to improve stratification of cancer patients into subtypes with distinct survival and therapy response profiles. However, current cancer phylogeny methods infer large solution spaces of plausible evolutionary histories from the same sequencing data, obfuscating repeated evolutionary patterns. Results To simultaneously resolve ambiguities in sequencing data and identify cancer subtypes, we propose to leverage common patterns of evolution found in patient cohorts. We first formulate the Multiple Choice Consensus Tree problem, which seeks to select a tumor tree for each patient and assign patients into clusters in such a way that maximizes consistency within each cluster of patient trees. We prove that this problem is NP-hard and develop a heuristic algorithm, Revealing Evolutionary Consensus Across Patients (RECAP), to solve this problem in practice. Finally, on simulated data, we show RECAP outperforms existing methods that do not account for patient subtypes. We then use RECAP to resolve ambiguities in patient trees and find repeated evolutionary trajectories in lung and breast cancer cohorts. Availability and implementation https://github.com/elkebir-group/RECAP. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

LAmbDA: label ambiguous domain adaptation dataset integration reduces batch effects and improves subtype detection

Bioinformatics ◽

10.1093/bioinformatics/btz295 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4696-4706 ◽

Cited By ~ 6

Author(s):

Travis S Johnson ◽

Tongxin Wang ◽

Zhi Huang ◽

Christina Y Yu ◽

Yi Wu ◽

...

Keyword(s):

Domain Adaptation ◽

Simulated Data ◽

Cell Types ◽

Supplementary Information ◽

Batch Effects ◽

Data Types ◽

Single Model ◽

Model Type ◽

Multiple Datasets ◽

Systematic Biases

Abstract Motivation Rapid advances in single cell RNA sequencing (scRNA-seq) have produced higher-resolution cellular subtypes in multiple tissues and species. Methods are increasingly needed across datasets and species to (i) remove systematic biases, (ii) model multiple datasets with ambiguous labels and (iii) classify cells and map cell type labels. However, most methods only address one of these problems on broad cell types or simulated data using a single model type. It is also important to address higher-resolution cellular subtypes, subtype labels from multiple datasets, models trained on multiple datasets simultaneously and generalizability beyond a single model type. Results We developed a species- and dataset-independent transfer learning framework (LAmbDA) to train models on multiple datasets (even from different species) and applied our framework on simulated, pancreas and brain scRNA-seq experiments. These models mapped corresponding cell types between datasets with inconsistent cell subtype labels while simultaneously reducing batch effects. We achieved high accuracy in labeling cellular subtypes (weighted accuracy simulated 1 datasets: 90%; simulated 2 datasets: 94%; pancreas datasets: 88% and brain datasets: 66%) using LAmbDA Feedforward 1 Layer Neural Network with bagging. This method achieved higher weighted accuracy in labeling cellular subtypes than two other state-of-the-art methods, scmap and CaSTLe in brain (66% versus 60% and 32%). Furthermore, it achieved better performance in correctly predicting ambiguous cellular subtype labels across datasets in 88% of test cases compared with CaSTLe (63%), scmap (50%) and MetaNeighbor (50%). LAmbDA is model- and dataset-independent and generalizable to diverse data types representing an advance in biocomputing. Availability and implementation github.com/tsteelejohnson91/LAmbDA Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text