blupADC: An R package and shiny toolkit for comprehensive genetic data analysis in animal and plant breeding

AbstractSummaryGenetic analysis is a systematic and complex procedure in animal and plant breeding. With fast development of high-throughput genotyping techniques and algorithms, animal and plant breeding has entered into a genomic era. However, there is a lack of software, which can be used to process comprehensive genetic analyses, in the routine animal and plant breeding program. To make the whole genetic analysis in animal and plant breeding straightforward, we developed a powerful, robust and fast R package that includes genomic data format conversion, genomic data quality control and genotype imputation, breed composition analysis, pedigree tracing, analysis and visualization, pedigree-based and genomic-based relationship matrix construction, and genomic evaluation. In addition, to simplify the application of this package, we also developed a shiny toolkit for users.Availability and implementationblupADC is developed primarily in R with core functions written in C++. The development version is maintained at https://github.com/TXiang-lab/blupADC.Supplementary informationSupplementary data are available online

Download Full-text

hypeR: An R Package for Geneset Enrichment Workflows

10.1101/656637 ◽

2019 ◽

Cited By ~ 1

Author(s):

Anthony Federico ◽

Stefano Monti

Keyword(s):

High Throughput Sequencing ◽

R Package ◽

Supplementary Information ◽

Sequencing Data ◽

Wide Audience ◽

Popular Method ◽

Link Type ◽

High Throughput Sequencing Data ◽

One Stop ◽

Recent Version

ABSTRACTSummaryGeneset enrichment is a popular method for annotating high-throughput sequencing data. Existing tools fall short in providing the flexibility to tackle the varied challenges researchers face in such analyses, particularly when analyzing many signatures across multiple experiments. We present a comprehensive R package for geneset enrichment workflows that offers multiple enrichment, visualization, and sharing methods in addition to novel features such as hierarchical geneset analysis and built-in markdown reporting. hypeR is a one-stop solution to performing geneset enrichment for a wide audience and range of use cases.Availability and implementationThe most recent version of the package is available at https://github.com/montilab/hypeR.Supplementary informationComprehensive documentation and tutorials, are available at https://montilab.github.io/hypeR-docs.

Download Full-text

gwasurvivr: an R package for genome wide survival analysis

10.1101/326033 ◽

2018 ◽

Author(s):

Abbas A Rizvi ◽

Ezgi Karaesmen ◽

Martin Morgan ◽

Leah Preus ◽

Junke Wang ◽

...

Keyword(s):

Survival Analysis ◽

Cox Model ◽

R Package ◽

Supplementary Information ◽

Parameter Estimates ◽

Survival Analyses ◽

Link Type ◽

Genome Wide ◽

Size Number ◽

Simple Interface

ABSTRACTSummaryTo address the limited software options for performing survival analyses with millions of SNPs, we developed gwasurvivr, an R/Bioconductor package with a simple interface for conducting genome wide survival analyses using VCF (outputted from Michigan or Sanger imputation servers), IMPUTE2 or PLINK files. To decrease the number of iterations needed for convergence when optimizing the parameter estimates in the Cox model we modified the R package survival; covariates in the model are first fit without the SNP, and those parameter estimates are used as initial points. We benchmarked gwasurvivr with other software capable of conducting genome wide survival analysis (genipe, SurvivalGWAS_SV, and GWASTools). gwasurvivr is significantly faster and shows better scalability as sample size, number of SNPs and number of covariates increases.Availability and implementationgwasurvivr, including source code, documentation, and vignette are available at: http://bioconductor.org/packages/gwasurvivrContactAbbas Rizvi, [email protected]; Lara E Sucheston-Campbell, [email protected] information: Supplementary data are available at https://github.com/suchestoncampbelllab/gwasurvivr_manuscript

Download Full-text

dbgap2x: An R package to explore and extract data from the database of Genotypes and Phenotypes (dbGaP)

Bioinformatics ◽

10.1093/bioinformatics/btz680 ◽

2019 ◽

Cited By ~ 1

Author(s):

Grégoire Versmée ◽

Laura Versmée ◽

Mikaël Dusenne ◽

Niloofar Jalali ◽

Paul Avillach

Keyword(s):

Data Sharing ◽

Large Scale ◽

Genomic Data ◽

R Package ◽

National Institutes Of Health ◽

Supplementary Information ◽

Supplementary Data ◽

Complex Procedure ◽

Range Of Functions ◽

The Relationship

Abstract Summary Based on the Genomic Data Sharing Policy issued in August 2007, the National Institutes of Health (NIH) has supported several repositories such as the database of Genotypes and Phenotypes (dbGaP). dbGaP is an online repository that provides access to large-scale genetic and phenotypic datasets with more than 1,000 studies. However, navigating the website and understanding the relationship between the studies are not easy tasks. Moreover, the decryption of the files is a complex procedure. In this study we propose the dbgap2x R package that covers a broad range of functions for searching dbGaP studies, exploring the characteristics of a study and easily decrypting the files from dbGaP. Availability and implementation dbgap2x is an R package with the code available at https://github.com/gversmee/dbgap2x. A containerized version including the package, a Jupyter server and with a Notebook example is available at https://hub.docker.com/r/gversmee/dbgap2x. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Efficient management and analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr

10.1101/190926 ◽

2017 ◽

Cited By ~ 1

Author(s):

Florian Privé ◽

Hugues Aschard ◽

Michael G.B. Blum

Keyword(s):

Data Analysis ◽

Large Scale ◽

Genomic Data ◽

Supplementary Information ◽

Risk Scores ◽

Analysis Pipeline ◽

Polygenic Risk ◽

Link Type ◽

Genome Wide ◽

R Packages

AbstractMotivation:Genome-wide datasets produced for association studies have dramatically increased in size over the past few years, with modern datasets commonly including millions of variants measured in dozens of thousands of individuals. This increase in data size is a major challenge severely slowing down genomic analyses. Specialized software for every part of the analysis pipeline have been developed to handle large genomic data. However, combining all these software into a single data analysis pipeline might be technically difficult.Results:Here we present two R packages, bigstatsr and bigsnpr, allowing for management and analysis of large scale genomic data to be performed within a single comprehensive framework. To address large data size, the packages use memory-mapping for accessing data matrices stored on disk instead of in RAM. To perform data pre-processing and data analysis, the packages integrate most of the tools that are commonly used, either through transparent system calls to existing software, or through updated or improved implementation of existing methods. In particular, the packages implement a fast derivation of Principal Component Analysis, functions to remove SNPs in Linkage Disequilibrium, and algorithms to learn Polygenic Risk Scores on millions of SNPs. We illustrate applications of the two R packages by analysing a case-control genomic dataset for the celiac disease, performing an association study and computing Polygenic Risk Scores. Finally, we demonstrate the scalability of the R packages by analyzing a simulated genome-wide dataset including 500,000 individuals and 1 million markers on a single desktop computer.Availability:https://privefl.github.io/bigstatsr/ & https://privefl.github.io/bigsnpr/Contact:[email protected] & [email protected] information:Supplementary data are available at Bioinformatics online.

Download Full-text

eQTpLot: an R package for the visualization and colocalization of eQTL and GWAS signals

10.1101/2020.08.26.268268 ◽

2020 ◽

Author(s):

Theodore G. Drivas ◽

Anastasia Lucas ◽

Marylyn D. Ritchie

Keyword(s):

Quantitative Trait Loci ◽

Quantitative Trait ◽

Genomic Data ◽

R Package ◽

Expression Quantitative Trait Loci ◽

Summary Statistics ◽

Link Type ◽

Trait Loci ◽

Genomic Studies

SummaryGenomic studies increasingly integrate expression quantitative trait loci (eQTL) information into their analysis pipelines, but few tools exist for the visualization of colocalization between eQTL and GWAS results. To address this issue, we developed the intuitive R package eQTpLot, which takes as input GWAS and eQTL summary statistics to generate a series of plots visualizing colocalization, correlation, and enrichment between eQTL and GWAS signals for a given gene-trait pair. We believe eQTpLot will prove a useful tool for investigators seeking a convenient and customizable visualization of genomic data colocalization.Availability and Implementationthe eQTpLot R package and tutorial are available at https://github.com/RitchieLab/[email protected]

Download Full-text

ClusterMine: a Knowledge-integrated Clustering Approach based on Expression Profiles of Gene Sets

10.1101/255711 ◽

2018 ◽

Author(s):

Hong-Dong Li ◽

Yunpei Xu ◽

Xiaoshu Zhu ◽

Quan Liu ◽

Gilbert S. Omenn ◽

...

Keyword(s):

Expression Profiles ◽

R Package ◽

Biological Data ◽

Supplementary Information ◽

Consensus Clustering ◽

Cluster Membership ◽

Link Type ◽

Novel Approach ◽

Gene Sets ◽

Biological Interpretation

ABSTRACTMotivationClustering analysis is essential for understanding complex biological data. In widely used methods such as hierarchical clustering (HC) and consensus clustering (CC), expression profiles of all genes are often used to assess similarity between samples for clustering. These methods output sample clusters, but are not able to provide information about which gene sets (functions) contribute most to the clustering. So interpretability of their results is limited. We hypothesized that integrating prior knowledge of annotated biological processes would not only achieve satisfying clustering performance but also, more importantly, enable potential biological interpretation of clusters.ResultsHere we report ClusterMine, a novel approach that identifies clusters by assessing functional similarity between samples through integrating known annotated gene sets, e.g., in Gene Ontology. In addition to outputting cluster membership of each sample as conventional approaches do, it outputs gene sets that are most likely to contribute to the clustering, a feature facilitating biological interpretation. Using three cancer datasets, two single cell RNA-sequencing based cell differentiation datasets, one cell cycle dataset and two datasets of cells of different tissue origins, we found that ClusterMine achieved similar or better clustering performance and that top-scored gene sets prioritized by ClusterMine are biologically relevant.Implementation and availabilityClusterMine is implemented as an R package and is freely available at: www.genemine.org/[email protected] InformationSupplementary data are available at Bioinformatics online.

Download Full-text

adductomicsR: A package for detection and quantification of protein adducts from mass spectra of tryptic digests

10.1101/463331 ◽

2018 ◽

Author(s):

Josie Hayes ◽

William M. B. Edmands ◽

Yukiko Yano ◽

Hasmik Grigoryan ◽

Courtney Schiffman ◽

...

Keyword(s):

Mass Spectra ◽

High Resolution Mass Spectrometry ◽

Internal Standard ◽

R Package ◽

Protein Adducts ◽

Supplementary Information ◽

Link Type ◽

Protein Digests ◽

Modified Peptides ◽

Time Drift

ABSTRACTSummaryLiquid chromatography-high resolution mass spectrometry (LC-HRMS) has been used to establish a method, referred to as ‘adductomics’, for characterisation of putative protein adducts at selected loci in human serum albumin (HSA). Applications of this method have been limited by the lack of software for untargeted analysis of modified peptides in protein digests. Here we present adductomicsR, an open-source R package for processing LC-HRMS data from analysis of adducted HSA peptides. The software interrogates mass spectra to correct for retention-time drift, and to discover and quantify putative adducts along with those for a housekeeping peptide and internal standard.Availability and implementationadductomicsR is written in R and publicly available at https://github.com/JosieLHayes/adductomicsR, which includes a vignette with example data.Supplementary informationmzXML files for the vignette and test dataset are available in an associated data package adductData (https://github.com/JosieLHayes/adductData)[email protected] SectionAPPLICATIONS NOTE

Download Full-text

Improving the value of public RNA-seq expression data by phenotype prediction

10.1101/145656 ◽

2017 ◽

Cited By ~ 2

Author(s):

Shannon E. Ellis ◽

Leonardo Collado-Torres ◽

Jeffrey T. Leek

Keyword(s):

In Silico ◽

Tissue Sample ◽

Genomic Data ◽

R Package ◽

Training Data ◽

Expression Data ◽

Rna Seq ◽

Phenotype Prediction ◽

Link Type ◽

Public Data

AbstractBackgroundPublicly available genomic data are a valuable resource for studying normal human variation and disease, but these data are often not well labeled or annotated. The lack of phenotype information for public genomic data severely limits their utility for addressing targeted biological questions.ResultsWe develop an in silico phenotyping approach for predicting critical missing annotation directly from genomic measurements using, well-annotated genomic and phenotypic data produced by consortia like TCGA and GTEx as training data. We apply in silico phenotyping to a set of 70,000 RNA-seq samples we recently processed on a common pipeline as part of the recount2 project (https://jhubiostatistics.shinyapps.io/recount/). We use gene expression data to build and evaluate predictors for both biological phenotypes (sex, tissue, sample source) and experimental conditions (sequencing strategy). We demonstrate how these predictions can be used to study cross-sample properties of public genomic data, select genomic projects with specific characteristics, and perform downstream analyses using predicted phenotypes. The methods to perform phenotype prediction are available in the phenopredict R package (https://github.com/leekgroup/phenopredict) and the predictions for recount2 are available from the recount R package (https://bioconductor.org/packages/release/bioc/html/recount.html)ConclusionHaving leveraging massive public data sets to generate a well-phenotyped set of expression data for more than 70,000 human samples, expression data is available for use on a scale that was not previously feasible.

Download Full-text

MutSpot: detection of non-coding mutation hotspots in cancer genomes

10.1101/740944 ◽

2019 ◽

Cited By ~ 1

Author(s):

Yu Amanda Guo ◽

Mei Mei Chang ◽

Anders Jacobsen Skanderup

Keyword(s):

Somatic Mutations ◽

R Package ◽

Supplementary Information ◽

Patient Specific ◽

Supplementary Data ◽

Link Type ◽

Genome Wide ◽

Cancer Genomes ◽

User Friendly ◽

Regulatory Dna

AbstractSummaryRecurrence and clustering of somatic mutations (hotspots) in cancer genomes may indicate positive selection and involvement in tumorigenesis. MutSpot performs genome-wide inference of mutation hotspots in non-coding and regulatory DNA of cancer genomes. MutSpot performs feature selection across hundreds of epigenetic and sequence features followed by estimation of position and patient-specific background somatic mutation probabilities. MutSpot is user-friendly, works on a standard workstation, and scales to thousands of cancer genomes.Availability and implementationMutSpot is implemented as an R package and is available at https://github.com/skandlab/MutSpot/Supplementary informationSupplementary data are available at https://github.com/skandlab/MutSpot/

Download Full-text

BiomeHorizon: visualizing microbiome time series data in R

10.1101/2021.08.29.458140 ◽

2021 ◽

Author(s):

Isaac Fink ◽

Richard J. Abdill ◽

Ran Blekhman ◽

Laura Grieneisen

Keyword(s):

Time Series ◽

Open Source ◽

Time Series Data ◽

R Package ◽

Supplementary Information ◽

Series Data ◽

Link Type ◽

Microbiome Research ◽

Microbiome Data ◽

Over Time

AbstractSummaryA key aspect of microbiome research is analysis of longitudinal dynamics using time series data. A method to visualize both the proportional and absolute change in the abundance of multiple taxa across multiple subjects over time is needed. We developed BiomeHorizon, an open-source R package that visualizes longitudinal compositional microbiome data using horizon plots.Availability and ImplementationBiomeHorizon is available at https://github.com/blekhmanlab/biomehorizon/ and released under the MIT license. A guide with step-by-step instructions for using the package is provided at https://blekhmanlab.github.io/biomehorizon/. The guide also provides code to reproduce all plots in this [email protected], [email protected], [email protected] informationNone

Download Full-text