NetBoxR: Automated Discovery of Biological Process Modules by Network Analysis in R

Mapping Intimacies ◽

10.1101/2020.06.02.129387 ◽

2020 ◽

Author(s):

Eric Minwei Liu ◽

Augustin Luna ◽

Guanlan Dong ◽

Chris Sander

Keyword(s):

Cell Biology ◽

Large Scale ◽

Clustering Algorithm ◽

High Throughput Sequencing ◽

R Package ◽

Cancer Genome ◽

The Cancer Genome Atlas ◽

Supplementary Information ◽

Network Clustering ◽

Data Types

AbstractSummaryLarge-scale sequencing projects, such as The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC), have accumulated a variety of high throughput sequencing and molecular profiling data, but it is still challenging to identify potentially causal genetic mutations in cancer as well as in other diseases in an automated fashion. We developed the NetBoxR package written in the R programming language, that makes use of the NetBox algorithm to identify candidate cancer-related processes. The algorithm makes use of a networkbased approach that combines prior knowledge with a network clustering algorithm, obviating the need for and the limitation of functionally curated gene sets. A key aspect of this approach is its ability to combine multiple data types, such as mutations and copy number alterations, leading to more reliable identification of functional modules. We make the tool available in the Bioconductor R ecosystem for applications in cancer research and cell biology.Availability and implementationThe NetBoxR package is free and open-sourced under the GNU GPL-3 license R package available at https://www.bioconductor.org/packages/release/bioc/html/[email protected]; [email protected]; [email protected] informationNone

Download Full-text

Mercator: An R Package for Visualization of Distance Matrices

10.1101/733261 ◽

2019 ◽

Author(s):

Zachary B. Abrams ◽

Caitlin E. Coombes ◽

Suli Li ◽

Kevin R. Coombes

Keyword(s):

Large Scale ◽

R Package ◽

Supplementary Information ◽

Data Sets ◽

Data Types ◽

Vast Number ◽

Large Scale Data ◽

Scientific Disciplines ◽

Compare And Contrast ◽

Scale Data

AbstractSummaryUnsupervised data analysis in many scientific disciplines is based on calculating distances between observations and finding ways to visualize those distances. These kinds of unsupervised analyses help researchers uncover patterns in large-scale data sets. However, researchers can select from a vast number of different distance metrics, each designed to highlight different aspects of different data types. There are also numerous visualization methods with their own strengths and weaknesses. To help researchers perform unsupervised analyses, we developed the Mercator R package. Mercator enables users to see important patterns in their data by generating multiple visualizations using different standard algorithms, making it particularly easy to compare and contrast the results arising from different metrics. By allowing users to select the distance metric that best fits their needs, Mercator helps researchers perform unsupervised analyses that use pattern identification through computation and visual inspection.Availability and ImplementationMercator is freely available at the Comprehensive R Archive Network (https://cran.r-project.org/web/packages/Mercator/index.html)[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

MOVIE: Multi-Omics Visualization of Estimated contributions

10.1101/379115 ◽

2018 ◽

Author(s):

Sean D. McCabe ◽

Dan-Yu Lin ◽

Michael I. Love

Keyword(s):

Correlation Analysis ◽

Canonical Correlation Analysis ◽

Canonical Correlation ◽

High Specificity ◽

R Package ◽

Data Type ◽

The Cancer Genome Atlas ◽

Supplementary Information ◽

Data Types ◽

Cancer Data

AbstractSummaryThe growth of multi-omics datasets has given rise to many methods for identifying sources of common variation across data types. The unsupervised nature of these methods makes it difficult to evaluate their performance. We present MOVIE, Multi-Omics Visualization of Estimated contributions, as a framework for evaluating the degree of overfitting and the stability of unsupervised multi-omics methods. MOVIE plots the contributions of one data type against another to produce contribution plots, where contributions are calculated for each subject and each data type from the results of each multi-omics method. The usefulness of MOVIE is demonstrated by applying existing multi-omics methods to permuted null data and breast cancer data from The Cancer Genome Atlas. Contribution plots indicated that principal components-based Canonical Correlation Analysis overfit null data, while Sparse multiple Canonical Correlation Analysis and Multi-Omics Factor Analysis provided stable results with high specificity for both the real and permuted null datasets.AvailabilityMOVIE is available as an R package at https://github.com/mccabes292/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Inferring perturbation profiles of cancer samples

Bioinformatics ◽

10.1093/bioinformatics/btab113 ◽

2021 ◽

Author(s):

Martin Pirkl ◽

Niko Beerenwinkel

Keyword(s):

Indirect Evidence ◽

R Package ◽

The Cancer Genome Atlas ◽

Supplementary Information ◽

Patient Specific ◽

Driver Genes ◽

Cancer Driver ◽

Molecular Alterations ◽

Incomplete Coverage ◽

Gene Perturbations

Abstract Motivation Cancer is one of the most prevalent diseases in the world. Tumors arise due to important genes changing their activity, e.g. when inhibited or over-expressed. But these gene perturbations are difficult to observe directly. Molecular profiles of tumors can provide indirect evidence of gene perturbations. However, inferring perturbation profiles from molecular alterations is challenging due to error-prone molecular measurements and incomplete coverage of all possible molecular causes of gene perturbations. Results We have developed a novel mathematical method to analyze cancer driver genes and their patient-specific perturbation profiles. We combine genetic aberrations with gene expression data in a causal network derived across patients to infer unobserved perturbations. We show that our method can predict perturbations in simulations, CRISPR perturbation screens and breast cancer samples from The Cancer Genome Atlas. Availability and implementation The method is available as the R-package nempi at https://github.com/cbg-ethz/nempi and http://bioconductor.org/packages/nempi. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Large-scale profiling of microRNAs for The Cancer Genome Atlas

Nucleic Acids Research ◽

10.1093/nar/gkv808 ◽

2015 ◽

Vol 44 (1) ◽

pp. e3-e3 ◽

Cited By ~ 62

Author(s):

Andy Chu ◽

Gordon Robertson ◽

Denise Brooks ◽

Andrew J. Mungall ◽

Inanc Birol ◽

...

Keyword(s):

Large Scale ◽

Cancer Genome ◽

The Cancer Genome Atlas ◽

Cancer Genome Atlas ◽

Genome Atlas

Download Full-text

CASS: A distributed network clustering algorithm based on structure similarity for large-scale network

PLoS ONE ◽

10.1371/journal.pone.0203670 ◽

2018 ◽

Vol 13 (10) ◽

pp. e0203670 ◽

Cited By ~ 1

Author(s):

Jungrim Kim ◽

Mincheol Shin ◽

Jeongwoo Kim ◽

Chihyun Park ◽

Sujin Lee ◽

...

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Network Clustering ◽

Distributed Network ◽

Large Scale Network ◽

Scale Network ◽

Structure Similarity

Download Full-text

An R Package for Divergence Analysis of Omics Data

10.1101/720391 ◽

2019 ◽

Author(s):

Wikum Dinalankara ◽

Qian Ke ◽

Donald Geman ◽

Luigi Marchionni

Keyword(s):

High Throughput Sequencing ◽

R Package ◽

The Cancer Genome Atlas ◽

High Dimensional ◽

Omics Data ◽

Sequencing Data ◽

High Throughput Sequencing Data ◽

Ternary Code ◽

Cancer Genome Atlas ◽

Level Analysis

AbstractGiven the ever-increasing amount of high-dimensional and complex omics data becoming available, it is increasingly important to discover simple but effective methods of analysis. Divergence analysis transforms each entry of a high-dimensional omics profile into a digitized (binary or ternary) code based on the deviation of the entry from a given baseline population. This is a novel framework that is significantly different from existing omics data analysis methods: it allows digitization of continuous omics data at the univariate or multivariate level, facilitates sample level analysis, and is applicable on many different omics platforms. The divergence package, available on the R platform through the Bioconductor repository collection, provides easy-to-use functions for carrying out this transformation. Here we demonstrate how to use the package with sample high throughput sequencing data from the Cancer Genome Atlas.

Download Full-text

hypeR: An R Package for Geneset Enrichment Workflows

10.1101/656637 ◽

2019 ◽

Cited By ~ 1

Author(s):

Anthony Federico ◽

Stefano Monti

Keyword(s):

High Throughput Sequencing ◽

R Package ◽

Supplementary Information ◽

Sequencing Data ◽

Wide Audience ◽

Popular Method ◽

Link Type ◽

High Throughput Sequencing Data ◽

One Stop ◽

Recent Version

ABSTRACTSummaryGeneset enrichment is a popular method for annotating high-throughput sequencing data. Existing tools fall short in providing the flexibility to tackle the varied challenges researchers face in such analyses, particularly when analyzing many signatures across multiple experiments. We present a comprehensive R package for geneset enrichment workflows that offers multiple enrichment, visualization, and sharing methods in addition to novel features such as hierarchical geneset analysis and built-in markdown reporting. hypeR is a one-stop solution to performing geneset enrichment for a wide audience and range of use cases.Availability and implementationThe most recent version of the package is available at https://github.com/montilab/hypeR.Supplementary informationComprehensive documentation and tutorials, are available at https://montilab.github.io/hypeR-docs.

Download Full-text

ADFinder: accurate detection of programmed DNA elimination using NGS high-throughput sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btaa226 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3632-3636 ◽

Cited By ~ 2

Author(s):

Weibo Zheng ◽

Jing Chen ◽

Thomas G Doak ◽

Weibo Song ◽

Ying Yan

Keyword(s):

High Throughput ◽

Large Scale ◽

High Throughput Sequencing ◽

Supplementary Information ◽

Sequencing Data ◽

Source Codes ◽

High Throughput Sequencing Data ◽

Dna Elimination ◽

Multiple Alternative ◽

Dna Splicing

Abstract Motivation Programmed DNA elimination (PDE) plays a crucial role in the transitions between germline and somatic genomes in diverse organisms ranging from unicellular ciliates to multicellular nematodes. However, software specific for the detection of DNA splicing events is scarce. In this paper, we describe Accurate Deletion Finder (ADFinder), an efficient detector of PDEs using high-throughput sequencing data. ADFinder can predict PDEs with relatively low sequencing coverage, detect multiple alternative splicing forms in the same genomic location and calculate the frequency for each splicing event. This software will facilitate research of PDEs and all down-stream analyses. Results By analyzing genome-wide DNA splicing events in two micronuclear genomes of Oxytricha trifallax and Tetrahymena thermophila, we prove that ADFinder is effective in predicting large scale PDEs. Availability and implementation The source codes and manual of ADFinder are available in our GitHub website: https://github.com/weibozheng/ADFinder. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Efficient weighted univariate clustering maps outstanding dysregulated genomic zones in human cancers

Bioinformatics ◽

10.1093/bioinformatics/btaa613 ◽

2020 ◽

Vol 36 (20) ◽

pp. 5027-5036 ◽

Cited By ~ 3

Author(s):

Mingzhou Song ◽

Hua Zhong

Keyword(s):

Clustering Algorithm ◽

Human Cancer ◽

Clustering Algorithms ◽

Search Space ◽

R Package ◽

Supplementary Information ◽

Diagnostic Biomarkers ◽

Cancer Types ◽

Molecular Patterns ◽

Pan Cancer

Abstract Motivation Chromosomal patterning of gene expression in cancer can arise from aneuploidy, genome disorganization or abnormal DNA methylation. To map such patterns, we introduce a weighted univariate clustering algorithm to guarantee linear runtime, optimality and reproducibility. Results We present the chromosome clustering method, establish its optimality and runtime and evaluate its performance. It uses dynamic programming enhanced with an algorithm to reduce search-space in-place to decrease runtime overhead. Using the method, we delineated outstanding genomic zones in 17 human cancer types. We identified strong continuity in dysregulation polarity—dominance by either up- or downregulated genes in a zone—along chromosomes in all cancer types. Significantly polarized dysregulation zones specific to cancer types are found, offering potential diagnostic biomarkers. Unreported previously, a total of 109 loci with conserved dysregulation polarity across cancer types give insights into pan-cancer mechanisms. Efficient chromosomal clustering opens a window to characterize molecular patterns in cancer genome and beyond. Availability and implementation Weighted univariate clustering algorithms are implemented within the R package ‘Ckmeans.1d.dp’ (4.0.0 or above), freely available at https://cran.r-project.org/package=Ckmeans.1d.dp. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

CRAFT: Compact genome Representation towards large-scale Alignment-Free daTabase

10.1101/2020.07.10.196741 ◽

2020 ◽

Author(s):

Yang Young Lu ◽

Jiaxing Bai ◽

Yiwen Wang ◽

Ying Wang ◽

Fengzhu Sun

Keyword(s):

Dna Sequences ◽

Sequence Comparison ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Practical Interest ◽

Supplementary Information ◽

Computationally Efficient ◽

Sequencing Technologies ◽

Alignment Free

AbstractMotivationRapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption.ResultsWe report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing (HTS) data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102 – 104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures.AvailabilityCRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/[email protected]; [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text