Exploring evolutionary relationships across the genome using topology weighting

When the levee breaks: a practical guide to sketching algorithms for processing the flood of genomic data

Genome Biology ◽

10.1186/s13059-019-1809-x ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 13

Author(s):

Will P. M. Rowe

Keyword(s):

Genomic Data ◽

The Past ◽

Link Type ◽

Practical Guide ◽

Current State ◽

Great Utility ◽

State Of The Field

Abstract Considerable advances in genomics over the past decade have resulted in vast amounts of data being generated and deposited in global archives. The growth of these archives exceeds our ability to process their content, leading to significant analysis bottlenecks. Sketching algorithms produce small, approximate summaries of data and have shown great utility in tackling this flood of genomic data, while using minimal compute resources. This article reviews the current state of the field, focusing on how the algorithms work and how genomicists can utilize them effectively. References to interactive workbooks for explaining concepts and demonstrating workflows are included at https://github.com/will-rowe/genome-sketching.

BioJS InterMineTable Component: A BioJS component for displaying data from InterMine compatible webservice endpoints

F1000Research ◽

10.12688/f1000research.3-46.v1 ◽

2014 ◽

Vol 3 ◽

pp. 46 ◽

Cited By ~ 1

Author(s):

Alexis Kalderimis ◽

Radek Stepan ◽

Julie Sullivan ◽

Rachel Lyne ◽

Michael Lyne ◽

...

Keyword(s):

Genomic Data ◽

Data Warehouses ◽

Data Types ◽

Link Type ◽

Wide Range ◽

Flexible Queries

Summary: The InterMineTable component is a reusable JavaScript component as part of the BioJS project. It enables users to embed powerful table-based query facilities in their websites with access to genomic data-warehouses such as http://www.flymine.org, which allow users to perform flexible queries over a wide range of integrated data types.Availability: http://github.com/alexkalderimis/im-tables-biojs; http://github.com/biojs/biojs; http://dx.doi.org/10.5281/zenodo.8301.

BioJS InterMine List Analysis: A BioJS component for displaying graphical or statistical analysis of collections of items from InterMine endpoints

F1000Research ◽

10.12688/f1000research.3-45.v1 ◽

2014 ◽

Vol 3 ◽

pp. 45 ◽

Cited By ~ 1

Author(s):

Alexis Kalderimis ◽

Radek Stepan ◽

Julie Sullivan ◽

Rachel Lyne ◽

Michael Lyne ◽

...

Keyword(s):

Statistical Analysis ◽

Genomic Data ◽

Data Warehouses ◽

Data Types ◽

Link Type ◽

Wide Range ◽

Flexible Queries

Summary: The InterMineTable component is a reusable JavaScript component as part of the BioJS project. It enables users to embed powerful table-based query facilities in their websites with access to genomic data-warehouses such as http://www.flymine.org, which allow users to perform flexible queries over a wide range of integrated data types.Availability: http://github.com/alexkalderimis/im-tables-biojs; http://github.com/biojs/biojs; http://dx.doi.org/10.5281/zenodo.8301.

blupADC: An R package and shiny toolkit for comprehensive genetic data analysis in animal and plant breeding

10.1101/2021.09.09.459557 ◽

2021 ◽

Author(s):

Quanshun Mei ◽

Chuanke Fu ◽

Jieling Li ◽

Shuhong Zhao ◽

Tao Xiang

Keyword(s):

Genetic Analysis ◽

Plant Breeding ◽

Genomic Data ◽

R Package ◽

Genotype Imputation ◽

Supplementary Information ◽

Composition Analysis ◽

Relationship Matrix ◽

Link Type ◽

Plant Breeding Program

AbstractSummaryGenetic analysis is a systematic and complex procedure in animal and plant breeding. With fast development of high-throughput genotyping techniques and algorithms, animal and plant breeding has entered into a genomic era. However, there is a lack of software, which can be used to process comprehensive genetic analyses, in the routine animal and plant breeding program. To make the whole genetic analysis in animal and plant breeding straightforward, we developed a powerful, robust and fast R package that includes genomic data format conversion, genomic data quality control and genotype imputation, breed composition analysis, pedigree tracing, analysis and visualization, pedigree-based and genomic-based relationship matrix construction, and genomic evaluation. In addition, to simplify the application of this package, we also developed a shiny toolkit for users.Availability and implementationblupADC is developed primarily in R with core functions written in C++. The development version is maintained at https://github.com/TXiang-lab/blupADC.Supplementary informationSupplementary data are available online

Wedding higher taxonomic ranks with metabolic signatures coded in prokaryotic genomes

10.1101/044115 ◽

2016 ◽

Author(s):

Gregorio Iraola ◽

Hugo Naya

Keyword(s):

Best Practices ◽

Polyphasic Taxonomy ◽

Bacterial Species ◽

Genomic Data ◽

Link Type ◽

Prokaryotic Genomes ◽

Functional Features ◽

Definition Of ◽

Ecological Coherence ◽

Metabolic Signatures

Taxonomy of prokaryotes has remained a controversial discipline due to the extreme plasticity of microorganisms, causing inconsistencies between phenotypic and genotypic classifications. The genomics era has enhanced taxonomy but also opened new debates about the best practices for incorporating genomic data into polyphasic taxonomy protocols, which are fairly biased towards the identification of bacterial species. Here we use an extensive dataset of Archaea and Bacteria to prove that metabolic signatures coded in their genomes are informative traits that allow to accurately classify organisms coherently to higher taxonomic ranks, and to associate functional features with the definition of taxa. Our results support the ecological coherence of higher taxonomic ranks and reconciles taxonomy with traditional chemotaxonomic traits inferred from genomes. KARL, a simple and free tool useful for assisting polyphasic taxonomy or to perform functional prospections is also presented (https://github.com/giraola/KARL).

Pairwise comparisons across species are problematic when analyzing functional genomic data

10.1101/107177 ◽

2017 ◽

Cited By ~ 4

Author(s):

Casey W. Dunn ◽

Felipe Zapata ◽

Catriona Munro ◽

Stefan Siebert ◽

Andreas Hejnol

Keyword(s):

Gene Expression ◽

Evolutionary Process ◽

Opportunity To Learn ◽

Genomic Data ◽

The Other ◽

Evolutionary Relationships ◽

Pairwise Comparisons ◽

Functional Genomic ◽

Functional Genomic Data ◽

Comparative Functional Genomics

AbstractThere is considerable interest in comparing functional genomic data across species. One goal of such work is to provide an integrated understanding of genome and phenotype evolution. Most comparative functional genomic studies have relied on multiple pairwise comparisons between species, an approach that does not incorporate information about the evolutionary relationships among species. The statistical problems that arise from not considering these relationships can lead pairwise approaches to the wrong conclusions, and are a missed opportunity to learn about biology that can only be understood in an explicit phylogenetic context. Here we examine two recently published studies that compare gene expression across species with pairwise methods, and find reason to question the original conclusions of both. One study interpreted pairwise comparisons of gene expression as support for the ortholog conjecture, the hypothesis that orthologs tend to be more similar than paralogs. The other study interpreted pairwise comparisons of embryonic gene expression across distantly related animals as evidence for a distinct evolutionary process that gave rise to phyla. In each study, distinct patterns of pairwise similarity among species were originally interpreted as evidence of particular evolutionary processes, but instead we find they reflect species relationships. These reanalyses concretely demonstrate the inadequacy of pairwise comparisons for analyzing functional genomic data across species. It will be critical to adopt phylogenetic comparative methods in future functional genomic work. Fortunately, phylogenetic comparative biology is also a rapidly advancing field with many methods that can be directly applied to functional genomic data.SignificanceComparisons of genome function between species are providing important insight into the evolutionary origins of diversity. Here we demonstrate that comparative functional genomics studies can come to the wrong conclusions if they do not take the relationships of species into account and instead rely on pairwise comparisons between species, as is common practice. We re-examined two previously published studies and found problems with pairwise comparisons that draw both their original conclusions into question. One study had found support for the ortholog conjecture and the other had concluded that the evolution of gene expression was different between animal phyla than within them. Our results demonstrate that to answer evolutionary questions about genome function, it is critical to consider evolutionary relationships.

Dhaka: Variational Autoencoder for Unmasking Tumor Heterogeneity from Single Cell Genomic Data

10.1101/183863 ◽

2017 ◽

Cited By ~ 4

Author(s):

Sabrina Rashid ◽

Sohrab Shah ◽

Ziv Bar-Joseph ◽

Ravi Pandya

Keyword(s):

Gene Expression ◽

Single Cell ◽

Tumor Heterogeneity ◽

Genomic Data ◽

Feature Space ◽

Marker Genes ◽

Tumor Evolution ◽

Evolutionary Trajectory ◽

Link Type ◽

Variational Autoencoder

AbstractMotivationIntra-tumor heterogeneity is one of the key confounding factors in deciphering tumor evolution. Malignant cells exhibit variations in their gene expression, copy numbers, and mutation even when originating from a single progenitor cell. Single cell sequencing of tumor cells has recently emerged as a viable option for unmasking the underlying tumor heterogeneity. However, extracting features from single cell genomic data in order to infer their evolutionary trajectory remains computationally challenging due to the extremely noisy and sparse nature of the data.ResultsHere we describe ‘Dhaka’, a variational autoencoder method which transforms single cell genomic data to a reduced dimension feature space that is more efficient in differentiating between (hidden) tumor subpopulations. Our method is general and can be applied to several different types of genomic data including copy number variation from scDNA-Seq and gene expression from scRNA-Seq experiments. We tested the method on synthetic and 6 single cell cancer datasets where the number of cells ranges from 250 to 6000 for each sample. Analysis of the resulting feature space revealed subpopulations of cells and their marker genes. The features are also able to infer the lineage and/or differentiation trajectory between cells greatly improving upon prior methods suggested for feature extraction and dimensionality reduction of such data.Availability and ImplementationAll the datasets used in the paper are publicly available and developed software package is available on Github https://github.com/MicrosoftGenomics/Dhaka.Supporting info and Software: https://github.com/MicrosoftGenomics/Dhaka

Plotgardener: Cultivating precise multi-panel figures in R

10.1101/2021.09.08.459338 ◽

2021 ◽

Author(s):

Nicole E Kramer ◽

Eric S Davis ◽

Craig D Wenger ◽

Erika M Deoudes ◽

Sarah M Parker ◽

...

Keyword(s):

Programming Languages ◽

Genomic Data ◽

Data Access ◽

Manuscript Preparation ◽

Data Sets ◽

New Paradigm ◽

Link Type ◽

Bioconductor Project ◽

Invaluable Tool ◽

R Programming

The R programming language is one of the most widely used programming languages for transforming raw genomic data sets into meaningful biological conclusions through analysis and visualization, which has been largely facilitated by infrastructure and tools developed by the Bioconductor project. However, existing plotting packages rely on relative positioning and sizing of plots, which is often sufficient for exploratory analysis but is poorly suited for the creation of publication-quality multi-panel images inherent to scientific manuscript preparation. We present plotgardener, a coordinate-based genomic data visualization package that offers a new paradigm for multi-plot figure generation in R. Plotgardener allows precise, programmatic control over the placement, aesthetics, and arrangements of plots while maximizing user experience through fast and memory-efficient data access, support for a wide variety of data and file types, and tight integration with the Bioconductor environment. Plotgardener also allows precise placement and sizing of ggplot2 plots, making it an invaluable tool for R users and data scientists from virtually any discipline.AvailabilityPackage: https://bioconductor.org/packages/plotgardenerCode: https://github.com/PhanstielLab/plotgardenerDocumentation: https://phanstiellab.github.io/plotgardener/

Better quality score compression through sequence-based quality smoothing

BMC Bioinformatics ◽

10.1186/s12859-019-2883-5 ◽

2019 ◽

Vol 20 (S9) ◽

Cited By ~ 3

Author(s):

Yoshihiro Shibuya ◽

Matteo Comin

Keyword(s):

High Precision ◽

Exponential Growth ◽

Genomic Data ◽

Suffix Array ◽

Quality Score ◽

Smoothing Algorithm ◽

Snp Calling ◽

Link Type ◽

Downstream Analysis ◽

Ngs Data

Abstract Motivation Current NGS techniques are becoming exponentially cheaper. As a result, there is an exponential growth of genomic data unfortunately not followed by an exponential growth of storage, leading to the necessity of compression. Most of the entropy of NGS data lies in the quality values associated to each read. Those values are often more diversified than necessary. Because of that, many tools such as Quartz or GeneCodeq, try to change (smooth) quality scores in order to improve compressibility without altering the important information they carry for downstream analysis like SNP calling. Results We use the FM-Index, a type of compressed suffix array, to reduce the storage requirements of a dictionary of k-mers and an effective smoothing algorithm to maintain high precision for SNP calling pipelines, while reducing quality scores entropy. We present YALFF (Yet Another Lossy Fastq Filter), a tool for quality scores compression by smoothing leading to improved compressibility of FASTQ files. The succinct k-mers dictionary allows YALFF to run on consumer computers with only 5.7 GB of available free RAM. YALFF smoothing algorithm can improve genotyping accuracy while using less resources. Availability https://github.com/yhhshb/yalff

Prediction of lithium response using genomic data

Scientific Reports ◽

10.1038/s41598-020-80814-z ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

William Stone ◽

Abraham Nunes ◽

Kazufumi Akiyama ◽

Nirmala Akula ◽

Raffaella Ardau ◽

...

Keyword(s):

Cross Validation ◽

Genomic Data ◽

Postsynaptic Membrane ◽

Classification Performance ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Genomic Dataset ◽

Task Classification ◽

Lithium Response

AbstractPredicting lithium response prior to treatment could both expedite therapy and avoid exposure to side effects. Since lithium responsiveness may be heritable, its predictability based on genomic data is of interest. We thus evaluate the degree to which lithium response can be predicted with a machine learning (ML) approach using genomic data. Using the largest existing genomic dataset in the lithium response literature (n = 2210 across 14 international sites; 29% responders), we evaluated the degree to which lithium response could be predicted based on 47,465 genotyped single nucleotide polymorphisms using a supervised ML approach. Under appropriate cross-validation procedures, lithium response could be predicted to above-chance levels in two constituent sites (Halifax, Cohen’s kappa 0.15, 95% confidence interval, CI [0.07, 0.24]; and Würzburg, kappa 0.2 [0.1, 0.3]). Variants with shared importance in these models showed over-representation of postsynaptic membrane related genes. Lithium response was not predictable in the pooled dataset (kappa 0.02 [− 0.01, 0.04]), although non-trivial performance was achieved within a restricted dataset including only those patients followed prospectively (kappa 0.09 [0.04, 0.14]). Genomic classification of lithium response remains a promising but difficult task. Classification performance could potentially be improved by further harmonization of data collection procedures.