HyDe: a Python Package for Genome-Scale Hybridization Detection

AbstractThe analysis of hybridization and gene flow among closely related taxa is a common goal for researchers studying speciation and phylogeography. Many methods for hybridization detection use simple site pattern frequencies from observed genomic data and compare them to null models that predict an absence of gene flow. The theory underlying the detection of hybridization using these site pattern probabilities exploits the relationship between the coalescent process for gene trees within population trees and the process of mutation along the branches of the gene trees. For certain models, site patterns are predicted to occur in equal frequency (i.e., their difference is 0), producing a set of functions called phylogenetic invariants. In this paper we introduce HyDe, a software package for detecting hybridization using phylogenetic invariants arising under the coalescent model with hybridization. HyDe is written in Python, and can be used interactively or through the command line using pre-packaged scripts. We demonstrate the use of HyDe on simulated data, as well as on two empirical data sets from the literature. We focus in particular on identifying individual hybrids within population samples and on distinguishing between hybrid speciation and gene flow. HyDe is freely available as an open source Python package under the GNU GPL v3 on both GitHub (https://github.com/pblischak/HyDe) and the Python Package Index (PyPI: https://pypi.python.org/pypi/phyde).

Download Full-text

Quartet-based computations of internode certainty provide accurate and robust measures of phylogenetic incongruence

10.1101/168526 ◽

2017 ◽

Cited By ~ 8

Author(s):

Xiaofan Zhou ◽

Sarah Lutteropp ◽

Lucas Czech ◽

Alexandros Stamatakis ◽

Moritz von Looz ◽

...

Keyword(s):

Phylogenetic Relationships ◽

Phylogenetic Trees ◽

Phylogenetic Signal ◽

Simulated Data ◽

Data Sets ◽

Gene Trees ◽

Branch Support ◽

Robust Measures ◽

Genome Scale ◽

Scale Data

AbstractIncongruence, or topological conflict, is prevalent in genome-scale data sets but relatively few measures have been developed to quantify it. Internode Certainty (IC) and related measures were recently introduced to explicitly quantify the level of incongruence of a given internode (or internal branch) among a set of phylogenetic trees and complement regular branch support statistics in assessing the confidence of the inferred phylogenetic relationships. Since most phylogenomic studies contain data partitions (e.g., genes) with missing taxa and IC scores stem from the frequencies of bipartitions (or splits) on a set of trees, the calculation of IC scores requires adjusting the frequencies of bipartitions from these partial gene trees. However, when the proportion of missing data is high, current approaches that adjust bipartition frequencies in partial gene trees tend to overestimate IC scores and alternative adjustment approaches differ substantially from each other in their scores. To overcome these issues, we developed three new measures for calculating internode certainty that are based on the frequencies of quartets, which naturally apply to both comprehensive and partial trees. Our comparison of these new quartet-based measures to previous bipartition-based measures on simulated data shows that: 1) on comprehensive trees, both types of measures yield highly similar IC scores; 2) on partial trees, quartet-based measures generate more accurate IC scores; and 3) quartet-based measures are more robust to the absence of phylogenetic signal and errors in the phylogenetic relationships to be assessed. Additionally, analysis of 15 empirical phylogenomic data sets using our quartet-based measures suggests that numerous relationships remain unresolved despite the availability of genome-scale data. Finally, we provide an efficient open-source implementation of these quartet-based measures in the program QuartetScores, which is freely available at https://github.com/algomaus/QuartetScores.

Download Full-text

Quartet-Based Computations of Internode Certainty Provide Robust Measures of Phylogenetic Incongruence

Systematic Biology ◽

10.1093/sysbio/syz058 ◽

2019 ◽

Vol 69 (2) ◽

pp. 308-324 ◽

Cited By ~ 7

Author(s):

Xiaofan Zhou ◽

Sarah Lutteropp ◽

Lucas Czech ◽

Alexandros Stamatakis ◽

Moritz Von Looz ◽

...

Keyword(s):

Phylogenetic Trees ◽

Phylogenetic Signal ◽

Simulated Data ◽

Data Sets ◽

Gene Trees ◽

Data Set ◽

Statistical Confidence ◽

Branch Support ◽

Robust Measures ◽

Genome Scale

Abstract Incongruence, or topological conflict, is prevalent in genome-scale data sets. Internode certainty (IC) and related measures were recently introduced to explicitly quantify the level of incongruence of a given internal branch among a set of phylogenetic trees and complement regular branch support measures (e.g., bootstrap, posterior probability) that instead assess the statistical confidence of inference. Since most phylogenomic studies contain data partitions (e.g., genes) with missing taxa and IC scores stem from the frequencies of bipartitions (or splits) on a set of trees, IC score calculation typically requires adjusting the frequencies of bipartitions from these partial gene trees. However, when the proportion of missing taxa is high, the scores yielded by current approaches that adjust bipartition frequencies in partial gene trees differ substantially from each other and tend to be overestimates. To overcome these issues, we developed three new IC measures based on the frequencies of quartets, which naturally apply to both complete and partial trees. Comparison of our new quartet-based measures to previous bipartition-based measures on simulated data shows that: (1) on complete data sets, both quartet-based and bipartition-based measures yield very similar IC scores; (2) IC scores of quartet-based measures on a given data set with and without missing taxa are more similar than the scores of bipartition-based measures; and (3) quartet-based measures are more robust to the absence of phylogenetic signal and errors in phylogenetic inference than bipartition-based measures. Additionally, the analysis of an empirical mammalian phylogenomic data set using our quartet-based measures reveals the presence of substantial levels of incongruence for numerous internal branches. An efficient open-source implementation of these quartet-based measures is freely available in the program QuartetScores (https://github.com/lutteropp/QuartetScores).

Download Full-text

IMPROVED ESTIMATION OF COVARIANCE MATRIX IN HOTELLING’S T2 FOR MICROARRAY DATA

International Journal of Pharmacy and Pharmaceutical Sciences ◽

10.22159/ijpps.2016v8s2.15215 ◽

2016 ◽

Vol 8 (2) ◽

pp. 26 ◽

Cited By ~ 1

Author(s):

Suryaefiza Karjanto ◽

Norazan Mohamed Ramli ◽

Nor Azura Md Ghaninor Azura Md Ghani

Keyword(s):

Covariance Matrix ◽

Microarray Data ◽

Simulated Data ◽

Principal Component ◽

Data Sets ◽

Similar Data ◽

Shrinkage Methods ◽

Differential Gene ◽

The Relationship ◽

Simulated Data Sets

The relationship between genes in gene set analysis in microarray data is analyzed using Hotelling’s T2 but the test cannot be applied when the number of samples is larger than the number of variables which is uncommon in the microarray. Thus, in this study, we proposed shrinkage approaches to estimating the covariance matrix in Hotelling’s T2 particularly to cater high dimensionality problem in microarray data. Three shrinkage covariance methods were proposed in this study and are referred as Shrink A, Shrink B and Shrink C. The analysis of the three proposed shrinkage methods was compared with the Regularized Covariance Matrix Approach and Kong’s Principal Component Analysis. The performances of the proposed methods were assessed using several cases of simulated data sets. In many cases, the Shrink A method performed the best, followed by the Shrink C and RCMAT methods. In contrast, both the Shrink B and KPCA methods showed relatively poor results. The study contributes to an establishment of modified multivariate approach to differential gene expression analysis and expected to be applied in other areas with similar data characteristics.

Download Full-text

Co-estimating Reticulate Phylogenies and Gene Trees from Multi-locus Sequence Data

10.1101/095539 ◽

2016 ◽

Cited By ~ 6

Author(s):

Dingqiao Wen ◽

Luay Nakhleh

Keyword(s):

Gene Flow ◽

Incomplete Lineage Sorting ◽

Phylogenetic Network ◽

Simulated Data ◽

Biological Data ◽

Generative Model ◽

Divergence Times ◽

Gene Trees ◽

Lineage Sorting ◽

Coalescence Times

AbstractThe multispecies network coalescent (MSNC) is a stochastic process that captures how gene trees grow within the branches of a phylogenetic network. Coupling the MSNC with a stochastic mutational process that operates along the branches of the gene trees gives rise to a generative model of how multiple loci from within and across species evolve in the presence of both incomplete lineage sorting (ILS) and reticulation (e.g., hybridization). We report on a Bayesian method for sampling the parameters of this generative model, including the species phylogeny, gene trees, divergence times, and population sizes, from DNA sequences of multiple independent loci. We demonstrate the utility of our method by analyzing simulated data and reanalyzing three biological data sets. Our results demonstrate the significance of not only co-estimating species phylogenies and gene trees, but also accounting for reticulation and ILS simultaneously. In particular, we show that when gene flow occurs, our method accurately estimates the evolutionary histories, coalescence times, and divergence times. Tree inference methods, on the other hand, underestimate divergence times and overestimate coalescence times when the evolutionary history is reticulate. While the MSNC corresponds to an abstract model of “intermixture,” we study the performance of the model and method on simulated data generated under a gene flow model. We show that the method accurately infers the most recent time at which gene flow occurs. Finally, we demonstrate the application of the new method to a 106-locus yeast data set. [Multispecies network coalescent; reticulation; incomplete lineage sorting; phylogenetic network; Bayesian inference; RJMCMC.]

Download Full-text

EXACT SOLUTIONS FOR SPECIES TREE INFERENCE FROM DISCORDANT GENE TREES

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720013420055 ◽

2013 ◽

Vol 11 (05) ◽

pp. 1342005 ◽

Cited By ~ 16

Author(s):

WEN-CHIEH CHANG ◽

PAWEŁ GÓRECKI ◽

OLIVER EULENSTEIN

Keyword(s):

Exact Solutions ◽

Gene Tree ◽

Simulated Data ◽

Gene Families ◽

Species Tree ◽

Data Sets ◽

Gene Trees ◽

Species Trees ◽

Worst Case

Phylogenetic analysis has to overcome the grant challenge of inferring accurate species trees from evolutionary histories of gene families (gene trees) that are discordant with the species tree along whose branches they have evolved. Two well studied approaches to cope with this challenge are to solve either biologically informed gene tree parsimony (GTP) problems under gene duplication, gene loss, and deep coalescence, or the classic RF supertree problem that does not rely on any biological model. Despite the potential of these problems to infer credible species trees, they are NP-hard. Therefore, these problems are addressed by heuristics that typically lack any provable accuracy and precision. We describe fast dynamic programming algorithms that solve the GTP problems and the RF supertree problem exactly, and demonstrate that our algorithms can solve instances with data sets consisting of as many as 22 taxa. Extensions of our algorithms can also report the number of all optimal species trees, as well as the trees themselves. To better asses the quality of the resulting species trees that best fit the given gene trees, we also compute the worst case species trees, their numbers, and optimization score for each of the computational problems. Finally, we demonstrate the performance of our exact algorithms using empirical and simulated data sets, and analyze the quality of heuristic solutions for the studied problems by contrasting them with our exact solutions.

Download Full-text

HyDe: A Python Package for Genome-Scale Hybridization Detection

Systematic Biology ◽

10.1093/sysbio/syy023 ◽

2018 ◽

Vol 67 (5) ◽

pp. 821-829 ◽

Cited By ~ 36

Author(s):

Paul D Blischak ◽

Julia Chifman ◽

Andrea D Wolfe ◽

Laura S Kubatko

Keyword(s):

Genome Scale ◽

Python Package ◽

Hybridization Detection

Download Full-text

The Effects of Pollen and Seed Migration on Nuclear-Dicytoplasmic Systems. II. A New Method for Estimating Plant Gene Flow From Joint Nuclear-Cytoplasmic Data

Genetics ◽

10.1093/genetics/155.2.833 ◽

2000 ◽

Vol 155 (2) ◽

pp. 833-854 ◽

Cited By ~ 2

Author(s):

Maria E Orive ◽

Marjorie A Asmussen

Keyword(s):

Gene Flow ◽

Simulated Data ◽

Likelihood Method ◽

Data Sets ◽

Mixed Mating ◽

Hybrid Zones ◽

Plant Gene ◽

Seed Migration ◽

Genotype Frequencies ◽

Source Populations

Abstract A new maximum-likelihood method is developed for estimating unidirectional pollen and seed flow in mixed-mating plant populations from counts of joint nuclear-cytoplasmic genotypes. Data may include multiple unlinked nuclear markers with a single maternally or paternally inherited cytoplasmic marker, or with two cytoplasmic markers inherited through opposite parents, as in many conifer species. Migration rate estimates are based on fitting the equilibrium genotype frequencies under continent-island models of plant gene flow to the data. Detailed analysis of their equilibrium structures indicates when each of the three nuclear-cytoplasmic systems allows gene flow estimation and shows that, in general, it is easier to estimate seed than pollen migration. Three-locus nuclear-dicytoplasmic data only increase the conditions allowing seed migration estimates; however, the additional dicytonuclear disequilibria allow more accurate estimates of both forms of gene flow. Estimates and their confidence limits for simulated data sets confirm that two-locus data with paternal cytoplasmic inheritance provide better estimates than those with maternal inheritance, while three-locus dicytonuclear data with three modes of inheritance generally provide the most reliable estimates for both types of gene flow. Similar results are obtained for hybrid zones receiving pollen and seed flow from two source populations. An estimation program is available upon request.

Download Full-text

Medusa: software to build and analyze ensembles of genome-scale metabolic network reconstructions

10.1101/547174 ◽

2019 ◽

Cited By ~ 1

Author(s):

Gregory L. Medlock ◽

Jason A. Papin

Keyword(s):

Machine Learning ◽

Experimental Data ◽

Computational Biology ◽

Metabolic Network ◽

Metabolic Networks ◽

Ensemble Simulations ◽

Link Type ◽

Genome Scale ◽

Python Package

AbstractUncertainty in the structure and parameters of networks is ubiquitous across computational biology. In constraint-based reconstruction and analysis of metabolic networks, this uncertainty is present both during the reconstruction of networks and in simulations performed with them. Here, we present Medusa, a Python package for the generation and analysis of ensembles of genome-scale metabolic network reconstructions. Medusa builds on the COBRApy package for constraint-based reconstruction and analysis by compressing a set of models into a compact ensemble object, providing functions for the generation of ensembles using experimental data, and extending constraint-based analyses to ensemble scale. We demonstrate how Medusa can be used to generate ensembles, perform ensemble simulations, and how machine learning can be used in conjunction with Medusa to guide the curation of genome-scale metabolic network reconstructions. Medusa is available under the permissive MIT license from the Python Packaging Index (https://pypi.org/) and from github (https://github.com/gregmedlock/Medusa/), and comprehensive documentation is available at https://medusa.readthedocs.io/en/latest/.

Download Full-text

ProbAnnoWeb and ProbAnnoPy: probabilistic annotation and gap-filling of metabolic reconstructions

10.1101/151258 ◽

2017 ◽

Author(s):

Brendan King ◽

Terry Farrah ◽

Matthew Richards ◽

Michael Mundy ◽

Evangelos Simeonidis ◽

...

Keyword(s):

Web Service ◽

Gap Filling ◽

Flux Balance ◽

Link Type ◽

Likelihood Score ◽

Genome Scale ◽

Python Package

AbstractSummaryGap-filling is a necessary step to produce quality genome-scale metabolic reconstructions capable of flux-balance simulation. Most available gap-filling tools use an organism-agnostic approach, where reactions are selected from a database to fill gaps without consideration of the target organism. Conversely, our likelihood based gap-filling with probabilistic annotations selects candidate reactions based on a likelihood score derived specifically from the target organism’s genome. Here, we present two new implementations of probabilistic annotation and likelihood based gap-filling: a web service called ProbAnnoWeb, and a standalone python package called ProbAnnoPy.Availability and ImplementationOur tools are available as a web service with no installation needed (ProbAnnoWeb), available at http://probannoweb.systemsbiology.net, and as a local python package implementation (ProbAnnoPy), available for download at http://github.com/PriceLab/probannopy.Contacthttp://[email protected]; http://[email protected]

Download Full-text

Inference of genome 3D architecture by modeling overdispersion of Hi-C data

10.1101/2021.02.04.429864 ◽

2021 ◽

Author(s):

Nelle Varoquaux ◽

William S. Noble ◽

Jean-Philippe Vert

Keyword(s):

3D Model ◽

Negative Binomial ◽

Simulated Data ◽

Poisson Model ◽

Random Variable ◽

Dispersion Parameter ◽

Supplementary Information ◽

Data Sets ◽

Link Type ◽

The Mean

We address the challenge of inferring a consensus 3D model of genome architecture from Hi-C data. Existing approaches most often rely on a two step algorithm: first convert the contact counts into distances, then optimize an objective function akin to multidimensional scaling (MDS) to infer a 3D model. Other approaches use a maximum likelihood approach, modeling the contact counts between two loci as a Poisson random variable whose intensity is a decreasing function of the distance between them. However, a Poisson model of contact counts implies that the variance of the data is equal to the mean, a relationship that is often too restrictive to properly model count data.We first confirm the presence of overdispersion in several real Hi-C data sets, and we show that the overdispersion arises even in simulated data sets. We then propose a new model, called Pastis-NB, where we replace the Poisson model of contact counts by a negative binomial one, which is parametrized by a mean and a separate dispersion parameter. The dispersion parameter allows the variance to be adjusted independently from the mean, thus better modeling overdispersed data. We compare the results of Pastis-NB to those of several previously published algorithms: three MDS-based methods (ShRec3D, ChromSDE, and Pastis-MDS) and a statistical methods based on a Poisson model of the data (Pastis-PM). We show that the negative binomial inference yields more accurate structures on simulated data, and more robust structures than other models across real Hi-C replicates and across different resolutions.A Python implementation of Pastis-NB is available at https://github.com/hiclib/pastis under the BSD licenseSupplementary information is available at https://nellev.github.io/pastisnb/

Download Full-text