VESPA: Very large-scale Evolutionary and Selective Pressure Analyses

10.7287/peerj.preprints.1895v1 ◽

2016 ◽

Author(s):

Andrew E. Webb ◽

Thomas A. Walsh ◽

Mary J O'Connell

Keyword(s):

Phylogenetic Trees ◽

Large Scale ◽

Selective Pressure ◽

Gene Families ◽

Pressure Variation ◽

Phylogeny Reconstruction ◽

Entire Genome ◽

Protein Coding ◽

Large Sets ◽

Pressure Analysis

Large-scale molecular evolutionary analyses of protein coding sequences requires a number of preparatory inter-related steps from finding gene families, to generating alignments and phylogenetic trees and assessing selective pressure variation. Each phase of these analyses can represent significant challenges particularly when working with the entire genome of large sets of species. We present VESPA, software capable of automating a selective pressure analysis using codeML in addition to the preparatory analyses and summary statistics. VESPA is written in python and is designed to run within a UNIX environment. Large-scale gene family identification, sequence alignment, and phylogeny reconstruction are all important aspects of large-scale molecular evolutionary analyses. VESPA provides flexible software for simplifying these processes along with downstream selective pressure variation analyses. The software automatically interprets results from codeML and produces simplified summary files to assist the user in better understanding the results. VESPA may be found at the following website: www.mol-evol.org/VESPA

Download Full-text

VESPA: Very large-scale Evolutionary and Selective Pressure Analyses

10.7287/peerj.preprints.1895 ◽

2017 ◽

Author(s):

Andrew E. Webb ◽

Thomas A. Walsh ◽

Mary J O'Connell

Keyword(s):

Phylogenetic Trees ◽

Large Scale ◽

Selective Pressure ◽

Gene Families ◽

Pressure Variation ◽

Phylogeny Reconstruction ◽

Entire Genome ◽

Protein Coding ◽

Large Sets ◽

Pressure Analysis

Large-scale molecular evolutionary analyses of protein coding sequences requires a number of preparatory inter-related steps from finding gene families, to generating alignments and phylogenetic trees and assessing selective pressure variation. Each phase of these analyses can represent significant challenges particularly when working with the entire genome of large sets of species. We present VESPA, software capable of automating a selective pressure analysis using codeML in addition to the preparatory analyses and summary statistics. VESPA is written in python and is designed to run within a UNIX environment. Large-scale gene family identification, sequence alignment, and phylogeny reconstruction are all important aspects of large-scale molecular evolutionary analyses. VESPA provides flexible software for simplifying these processes along with downstream selective pressure variation analyses. The software automatically interprets results from codeML and produces simplified summary files to assist the user in better understanding the results. VESPA may be found at the following website: www.mol-evol.org/VESPA

Download Full-text

VESPA: Very large-scale Evolutionary and Selective Pressure Analyses

10.7287/peerj.preprints.1895v2 ◽

2017 ◽

Author(s):

Andrew E. Webb ◽

Thomas A. Walsh ◽

Mary J O'Connell

Keyword(s):

Phylogenetic Trees ◽

Large Scale ◽

Selective Pressure ◽

Gene Families ◽

Pressure Variation ◽

Phylogeny Reconstruction ◽

Entire Genome ◽

Protein Coding ◽

Large Sets ◽

Pressure Analysis

Large-scale molecular evolutionary analyses of protein coding sequences requires a number of preparatory inter-related steps from finding gene families, to generating alignments and phylogenetic trees and assessing selective pressure variation. Each phase of these analyses can represent significant challenges particularly when working with the entire genome of large sets of species. We present VESPA, software capable of automating a selective pressure analysis using codeML in addition to the preparatory analyses and summary statistics. VESPA is written in python and is designed to run within a UNIX environment. Large-scale gene family identification, sequence alignment, and phylogeny reconstruction are all important aspects of large-scale molecular evolutionary analyses. VESPA provides flexible software for simplifying these processes along with downstream selective pressure variation analyses. The software automatically interprets results from codeML and produces simplified summary files to assist the user in better understanding the results. VESPA may be found at the following website: www.mol-evol.org/VESPA

Download Full-text

Arabidopsis Genes Essential for Seedling Viability: Isolation of Insertional Mutants and Molecular Cloning

Genetics ◽

10.1093/genetics/159.4.1765 ◽

2001 ◽

Vol 159 (4) ◽

pp. 1765-1778

Author(s):

Gregory J Budziszewski ◽

Sharon Potter Lewis ◽

Lyn Wegrich Glover ◽

Jennifer Reineke ◽

Gary Jones ◽

...

Keyword(s):

Large Scale ◽

Protein Translocation ◽

Gene Families ◽

Mutant Phenotype ◽

Lethal Mutant ◽

A Genome ◽

Genes Encoding ◽

High Level ◽

Mutant Lines ◽

Genome Scale

Abstract We have undertaken a large-scale genetic screen to identify genes with a seedling-lethal mutant phenotype. From screening ~38,000 insertional mutant lines, we identified >500 seedling-lethal mutants, completed cosegregation analysis of the insertion and the lethal phenotype for >200 mutants, molecularly characterized 54 mutants, and provided a detailed description for 22 of them. Most of the seedling-lethal mutants seem to affect chloroplast function because they display altered pigmentation and affect genes encoding proteins predicted to have chloroplast localization. Although a high level of functional redundancy in Arabidopsis might be expected because 65% of genes are members of gene families, we found that 41% of the essential genes found in this study are members of Arabidopsis gene families. In addition, we isolated several interesting classes of mutants and genes. We found three mutants in the recently discovered nonmevalonate isoprenoid biosynthetic pathway and mutants disrupting genes similar to Tic40 and tatC, which are likely to be involved in chloroplast protein translocation. Finally, we directly compared T-DNA and Ac/Ds transposon mutagenesis methods in Arabidopsis on a genome scale. In each population, we found only about one-third of the insertion mutations cosegregated with a mutant phenotype.

Download Full-text

Systematic Detection of Large-Scale Multi-Gene Horizontal Transfer in Prokaryotes

Molecular Biology and Evolution ◽

10.1093/molbev/msab043 ◽

2021 ◽

Author(s):

Lina Kloub ◽

Sean Gosselin ◽

Matthew Fullmer ◽

Joerg Graf ◽

J Peter Gogarten ◽

...

Keyword(s):

Gene Transfer ◽

Large Scale ◽

Single Gene ◽

Gene Families ◽

Microbial Evolution ◽

Phylogenetic Distance ◽

Secretion Systems ◽

Type Iii Secretion Systems ◽

A Genome ◽

Conserved Gene

Abstract Horizontal gene transfer (HGT) is central to prokaryotic evolution. However, little is known about the “scale” of individual HGT events. In this work, we introduce the first computational framework to help answer the following fundamental question: How often does more than one gene get horizontally transferred in a single HGT event? Our method, called HoMer, uses phylogenetic reconciliation to infer single-gene HGT events across a given set of species/strains, employs several techniques to account for inference error and uncertainty, combines that information with gene order information from extant genomes, and uses statistical analysis to identify candidate horizontal multi-gene transfers (HMGTs) in both extant and ancestral species/strains. HoMer is highly scalable and can be easily used to infer HMGTs across hundreds of genomes. We apply HoMer to a genome-scale dataset of over 22000 gene families from 103 Aeromonas genomes and identify a large number of plausible HMGTs of various scales at both small and large phylogenetic distances. Analysis of these HMGTs reveals interesting relationships between gene function, phylogenetic distance, and frequency of multi-gene transfer. Among other insights, we find that (i) the observed relative frequency of HMGT increases as divergence between genomes increases, (ii) HMGTs often have conserved gene functions, and (iii) rare genes are frequently acquired through HMGT. We also analyze in detail HMGTs involving the zonula occludens toxin and type III secretion systems. By enabling the systematic inference of HMGTs on a large scale, HoMer will facilitate a more accurate and more complete understanding of HGT and microbial evolution.

Download Full-text

GeneRax: A tool for species tree-aware maximum likelihood based gene family tree inference under gene duplication, transfer, and loss

10.1101/779066 ◽

2019 ◽

Cited By ~ 3

Author(s):

Benoit Morel ◽

Alexey M. Kozlov ◽

Alexandros Stamatakis ◽

Gergely J. Szöllősi

Keyword(s):

Maximum Likelihood ◽

Phylogenetic Trees ◽

Large Scale ◽

Simulated Data ◽

Gene Families ◽

Species Tree ◽

Homologous Gene ◽

Sequence Alignments ◽

Full Likelihood ◽

True Tree

AbstractInferring phylogenetic trees for individual homologous gene families is difficult because alignments are often too short, and thus contain insufficient signal, while substitution models inevitably fail to capture the complexity of the evolutionary processes. To overcome these challenges species tree-aware methods also leverage information from a putative species tree. However, only few methods are available that implement a full likelihood framework or account for horizontal gene transfers. Furthermore, these methods often require expensive data pre-processing (e.g., computing bootstrap trees), and rely on approximations and heuristics that limit the degree of tree space exploration. Here we present GeneRax, the first maximum likelihood species tree-aware phylogenetic inference software. It simultaneously accounts for substitutions at the sequence level as well as gene level events, such as duplication, transfer, and loss relying on established maximum likelihood optimization algorithms. GeneRax can infer rooted phylogenetic trees for multiple gene families, directly from the per-gene sequence alignments and a rooted, yet undated, species tree. We show that compared to competing tools, on simulated data GeneRax infers trees that are the closest to the true tree in 90% of the simulations in terms of relative Robinson-Foulds distance. On empirical datasets, GeneRax is the fastest among all tested methods when starting from aligned sequences, and it infers trees with the highest likelihood score, based on our model. GeneRax completed tree inferences and reconciliations for 1099 Cyanobacteria families in eight minutes on 512 CPU cores. Thus, its parallelization scheme enables large-scale analyses. GeneRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax.

Download Full-text

SINE jumping contributes to large-scale polymorphisms in the pig genomes

10.21203/rs.3.rs-352249/v1 ◽

2021 ◽

Author(s):

Cai Chen ◽

Enrico D'Alessandro ◽

Eduard Murani ◽

Yao Zheng ◽

Domenico Giosa ◽

...

Keyword(s):

Genetic Analysis ◽

Population Genetic ◽

Large Scale ◽

Structural Variations ◽

Population Genetic Analysis ◽

Protein Coding ◽

Pig Breeds ◽

A Genome ◽

Trait Locus ◽

And Cluster Analysis

Abstract Background: Molecular markers based on retrotransposon insertion polymorphisms (RIPs) have been developed and are widely used in plants and animals. Short interspersed nuclear elements (SINEs) exert wide impacts on gene activity and even on phenotypes. However, SINE RIP profiles in livestock remain largely unknown, and not be revealed in pigs. Results: Our data revealed that SINEA1 displayed the most polymorphic insertions (22.5% intragenic and 26.5% intergenic), followed by SINEA2 (10.5% intragenic and 9% intergenic) and SINEA3 (12.5% intragenic and 5.0% intergenic). We developed a genome-wide SINE RIP mining protocol and obtained a large number of SINE RIPs (36,284), with over 80% accuracy and an even distribution in chromosomes (14.5/Mb), and 74.34% of SINE RIPs generated by SINEA1 element. Over 65% of pig SINE RIPs overlap with genes, with significant enrichment in the first and second introns of protein-coding and long non-coding RNA genes. Nearly half of the RIPs are common in these pig breeds. Sixteen SINE RIPs were applied for population genetic analysis in 23 pig breeds, the phylogeny tree and cluster analysis were generally consistent with the geographical distributions of native pig breeds in China. Conclusions: Our analysis revealed that SINEA1–3 elements, particularly SINEA1, are high polymorphic across different pig breeds, and generate large-scale structural variations in the pig genomes. And over 35, 000 SINE RIP markers were obtained. These data indicate that young SINE elements play important roles in creating new genetic variations and shaping the evolution of pig genome, and also provide strong evidences to support the great potential of SINE RIPs as genetic markers, which can be used for population genetic analysis and quantitative trait locus (QTL) mapping in pig.

Download Full-text

gEVE: a genome-based endogenous viral element database provides comprehensive viral protein-coding sequences in mammalian genomes

Database ◽

10.1093/database/baw087 ◽

2016 ◽

Vol 2016 ◽

pp. baw087 ◽

Cited By ~ 17

Author(s):

So Nakagawa ◽

Mahoko Ueda Takahashi

Keyword(s):

Viral Protein ◽

Protein Coding ◽

Coding Sequences ◽

A Genome ◽

Mammalian Genomes

Download Full-text

Exploration of the Germline Genome of the Ciliate Chilodonella uncinata through Single-Cell Omics (Transcriptomics and Genomics)

mBio ◽

10.1128/mbio.01836-17 ◽

2018 ◽

Vol 9 (1) ◽

Cited By ~ 12

Author(s):

Xyrus X. Maurer-Alcalá ◽

Rob Knight ◽

Laura A. Katz

Keyword(s):

Single Cell ◽

Large Scale ◽

Gc Content ◽

Gene Families ◽

Genome Rearrangements ◽

Paramecium Tetraurelia ◽

Protein Coding ◽

Genome Data ◽

Large Gene ◽

Disproportionate Number

ABSTRACTSeparate germline and somatic genomes are found in numerous lineages across the eukaryotic tree of life, often separated into distinct tissues (e.g., in plants, animals, and fungi) or distinct nuclei sharing a common cytoplasm (e.g., in ciliates and some foraminifera). In ciliates, germline-limited (i.e., micronuclear-specific) DNA is eliminated during the development of a new somatic (i.e., macronuclear) genome in a process that is tightly linked to large-scale genome rearrangements, such as deletions and reordering of protein-coding sequences. Most studies of germline genome architecture in ciliates have focused on the model ciliatesOxytricha trifallax,Paramecium tetraurelia, andTetrahymena thermophila, for which the complete germline genome sequences are known. Outside of these model taxa, only a few dozen germline loci have been characterized from a limited number of cultivable species, which is likely due to difficulties in obtaining sufficient quantities of “purified” germline DNA in these taxa. Combining single-cell transcriptomics and genomics, we have overcome these limitations and provide the first insights into the structure of the germline genome of the ciliateChilodonella uncinata, a member of the understudied classPhyllopharyngea. Our analyses reveal the following: (i) large gene families contain a disproportionate number of genes from scrambled germline loci; (ii) germline-soma boundaries in the germline genome are demarcated by substantial shifts in GC content; (iii) single-cell omics techniques provide large-scale quality germline genome data with limited effort, at least for ciliates with extensively fragmented somatic genomes. Our approach provides an efficient means to understand better the evolution of genome rearrangements between germline and soma in ciliates.IMPORTANCEOur understanding of the distinctions between germline and somatic genomes in ciliates has largely relied on studies of a few model genera (e.g.,Oxytricha,Paramecium,Tetrahymena). We have used single-cell omics to explore germline-soma distinctions in the ciliateChilodonella uncinata, which likely diverged from the better-studied ciliates ~700 million years ago. The analyses presented here indicate that developmentally regulated genome rearrangements between germline and soma are demarcated by rapid transitions in local GC composition and lead to diversification of protein families. The approaches used here provide the basis for future work aimed at discerning the evolutionary impacts of germline-soma distinctions among diverse ciliates.

Download Full-text

GeneRax: A Tool for Species-Tree-Aware Maximum Likelihood-Based Gene Family Tree Inference under Gene Duplication, Transfer, and Loss

Molecular Biology and Evolution ◽

10.1093/molbev/msaa141 ◽

2020 ◽

Vol 37 (9) ◽

pp. 2763-2774 ◽

Cited By ~ 5

Author(s):

Benoit Morel ◽

Alexey M Kozlov ◽

Alexandros Stamatakis ◽

Gergely J Szöllősi

Keyword(s):

Maximum Likelihood ◽

Phylogenetic Trees ◽

Large Scale ◽

Simulated Data ◽

Gene Families ◽

Species Tree ◽

Homologous Gene ◽

Sequence Alignments ◽

Full Likelihood ◽

True Tree

Abstract Inferring phylogenetic trees for individual homologous gene families is difficult because alignments are often too short, and thus contain insufficient signal, while substitution models inevitably fail to capture the complexity of the evolutionary processes. To overcome these challenges, species-tree-aware methods also leverage information from a putative species tree. However, only few methods are available that implement a full likelihood framework or account for horizontal gene transfers. Furthermore, these methods often require expensive data preprocessing (e.g., computing bootstrap trees) and rely on approximations and heuristics that limit the degree of tree space exploration. Here, we present GeneRax, the first maximum likelihood species-tree-aware phylogenetic inference software. It simultaneously accounts for substitutions at the sequence level as well as gene level events, such as duplication, transfer, and loss relying on established maximum likelihood optimization algorithms. GeneRax can infer rooted phylogenetic trees for multiple gene families, directly from the per-gene sequence alignments and a rooted, yet undated, species tree. We show that compared with competing tools, on simulated data GeneRax infers trees that are the closest to the true tree in 90% of the simulations in terms of relative Robinson–Foulds distance. On empirical data sets, GeneRax is the fastest among all tested methods when starting from aligned sequences, and it infers trees with the highest likelihood score, based on our model. GeneRax completed tree inferences and reconciliations for 1,099 Cyanobacteria families in 8 min on 512 CPU cores. Thus, its parallelization scheme enables large-scale analyses. GeneRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax (last accessed June 17, 2020).

Download Full-text