scholarly journals FUSTr: a tool to find gene families under selection in transcriptomes

PeerJ ◽  
2018 ◽  
Vol 6 ◽  
pp. e4234 ◽  
Author(s):  
T. Jeffrey Cole ◽  
Michael S. Brewer

Background The recent proliferation of large amounts of biodiversity transcriptomic data has resulted in an ever-expanding need for scalable and user-friendly tools capable of answering large scale molecular evolution questions. FUSTr identifies gene families involved in the process of adaptation. This is a tool that finds genes in transcriptomic datasets under strong positive selection that automatically detects isoform designation patterns in transcriptome assemblies to maximize phylogenetic independence in downstream analysis. Results When applied to previously studied spider transcriptomic data as well as simulated data, FUSTr successfully grouped coding sequences into proper gene families as well as correctly identified those under strong positive selection in relatively little time. Conclusions FUSTr provides a useful tool for novice bioinformaticians to characterize the molecular evolution of organisms throughout the tree of life using large transcriptomic biodiversity datasets and can utilize multi-processor high-performance computational facilities.

2017 ◽  
Author(s):  
T. Jeffrey Cole ◽  
Michael S. Brewer

AbstractFUSTr is a tool for finding genes in transcriptomic datasets under strong positive selection that automatically detects isoform designation patterns in transcriptome assemblies to maximize phylogenetic independence in downstream analysis. When applied to previously studied spider toxin families as well as simulated data, FUSTr successfully grouped coding sequences into proper gene families as well as correctly identified those under strong positive selection. FUSTr provides a tool capable of utilizing multi-processor high-performance computational facilities and is scalable for large transcriptomic biodiversity datasets.AvailabilityFUSTr is freely available under a GNU license and can be downloaded at https://github.com/tijeco/[email protected]


Genetics ◽  
2003 ◽  
Vol 165 (4) ◽  
pp. 2269-2282
Author(s):  
D Mester ◽  
Y Ronin ◽  
D Minkov ◽  
E Nevo ◽  
A Korol

Abstract This article is devoted to the problem of ordering in linkage groups with many dozens or even hundreds of markers. The ordering problem belongs to the field of discrete optimization on a set of all possible orders, amounting to n!/2 for n loci; hence it is considered an NP-hard problem. Several authors attempted to employ the methods developed in the well-known traveling salesman problem (TSP) for multilocus ordering, using the assumption that for a set of linked loci the true order will be the one that minimizes the total length of the linkage group. A novel, fast, and reliable algorithm developed for the TSP and based on evolution-strategy discrete optimization was applied in this study for multilocus ordering on the basis of pairwise recombination frequencies. The quality of derived maps under various complications (dominant vs. codominant markers, marker misclassification, negative and positive interference, and missing data) was analyzed using simulated data with ∼50-400 markers. High performance of the employed algorithm allows systematic treatment of the problem of verification of the obtained multilocus orders on the basis of computing-intensive bootstrap and/or jackknife approaches for detecting and removing questionable marker scores, thereby stabilizing the resulting maps. Parallel calculation technology can easily be adopted for further acceleration of the proposed algorithm. Real data analysis (on maize chromosome 1 with 230 markers) is provided to illustrate the proposed methodology.


2019 ◽  
Author(s):  
Benoit Morel ◽  
Alexey M. Kozlov ◽  
Alexandros Stamatakis ◽  
Gergely J. Szöllősi

AbstractInferring phylogenetic trees for individual homologous gene families is difficult because alignments are often too short, and thus contain insufficient signal, while substitution models inevitably fail to capture the complexity of the evolutionary processes. To overcome these challenges species tree-aware methods also leverage information from a putative species tree. However, only few methods are available that implement a full likelihood framework or account for horizontal gene transfers. Furthermore, these methods often require expensive data pre-processing (e.g., computing bootstrap trees), and rely on approximations and heuristics that limit the degree of tree space exploration. Here we present GeneRax, the first maximum likelihood species tree-aware phylogenetic inference software. It simultaneously accounts for substitutions at the sequence level as well as gene level events, such as duplication, transfer, and loss relying on established maximum likelihood optimization algorithms. GeneRax can infer rooted phylogenetic trees for multiple gene families, directly from the per-gene sequence alignments and a rooted, yet undated, species tree. We show that compared to competing tools, on simulated data GeneRax infers trees that are the closest to the true tree in 90% of the simulations in terms of relative Robinson-Foulds distance. On empirical datasets, GeneRax is the fastest among all tested methods when starting from aligned sequences, and it infers trees with the highest likelihood score, based on our model. GeneRax completed tree inferences and reconciliations for 1099 Cyanobacteria families in eight minutes on 512 CPU cores. Thus, its parallelization scheme enables large-scale analyses. GeneRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax.


2020 ◽  
Vol 37 (9) ◽  
pp. 2763-2774 ◽  
Author(s):  
Benoit Morel ◽  
Alexey M Kozlov ◽  
Alexandros Stamatakis ◽  
Gergely J Szöllősi

Abstract Inferring phylogenetic trees for individual homologous gene families is difficult because alignments are often too short, and thus contain insufficient signal, while substitution models inevitably fail to capture the complexity of the evolutionary processes. To overcome these challenges, species-tree-aware methods also leverage information from a putative species tree. However, only few methods are available that implement a full likelihood framework or account for horizontal gene transfers. Furthermore, these methods often require expensive data preprocessing (e.g., computing bootstrap trees) and rely on approximations and heuristics that limit the degree of tree space exploration. Here, we present GeneRax, the first maximum likelihood species-tree-aware phylogenetic inference software. It simultaneously accounts for substitutions at the sequence level as well as gene level events, such as duplication, transfer, and loss relying on established maximum likelihood optimization algorithms. GeneRax can infer rooted phylogenetic trees for multiple gene families, directly from the per-gene sequence alignments and a rooted, yet undated, species tree. We show that compared with competing tools, on simulated data GeneRax infers trees that are the closest to the true tree in 90% of the simulations in terms of relative Robinson–Foulds distance. On empirical data sets, GeneRax is the fastest among all tested methods when starting from aligned sequences, and it infers trees with the highest likelihood score, based on our model. GeneRax completed tree inferences and reconciliations for 1,099 Cyanobacteria families in 8 min on 512 CPU cores. Thus, its parallelization scheme enables large-scale analyses. GeneRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax (last accessed June 17, 2020).  


2015 ◽  
Vol 11 (7) ◽  
pp. 20150349 ◽  
Author(s):  
Alexander Van Nynatten ◽  
Devin Bloom ◽  
Belinda S. W. Chang ◽  
Nathan R. Lovejoy

Incursions of marine water into South America during the Miocene prompted colonization of freshwater habitats by ancestrally marine species and present a unique opportunity to study the molecular evolution of adaptations to varying environments. Freshwater and marine environments are distinct in both spectra and average intensities of available light. Here, we investigate the molecular evolution of rhodopsin, the photosensitive pigment in the eye that activates in response to light, in a clade of South American freshwater anchovies derived from a marine ancestral lineage. Using likelihood-based comparative sequence analyses, we found evidence for positive selection in the rhodopsin of freshwater anchovy lineages at sites known to be important for aspects of rhodopsin function such as spectral tuning. No evidence was found for positive selection in marine lineages, nor in three other genes not involved in vision. Our results suggest that an increased rate of rhodopsin evolution was driven by diversification into freshwater habitats, thereby constituting a rare example of molecular evolution mirroring large-scale palaeogeographic events.


2019 ◽  
Author(s):  
Manoj Kumar ◽  
Cameron Thomas Ellis ◽  
Qihong Lu ◽  
Hejia Zhang ◽  
Mihai Capota ◽  
...  

Advanced brain imaging analysis methods, including multivariate pattern analysis (MVPA), functional connectivity, and functional alignment, have become powerful tools in cognitive neuroscience over the past decade. These tools are implemented in custom code and separate packages, often requiring different software and language proficiencies. Although usable by expert researchers, novice users face a steep learning curve. These difficulties stem from the use of new programming languages (e.g., Python), learning how to apply machine-learning methods to high-dimensional fMRI data, and minimal documentation and training materials. Furthermore, most standard fMRI analysis packages (e.g., AFNI, FSL, SPM) focus on preprocessing and univariate analyses, leaving a gap in how to integrate with advanced tools. To address these needs, we developed BrainIAK (brainiak.org), an open-source Python software package that seamlessly integrates several cutting-edge, computationally efficient techniques with other Python packages (e.g., Nilearn, Scikit-learn) for file handling, visualization, and machine learning. To disseminate these powerful tools, we developed user-friendly tutorials (in Jupyter format; https://brainiak.org/tutorials/) for learning BrainIAK and advanced fMRI analysis in Python more generally. These materials cover techniques including: MVPA (pattern classification and representational similarity analysis); parallelized searchlight analysis; background connectivity; full correlation matrix analysis; inter-subject correlation; inter-subject functional connectivity; shared response modeling; event segmentation using hidden Markov models; and real-time fMRI. For long-running jobs or large memory needs we provide detailed guidance on high-performance computing clusters. These notebooks were successfully tested at multiple sites, including as problem sets for courses at Yale and Princeton universities and at various workshops and hackathons. These materials are freely shared, with the hope that they become part of a pool of open-source software and educational materials for large-scale, reproducible fMRI analysis and accelerated discovery.


2019 ◽  
Vol 3 (1) ◽  
pp. e201900546
Author(s):  
Matthias Blum ◽  
Pierre-Etienne Cholley ◽  
Valeriya Malysheva ◽  
Samuel Nicaise ◽  
Julien Moehlin ◽  
...  

The enormous amount of freely accessible functional genomics data is an invaluable resource for interrogating the biological function of multiple DNA-interacting players and chromatin modifications by large-scale comparative analyses. However, in practice, interrogating large collections of public data requires major efforts for (i) reprocessing available raw reads, (ii) incorporating quality assessments to exclude artefactual and low-quality data, and (iii) processing data by using high-performance computation. Here, we present qcGenomics, a user-friendly online resource for ultrafast retrieval, visualization, and comparative analysis of tens of thousands of genomics datasets to gain new functional insight from global or focused multidimensional data integration.


2019 ◽  
Vol 11 (7) ◽  
pp. 1897-1908 ◽  
Author(s):  
Zhenhua Zhang ◽  
Changfeng Qu ◽  
Ru Yao ◽  
Yuan Nie ◽  
Chenjie Xu ◽  
...  

Abstract Psychrophilic green algae from independent phylogenetic lines thrive in the polar extreme environments, but the hypothesis that their psychrophilic characteristics appeared through parallel routes of molecular evolution remains untested. The recent surge of transcriptome data enables large-scale evolutionary analyses to investigate the genetic basis for the adaptations to the Antarctic extreme environment, and the identification of the selective forces that drive molecular evolution is the foundation to understand the strategies of cold adaptation. Here, we conducted transcriptome sequencing of two Antarctic psychrophilic green algae (Chlamydomonas sp. ICE-L and Tetrabaena socialis) and performed positive selection and convergent substitution analyses to investigate their molecular convergence and adaptive strategies against extreme cold conditions. Our results revealed considerable shared positively selected genes and significant evidence of molecular convergence in two Antarctic psychrophilic algae. Significant evidence of positive selection and convergent substitution were detected in genes associated with photosynthetic machinery, multiple antioxidant systems, and several crucial translation elements in Antarctic psychrophilic algae. Our study reveals that the psychrophilic algae possess more stable photosynthetic apparatus and multiple protective mechanisms and provides new clues of parallel adaptive evolution in Antarctic psychrophilic green algae.


2020 ◽  
Author(s):  
Muhammad Zulfiqar Ahmad ◽  
Xiangsheng Zeng ◽  
Qiang Dong ◽  
Sehrish Manan ◽  
Huanan Jin ◽  
...  

Abstract Background: Members of the BAHD acyltransferase (ACT) family play important roles in plant defence against biotic and abiotic stresses. Previous genome-wide studies explored different acyltransferase gene families, but not a single study was found so far on the overall genome-wide or positive selection analyses of the BAHD family genes in Glycine max . A better understanding of the functions that specific members of this family play in stress defence can lead to better breeding strategies for stress tolerance. Results: A total of 103 genes of the BAHD family (GmACT genes) were mined from the soybean genome, which could be grouped into four phylogenetic clades (I- IV). Clade III was further divided into two sub-clades (IIIA and IIIB). In each clade, the constituent part of the gene structures and motifs were relatively conserved. These 103 genes were distributed unequally on all 20 chromosomes, and 16 paralogous pairs were found within the family. Positive selection analysis revealed important amino acids under strong positive selection, which suggests that the evolution of this gene family modulated soybean domestication. Most of the expression of ACT genes in soybean was repressed with Al 3+ and fungal elicitor exposure, except for GmACT84 , which expression increased in these conditions 2- and 3-fold, respectively. The promoter region of GmACT84 contains the maximum number of stress-responsive elements among all GmACT genes and is especially enriched in MYB-related elements. Some GmACT genes showed expression specific under specific conditions, while others showed constitutive expression in all soybean tissues or conditions analysed. Conclusions: This study provided a genome-wide analysis of the BAHD gene family and assessed their expression profiles. We found evidence of a strong positive selection of GmACT genes. Our findings will help efforts of functional characterisation of ACT genes in soybean in order to discover their involvement in growth, development, and defence mechanisms.


Author(s):  
Fuqiang Ma ◽  
Chun Yin Lau ◽  
Chaogu Zheng

Abstract The F-box and chemosensory GPCR (csGPCR) gene families are greatly expanded in nematodes, including the model organism Caenorhabditis elegans, compared to insects and vertebrates. However, the intraspecific evolution of these two gene families in nematodes remain unexamined. In this study, we analyzed the genomic sequences of 330 recently sequenced wild isolates of C. elegans using a range of population genetics approaches. We found that F-box and csGPCR genes, especially the Srw family csGPCRs, showed much more diversity than other gene families. Population structure analysis and phylogenetic analysis divided the wild strains into eight non-Hawaiian and three Hawaiian subpopulations. Some Hawaiian strains appeared to be more ancestral than all other strains. F-box and csGPCR genes maintained a great amount of the ancestral variants in the Hawaiian subpopulation and their divergence among the non-Hawaiian subpopulations contributed significantly to population structure. F-box genes are mostly located at the chromosomal arms and high recombination rate correlates with their large polymorphism. Moreover, using both neutrality tests and Extended Haplotype Homozygosity analysis, we identified signatures of strong positive selection in the F-box and csGPCR genes among the wild isolates, especially in the non-Hawaiian population. Accumulation of high-frequency derived alleles in these genes was found in non-Hawaiian population, leading to divergence from the ancestral genotype. In summary, we found that F-box and csGPCR genes harbour a large pool of natural variants, which may be subjected to positive selection. These variants are mostly mapped to the substrate-recognition domains of F-box proteins and the extracellular and intracellular regions of csGPCRs, possibly resulting in advantages during adaptation by affecting protein degradation and the sensing of environmental cues, respectively.


Sign in / Sign up

Export Citation Format

Share Document