scholarly journals PhyKIT: a broadly applicable UNIX shell toolkit for processing and analyzing phylogenomic data

Author(s):  
Jacob L Steenwyk ◽  
Thomas J Buida ◽  
Abigail L Labella ◽  
Yuanning Li ◽  
Xing-Xing Shen ◽  
...  

Abstract Motivation Diverse disciplines in biology process and analyze multiple sequence alignments (MSAs) and phylogenetic trees to evaluate their information content, infer evolutionary events and processes, and predict gene function. However, automated processing of MSAs and trees remains a challenge due to the lack of a unified toolkit. To fill this gap, we introduce PhyKIT, a toolkit for the UNIX shell environment with 30 functions that process MSAs and trees, including but not limited to estimation of mutation rate, evaluation of sequence composition biases, calculation of the degree of violation of a molecular clock, and collapsing bipartitions (internal branches) with low support. Results To demonstrate the utility of PhyKIT, we detail three use cases: (1) summarizing information content in MSAs and phylogenetic trees for diagnosing potential biases in sequence or tree data; (2) evaluating gene-gene covariation of evolutionary rates to identify functional relationships, including novel ones, among genes; and (3) identify lack of resolution events or polytomies in phylogenetic trees, which are suggestive of rapid radiation events or lack of data. We anticipate PhyKIT will be useful for processing, examining, and deriving biological meaning from increasingly large phylogenomic datasets. Availability PhyKIT is freely available on GitHub (https://github.com/JLSteenwyk/PhyKIT), PyPi (https://pypi.org/project/phykit/), and the Anaconda Cloud (https://anaconda.org/JLSteenwyk/phykit) under the MIT license with extensive documentation and user tutorials (https://jlsteenwyk.com/PhyKIT). Supplementary information Supplementary data are available on figshare (doi: 10.6084/m9.figshare.13118600) and are available at Bioinformatics online.

Author(s):  
Jacob L. Steenwyk ◽  
Thomas J. Buida ◽  
Abigail L. Labella ◽  
Yuanning Li ◽  
Xing-Xing Shen ◽  
...  

AbstractDiverse disciplines in biology process and analyze multiple sequence alignments (MSAs) and phylogenetic trees to evaluate their information content, infer evolutionary events and processes, and predict gene function. However, automated processing of MSAs and trees remains a challenge due to the lack of a unified toolkit. To fill this gap, we introduce PhyKIT, a toolkit for the UNIX shell environment with 30 functions that process MSAs and trees, including but not limited to estimation of mutation rate, evaluation of sequence composition biases, calculation of the degree of violation of a molecular clock, and collapsing bipartitions (internal branches) with low support. To demonstrate the utility of PhyKIT, we detail three use cases: (1) summarizing information content in MSAs and phylogenetic trees for diagnosing potential biases in sequence or tree data; (2) evaluating gene-gene covariation of evolutionary rates to identify functional relationships, including novel ones, among genes; and (3) identify lack of resolution events or polytomies in phylogenetic trees, which are suggestive of rapid radiation events or lack of data. We anticipate PhyKIT will be useful for processing, examining, and deriving biological meaning from increasingly large phylogenomic datasets. PhyKIT is freely available on GitHub (https://github.com/JLSteenwyk/PhyKIT) and documentation including user tutorials are available online (https://jlsteenwyk.com/PhyKIT).


Author(s):  
Jun Wang ◽  
Pu-Feng Du ◽  
Xin-Yu Xue ◽  
Guang-Ping Li ◽  
Yuan-Ke Zhou ◽  
...  

Abstract Summary Many efforts have been made in developing bioinformatics algorithms to predict functional attributes of genes and proteins from their primary sequences. One challenge in this process is to intuitively analyze and to understand the statistical features that have been selected by heuristic or iterative methods. In this paper, we developed VisFeature, which aims to be a helpful software tool that allows the users to intuitively visualize and analyze statistical features of all types of biological sequence, including DNA, RNA and proteins. VisFeature also integrates sequence data retrieval, multiple sequence alignments and statistical feature generation functions. Availability and implementation VisFeature is a desktop application that is implemented using JavaScript/Electron and R. The source codes of VisFeature are freely accessible from the GitHub repository (https://github.com/wangjun1996/VisFeature). The binary release, which includes an example dataset, can be freely downloaded from the same GitHub repository (https://github.com/wangjun1996/VisFeature/releases). Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Fabian Sievers ◽  
Desmond G Higgins

Abstract Motivation Secondary structure prediction accuracy (SSPA) in the QuanTest benchmark can be used to measure accuracy of a multiple sequence alignment. SSPA correlates well with the sum-of-pairs score, if the results are averaged over many alignments but not on an alignment-by-alignment basis. This is due to a sub-optimal selection of reference and non-reference sequences in QuanTest. Results We develop an improved strategy for selecting reference and non-reference sequences for a new benchmark, QuanTest2. In QuanTest2, SSPA and SP correlate better on an alignment-by-alignment basis than in QuanTest. Guide-trees for QuanTest2 are more balanced with respect to reference sequences than in QuanTest. QuanTest2 scores correlate well with other well-established benchmarks. Availability and implementation QuanTest2 is available at http://bioinf.ucd.ie/quantest2.tar, comprises of reference and non-reference sequence sets and a scoring script. Supplementary information Supplementary data are available at Bioinformatics online


2019 ◽  
Author(s):  
Anton Suvorov ◽  
Joshua Hochuli ◽  
Daniel R. Schrider

AbstractReconstructing the phylogenetic relationships between species is one of the most formidable tasks in evolutionary biology. Multiple methods exist to reconstruct phylogenetic trees, each with their own strengths and weaknesses. Both simulation and empirical studies have identified several “zones” of parameter space where accuracy of some methods can plummet, even for four-taxon trees. Further, some methods can have undesirable statistical properties such as statistical inconsistency and/or the tendency to be positively misleading (i.e. assert strong support for the incorrect tree topology). Recently, deep learning techniques have made inroads on a number of both new and longstanding problems in biological research. Here we designed a deep convolutional neural network (CNN) to infer quartet topologies from multiple sequence alignments. This CNN can readily be trained to make inferences using both gapped and ungapped data. We show that our approach is highly accurate, often outperforming traditional methods, and is remarkably robust to bias-inducing regions of parameter space such as the Felsenstein zone and the Farris zone. We also demonstrate that the confidence scores produced by our CNN can more accurately assess support for the chosen topology than bootstrap and posterior probability scores from traditional methods. While numerous practical challenges remain, these findings suggest that deep learning approaches such as ours have the potential to produce more accurate phylogenetic inferences.


Author(s):  
Saisai Sun ◽  
Wenkai Wang ◽  
Zhenling Peng ◽  
Jianyi Yang

Abstract Motivation Recent years have witnessed that the inter-residue contact/distance in proteins could be accurately predicted by deep neural networks, which significantly improve the accuracy of predicted protein structure models. In contrast, fewer studies have been done for the prediction of RNA inter-nucleotide 3D closeness. Results We proposed a new algorithm named RNAcontact for the prediction of RNA inter-nucleotide 3D closeness. RNAcontact was built based on the deep residual neural networks. The covariance information from multiple sequence alignments and the predicted secondary structure were used as the input features of the networks. Experiments show that RNAcontact achieves the respective precisions of 0.8 and 0.6 for the top L/10 and L (where L is the length of an RNA) predictions on an independent test set, significantly higher than other evolutionary coupling methods. Analysis shows that about 1/3 of the correctly predicted 3D closenesses are not base pairings of secondary structure, which are critical to the determination of RNA structure. In addition, we demonstrated that the predicted 3D closeness could be used as distance restraints to guide RNA structure folding by the 3dRNA package. More accurate models could be built by using the predicted 3D closeness than the models without using 3D closeness. Availability and implementation The webserver and a standalone package are available at: http://yanglab.nankai.edu.cn/RNAcontact/. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Lalitha Guruprasad

<div>Coronavirus disease 2019 (COVID-19) is a pandemic infectious disease caused by novel Severe Acute Respiratory Syndrome coronavirus-2 (SARS CoV-2). The SARS CoV-2 is transmitted more rapidly and readily than SARS CoV. Both, SARS CoV and SARS CoV-2 via their glycosylated spike proteins recognize the human angiotensin converting enzyme-2 (ACE-2) receptor. We generated multiple sequence alignments and phylogenetic trees for representative spike proteins of CoV and CoV-2 from various host sources in order to analyze the specificity in SARS CoV-2 spike proteins required for causing infection in humans. Our results show that two sequence motifs in the N-terminal domain; "MESEFR" and "SYLTPG" are specific to human SARS CoV-2 and pangolin SARS CoV. In the receptor binding domain (RBD), three sequence loops; VGGNY (loop 1), YQAGSTPC (loop 2), EGFNCY (loop 3) and a tethered disulfide bridge Cys480-Cys488 connecting loops 2 and 3 are structural determinants for the recognition of human ACE-2 receptor. The complete genome analysis of representative SARS CoVs from bat, civet, pangolin, human host sources and human SARS CoV-2 identified the bat genome (GenBank code: MN996532.1) and the pangolin SARS CoV genomes as closest to the recent novel human SARS CoV-2 genomes. The bat CoV genomes (GenBank codes: MG772933 and MG772934) are evolutionary intermediates in the mutagenesis progression towards becoming human SARS CoV-2. </div>


2019 ◽  
Vol 5 ◽  
Author(s):  
Alexis Criscuolo

This paper describes a novel alignment-free distance-based procedure for inferring phylogenetic trees from genome contig sequences using publicly available bioinformatics tools. For each pair of genomes, a dissimilarity measure is first computed and next transformed to obtain an estimation of the number of substitution events that have occurred during their evolution. These pairwise evolutionary distances are then used to infer a phylogenetic tree and assess a confidence support for each internal branch. Analyses of both simulated and real genome datasets show that this bioinformatics procedure allows accurate phylogenetic trees to be reconstructed with fast running times, especially when launched on multiple threads. Implemented in a publicly available script, named JolyTree, this procedure is a useful approach for quickly inferring species trees without the burden and potential biases of multiple sequence alignments.


Sign in / Sign up

Export Citation Format

Share Document