scholarly journals Profile Comparer Extended: phylogeny of lytic polysaccharide monooxygenase families using profile hidden Markov model alignments

F1000Research ◽  
2019 ◽  
Vol 8 ◽  
pp. 1834 ◽  
Author(s):  
Gerben P. Voshol ◽  
Peter J. Punt ◽  
Erik Vijgenboom

Insight into the inter- and intra-family relationship of protein families is important, since it can aid understanding of substrate specificity evolution and assign putative functions to proteins with unknown function. To study both these inter- and intra-family relationships, the ability to build phylogenetic trees using the most sensitive sequence similarity search methods (e.g. profile hidden Markov model (pHMM)–pHMM alignments) is required. However, existing solutions require a very long calculation time to obtain the phylogenetic tree. Therefore, a faster protocol is required to make this approach efficient for research. To contribute to this goal, we extended the original Profile Comparer program (PRC) for the construction of large pHMM phylogenetic trees at speeds several orders of magnitude faster compared to pHMM-tree. As an example, PRC Extended (PRCx) was used to study the phylogeny of over 10,000 sequences of lytic polysaccharide monooxygenase (LPMO) from over seven families. Using the newly developed program we were able to reveal previously unknown homologs of LPMOs, namely the PFAM Egh16-like family. Moreover, we show that the substrate specificities have evolved independently several times within the LPMO superfamily. Furthermore, the LPMO phylogenetic tree, does not seem to follow taxonomy-based classification.

2018 ◽  
Vol 13 (5) ◽  
pp. 1081-1095 ◽  
Author(s):  
Zhongliu Zhuo ◽  
Yang Zhang ◽  
Zhi-li Zhang ◽  
Xiaosong Zhang ◽  
Jingzhong Zhang

2003 ◽  
Vol 310 (2) ◽  
pp. 574-579 ◽  
Author(s):  
Norihiro Kikuchi ◽  
Yeon-Dae Kwon ◽  
Masanori Gotoh ◽  
Hisashi Narimatsu

2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Xujie Ren ◽  
Tao Shang ◽  
Yatong Jiang ◽  
Jianwei Liu

In the era of big data, next-generation sequencing produces a large amount of genomic data. With these genetic sequence data, research in biology fields will be further advanced. However, the growth of data scale often leads to privacy issues. Even if the data is not open, it is still possible for an attacker to steal private information by a member inference attack. In this paper, we proposed a private profile hidden Markov model (PHMM) with differential identifiability for gene sequence clustering. By adding random noise into the model, the probability of identifying individuals in the database is limited. The gene sequences could be unsupervised clustered without labels according to the output scores of private PHMM. The variation of the divergence distance in the experimental results shows that the addition of noise makes the profile hidden Markov model distort to a certain extent, and the maximum divergence distance can reach 15.47 when the amount of data is small. Also, the cosine similarity comparison of the clustering model before and after adding noise shows that as the privacy parameters changes, the clustering model distorts at a low or high level, which makes it defend the member inference attack.


2019 ◽  
Author(s):  
Michael C. Grundler ◽  
Daniel L. Rabosky

ABSTRACTThe evolutionary dynamics of complex ecological traits – including multistate representations of diet, habitat, and behavior – remain poorly understood. Reconstructing the tempo, mode, and historical sequence of transitions involving such traits poses many challenges for comparative biologists, owing to their multidimensional nature and intraspecific variability. Continuous-time Markov chains (CTMC) are commonly used to model ecological niche evolution on phylogenetic trees but are limited by the assumption that taxa are monomorphic and that states are univariate categorical variables. Thus, a necessary first step when using standard CTMC models is to categorize species into a pre-determined number of ecological states. This approach potentially confounds interpretation of state assignments with effects of sampling variation because it does not directly incorporate empirical observations of resource use into the statistical inference model. The neglect of sampling variation, along with univariate representations of true multivariate phenotypes, potentially leads to the distortion and loss of information, with substantial implications for downstream macroevolutionary analyses. In this study, we develop a hidden Markov model using a Dirichlet-multinomial framework to model resource use evolution on phylogenetic trees. Unlike existing CTMC implementations, states are unobserved probability distributions from which observed data are sampled. Our approach is expressly designed to model ecological traits that are intra-specifically variable and to account for uncertainty in state assignments of terminal taxa arising from effects of sampling variation. The method uses multivariate count data for individual species to simultaneously infer the number of ecological states, the proportional utilization of different resources by different states, and the phylogenetic distribution of ecological states among living species and their ancestors. The method is general and may be applied to any data expressible as a set of observational counts from different categories.


Sign in / Sign up

Export Citation Format

Share Document