scholarly journals ComPotts: Optimal alignment of coevolutionary models for protein sequences

Author(s):  
Hugo Talibart ◽  
François Coste

AbstractTo assign structural and functional annotations to the ever increasing amount of sequenced proteins, the main approach relies on sequence-based homology search methods, e.g. BLAST or the current state-of-the-art methods based on profile Hidden Markov Models (pHMMs), which rely on significant alignments of query sequences to annotated proteins or protein families. While powerful, these approaches do not take coevolution between residues into account. Taking advantage of recent advances in the field of contact prediction, we propose here to represent proteins by Potts models, which model direct couplings between positions in addition to positional composition. Due to the presence of non-local dependencies, aligning two Potts models is computationally hard. To tackle this task, we introduce an Integer Linear Programming formulation of the problem and present ComPotts, an implementation able to compute the optimal alignment of two Potts models representing proteins in tractable time. A first experimentation on 59 low sequence identity pairwise alignments, extracted from 3 reference alignments from sisyphus and BaliBase3 databases, shows that ComPotts finds better alignments than the other tested methods in the majority of these cases.

2020 ◽  
Author(s):  
Hugo Talibart ◽  
François Coste

AbstractBackgroundTo assign structural and functional annotations to the ever increasing amount of sequenced proteins, the main approach relies on sequence-based homology search methods, e.g. BLAST or the current state-of-the-art methods based on profile Hidden Markov Models (pHMM), which rely on significant alignments of query sequences to annotated proteins or protein families. While powerful, these approaches do not take coevolution between residues into account. Taking advantage of recent advances in the field of contact prediction, we propose here to represent proteins by Potts models, which model direct couplings between positions in addition to positional composition, and to compare proteins by aligning these models. Due to non-local dependencies, the problem of aligning Potts models is hard and remains the main computational bottleneck for their use.ResultsWe introduce here an Integer Linear Programming formulation of the problem and PPalign, a program based on this formulation, to compute the optimal pairwise alignment of Potts models representing proteins in tractable time. The approach is assessed with respect to a non-redundant set of reference pairwise sequence alignments from SISYPHUS benchmark which have lowest sequence identity (between 3% and 20%) and enable to build reliable Potts models for each sequence to be aligned. This experimentation confirms that Potts models can be aligned in reasonable time (1′37″ in average on these alignments). The contribution of couplings is evaluated in comparison with HHalign and PPalign without couplings. Although Potts models were not fully optimized for alignment purposes and simple gap scores were used, PPalign yields a better mean F1 score and finds significantly better alignments than HHalign and PPalign without couplings in some cases.ConclusionsThese results show that pairwise couplings from protein Potts models can be used to improve the alignment of remotely related protein sequences in tractable time. Our experimentation suggests yet that new research on the inference of Potts models is now needed to make them more comparable and suitable for homology search. We think that PPalign’s guaranteed optimality will be a powerful asset to perform unbiased investigations in this direction.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Hugo Talibart ◽  
François Coste

Abstract Background To assign structural and functional annotations to the ever increasing amount of sequenced proteins, the main approach relies on sequence-based homology search methods, e.g. BLAST or the current state-of-the-art methods based on profile Hidden Markov Models, which rely on significant alignments of query sequences to annotated proteins or protein families. While powerful, these approaches do not take coevolution between residues into account. Taking advantage of recent advances in the field of contact prediction, we propose here to represent proteins by Potts models, which model direct couplings between positions in addition to positional composition, and to compare proteins by aligning these models. Due to non-local dependencies, the problem of aligning Potts models is hard and remains the main computational bottleneck for their use. Methods We introduce here an Integer Linear Programming formulation of the problem and PPalign, a program based on this formulation, to compute the optimal pairwise alignment of Potts models representing proteins in tractable time. The approach is assessed with respect to a non-redundant set of reference pairwise sequence alignments from SISYPHUS benchmark which have lowest sequence identity (between $$3\%$$ 3 % and $$20\%$$ 20 % ) and enable to build reliable Potts models for each sequence to be aligned. This experimentation confirms that Potts models can be aligned in reasonable time ($$1'37''$$ 1 ′ 37 ′ ′ in average on these alignments). The contribution of couplings is evaluated in comparison with HHalign and independent-site PPalign. Although Potts models were not fully optimized for alignment purposes and simple gap scores were used, PPalign yields a better mean $$F_1$$ F 1 score and finds significantly better alignments than HHalign and PPalign without couplings in some cases. Conclusions These results show that pairwise couplings from protein Potts models can be used to improve the alignment of remotely related protein sequences in tractable time. Our experimentation suggests yet that new research on the inference of Potts models is now needed to make them more comparable and suitable for homology search. We think that PPalign’s guaranteed optimality will be a powerful asset to perform unbiased investigations in this direction.


Author(s):  
Sávio Soares Dias ◽  
Luidi Gelabert Simonetti ◽  
Luiz Satoru Ochi

The present paper tackles the Car Renter Salesman Problem (CaRS), which is a Traveling Salesman Problem variant. In CaRS, the goal is to travel through a set of cities using rented vehicles at minimum cost. The main aim of the current problem is to establish an optimal route using rented vehicles of different types to each trip. Since CaRS is NP-Hard, we herein present a heuristic approach to tackle it. The approach is based on a Multi-Start Iterated Local Search metaheuristic, where the local search step is based on the Random Variable Neighborhood Descent methodology. An Integer Linear Programming Formulation based on a Quadratic Formulation from literature is also proposed in the current study. Computational results for the proposed heuristic method in euclidean instances outperform current state-of-the-art results. The proposed formulation also has stronger bounds and relaxation when compared to others from literature.


2019 ◽  
Author(s):  
Pedro A. A. Penna ◽  
Nelson D. A. Mascarenhas

Synthetic aperture radar SAR imaging systems have a coherent processing that causes the appearance of the multiplicative speckle noise. This noise gives a granular appearance to the terrestrial surface scene impairing its interpretation. The similarity between patches approach is applied by the current state-of-the-art filters in remote sensing area. The goal of this manuscript is to present a method to transform the non-local means (NLM) algorithm capable to mitigate the noise. Singlelook speckle and the NLM under the Haar wavelet domain are considered in our research with intensity SAR images. To achieve our goal, we used the Exponential-Polynomial (EP) and Gamma distributions to describe the Haar coefficients. Also, stochastic distances based on these two mentioned distributions were formulated and embedded in the original NLM technique. Finally, we present analyses and comparisons of real scenarios to demonstrate the competitive performance of the proposed method with some recent filters of the literature.


2019 ◽  
Author(s):  
Yutong Qiu ◽  
Cong Ma ◽  
Han Xie ◽  
Carl Kingsford

AbstractTranscriptomic structural variants (TSVs) — structural variants that affect expressed regions — are common, especially in cancer. Detecting TSVs is a challenging computational problem. Sample heterogeneity (including differences between alleles in diploid organisms) is a critical confounding factor when identifying TSVs. To improve TSV detection in heterogeneous RNA-seq samples, we introduce the MULTIPLECOMPATIBLEARRANGEMENTPROBLEM(MCAP), which seekskgenome rearrangements to maximize the number of reads that are concordant with at least one rearrangement. This directly models the situation of a heterogeneous or diploid sample. We prove that MCAP is NP-hard and provide a-approximation algorithm fork= 1 and a-approximation algorithm for the diploid case (k= 2) assuming an oracle fork= 1. Combining these, we obtain a-approximation algorithm for MCAP whenk= 2 (without an oracle). We also present an integer linear programming formulation for generalk. We completely characterize the graph structures that requirek> 1 to satisfy all edges and show such structures are prevalent in cancer samples. We evaluate our algorithms on 381 TCGA samples and 2 cancer cell lines and show improved performance compared to the state-of-the-art TSV-calling tool, SQUID.


1995 ◽  
Vol 38 (5) ◽  
pp. 1126-1142 ◽  
Author(s):  
Jeffrey W. Gilger

This paper is an introduction to behavioral genetics for researchers and practioners in language development and disorders. The specific aims are to illustrate some essential concepts and to show how behavioral genetic research can be applied to the language sciences. Past genetic research on language-related traits has tended to focus on simple etiology (i.e., the heritability or familiality of language skills). The current state of the art, however, suggests that great promise lies in addressing more complex questions through behavioral genetic paradigms. In terms of future goals it is suggested that: (a) more behavioral genetic work of all types should be done—including replications and expansions of preliminary studies already in print; (b) work should focus on fine-grained, theory-based phenotypes with research designs that can address complex questions in language development; and (c) work in this area should utilize a variety of samples and methods (e.g., twin and family samples, heritability and segregation analyses, linkage and association tests, etc.).


1976 ◽  
Vol 21 (7) ◽  
pp. 497-498
Author(s):  
STANLEY GRAND

10.37236/24 ◽  
2002 ◽  
Vol 1000 ◽  
Author(s):  
A. Di Bucchianico ◽  
D. Loeb

We survey the mathematical literature on umbral calculus (otherwise known as the calculus of finite differences) from its roots in the 19th century (and earlier) as a set of “magic rules” for lowering and raising indices, through its rebirth in the 1970’s as Rota’s school set it on a firm logical foundation using operator methods, to the current state of the art with numerous generalizations and applications. The survey itself is complemented by a fairly complete bibliography (over 500 references) which we expect to update regularly.


2009 ◽  
Vol 5 (4) ◽  
pp. 359-366 ◽  
Author(s):  
Osvaldo Santos-Filho ◽  
Anton Hopfinger ◽  
Artem Cherkasov ◽  
Ricardo de Alencastro

Sign in / Sign up

Export Citation Format

Share Document