Eliminating redundancy among protein sequences using submodular optimization

AbstactMotivationSubmodular optimization, a discrete analogue to continuous convex optimization, has been used with great success in many fields but is not yet widely used in biology. We apply submodular optimization to the problem of removing redundancy in protein sequence data sets. This is a common step in many bioinformatics and structural biology workflows, including creation of non-redundant training sets for sequence and structural models as well as selection of “operational taxonomic units” from metagenomics data.ResultsWe demonstrate that the submodular optimization approach results in representative protein sequence subsets with greater structural diversity than sets chosen by existing methods. In particular, we compare to a widely used, heuristic algorithm implemented in software tools such as CD-HIT, as well to as a variety of standard clustering methods, using as a gold standard the SCOPe library of protein domain structures. In this setting, submodular optimization consistently yields protein sequence subsets that include more SCOPe domain families than sets of the same size selected by competing approaches. We also show how the optimization framework allows us to design a mixture objective function that performs well for both large and small representative sets. The framework we describe is theoretically optimal under some assumptions, and it is flexible and intuitive because it applies generic methods to optimize one of a variety of objective functions. This application serves as a model for how submodular optimization can be applied to other discrete problems in biology.AvailabilitySource code is available athttps://github.com/mlibbrecht/[email protected]

Download Full-text

Choosing Non-redundant Representative Subsets Of Protein Sequence Data Sets Using Submodular Optimization

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics - BCB '18 ◽

10.1145/3233547.3233717 ◽

2018 ◽

Author(s):

Maxwell W. Libbrecht ◽

Jeffrey A. Bilmes ◽

William Stafford Noble

Keyword(s):

Protein Sequence ◽

Sequence Data ◽

Data Sets ◽

Submodular Optimization ◽

Protein Sequence Data

Download Full-text

Choosing non-redundant representative subsets of protein sequence data sets using submodular optimization

Proteins Structure Function and Bioinformatics ◽

10.1002/prot.25461 ◽

2018 ◽

Vol 86 (4) ◽

pp. 454-466 ◽

Cited By ~ 6

Author(s):

Maxwell W. Libbrecht ◽

Jeffrey A. Bilmes ◽

William Stafford Noble

Keyword(s):

Protein Sequence ◽

Sequence Data ◽

Data Sets ◽

Submodular Optimization ◽

Protein Sequence Data

Download Full-text

Faculty Opinions recommendation of Continuous molecular evolution of protein-domain structures by single amino acid changes.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1066706.523896 ◽

2007 ◽

Author(s):

Reinhard Sterner

Keyword(s):

Amino Acid ◽

Molecular Evolution ◽

Protein Domain ◽

Single Amino Acid ◽

Domain Structures

Download Full-text

Faculty Opinions recommendation of Continuous molecular evolution of protein-domain structures by single amino acid changes.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1066706.519619 ◽

2007 ◽

Author(s):

Torsten Schwede

Keyword(s):

Amino Acid ◽

Molecular Evolution ◽

Protein Domain ◽

Single Amino Acid ◽

Domain Structures

Download Full-text

AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models

Entropy ◽

10.3390/e23050530 ◽

2021 ◽

Vol 23 (5) ◽

pp. 530

Author(s):

Milton Silva ◽

Diogo Pratas ◽

Armando J. Pinho

Keyword(s):

Protein Sequence ◽

Sequence Data ◽

Specific Protein ◽

General Purpose ◽

Amino Acid Sequences ◽

Input Size ◽

Protein Sequence Data ◽

Analysis Application ◽

Straightforward Solution ◽

Human Coronaviruses

Recently, the scientific community has witnessed a substantial increase in the generation of protein sequence data, triggering emergent challenges of increasing importance, namely efficient storage and improved data analysis. For both applications, data compression is a straightforward solution. However, in the literature, the number of specific protein sequence compressors is relatively low. Moreover, these specialized compressors marginally improve the compression ratio over the best general-purpose compressors. In this paper, we present AC2, a new lossless data compressor for protein (or amino acid) sequences. AC2 uses a neural network to mix experts with a stacked generalization approach and individual cache-hash memory models to the highest-context orders. Compared to the previous compressor (AC), we show gains of 2–9% and 6–7% in reference-free and reference-based modes, respectively. These gains come at the cost of three times slower computations. AC2 also improves memory usage against AC, with requirements about seven times lower, without being affected by the sequences’ input size. As an analysis application, we use AC2 to measure the similarity between each SARS-CoV-2 protein sequence with each viral protein sequence from the whole UniProt database. The results consistently show higher similarity to the pangolin coronavirus, followed by the bat and human coronaviruses, contributing with critical results to a current controversial subject. AC2 is available for free download under GPLv3 license.

Download Full-text

Techniques for the verification of minimal phylogenetic trees illustrated with ten mammalian haemoglobin sequences

Biochemical Journal ◽

10.1042/bj1870065 ◽

1980 ◽

Vol 187 (1) ◽

pp. 65-74 ◽

Cited By ~ 12

Author(s):

D Penny ◽

M D Hendy ◽

L R Foulds

Keyword(s):

Amino Acid ◽

Phylogenetic Tree ◽

Protein Sequence ◽

Phylogenetic Trees ◽

Sequence Data ◽

Protein Sequences ◽

Nucleotide Sequences ◽

Amino Acid Sequences ◽

Minimal Tree ◽

Protein Sequence Data

We have recently reported a method to identify the shortest possible phylogenetic tree for a set of protein sequences [Foulds Hendy & Penny (1979) J. Mol. Evol. 13. 127–150; Foulds, Penny & Hendy (1979) J. Mol. Evol. 13, 151–166]. The present paper discusses issues that arise during the construction of minimal phylogenetic trees from protein-sequence data. The conversion of the data from amino acid sequences into nucleotide sequences is shown to be advantageous. A new variation of a method for constructing a minimal tree is presented. Our previous methods have involved first constructing a tree and then either proving that it is minimal or transforming it into a minimal tree. The approach presented in the present paper progressively builds up a tree, taxon by taxon. We illustrate this approach by using it to construct a minimal tree for ten mammalian haemoglobin alpha-chain sequences. Finally we define a measure of the complexity of the data and illustrate a method to derive a directed phylogenetic tree from the minimal tree.

Download Full-text

GEMPROT: visualization of the impact on the protein of the genetic variants found on each haplotype

Bioinformatics ◽

10.1093/bioinformatics/bty993 ◽

2018 ◽

Vol 35 (14) ◽

pp. 2492-2494

Author(s):

Tania Cuppens ◽

Thomas E Ludwig ◽

Pascal Trouvé ◽

Emmanuelle Genin

Keyword(s):

Genetic Variants ◽

Protein Sequence ◽

Sequence Data ◽

Protein Sequences ◽

Supplementary Information ◽

Analysis Tool ◽

Functional Protein ◽

Key Players ◽

On Line ◽

The Impact

Abstract Summary When analyzing sequence data, genetic variants are considered one by one, taking no account of whether or not they are found in the same individual. However, variant combinations might be key players in some diseases as variants that are neutral on their own can become deleterious when associated together. GEMPROT is a new analysis tool that allows, from a phased vcf file, to visualize the consequences of the genetic variants on the protein. At the level of an individual, the program shows the variants on each of the two protein sequences and the Pfam functional protein domains. When data on several individuals are available, GEMPROT lists the haplotypes found in the sample and can compare the haplotype distributions between different sub-groups of individuals. By offering a global visualization of the gene with the genetic variants present, GEMPROT makes it possible to better understand the impact of combinations of genetic variants on the protein sequence. Availability and implementation GEMPROT is freely available at https://github.com/TaniaCuppens/GEMPROT. An on-line version is also available at http://med-laennec.univ-brest.fr/GEMPROT/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Comparison of dimensionality reduction and clustering methods for SARS-CoV-2 genome

Bulletin of Electrical Engineering and Informatics ◽

10.11591/eei.v10i4.2803 ◽

2021 ◽

Vol 10 (4) ◽

pp. 2170-2180

Author(s):

Untari N. Wisesty ◽

Tati Rajab Mengko

Keyword(s):

Dimensionality Reduction ◽

Dimensional Reduction ◽

Clustering Algorithm ◽

Sequence Data ◽

Clustering Algorithms ◽

Gaussian Mixture Models ◽

Reduction Process ◽

Principal Component ◽

Gaussian Mixture ◽

Clustering Methods

This paper aims to conduct an analysis of the SARS-CoV-2 genome variation was carried out by comparing the results of genome clustering using several clustering algorithms and distribution of sequence in each cluster. The clustering algorithms used are K-means, Gaussian mixture models, agglomerative hierarchical clustering, mean-shift clustering, and DBSCAN. However, the clustering algorithm has a weakness in grouping data that has very high dimensions such as genome data, so that a dimensional reduction process is needed. In this research, dimensionality reduction was carried out using principal component analysis (PCA) and autoencoder method with three models that produce 2, 10, and 50 features. The main contributions achieved were the dimensional reduction and clustering scheme of SARS-CoV-2 sequence data and the performance analysis of each experiment on each scheme and hyper parameters for each method. Based on the results of experiments conducted, PCA and DBSCAN algorithm achieve the highest silhouette score of 0.8770 with three clusters when using two features. However, dimensionality reduction using autoencoder need more iterations to converge. On the testing process with Indonesian sequence data, more than half of them enter one cluster and the rest are distributed in the other two clusters.

Download Full-text

Phylogenetic Analysis of Protein Sequence Data Using the Randomized Axelerated Maximum Likelihood ( RAXML ) Program

Current Protocols in Molecular Biology ◽

10.1002/0471142727.mb1911s96 ◽

2011 ◽

Vol 96 (1) ◽

Cited By ~ 19

Author(s):

Antonis Rokas

Keyword(s):

Phylogenetic Analysis ◽

Maximum Likelihood ◽

Protein Sequence ◽

Sequence Data ◽

Protein Sequence Data

Download Full-text

EdClust: A heuristic sequence clustering method with higher sensitivity

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720021500360 ◽

2021 ◽

Author(s):

Ming Cao ◽

Qinke Peng ◽

Ze-Gang Wei ◽

Fei Liu ◽

Yi-Fan Hou

Keyword(s):

Large Scale ◽

Sequence Data ◽

Clustering Algorithms ◽

Clustering Methods ◽

Sequencing Data ◽

Clustering Method ◽

Cluster Number ◽

Sequence Clustering ◽

Downstream Analysis ◽

Heuristic Clustering

The development of high-throughput technologies has produced increasing amounts of sequence data and an increasing need for efficient clustering algorithms that can process massive volumes of sequencing data for downstream analysis. Heuristic clustering methods are widely applied for sequence clustering because of their low computational complexity. Although numerous heuristic clustering methods have been developed, they suffer from two limitations: overestimation of inferred clusters and low clustering sensitivity. To address these issues, we present a new sequence clustering method (edClust) based on Edlib, a C/C[Formula: see text] library for fast, exact semi-global sequence alignment to group similar sequences. The new method edClust was tested on three large-scale sequence databases, and we compared edClust to several classic heuristic clustering methods, such as UCLUST, CD-HIT, and VSEARCH. Evaluations based on the metrics of cluster number and seed sensitivity (SS) demonstrate that edClust can produce fewer clusters than other methods and that its SS is higher than that of other methods. The source codes of edClust are available from https://github.com/zhang134/EdClust.git under the GNU GPL license.

Download Full-text