Improved DNA-versus-Protein Homology Search for Protein Fossils

Mapping Intimacies ◽

10.1101/2021.01.25.428050 ◽

2021 ◽

Author(s):

Yin Yao ◽

Martin C. Frith

Keyword(s):

Evolutionary History ◽

Sequence Data ◽

Protein Sequences ◽

Search Method ◽

Substitution Matrix ◽

Homology Search ◽

Protein Homology ◽

Noncoding Dna ◽

Genetic Codes ◽

Versus Protein

AbstractProtein fossils, i.e. noncoding DNA descended from coding DNA, arise frequently from transposable elements (TEs), decayed genes, and viral integrations. They can reveal, and mislead about, evolutionary history and relationships. They have been detected by comparing DNA to protein sequences, but current methods are not optimized for this task. We describe a powerful DNA-protein homology search method. We use a 64×21 substitution matrix, which is fitted to sequence data, automatically learning the genetic code. We detect subtly homologous regions by considering alternative possible alignments between them, and calculate significance (probability of occurring by chance between random sequences). Our method detects TE protein fossils much more sensitively than blastx, and > 10× faster. Of the ~7 major categories of eukaryotic TE, three have not been found in mammals: we find two of them in the human genome, polinton and DIRS/Ngaro. This method increases our power to find ancient fossils, and perhaps to detect non-standard genetic codes. The alternative-alignments and significance paradigm is not specific to DNA-protein comparison, and could benefit homology search generally.

Download Full-text

Improved DNA-versus-Protein Homology Search for Protein Fossils

Algorithms for Computational Biology - Lecture Notes in Computer Science ◽

10.1007/978-3-030-74432-8_11 ◽

2021 ◽

pp. 146-158

Author(s):

Yin Yao ◽

Martin C. Frith

Keyword(s):

Homology Search ◽

Protein Homology ◽

Versus Protein

Download Full-text

IMPROVING THE SENSITIVITY AND SPECIFICITY OF PROTEIN HOMOLOGY SEARCH BY INCORPORATING PREDICTED SECONDARY STRUCTURES

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720006002119 ◽

2006 ◽

Vol 04 (03) ◽

pp. 709-720 ◽

Cited By ~ 1

Author(s):

BIN MA ◽

LIEYU WU ◽

KAIZHONG ZHANG

Keyword(s):

Sensitivity And Specificity ◽

Structure Prediction ◽

Protein Sequences ◽

Secondary Structures ◽

Homology Search ◽

Search Performance ◽

Protein Homology ◽

Protein Secondary Structures

In this paper, we improve the homology search performance by the combination of the predicted protein secondary structures and protein sequences. Previous research suggested that the straightforward combination of predicted secondary structures did not improve the homology search performance, mostly because of the errors in the structure prediction. We solved this problem by taking into account the confidence scores output by the prediction programs.

Download Full-text

COMER2: GPU-accelerated sensitive and specific homology searches

Bioinformatics ◽

10.1093/bioinformatics/btaa185 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3570-3572

Author(s):

Mindaugas Margelevičius

Keyword(s):

Open Source Software ◽

Graphics Processing Units ◽

State Of The Art ◽

Sequence Data ◽

Search Method ◽

Homology Search ◽

Supplementary Information ◽

Cross Platform ◽

Graphics Processing ◽

Sequence Profiles

Abstract Summary Searching for homology in the vast amount of sequence data has a particular emphasis on its speed. We present a completely rewritten version of the sensitive homology search method COMER based on alignment of protein sequence profiles, which is capable of searching big databases even on a lightweight laptop. By harnessing the power of CUDA-enabled graphics processing units, it is up to 20 times faster than HHsearch, a state-of-the-art method using vectorized instructions on modern CPUs. Availability and implementation COMER2 is cross-platform open-source software available at https://sourceforge.net/projects/comer2 and https://github.com/minmarg/comer2. It can be easily installed from source code or using stand-alone installers. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

The evolutionary history of HSA7/16 synteny in vertebrates: a critical interpretation of comparative cytogenetic and genome sequence data

Caryologia ◽

10.1080/00087114.2013.829691 ◽

2013 ◽

Vol 66 (3) ◽

pp. 236-242

Author(s):

Barbara Picone ◽

Luca Sineo

Keyword(s):

Genome Sequence ◽

Evolutionary History ◽

Sequence Data ◽

Genome Sequence Data ◽

History Of ◽

Critical Interpretation

Download Full-text

PepPat, a pattern-based oligopeptide homology search method and the identification of a novel tachykinin-like peptide

Mammalian Genome ◽

10.1007/s00335-002-3061-y ◽

2003 ◽

Vol 14 (5) ◽

pp. 341-349 ◽

Cited By ~ 11

Author(s):

Ying Jiang ◽

Ge Gao ◽

Gang Fang ◽

Eric L. Gustafson ◽

Maureen Laverty ◽

...

Keyword(s):

Search Method ◽

Homology Search

Download Full-text

Techniques for the verification of minimal phylogenetic trees illustrated with ten mammalian haemoglobin sequences

Biochemical Journal ◽

10.1042/bj1870065 ◽

1980 ◽

Vol 187 (1) ◽

pp. 65-74 ◽

Cited By ~ 12

Author(s):

D Penny ◽

M D Hendy ◽

L R Foulds

Keyword(s):

Amino Acid ◽

Phylogenetic Tree ◽

Protein Sequence ◽

Phylogenetic Trees ◽

Sequence Data ◽

Protein Sequences ◽

Nucleotide Sequences ◽

Amino Acid Sequences ◽

Minimal Tree ◽

Protein Sequence Data

We have recently reported a method to identify the shortest possible phylogenetic tree for a set of protein sequences [Foulds Hendy & Penny (1979) J. Mol. Evol. 13. 127–150; Foulds, Penny & Hendy (1979) J. Mol. Evol. 13, 151–166]. The present paper discusses issues that arise during the construction of minimal phylogenetic trees from protein-sequence data. The conversion of the data from amino acid sequences into nucleotide sequences is shown to be advantageous. A new variation of a method for constructing a minimal tree is presented. Our previous methods have involved first constructing a tree and then either proving that it is minimal or transforming it into a minimal tree. The approach presented in the present paper progressively builds up a tree, taxon by taxon. We illustrate this approach by using it to construct a minimal tree for ten mammalian haemoglobin alpha-chain sequences. Finally we define a measure of the complexity of the data and illustrate a method to derive a directed phylogenetic tree from the minimal tree.

Download Full-text

GEMPROT: visualization of the impact on the protein of the genetic variants found on each haplotype

Bioinformatics ◽

10.1093/bioinformatics/bty993 ◽

2018 ◽

Vol 35 (14) ◽

pp. 2492-2494

Author(s):

Tania Cuppens ◽

Thomas E Ludwig ◽

Pascal Trouvé ◽

Emmanuelle Genin

Keyword(s):

Genetic Variants ◽

Protein Sequence ◽

Sequence Data ◽

Protein Sequences ◽

Supplementary Information ◽

Analysis Tool ◽

Functional Protein ◽

Key Players ◽

On Line ◽

The Impact

Abstract Summary When analyzing sequence data, genetic variants are considered one by one, taking no account of whether or not they are found in the same individual. However, variant combinations might be key players in some diseases as variants that are neutral on their own can become deleterious when associated together. GEMPROT is a new analysis tool that allows, from a phased vcf file, to visualize the consequences of the genetic variants on the protein. At the level of an individual, the program shows the variants on each of the two protein sequences and the Pfam functional protein domains. When data on several individuals are available, GEMPROT lists the haplotypes found in the sample and can compare the haplotype distributions between different sub-groups of individuals. By offering a global visualization of the gene with the genetic variants present, GEMPROT makes it possible to better understand the impact of combinations of genetic variants on the protein sequence. Availability and implementation GEMPROT is freely available at https://github.com/TaniaCuppens/GEMPROT. An on-line version is also available at http://med-laennec.univ-brest.fr/GEMPROT/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Origins of the new influenza A(H1N1) virus: time to take action

Eurosurveillance ◽

10.2807/ese.14.22.19228-en ◽

2009 ◽

Vol 14 (22) ◽

Cited By ~ 9

Author(s):

G M Nava ◽

M S Attene-Ramos ◽

J K Ang ◽

M Escorcia

Keyword(s):

Influenza A ◽

H1n1 Virus ◽

Phylogenetic Analyses ◽

Protein Sequences ◽

Interspecies Transmission ◽

Genetic Evolution ◽

Protein Homology ◽

Influenza A H1n1 ◽

Insight Into

To gain insight into the possible origins of the 2009 outbreak of new influenza A(H1N1), we performed two independent analyses of genetic evolution of the new influenza A(H1N1) virus. Firstly, protein homology analyses of more than 400 sequences revealed that this virus most likely evolved from recent swine viruses. Secondly, phylogenetic analyses of 5,214 protein sequences of influenza A(H1N1) viruses (avian, swine and human) circulating in North America for the last two decades (from 1989 to 2009) indicated that the new influenza A(H1N1) virus possesses a distinctive evolutionary trait (genetic distinctness). This appears to be a particular characteristic in pig-human interspecies transmission of influenza A. Thus these analyses contribute to the evidence of the role of pig populations as “mixing vessels” for influenza A(H1N1) viruses.

Download Full-text

Genome-scale DNA sequence data and the evolutionary history of placental mammals

Data in Brief ◽

10.1016/j.dib.2018.04.094 ◽

2018 ◽

Vol 18 ◽

pp. 1972-1975 ◽

Cited By ~ 7

Author(s):

Shaoyuan Wu ◽

Scott Edwards ◽

Liang Liu

Keyword(s):

Dna Sequence ◽

Evolutionary History ◽

Sequence Data ◽

Dna Sequence Data ◽

History Of ◽

Genome Scale

Download Full-text

Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207321666180130100838 ◽

2018 ◽

Vol 21 (2) ◽

pp. 100-110 ◽

Cited By ~ 3

Author(s):

Chun Li ◽

Jialing Zhao ◽

Changzhong Wang ◽

Yuhua Yao

Keyword(s):

Dna Binding ◽

Protein Sequence ◽

Protein Identification ◽

Binding Proteins ◽

Graphical Representation ◽

Sequence Data ◽

Protein Sequences ◽

Dna Binding Proteins ◽

Support Vector ◽

Letter Sequence

Aim and Objective: The rapid increase in the amount of protein sequence data available leads to an urgent need for novel computational algorithms to analyze and compare these sequences. This study is undertaken to develop an efficient computational approach for timely encoding protein sequences and extracting the hidden information. Methods: Based on two physicochemical properties of amino acids, a protein primary sequence was converted into a three-letter sequence, and then a graph without loops and multiple edges and its geometric line adjacency matrix were obtained. A generalized PseAAC (pseudo amino acid composition) model was thus constructed to characterize a protein sequence numerically. Results: By using the proposed mathematical descriptor of a protein sequence, similarity comparisons among β-globin proteins of 17 species and 72 spike proteins of coronaviruses were made, respectively. The resulting clusters agreed well with the established taxonomic groups. In addition, a generalized PseAAC based SVM (support vector machine) model was developed to identify DNA-binding proteins. Experiment results showed that our method performed better than DNAbinder, DNA-Prot, iDNA-Prot and enDNA-Prot by 3.29-10.44% in terms of ACC, 0.056-0.206 in terms of MCC, and 1.45-15.76% in terms of F1M. When the benchmark dataset was expanded with negative samples, the presented approach outperformed the four previous methods with improvement in the range of 2.49-19.12% in terms of ACC, 0.05-0.32 in terms of MCC, and 3.82- 33.85% in terms of F1M. Conclusion: These results suggested that the generalized PseAAC model was very efficient for comparison and analysis of protein sequences, and very competitive in identifying DNA-binding proteins.

Download Full-text