HH-suite3 for fast remote homology detection and deep protein annotation

Abstract Background HH-suite is a widely used open source software suite for sensitive sequence similarity searches and protein fold recognition. It is based on pairwise alignment of profile Hidden Markov models (HMMs), which represent multiple sequence alignments of homologous proteins. Results We developed a single-instruction multiple-data (SIMD) vectorized implementation of the Viterbi algorithm for profile HMM alignment and introduced various other speed-ups. These accelerated the search methods HHsearch by a factor 4 and HHblits by a factor 2 over the previous version 2.0.16. HHblits3 is ∼10× faster than PSI-BLAST and ∼20× faster than HMMER3. Jobs to perform HHsearch and HHblits searches with many query profile HMMs can be parallelized over cores and over cluster servers using OpenMP and message passing interface (MPI). The free, open-source, GPLv3-licensed software is available at https://github.com/soedinglab/hh-suite. Conclusion The added functionalities and increased speed of HHsearch and HHblits should facilitate their use in large-scale protein structure and function prediction, e.g. in metagenomics and genomics projects.

Download Full-text

HH-suite3 for fast remote homology detection and deep protein annotation

10.1101/560029 ◽

2019 ◽

Cited By ~ 9

Author(s):

Martin Steinegger ◽

Markus Meier ◽

Milot Mirdita ◽

Harald Vöhringer ◽

Stephan J. Haunsberger ◽

...

Keyword(s):

Open Source ◽

Large Scale ◽

Message Passing Interface ◽

Viterbi Algorithm ◽

Markov Models ◽

Sequence Similarity ◽

Pairwise Alignment ◽

Homology Detection ◽

Sequence Alignments ◽

Multiple Sequence

AbstractBackgroundHH-suite is a widely used open source software suite for sensitive sequence similarity searches and protein fold recognition. It is based on pairwise alignment of profile Hidden Markov models (HMMs), which represent multiple sequence alignments of homologous sequences.ResultsWe developed a single-instruction multiple-data (SIMD) vectorized implementation of the Viterbi algorithm for profile HMM alignment and introduced various other speed-ups. This accelerated HHsearch by a factor 4 and HHblits by a factor 2 over the previous version 2.0.16. HHblits3 is ~10× faster than PSI-BLAST and ~20× faster than HMMER3. Jobs to perform HHsearch and HHblits searches with many query profile HMMs can be parallelized over cores and over servers in a cluster using OpenMP and message passing interface (MPI). The free, open-source, GNU GPL(v3)-licensed software is available at https://github.com/soedinglab/hh-suite.ConclusionThe added functionalities and increased speed of HHsearch and HHblits should facilitate their use in large-scale protein structure and function prediction, e.g. in metagenomics and genomics projects.

Download Full-text

Recent Advances in Protein Homology Detection Propelled by Inter-Residue Interaction Map Threading

Frontiers in Molecular Biosciences ◽

10.3389/fmolb.2021.643752 ◽

2021 ◽

Vol 8 ◽

Author(s):

Sutanu Bhattacharya ◽

Rahmatullah Roche ◽

Md Hossain Shuvo ◽

Debswapna Bhattacharya

Keyword(s):

Structure Prediction ◽

Accurate Estimation ◽

Protein Homology ◽

Homology Detection ◽

Sequence Alignments ◽

Homologous Proteins ◽

Multiple Sequence ◽

Additional Information ◽

Residue Interaction ◽

Interaction Map

Sequence-based protein homology detection has emerged as one of the most sensitive and accurate approaches to protein structure prediction. Despite the success, homology detection remains very challenging for weakly homologous proteins with divergent evolutionary profile. Very recently, deep neural network architectures have shown promising progress in mining the coevolutionary signal encoded in multiple sequence alignments, leading to reasonably accurate estimation of inter-residue interaction maps, which serve as a rich source of additional information for improved homology detection. Here, we summarize the latest developments in protein homology detection driven by inter-residue interaction map threading. We highlight the emerging trends in distant-homology protein threading through the alignment of predicted interaction maps at various granularities ranging from binary contact maps to finer-grained distance and orientation maps as well as their combination. We also discuss some of the current limitations and possible future avenues to further enhance the sensitivity of protein homology detection.

Download Full-text

Analysis of the diversity of the glycoside hydrolase family 130 in mammal gut microbiomes reveals a novel mannoside-phosphorylase function

Microbial Genomics ◽

10.1099/mgen.0.000404 ◽

2020 ◽

Vol 6 (10) ◽

Author(s):

Ao Li ◽

Elisabeth Laville ◽

Laurence Tarquis ◽

Vincent Lombard ◽

David Ropartz ◽

...

Keyword(s):

Glycoside Hydrolase ◽

Sequence Similarity ◽

Gut Bacteria ◽

Glycoside Hydrolase Family ◽

Sequence Alignments ◽

Multiple Sequence ◽

Content Type ◽

Multiple Sequence Alignments ◽

Hydrolase Family

Mannoside phosphorylases are involved in the intracellular metabolization of mannooligosaccharides, and are also useful enzymes for the in vitro synthesis of oligosaccharides. They are found in glycoside hydrolase family GH130. Here we report on an analysis of 6308 GH130 sequences, including 4714 from the human, bovine, porcine and murine microbiomes. Using sequence similarity networks, we divided the diversity of sequences into 15 mostly isofunctional meta-nodes; of these, 9 contained no experimentally characterized member. By examining the multiple sequence alignments in each meta-node, we predicted the determinants of the phosphorolytic mechanism and linkage specificity. We thus hypothesized that eight uncharacterized meta-nodes would be phosphorylases. These sequences are characterized by the absence of signal peptides and of the catalytic base. Those sequences with the conserved E/K, E/R and Y/R pairs of residues involved in substrate binding would target β-1,2-, β-1,3- and β-1,4-linked mannosyl residues, respectively. These predictions were tested by characterizing members of three of the uncharacterized meta-nodes from gut bacteria. We discovered the first known β-1,4-mannosyl-glucuronic acid phosphorylase, which targets a motif of the Shigella lipopolysaccharide O-antigen. This work uncovers a reliable strategy for the discovery of novel mannoside-phosphorylases, reveals possible interactions between gut bacteria, and identifies a biotechnological tool for the synthesis of antigenic oligosaccharides.

Download Full-text

Molecular and Symptom Analysis Reveal the Presence of New Phytoplasmas Associated with Sugarcane Grassy Shoot Disease in India

Plant Disease ◽

10.1094/pdis-91-11-1413 ◽

2007 ◽

Vol 91 (11) ◽

pp. 1413-1418 ◽

Cited By ~ 17

Author(s):

Kanchan Nasare ◽

Amit Yadav ◽

Anil K. Singh ◽

K. B. Shivasharanappa ◽

Y. S. Nerkar ◽

...

Keyword(s):

16S Rrna ◽

Sequence Similarity ◽

Pcr Amplification ◽

Saccharum Officinarum ◽

23S Rrna ◽

Rrna Gene ◽

Sequence Alignments ◽

High Sequence Similarity ◽

Multiple Sequence ◽

Very High

A total of 240 sugarcane (Saccharum officinarum) plants showing phenotypic symptoms of sugarcane grassy shoot (SCGS) disease were collected from three states of India, Maharashtra, Karnataka, and Uttar Pradesh. Phytoplasmas were detected in all symptomatic samples by the polymerase chain reaction (PCR) amplification of phytoplasma-specific 16S rRNA gene and 16S-23S rRNA spacer region (SR) sequences. No amplification was observed when DNA from asymptomatic plant samples was used as a template. Sixteen samples were selected on the basis of phenotypic symptoms and geographic location, and cloning and sequencing of the 16S rRNA and spacer regions were performed. Multiple sequence alignments of the 16S rRNA sequences revealed that they share very high sequence similarity with phytoplasmas of rice yellow dwarf, 16SrXI. However, the 16S-23S rRNA SR sequence analysis revealed that while the majority of phytoplasmas shared very high (>99%) sequence similarity with previously reported sugarcane phytoplasmas, two of them, namely BV2 (DQ380342) and VD7 (DQ380343), shared relatively low sequence similarity (79 and 84%, respectively). Therefore, these two phytoplasmas may be previously unreported ones that cause significant yield losses in sugarcane in India.

Download Full-text

Transcriptome Ortholog Alignment Sequence Tools (TOAST) for Phylogenomic Dataset Assembly

10.21203/rs.2.16269/v4 ◽

2020 ◽

Author(s):

Dustin J. Wcisel ◽

J. Thomas Howard ◽

Jeffrey A. Yoder ◽

Alex Dornburg

Keyword(s):

Missing Data ◽

Open Source ◽

Research Question ◽

Single Copy ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Sequencing Technologies ◽

Phylogenomic Analyses ◽

The Impact

Abstract Background Advances in next-generation sequencing technologies have reduced the cost of whole transcriptome analyses, allowing characterization of non-model species at unprecedented levels. The rapid pace of transcriptomic sequencing has driven the public accumulation of a wealth of data for phylogenomic analyses, however lack of tools aimed towards phylogeneticists to efficiently identify orthologous sequences currently hinders effective harnessing of this resource. Results We introduce TOAST, an open source R software package that can utilize the ortholog searches based on the software Benchmarking Universal Single-Copy Orthologs (BUSCO) to assemble multiple sequence alignments of orthologous loci from transcriptomes for any group of organisms. By streamlining search, query, and alignment, TOAST automates the generation of locus and concatenated alignments, and also presents a series of outputs from which users can not only explore missing data patterns across their alignments, but also reassemble alignments based on user-defined acceptable missing data levels for a given research question. Conclusions TOAST provides a comprehensive set of tools for assembly of sequence alignments of orthologs for comparative transcriptomic and phylogenomic studies. This software empowers easy assembly of public and novel sequences for any target database of candidate orthologs, and fills a critically needed niche for tools that enable quantification and testing of the impact of missing data. As open-source software, TOAST is fully customizable for integration into existing or novel custom informatic pipelines for phylogenomic inference.

Download Full-text

Transcriptome Ortholog Alignment Sequence Tools (TOAST) for Phylogenomic Dataset Assembly

10.21203/rs.2.16269/v1 ◽

2019 ◽

Author(s):

Alex Dornburg ◽

Dustin J. Wcisel ◽

J. Thomas Howard ◽

Jeffrey A. Yoder

Keyword(s):

Missing Data ◽

Open Source ◽

Single Copy ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Sequencing Technologies ◽

Phylogenomic Analyses ◽

The Cost ◽

The Impact

Abstract Background Advances in next-generation sequencing technologies have reduced the cost of whole transcriptome analyses, allowing characterization of non-model species at unprecedented levels. The rapid pace of transcriptomic sequencing has driven the public accumulation of a wealth of data for phylogenomic analyses, however lack of tools aimed towards phylogeneticists to efficiently identify orthologous sequences currently hinders effective harnessing of this resource.Results We introduce TOAST, an open source R software package that can utilize the ortholog searches based on the software Benchmarking Universal Single-Copy Orthologs (BUSCO) to assemble multiple sequence alignments of orthologous loci from transcriptomes for any group of organisms. By streamlining search, query, and alignment, TOAST automates the generation of locus and concatenated alignments, and also presents a series of outputs from which users can not only explore missing data patterns across their alignments, but also reassemble alignments based on user-defined acceptable missing data levels for a given research question.Conclusions TOAST provides a comprehensive set of tools for assembly of sequence alignments of orthologs for comparative transcriptomic and phylogenomic studies. This software empowers easy assembly of public and novel sequences for any target database of candidate orthologs, and fills a critically needed niche for tools that enable quantification and testing of the impact of missing data. As open-source software, TOAST is fully customizable for integration into existing or novel custom informatic pipelines for phylogenomic inference.

Download Full-text

Identification and analysis of proton-translocating pyrophosphatases in the methanogenic archaeonMethanosarcina mazei

Archaea ◽

10.1155/2002/371325 ◽

2002 ◽

Vol 1 (1) ◽

pp. 1-7 ◽

Cited By ~ 9

Author(s):

Sebastian Bäumer ◽

Sabine Lentes ◽

Gerhard Gottschalk ◽

Uwe Deppenmeier

Keyword(s):

Specific Activity ◽

Sequence Data ◽

Sequence Similarity ◽

Open Reading Frames ◽

Inorganic Pyrophosphatase ◽

Amino Acid Sequence Similarity ◽

Sequence Alignments ◽

Multiple Sequence ◽

Inorganic Pyrophosphate ◽

Reading Frames

Analysis of genome sequence data from the methanogenic archaeonMethanosarcina mazeiGö1 revealed the existence of two open reading frames encoding proton-translocating pyrophosphatases (PPases). These open reading frames are linked by a 750-bp intergenic region containing TC-rich stretches and are transcribed in opposite directions. The corresponding polypeptides are referred to as Mvp1 and Mvp2 and consist of 671 and 676 amino acids, respectively. Both enzymes represent extremely hydrophobic, integral membrane proteins with 15 predicted transmembrane segments and an overall amino acid sequence similarity of 50.1%. Multiple sequence alignments revealed that Mvp1 is closely related to eukaryotic PPases, whereas Mvp2 shows highest homologies to bacterial PPases. Northern blot experiments with RNA from methanol-grown cells harvested in the mid-log growth phase indicated that only Mvp2 was produced under these conditions. Analysis of washed membranes showed that Mvp2 had a specific activity of 0.34 U mg (protein)–1. Proton translocation experiments with inverted membrane vesicles prepared from methanol-grown cells showed that hydrolysis of 1 mol of pyrophosphate was coupled to the translocation of about 1 mol of protons across the cytoplasmic membrane. Appropriate conditions formvp1 expression could not be determined yet. The pyrophosphatases ofM. mazeiGö1 represent the first examples of this enzyme class in methanogenic archaea and may be part of their energy-conserving system. Abbreviations: DCCD,N,N′-dicyclohexylcarbodiimide; PPase, inorganic pyrophosphatase; PPi, inorganic pyrophosphate; Δp, proton motive force.

Download Full-text

Contrastive learning on protein embeddings enlightens midnight zone at lightning speed

10.1101/2021.11.14.468528 ◽

2021 ◽

Author(s):

Michael Heinzinger ◽

Maria Littmann ◽

Ian Sillitoe ◽

Nicola Bordin ◽

Christine Orengo ◽

...

Keyword(s):

Structure Prediction ◽

Sequence Similarity ◽

3D Structure ◽

Three Dimensional ◽

Hierarchical Classification ◽

Language Models ◽

Sequence Alignments ◽

Sequence Comparisons ◽

Multiple Sequence ◽

3D Structures

Thanks to the recent advances in protein three-dimensional (3D) structure prediction, in particular through AlphaFold 2 and RoseTTAFold, the abundance of protein 3D information will explode over the next year(s). Expert resources based on 3D structures such as SCOP and CATH have been organizing the complex sequence-structure-function relations into a hierarchical classification schema. Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI) transferring annotations from a protein with experimentally known annotation to a query without annotation. Here, we presented a novel approach that expands the concept of HBI from a low-dimensional sequence-distance lookup to the level of a high-dimensional embedding-based annotation transfer (EAT). Secondly, we introduced a novel solution using single protein sequence representations from protein Language Models (pLMs), so called embeddings (Prose, ESM-1b, ProtBERT, and ProtT5), as input to contrastive learning, by which a new set of embeddings was created that optimized constraints captured by hierarchical classifications of protein 3D structures. These new embeddings (dubbed ProtTucker) clearly improved what was historically referred to as threading or fold recognition. Thereby, the new embeddings enabled the intrusion into the midnight zone of protein comparisons, i.e., the region in which the level of pairwise sequence similarity is akin of random relations and therefore is hard to navigate by HBI methods. Cautious benchmarking showed that ProtTucker reached much further than advanced sequence comparisons without the need to compute alignments allowing it to be orders of magnitude faster. Code is available at https://github.com/Rostlab/EAT .

Download Full-text

Wei2GO: weighted sequence similarity-based protein function prediction

10.1101/2020.04.24.059501 ◽

2020 ◽

Author(s):

Maarten J.M.F Reijnders

Keyword(s):

Gene Ontology ◽

Open Source ◽

Protein Function ◽

Large Scale ◽

Sequence Similarity ◽

Protein Function Prediction ◽

Function Prediction ◽

Computational Time ◽

Web Servers ◽

Weighted Sequence

AbstractBackgroundProtein function prediction is an important part of bioinformatics and genomics studies. There are many different predictors available, however most of these are in the form of web-servers instead of open-source locally installable versions. Such local versions are necessary to perform large scale genomics studies due to the presence of limitations imposed by web servers such as queues, prediction speed, and updatability of databases.MethodsThis paper describes Wei2GO: a weighted sequence similarity and python-based open-source protein function prediction software. It uses DIAMOND and HMMScan sequence alignment searches against the UniProtKB and Pfam databases respectively, transfers Gene Ontology terms from the reference protein to the query protein, and uses a weighing algorithm to calculate a score for the Gene Ontology annotations.ResultsWei2GO is compared against the Argot2 and Argot2.5 web servers, which use a similar concept, and DeepGOPlus which acts as a reference. Wei2GO shows an increase in performance according to precision and recall curves, Fmax scores, and Smin scores for biological process and molecular function ontologies. Computational time compared to Argot2 and Argot2.5 is decreased from several hours to several minutes.AvailabilityWei2GO is written in Python 3, and can be found at https://gitlab.com/mreijnders/Wei2GO

Download Full-text

Mandrake: visualising microbial population structure by embedding millions of genomes into a low-dimensional representation

10.1101/2021.10.28.466232 ◽

2021 ◽

Author(s):

John A Lees ◽

Gerry Tonkin-Hill ◽

Zhirong Yang ◽

Jukka Corander

Keyword(s):

Population Structure ◽

Large Scale ◽

Population Genomics ◽

Bacterial Species ◽

Population Based ◽

Data Sets ◽

Sequence Alignments ◽

Multiple Sequence ◽

Dimensional Reduction Method ◽

Low Dimensional

In less than a decade, population genomics of microbes has progressed from the effort of sequencing dozens of strains to thousands, or even tens of thousands of strains in a single study. There are now hundreds of thousands of genomes available even for a single bacterial species and the number of genomes is expected to continue to increase at an accelerated pace given the advances in sequencing technology and widespread genomic surveillance initiatives. This explosion of data calls for innovative methods to enable rapid exploration of the structure of a population based on different data modalities, such as multiple sequence alignments, assemblies and estimates of gene content across different genomes. Here we present Mandrake, an efficient implementation of a dimensional reduction method tailored for the needs of large-scale population genomics. Mandrake is capable of visualising population structure from millions of whole genomes and we illustrate its usefulness with several data sets representing major pathogens. Our method is freely available both as an analysis pipeline (https://github.com/johnlees/mandrake) and as a browser-based interactive application (https://gtonkinhill.github.io/mandrake-web/).

Download Full-text