FASTA Herder: a web application to trim protein sequence sets

Abstract The ever increasing number of sequences in protein databases usually turns out large numbers of homologs in sequence similarity searches. While information from homology can be very useful for functional prediction based on amino acid conservation, many of these homologs usually have high levels of identity among themselves, which hinders multiple sequence alignment computation and, especially, visualization. More generally, high redundancy reduces the usability of a protein set in machine learning applications and biases statistical analyses. We developed an algorithm to identify redundant sequence homologs that can be culled producing a streamlined FASTA file. As a difference from other automatic approaches that only aggregate sequences with high identity, our method clusters near-full length homologs allowing for lower sequence identity thresholds. Our method was fully tested and implemented in a web application called FASTA Herder, publicly available at http://fh.ogic.ca/.

Download Full-text

FASTA Herder: a web application to trim protein sequence sets

ScienceOpen Research ◽

10.14293/s2199-1006.1.sor-life.a67837.v1 ◽

2014 ◽

Keyword(s):

Web Application ◽

Sequence Similarity ◽

Functional Prediction ◽

Multiple Sequence ◽

High Identity ◽

Protein Databases ◽

Large Numbers ◽

Machine Learning Applications ◽

Similarity Searches ◽

Amino Acid Conservation

The ever increasing number of sequences in protein databases usually turns out large numbers of homologs in sequence similarity searches. While information from homology can be very useful for functional prediction based on amino acid conservation, many of these homologs usually have high levels of identity among themselves, which hinders multiple sequence alignment (MSA) computation and, especially, visualization. More generally, high redundancy reduces the usability of a protein set in machine learning applications and biases statistical analyses. We developed an algorithm to identify redundant sequence homologs that can be culled producing a streamlined FASTA file. As a difference from other automatic approaches that only aggregate sequences with high identity, our method clusters near-full length homologs allowing for lower sequence identity thresholds. Our method was fully tested and implemented in a web application called FASTA Herder, publicly available at http://www.ogic.ca/projects/fh/orain.html .

Download Full-text

PyMod: sequence similarity searches, multiple sequence-structure alignments, and homology modeling within PyMOL

BMC Bioinformatics ◽

10.1186/1471-2105-13-s4-s2 ◽

2012 ◽

Vol 13 (Suppl 4) ◽

pp. S2 ◽

Cited By ~ 82

Author(s):

Emanuele Bramucci ◽

Alessandro Paiardini ◽

Francesco Bossa ◽

Stefano Pascarella

Keyword(s):

Homology Modeling ◽

Sequence Similarity ◽

Sequence Structure ◽

Multiple Sequence ◽

Similarity Searches

Download Full-text

WASPS: web-assisted symbolic plasmid synteny server

Bioinformatics ◽

10.1093/bioinformatics/btz745 ◽

2019 ◽

Author(s):

Catherine Badel ◽

Violette Da Cunha ◽

Ryan Catchpole ◽

Patrick Forterre ◽

Jacques Oberto

Keyword(s):

Web Service ◽

Dna Sequences ◽

Sequence Similarity ◽

Supplementary Information ◽

Orthologous Protein ◽

Large Numbers ◽

Daunting Task ◽

Internet Explorer ◽

Genome Analyses ◽

Similarity Searches

Abstract Motivation Comparative plasmid genome analyses require complex tools, the manipulation of large numbers of sequences and constitute a daunting task for the wet bench experimentalist. Dedicated plasmid databases are sparse, only comprise bacterial plasmids and provide exclusively access to sequence similarity searches. Results We have developed Web-Assisted Symbolic Plasmid Synteny (WASPS), a web service granting protein and DNA sequence similarity searches against a database comprising all completely sequenced natural plasmids from bacterial, archaeal and eukaryal origin. This database pre-calculates orthologous protein clustering and enables WASPS to generate fully resolved plasmid synteny maps in real time using internal and user-provided DNA sequences. Availability and implementation WASPS queries befit all current browsers such as Firefox, Edge or Safari while the best functionality is achieved with Chrome. Internet Explorer is not supported. WASPS is freely accessible at https://archaea.i2bc.paris-saclay.fr/wasps/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

VECTOR SPACE INDEXING FOR BIOSEQUENCE SIMILARITY SEARCHES

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213005002405 ◽

2005 ◽

Vol 14 (05) ◽

pp. 811-826 ◽

Cited By ~ 1

Author(s):

OZGUR OZTURK ◽

HAKAN FERHATOSMANOGLU

Keyword(s):

Nearest Neighbor ◽

Sequence Similarity ◽

Distance Functions ◽

Data Sets ◽

Index Structures ◽

K Nearest Neighbor ◽

Protein Databases ◽

Approximation Quality ◽

Similarity Searches ◽

Better Than

We present a multi-dimensional indexing approach for fast sequence similarity search in DNA and protein databases. In particular, we propose effective transformations of subsequences into numerical vector domains and build efficient index structures on the transformed vectors. We then define distance functions in the transformed domain and examine properties of these functions. We experimentally compared their (a) approximation quality for k-Nearest Neighbor (k-NN) queries and both (b) pruning ability and (c) approximation quality for ε-range queries. Results for k-NN queries, which we present here, show that our proposed distances FD2 and WD2 (i.e. Frequency and Wavelet Distance functions for 2-grams) perform significantly better than the others. We then develop effective index structures, based on R-trees and scalar quantization, on top of transformed vectors and distance functions. Promising results from the experiments on real biosequence data sets are presented.

Download Full-text

PredNTS: Improved and Robust Prediction of Nitrotyrosine Sites by Integrating Multiple Sequence Features

International Journal of Molecular Sciences ◽

10.3390/ijms22052704 ◽

2021 ◽

Vol 22 (5) ◽

pp. 2704

Author(s):

Andi Nur Nilamyani ◽

Firda Nurul Auliah ◽

Mohammad Ali Moni ◽

Watshara Shoombuatong ◽

Md Mehedi Hasan ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Web Application ◽

Computational Prediction ◽

Vital Role ◽

Machine Learning Algorithms ◽

Recursive Feature Elimination ◽

Post Translational Modification ◽

Multiple Sequence ◽

Sequence Features

Nitrotyrosine, which is generated by numerous reactive nitrogen species, is a type of protein post-translational modification. Identification of site-specific nitration modification on tyrosine is a prerequisite to understanding the molecular function of nitrated proteins. Thanks to the progress of machine learning, computational prediction can play a vital role before the biological experimentation. Herein, we developed a computational predictor PredNTS by integrating multiple sequence features including K-mer, composition of k-spaced amino acid pairs (CKSAAP), AAindex, and binary encoding schemes. The important features were selected by the recursive feature elimination approach using a random forest classifier. Finally, we linearly combined the successive random forest (RF) probability scores generated by the different, single encoding-employing RF models. The resultant PredNTS predictor achieved an area under a curve (AUC) of 0.910 using five-fold cross validation. It outperformed the existing predictors on a comprehensive and independent dataset. Furthermore, we investigated several machine learning algorithms to demonstrate the superiority of the employed RF algorithm. The PredNTS is a useful computational resource for the prediction of nitrotyrosine sites. The web-application with the curated datasets of the PredNTS is publicly available.

Download Full-text

Molecular characterization and phylogenetic analysis of NBS-LRR genes in wild relatives of eggplant (Solanum melongena L

Indian Journal of Agricultural Research ◽

10.18805/ijare.a-4793 ◽

2018 ◽

Author(s):

Sona. S Dev ◽

P. Poornima ◽

Akhil Venu

Keyword(s):

Phylogenetic Analysis ◽

Amino Acid ◽

Sequence Similarity ◽

Interleukin 1 ◽

Preliminary Investigation ◽

Solanum Melongena ◽

Wild Relatives ◽

Amino Acid Sequences ◽

R Genes ◽

Multiple Sequence

Eggplantor brinjal (Solanum melongena L.), is highly susceptible to various soil-borne diseases. The extensive use of chemical fungicides to combat these diseases can be minimized by identification of resistance gene analogs (RGAs) in wild species of cultivated plants.In the present study, degenerate PCR primers for the conserved regions ofnucleotide binding site-leucine rich repeat (NBS-LRR) were used to amplify RGAs from wild relatives of eggplant (Black nightshade (Solanum nigrum), Indian nightshade (Solanumviolaceum)and Solanu mincanum) which showed resistance to the bacterial wilt pathogen, Ralstonia solanacearumin the preliminary investigation. The amino acid sequence of the amplicons when compared to each other and to the amino acid sequences of known RGAs deposited in Gen Bank revealed significant sequence similarity. The phylogenetic analysis indicated that they belonged to the toll interleukin-1 receptors (TIR)-NBS-LRR type R-genes. Multiple sequence alignment with other known R genes showed significant homology with P-loop, Kinase 2 and GLPL domains of NBS-LRR class genes. There has been no report on R genes from these wild eggplants and hence the diversity analysis of these novel RGAs can lead to the identification of other novel R genes within the germplasm of different brinjal plants as well as other species of Solanum.

Download Full-text

Analysis of the diversity of the glycoside hydrolase family 130 in mammal gut microbiomes reveals a novel mannoside-phosphorylase function

Microbial Genomics ◽

10.1099/mgen.0.000404 ◽

2020 ◽

Vol 6 (10) ◽

Author(s):

Ao Li ◽

Elisabeth Laville ◽

Laurence Tarquis ◽

Vincent Lombard ◽

David Ropartz ◽

...

Keyword(s):

Glycoside Hydrolase ◽

Sequence Similarity ◽

Gut Bacteria ◽

Glycoside Hydrolase Family ◽

Sequence Alignments ◽

Multiple Sequence ◽

Content Type ◽

Multiple Sequence Alignments ◽

Hydrolase Family

Mannoside phosphorylases are involved in the intracellular metabolization of mannooligosaccharides, and are also useful enzymes for the in vitro synthesis of oligosaccharides. They are found in glycoside hydrolase family GH130. Here we report on an analysis of 6308 GH130 sequences, including 4714 from the human, bovine, porcine and murine microbiomes. Using sequence similarity networks, we divided the diversity of sequences into 15 mostly isofunctional meta-nodes; of these, 9 contained no experimentally characterized member. By examining the multiple sequence alignments in each meta-node, we predicted the determinants of the phosphorolytic mechanism and linkage specificity. We thus hypothesized that eight uncharacterized meta-nodes would be phosphorylases. These sequences are characterized by the absence of signal peptides and of the catalytic base. Those sequences with the conserved E/K, E/R and Y/R pairs of residues involved in substrate binding would target β-1,2-, β-1,3- and β-1,4-linked mannosyl residues, respectively. These predictions were tested by characterizing members of three of the uncharacterized meta-nodes from gut bacteria. We discovered the first known β-1,4-mannosyl-glucuronic acid phosphorylase, which targets a motif of the Shigella lipopolysaccharide O-antigen. This work uncovers a reliable strategy for the discovery of novel mannoside-phosphorylases, reveals possible interactions between gut bacteria, and identifies a biotechnological tool for the synthesis of antigenic oligosaccharides.

Download Full-text

Rfam 14: expanded coverage of metagenomic, viral and microRNA families

Nucleic Acids Research ◽

10.1093/nar/gkaa1047 ◽

2020 ◽

Vol 49 (D1) ◽

pp. D192-D200 ◽

Cited By ~ 2

Author(s):

Ioanna Kalvari ◽

Eric P Nawrocki ◽

Nancy Ontiveros-Palacios ◽

Joanna Argasinska ◽

Kevin Lamkiewicz ◽

...

Keyword(s):

Similarity Search ◽

Sequence Similarity ◽

Sequence Similarity Search ◽

Covariance Model ◽

Rna Sequences ◽

Multiple Sequence ◽

The Family ◽

Recent Developments ◽

Community Contribution ◽

Website Features

Abstract Rfam is a database of RNA families where each of the 3444 families is represented by a multiple sequence alignment of known RNA sequences and a covariance model that can be used to search for additional members of the family. Recent developments have involved expert collaborations to improve the quality and coverage of Rfam data, focusing on microRNAs, viral and bacterial RNAs. We have completed the first phase of synchronising microRNA families in Rfam and miRBase, creating 356 new Rfam families and updating 40. We established a procedure for comprehensive annotation of viral RNA families starting with Flavivirus and Coronaviridae RNAs. We have also increased the coverage of bacterial and metagenome-based RNA families from the ZWD database. These developments have enabled a significant growth of the database, with the addition of 759 new families in Rfam 14. To facilitate further community contribution to Rfam, expert users are now able to build and submit new families using the newly developed Rfam Cloud family curation system. New Rfam website features include a new sequence similarity search powered by RNAcentral, as well as search and visualisation of families with pseudoknots. Rfam is freely available at https://rfam.org.

Download Full-text

Identification of a PRC2 Accessory Subunit Required for Subtelomeric H3K27 Methylation in Neurospora crassa

Molecular and Cellular Biology ◽

10.1128/mcb.00003-20 ◽

2020 ◽

Vol 40 (11) ◽

Author(s):

Kevin J. McNaught ◽

Elizabeth T. Wiles ◽

Eric U. Selker

Keyword(s):

Neurospora Crassa ◽

Transcriptional Repression ◽

Sequence Similarity ◽

Model Organism ◽

Content Type ◽

Polycomb Repressive Complex 2 ◽

H3k27 Methylation ◽

Accessory Subunit ◽

Genomic Regions ◽

Similarity Searches

ABSTRACT Polycomb repressive complex 2 (PRC2) catalyzes methylation of histone H3 at lysine 27 (H3K27) in genomic regions of most eukaryotes and is critical for maintenance of the associated transcriptional repression. However, the mechanisms that shape the distribution of H3K27 methylation, such as recruitment of PRC2 to chromatin and/or stimulation of PRC2 activity, are unclear. Here, using a forward genetic approach in the model organism Neurospora crassa, we identified two alleles of a gene, NCU04278, encoding an unknown PRC2 accessory subunit (PAS). Loss of PAS resulted in losses of H3K27 methylation concentrated near the chromosome ends and derepression of a subset of associated subtelomeric genes. Immunoprecipitation followed by mass spectrometry confirmed reciprocal interactions between PAS and known PRC2 subunits, and sequence similarity searches demonstrated that PAS is not unique to N. crassa. PAS homologs likely influence the distribution of H3K27 methylation and underlying gene repression in a variety of fungal lineages.

Download Full-text

Partial sequence analysis of mitochondrial cytochrome B gene of Labeo calbasu of Bangladesh

Journal of Biodiversity Conservation and Bioresource Management ◽

10.3329/jbcbm.v5i1.42182 ◽

2019 ◽

Vol 5 (1) ◽

pp. 25-30

Author(s):

RA Begum ◽

MT Alam ◽

H Jahan ◽

MS Alam

Keyword(s):

Genetic Diversity ◽

Cytochrome B ◽

Tissue Sample ◽

Sequence Similarity ◽

Cytochrome B Gene ◽

Mitochondrial Cytochrome ◽

Cyt B ◽

Multiple Sequence ◽

Mitochondrial Cytochrome B ◽

Cyt B Gene

Labeo calbasu (Family Cyprinidae) was studied at DNA level to know genetic diversity within and between species. The mitochondrial cytochrome b (cyt-b) gene of L. calbasu was sequenced and compared to the corresponding sequences of other Labeo species. DNA was isolated from the tissue sample of L. calbasu using phenol: chloroform extraction method. Forward and reverse primers were designed to amplify the target region of cytochrome b gene. A standard PCR protocol was used for the amplification of the desired region. Then, the forward and reverse sequences obtained were aligned and edited to finalize a length of 510 nucleotides which was submitted to NCBI genbank database. Nucleotide BLAST of this sequence at NCBI resulted 100% sequence similarity with L. calbasu sequence of the same region of cyt-b gene. Multiple sequence alignment of the sequence with seven more Labeo species sequences revealed 120 polymorphic sites, which have been mark of diversity among the species and might be used in molecular identification of the Labeo species. A constructed phylogenetic tree has shown relationship among the Labeo species. This research demonstrated the usefulness of mitochondrial DNA-based approach in species identification. Further, the data will provide appropriate background for studying genetic diversity within-species of the Labeo species in general and of L. calbasu in particular. J. Biodivers. Conserv. Bioresour. Manag. 2019, 5(1): 25-30

Download Full-text