scholarly journals FASTA Herder: a web application to trim protein sequence sets

Author(s):  
Miguel Andrade ◽  
Caroline Louis-Jeune ◽  
Carol Perez-Iratxeta

Abstract The ever increasing number of sequences in protein databases usually turns out large numbers of homologs in sequence similarity searches. While information from homology can be very useful for functional prediction based on amino acid conservation, many of these homologs usually have high levels of identity among themselves, which hinders multiple sequence alignment computation and, especially, visualization. More generally, high redundancy reduces the usability of a protein set in machine learning applications and biases statistical analyses. We developed an algorithm to identify redundant sequence homologs that can be culled producing a streamlined FASTA file. As a difference from other automatic approaches that only aggregate sequences with high identity, our method clusters near-full length homologs allowing for lower sequence identity thresholds. Our method was fully tested and implemented in a web application called FASTA Herder, publicly available at http://fh.ogic.ca/.

The ever increasing number of sequences in protein databases usually turns out large numbers of homologs in sequence similarity searches. While information from homology can be very useful for functional prediction based on amino acid conservation, many of these homologs usually have high levels of identity among themselves, which hinders multiple sequence alignment (MSA) computation and, especially, visualization. More generally, high redundancy reduces the usability of a protein set in machine learning applications and biases statistical analyses. We developed an algorithm to identify redundant sequence homologs that can be culled producing a streamlined FASTA file. As a difference from other automatic approaches that only aggregate sequences with high identity, our method clusters near-full length homologs allowing for lower sequence identity thresholds. Our method was fully tested and implemented in a web application called FASTA Herder, publicly available at http://www.ogic.ca/projects/fh/orain.html .


2012 ◽  
Vol 13 (Suppl 4) ◽  
pp. S2 ◽  
Author(s):  
Emanuele Bramucci ◽  
Alessandro Paiardini ◽  
Francesco Bossa ◽  
Stefano Pascarella

2019 ◽  
Author(s):  
Catherine Badel ◽  
Violette Da Cunha ◽  
Ryan Catchpole ◽  
Patrick Forterre ◽  
Jacques Oberto

Abstract Motivation Comparative plasmid genome analyses require complex tools, the manipulation of large numbers of sequences and constitute a daunting task for the wet bench experimentalist. Dedicated plasmid databases are sparse, only comprise bacterial plasmids and provide exclusively access to sequence similarity searches. Results We have developed Web-Assisted Symbolic Plasmid Synteny (WASPS), a web service granting protein and DNA sequence similarity searches against a database comprising all completely sequenced natural plasmids from bacterial, archaeal and eukaryal origin. This database pre-calculates orthologous protein clustering and enables WASPS to generate fully resolved plasmid synteny maps in real time using internal and user-provided DNA sequences. Availability and implementation WASPS queries befit all current browsers such as Firefox, Edge or Safari while the best functionality is achieved with Chrome. Internet Explorer is not supported. WASPS is freely accessible at https://archaea.i2bc.paris-saclay.fr/wasps/. Supplementary information Supplementary data are available at Bioinformatics online.


2005 ◽  
Vol 14 (05) ◽  
pp. 811-826 ◽  
Author(s):  
OZGUR OZTURK ◽  
HAKAN FERHATOSMANOGLU

We present a multi-dimensional indexing approach for fast sequence similarity search in DNA and protein databases. In particular, we propose effective transformations of subsequences into numerical vector domains and build efficient index structures on the transformed vectors. We then define distance functions in the transformed domain and examine properties of these functions. We experimentally compared their (a) approximation quality for k-Nearest Neighbor (k-NN) queries and both (b) pruning ability and (c) approximation quality for ε-range queries. Results for k-NN queries, which we present here, show that our proposed distances FD2 and WD2 (i.e. Frequency and Wavelet Distance functions for 2-grams) perform significantly better than the others. We then develop effective index structures, based on R-trees and scalar quantization, on top of transformed vectors and distance functions. Promising results from the experiments on real biosequence data sets are presented.


2021 ◽  
Vol 22 (5) ◽  
pp. 2704
Author(s):  
Andi Nur Nilamyani ◽  
Firda Nurul Auliah ◽  
Mohammad Ali Moni ◽  
Watshara Shoombuatong ◽  
Md Mehedi Hasan ◽  
...  

Nitrotyrosine, which is generated by numerous reactive nitrogen species, is a type of protein post-translational modification. Identification of site-specific nitration modification on tyrosine is a prerequisite to understanding the molecular function of nitrated proteins. Thanks to the progress of machine learning, computational prediction can play a vital role before the biological experimentation. Herein, we developed a computational predictor PredNTS by integrating multiple sequence features including K-mer, composition of k-spaced amino acid pairs (CKSAAP), AAindex, and binary encoding schemes. The important features were selected by the recursive feature elimination approach using a random forest classifier. Finally, we linearly combined the successive random forest (RF) probability scores generated by the different, single encoding-employing RF models. The resultant PredNTS predictor achieved an area under a curve (AUC) of 0.910 using five-fold cross validation. It outperformed the existing predictors on a comprehensive and independent dataset. Furthermore, we investigated several machine learning algorithms to demonstrate the superiority of the employed RF algorithm. The PredNTS is a useful computational resource for the prediction of nitrotyrosine sites. The web-application with the curated datasets of the PredNTS is publicly available.


Author(s):  
Sona. S Dev ◽  
P. Poornima ◽  
Akhil Venu

Eggplantor brinjal (Solanum melongena L.), is highly susceptible to various soil-borne diseases. The extensive use of chemical fungicides to combat these diseases can be minimized by identification of resistance gene analogs (RGAs) in wild species of cultivated plants.In the present study, degenerate PCR primers for the conserved regions ofnucleotide binding site-leucine rich repeat (NBS-LRR) were used to amplify RGAs from wild relatives of eggplant (Black nightshade (Solanum nigrum), Indian nightshade (Solanumviolaceum)and Solanu mincanum) which showed resistance to the bacterial wilt pathogen, Ralstonia solanacearumin the preliminary investigation. The amino acid sequence of the amplicons when compared to each other and to the amino acid sequences of known RGAs deposited in Gen Bank revealed significant sequence similarity. The phylogenetic analysis indicated that they belonged to the toll interleukin-1 receptors (TIR)-NBS-LRR type R-genes. Multiple sequence alignment with other known R genes showed significant homology with P-loop, Kinase 2 and GLPL domains of NBS-LRR class genes. There has been no report on R genes from these wild eggplants and hence the diversity analysis of these novel RGAs can lead to the identification of other novel R genes within the germplasm of different brinjal plants as well as other species of Solanum.


2020 ◽  
Vol 6 (10) ◽  
Author(s):  
Ao Li ◽  
Elisabeth Laville ◽  
Laurence Tarquis ◽  
Vincent Lombard ◽  
David Ropartz ◽  
...  

Mannoside phosphorylases are involved in the intracellular metabolization of mannooligosaccharides, and are also useful enzymes for the in vitro synthesis of oligosaccharides. They are found in glycoside hydrolase family GH130. Here we report on an analysis of 6308 GH130 sequences, including 4714 from the human, bovine, porcine and murine microbiomes. Using sequence similarity networks, we divided the diversity of sequences into 15 mostly isofunctional meta-nodes; of these, 9 contained no experimentally characterized member. By examining the multiple sequence alignments in each meta-node, we predicted the determinants of the phosphorolytic mechanism and linkage specificity. We thus hypothesized that eight uncharacterized meta-nodes would be phosphorylases. These sequences are characterized by the absence of signal peptides and of the catalytic base. Those sequences with the conserved E/K, E/R and Y/R pairs of residues involved in substrate binding would target β-1,2-, β-1,3- and β-1,4-linked mannosyl residues, respectively. These predictions were tested by characterizing members of three of the uncharacterized meta-nodes from gut bacteria. We discovered the first known β-1,4-mannosyl-glucuronic acid phosphorylase, which targets a motif of the Shigella lipopolysaccharide O-antigen. This work uncovers a reliable strategy for the discovery of novel mannoside-phosphorylases, reveals possible interactions between gut bacteria, and identifies a biotechnological tool for the synthesis of antigenic oligosaccharides.


2020 ◽  
Vol 49 (D1) ◽  
pp. D192-D200 ◽  
Author(s):  
Ioanna Kalvari ◽  
Eric P Nawrocki ◽  
Nancy Ontiveros-Palacios ◽  
Joanna Argasinska ◽  
Kevin Lamkiewicz ◽  
...  

Abstract Rfam is a database of RNA families where each of the 3444 families is represented by a multiple sequence alignment of known RNA sequences and a covariance model that can be used to search for additional members of the family. Recent developments have involved expert collaborations to improve the quality and coverage of Rfam data, focusing on microRNAs, viral and bacterial RNAs. We have completed the first phase of synchronising microRNA families in Rfam and miRBase, creating 356 new Rfam families and updating 40. We established a procedure for comprehensive annotation of viral RNA families starting with Flavivirus and Coronaviridae RNAs. We have also increased the coverage of bacterial and metagenome-based RNA families from the ZWD database. These developments have enabled a significant growth of the database, with the addition of 759 new families in Rfam 14. To facilitate further community contribution to Rfam, expert users are now able to build and submit new families using the newly developed Rfam Cloud family curation system. New Rfam website features include a new sequence similarity search powered by RNAcentral, as well as search and visualisation of families with pseudoknots. Rfam is freely available at https://rfam.org.


2020 ◽  
Vol 40 (11) ◽  
Author(s):  
Kevin J. McNaught ◽  
Elizabeth T. Wiles ◽  
Eric U. Selker

ABSTRACT Polycomb repressive complex 2 (PRC2) catalyzes methylation of histone H3 at lysine 27 (H3K27) in genomic regions of most eukaryotes and is critical for maintenance of the associated transcriptional repression. However, the mechanisms that shape the distribution of H3K27 methylation, such as recruitment of PRC2 to chromatin and/or stimulation of PRC2 activity, are unclear. Here, using a forward genetic approach in the model organism Neurospora crassa, we identified two alleles of a gene, NCU04278, encoding an unknown PRC2 accessory subunit (PAS). Loss of PAS resulted in losses of H3K27 methylation concentrated near the chromosome ends and derepression of a subset of associated subtelomeric genes. Immunoprecipitation followed by mass spectrometry confirmed reciprocal interactions between PAS and known PRC2 subunits, and sequence similarity searches demonstrated that PAS is not unique to N. crassa. PAS homologs likely influence the distribution of H3K27 methylation and underlying gene repression in a variety of fungal lineages.


Author(s):  
RA Begum ◽  
MT Alam ◽  
H Jahan ◽  
MS Alam

Labeo calbasu (Family Cyprinidae) was studied at DNA level to know genetic diversity within and between species. The mitochondrial cytochrome b (cyt-b) gene of L. calbasu was sequenced and compared to the corresponding sequences of other Labeo species. DNA was isolated from the tissue sample of L. calbasu using phenol: chloroform extraction method. Forward and reverse primers were designed to amplify the target region of cytochrome b gene. A standard PCR protocol was used for the amplification of the desired region. Then, the forward and reverse sequences obtained were aligned and edited to finalize a length of 510 nucleotides which was submitted to NCBI genbank database. Nucleotide BLAST of this sequence at NCBI resulted 100% sequence similarity with L. calbasu sequence of the same region of cyt-b gene. Multiple sequence alignment of the sequence with seven more Labeo species sequences revealed 120 polymorphic sites, which have been mark of diversity among the species and might be used in molecular identification of the Labeo species. A constructed phylogenetic tree has shown relationship among the Labeo species. This research demonstrated the usefulness of mitochondrial DNA-based approach in species identification. Further, the data will provide appropriate background for studying genetic diversity within-species of the Labeo species in general and of L. calbasu in particular. J. Biodivers. Conserv. Bioresour. Manag. 2019, 5(1): 25-30


Sign in / Sign up

Export Citation Format

Share Document