scholarly journals RAFTS3: Rapid Alignment-Free Tool for Sequence Similarity Search

2016 ◽  
Author(s):  
Ricardo Assunção Vialle ◽  
Fábio de Oliveira Pedrosa ◽  
Vinicius Almir Weiss ◽  
Dieval Guizelini ◽  
Juliana Helena Tibaes ◽  
...  

AbstractBackgroundSimilarity search of a given protein sequence against a database is an essential task in genome analysis. Sequence alignment is the most used method to perform such analysis. Although this approach is efficient, the time required to perform searches against large databases is always a challenge. Alignment-free techniques offer alternatives to comparing sequences without the need of alignment.ResultsHere We developed RAFTS3, a fast protein similarity search tool that utilizes a filter step for candidate selection based on shared k-mers and a comparison measure using a binary matrix of co-occurrence of amino acid residues. RAFTS3performed searches many times faster than those with BLASTp against large protein databases, such as NR, Pfam or UniRef, with a small loss of sensitivity depending on the similarity degree of the sequences.ConclusionsRAFTS3 is a new alternative for fast comparison of proteinsequences genome annotation and biological data mining. The source code and the standalone files for Windows and Linux platform are available at: https://sourceforge.net/projects/rafts3/

2017 ◽  
Author(s):  
Wentian Li ◽  
Jerome Freudenberg ◽  
Jan Freudenberg

AbstractThe nuclear human genome harbors sequences of mitochondrial origin, indicating an ancestral transfer of DNA from the mitogenome. Several Nuclear Mitochondrial Segments (NUMTs) have been detected by alignment-based sequence similarity search, as implemented in the Basic Local Alignment Search Tool (BLAST). Identifying NUMTs is important for the comprehensive annotation and understanding of the human genome. Here we explore the possibility of detecting NUMTs in the human genome by alignment-free sequence similarity search, such as k-mers (k-tuples, k-grams, oligos of length k) distributions. We find that when k=6 or larger, the k-mer approach and BLAST search produce almost identical results, e.g., detect the same set of NUMTs longer than 3kb. However, when k=5 or k=4, certain signals are only detected by the alignment-free approach, and these may indicate yet unrecognized, and potentially more ancestral NUMTs. We introduce a “Manhattan plot” style representation of NUMT predictions across the genome, which are calculated based on the reciprocal of the Jensen-Shannon divergence between the nuclear and mitochondrial k-mer frequencies. The further inspection of the k-mer-based NUMT predictions however shows that most of them contain long-terminal-repeat (LTR) annotations, whereas BLAST-based NUMT predictions do not. Thus, similarity of the mitogenome to LTR sequences is recognized, which we validate by finding the mitochondrial k-mer distribution closer to those for transposable sequences and specifically, close to some types of LTR.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Dimitri Boeckaerts ◽  
Michiel Stock ◽  
Bjorn Criel ◽  
Hans Gerstmans ◽  
Bernard De Baets ◽  
...  

AbstractNowadays, bacteriophages are increasingly considered as an alternative treatment for a variety of bacterial infections in cases where classical antibiotics have become ineffective. However, characterizing the host specificity of phages remains a labor- and time-intensive process. In order to alleviate this burden, we have developed a new machine-learning-based pipeline to predict bacteriophage hosts based on annotated receptor-binding protein (RBP) sequence data. We focus on predicting bacterial hosts from the ESKAPE group, Escherichia coli, Salmonella enterica and Clostridium difficile. We compare the performance of our predictive model with that of the widely used Basic Local Alignment Search Tool (BLAST). Our best-performing predictive model reaches Precision-Recall Area Under the Curve (PR-AUC) scores between 73.6 and 93.8% for different levels of sequence similarity in the collected data. Our model reaches a performance comparable to that of BLASTp when sequence similarity in the data is high and starts outperforming BLASTp when sequence similarity drops below 75%. Therefore, our machine learning methods can be especially useful in settings in which sequence similarity to other known sequences is low. Predicting the hosts of novel metagenomic RBP sequences could extend our toolbox to tune the host spectrum of phages or phage tail-like bacteriocins by swapping RBPs.


2007 ◽  
Vol 401 (3) ◽  
pp. 623-633 ◽  
Author(s):  
Emily R. Slepkov ◽  
Jan K. Rainey ◽  
Brian D. Sykes ◽  
Larry Fliegel

The mammalian NHE (Na+/H+ exchanger) is a ubiquitously expressed integral membrane protein that regulates intracellular pH by removing a proton in exchange for an extracellular sodium ion. Of the nine known isoforms of the mammalian NHEs, the first isoform discovered (NHE1) is the most thoroughly characterized. NHE1 is involved in numerous physiological processes in mammals, including regulation of intracellular pH, cell-volume control, cytoskeletal organization, heart disease and cancer. NHE comprises two domains: an N-terminal membrane domain that functions to transport ions, and a C-terminal cytoplasmic regulatory domain that regulates the activity and mediates cytoskeletal interactions. Although the exact mechanism of transport by NHE1 remains elusive, recent studies have identified amino acid residues that are important for NHE function. In addition, progress has been made regarding the elucidation of the structure of NHEs. Specifically, the structure of a single TM (transmembrane) segment from NHE1 has been solved, and the high-resolution structure of the bacterial Na+/H+ antiporter NhaA has recently been elucidated. In this review we discuss what is known about both functional and structural aspects of NHE1. We relate the known structural data for NHE1 to the NhaA structure, where TM IV of NHE1 shows surprising structural similarity with TM IV of NhaA, despite little primary sequence similarity. Further experiments that will be required to fully understand the mechanism of transport and regulation of the NHE1 protein are discussed.


2000 ◽  
Vol 350 (2) ◽  
pp. 369-379 ◽  
Author(s):  
Dietrich LOEBEL ◽  
Andrea SCALONI ◽  
Sara PAOLINI ◽  
Carlo FINI ◽  
Lino FERRARA ◽  
...  

Boar submaxillary glands produce the sex-specific salivary lipocalin (SAL), which binds steroidal sex pheromones as endogenous ligands. The cDNA encoding SAL was cloned and sequenced. From a single individual, two protein isoforms, differing in three amino acid residues, were purified and structurally characterized by a combined Edman degradation/MS approach. These experiments ascertained that the mature polypeptide is composed of 168 amino acid residues, that one of the three putative glycosylation sites is post-translationally modified and the structure of the bound glycosidic moieties. Two of the cysteine residues are paired together in a disulphide bridge, whereas the remaining two occur as free thiols. SAL bears sequence similarity to other lipocalins; on this basis, a three-dimensional model of the protein has been built. A SAL isoform was expressed in Escherichiacoli in good yields. Protein chemistry and CD experiments verified that the recombinant product shows the same redox state at the cysteine residues and that the same conformation is observed as in the natural protein, thus suggesting similar folding. Binding experiments on natural and recombinant SAL were performed with the fluorescent probe 1-aminoanthracene, which was efficiently displaced by the steroidal sex pheromone, as well as by several odorants.


1996 ◽  
Vol 43 (3) ◽  
pp. 507-513 ◽  
Author(s):  
D Stachowiak ◽  
A Polanowski ◽  
G Bieniarz ◽  
T Wilusz

Two serine proteinase inhibitors (ELTI I and ELTI II) have been isolated from mature seeds of Echinocystis lobata by ammonium sulfate fractionation, methanol precipitation, ion exchange chromatography, affinity chromatography on immobilized anhydrotrypsin and HPLC. ELTI I and ELTI II consist of 33 and 29 amino-acid residues, respectively. The primary structures of these inhibitors are as follows: ELTI I KEEQRVCPRILMRCKRDSDCLAQCTCQQSGFCG ELTI II RVCPRILMRCKRDSDCLAQCTCQQSGFCG The inhibitors show sequence similarity with the squash inhibitor family. ELTI I differs from ELTI II only by the presence of the NH2-terminal tetrapeptide Lys-Glu-Glu-Gln. The association constants (Ka) of ELTI I and ELTI II with bovine-trypsin were determined to be 6.6 x 10(10) M-1, and 3.1 x 10(11) M-1, whereas the association constants of these inhibitors with cathepsin G were 1.2 x 10(7) M-1, and 1.1 x 10(7) M-1, respectively.


2005 ◽  
Vol 22 (4) ◽  
pp. 487-492
Author(s):  
Hubert Cantalloube ◽  
Jacques Chomilier ◽  
Sylvain Chiusa ◽  
Mathieu Lonquety ◽  
Jean-Louis Spadoni ◽  
...  

2000 ◽  
Vol 113 (23) ◽  
pp. 4143-4149 ◽  
Author(s):  
J. Li ◽  
G.I. Lee ◽  
S.R. Van Doren ◽  
J.C. Walker

The forkhead-associated (FHA) domain is a phosphopeptide-binding domain first identified in a group of forkhead transcription factors but is present in a wide variety of proteins from both prokaryotes and eukaryotes. In yeast and human, many proteins containing an FHA domain are found in the nucleus and involved in DNA repair, cell cycle arrest, or pre-mRNA processing. In plants, the FHA domain is part of a protein that is localized to the plasma membrane and participates in the regulation of receptor-like protein kinase signaling pathways. Recent studies show that a functional FHA domain consists of 120–140 amino acid residues, which is significantly larger than the sequence motif first described. Although FHA domains do not exhibit extensive sequence similarity, they share similar secondary and tertiary structures, featuring a sandwich of two anti-parallel (beta)-sheets. One intriguing finding is that FHA domains may bind phosphothreonine, phosphoserine and sometimes phosphotyrosine, distinguishing them from other well-studied phosphoprotein-binding domains. The diversity of proteins containing FHA domains and potential differences in binding specificities suggest the FHA domain is involved in coordinating diverse cellular processes.


Author(s):  
Alexander Thomasian

Data storage requirements have consistently increased over time. According to the latest WinterCorp survey (http://www/WinterCorp.com), “The size of the world’s largest databases has tripled every two years since 2001.” With database size in excess of 1 terabyte, there is a clear need for storage systems that are both cost effective and highly reliable. Historically, large databases are implemented on mainframe systems. These systems are large and expensive to purchase and maintain. In recent years, large data warehouse applications are being deployed on Linux and Windows hosts, as replacements for the existing mainframe systems. These systems are significantly less expensive to purchase while requiring less resources to run and maintain. With large databases it is less feasible, and less cost effective, to use tapes for backup and restore. The time required to copy terabytes of data from a database to a serial medium (streaming tape) is measured in hours, which would significantly degrade performance and decreases availability. Alternatives to serial backup include local replication, mirroring, or geoplexing of data. The increasing demands of larger databases must be met by less expensive disk storage systems, which are yet highly reliable and less susceptible to data loss. This article is organized into five sections. The first section provides background information that serves to introduce the concepts of disk arrays. The following three sections detail the concepts used to build complex storage systems. The focus of these sections is to detail: (i) Redundant Arrays of Independent Disks (RAID) arrays; (ii) multilevel RAID (MRAID); (iii) concurrency control and storage transactions. The conclusion contains a brief survey of modular storage prototypes.


2020 ◽  
Vol 49 (D1) ◽  
pp. D192-D200 ◽  
Author(s):  
Ioanna Kalvari ◽  
Eric P Nawrocki ◽  
Nancy Ontiveros-Palacios ◽  
Joanna Argasinska ◽  
Kevin Lamkiewicz ◽  
...  

Abstract Rfam is a database of RNA families where each of the 3444 families is represented by a multiple sequence alignment of known RNA sequences and a covariance model that can be used to search for additional members of the family. Recent developments have involved expert collaborations to improve the quality and coverage of Rfam data, focusing on microRNAs, viral and bacterial RNAs. We have completed the first phase of synchronising microRNA families in Rfam and miRBase, creating 356 new Rfam families and updating 40. We established a procedure for comprehensive annotation of viral RNA families starting with Flavivirus and Coronaviridae RNAs. We have also increased the coverage of bacterial and metagenome-based RNA families from the ZWD database. These developments have enabled a significant growth of the database, with the addition of 759 new families in Rfam 14. To facilitate further community contribution to Rfam, expert users are now able to build and submit new families using the newly developed Rfam Cloud family curation system. New Rfam website features include a new sequence similarity search powered by RNAcentral, as well as search and visualisation of families with pseudoknots. Rfam is freely available at https://rfam.org.


Sign in / Sign up

Export Citation Format

Share Document