Multi-query sequence BLAST output examination with MuSeqBox

Abstract Background Accurate prediction of protein structure is fundamentally important to understand biological function of proteins. Template-based modeling, including protein threading and homology modeling, is a popular method for protein tertiary structure prediction. However, accurate template-query alignment and template selection are still very challenging, especially for the proteins with only distant homologs available. Results We propose a new template-based modelling method called ThreaderAI to improve protein tertiary structure prediction. ThreaderAI formulates the task of aligning query sequence with template as the classical pixel classification problem in computer vision and naturally applies deep residual neural network in prediction. ThreaderAI first employs deep learning to predict residue-residue aligning probability matrix by integrating sequence profile, predicted sequential structural features, and predicted residue-residue contacts, and then builds template-query alignment by applying a dynamic programming algorithm on the probability matrix. We evaluated our methods both in generating accurate template-query alignment and protein threading. Experimental results show that ThreaderAI outperforms currently popular template-based modelling methods HHpred, CNFpred, and the latest contact-assisted method CEthreader, especially on the proteins that do not have close homologs with known structures. In particular, in terms of alignment accuracy measured with TM-score, ThreaderAI outperforms HHpred, CNFpred, and CEthreader by 56, 13, and 11%, respectively, on template-query pairs at the similarity of fold level from SCOPe data. And on CASP13’s TBM-hard data, ThreaderAI outperforms HHpred, CNFpred, and CEthreader by 16, 9 and 8% in terms of TM-score, respectively. Conclusions These results demonstrate that with the help of deep learning, ThreaderAI can significantly improve the accuracy of template-based structure prediction, especially for distant-homology proteins.

Download Full-text

PPIT: an R package for inferring microbial taxonomy from nifH sequences

Bioinformatics ◽

10.1093/bioinformatics/btab100 ◽

2021 ◽

Author(s):

Bennett J Kapili ◽

Anne E Dekas

Keyword(s):

Gene Transfer ◽

Horizontal Gene Transfer ◽

Query Sequence ◽

Marker Gene ◽

R Package ◽

Supplementary Information ◽

Marker Genes ◽

Pairwise Identity ◽

Metabolic Marker ◽

Microbial Taxonomy

Abstract Motivation Linking microbial community members to their ecological functions is a central goal of environmental microbiology. When assigned taxonomy, amplicon sequences of metabolic marker genes can suggest such links, thereby offering an overview of the phylogenetic structure underpinning particular ecosystem functions. However, inferring microbial taxonomy from metabolic marker gene sequences remains a challenge, particularly for the frequently sequenced nitrogen fixation marker gene, nitrogenase reductase (nifH). Horizontal gene transfer in recent nifH evolutionary history can confound taxonomic inferences drawn from the pairwise identity methods used in existing software. Other methods for inferring taxonomy are not standardized and require manual inspection that is difficult to scale. Results We present Phylogenetic Placement for Inferring Taxonomy (PPIT), an R package that infers microbial taxonomy from nifH amplicons using both phylogenetic and sequence identity approaches. After users place query sequences on a reference nifH gene tree provided by PPIT (n = 6317 full-length nifH sequences), PPIT searches the phylogenetic neighborhood of each query sequence and attempts to infer microbial taxonomy. An inference is drawn only if references in the phylogenetic neighborhood are: (1) taxonomically consistent and (2) share sufficient pairwise identity with the query, thereby avoiding erroneous inferences due to known horizontal gene transfer events. We find that PPIT returns a higher proportion of correct taxonomic inferences than BLAST-based approaches at the cost of fewer total inferences. We demonstrate PPIT on deep-sea sediment and find that Deltaproteobacteria are the most abundant potential diazotrophs. Using this dataset we show that emending PPIT inferences based on visual inspection of query sequence placement can achieve taxonomic inferences for nearly all sequences in a query set. We additionally discuss how users can apply PPIT to the analysis of other marker genes. Availability PPIT is freely available to non-commercial users at https://github.com/bkapili/ppit. Installation includes a vignette that demonstrates package use and reproduces the nifH amplicon analysis discussed here. The raw nifH amplicon sequence data have been deposited in the GenBank, EMBL, and DDBJ databases under BioProject number PRJEB37167. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SHOOT: phylogenetic gene search and ortholog inference

10.1101/2021.09.01.458564 ◽

2021 ◽

Author(s):

David Emms ◽

Steven Kelly

Keyword(s):

Phylogenetic Analysis ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Phylogenetic Trees ◽

Query Sequence ◽

Gene Tree ◽

Biological Research ◽

Gene Sequences ◽

Multiple Sequence ◽

Gene Search

Determining the evolutionary relationships between gene sequences is fundamental to comparative biological research. However, conducting such analyses requires a high degree of technical proficiency in several computational tools including gene family construction, multiple sequence alignment, and phylogenetic inference. Here we present SHOOT, an easy to use phylogenetic search engine for fast and accurate phylogenetic analysis of biological sequences. SHOOT searches a user-provided query sequence against a database of phylogenetic trees of gene sequences (gene trees) and returns a gene tree with the given query sequence correctly grafted within it. We show that SHOOT can perform this search and placement with comparable speed to a conventional BLAST search. We demonstrate that SHOOT phylogenetic placements are as accurate as conventional multiple sequence alignment and maximum likelihood tree inference approaches. We further show that SHOOT can be used to identify orthologs with equivalent accuracy to conventional orthology inference methods. In summary, SHOOT is an accurate and fast tool for complete phylogenetic analysis of novel query sequences. An easy to use webserver is available online at www.shoot.bio.

Download Full-text

Decona: From demultiplexing to consensus for Nanopore amplicon data

ARPHA Conference Abstracts ◽

10.3897/aca.4.e65029 ◽

2021 ◽

Vol 4 ◽

Author(s):

Saskia Oosterbroek ◽

Karlijn Doorenspleet ◽

Reindert Nijland ◽

Lara Jansen

Keyword(s):

Sequence Data ◽

Variant Calling ◽

Environmental Dna ◽

Laptop Computer ◽

Consensus Sequences ◽

Sequencing Errors ◽

Blast Output ◽

Command Line Tool ◽

Microbial Symbionts ◽

User Friendly

Sequencing of long amplicons is one of the major benefits of Nanopore technologies, as it allows for reads much longer than Illumina. One of the major challenges for the analysis of these long Nanopore reads is the relatively high error rate. Sequencing errors are generally corrected by consensus generation and polishing. This is still a challenge for mixed samples such as metabarcoding environmental DNA, bulk DNA, mixed amplicon PCR’s and contaminated samples because sequence data would have to be clustered before consensus generation. To this end, we developed Decona (https://github.com/Saskia-Oosterbroek/decona), a command line tool that creates consensus sequences from mixed (metabarcoding) samples using a single command. Decona uses the CD-hit algorithm to cluster reads after demultiplexing (qcat) and filtering (NanoFilt). The sequences in each cluster are subsequently aligned (Minimap2), consensus sequences are generated (Racon) and finally polished (Medaka). Variant calling of the clusters (Medaka) is optional. With the integration of the BLAST+ application Decona does not only generate consensus sequences but also produces BLAST output if desired. The program can be used on a laptop computer making it suitable for use under field conditions. Amplicon data ranging from 300-7500 nucleotides was successfully processed by Decona, creating consensus sequences reaching over 99,9% read identity. This included fish datasets (environmental DNA from filtered water) from a curated aquarium, vertebrate datasets that were contaminated with human sequences and separating sponge sequences from their countless microbial symbionts. Decona considerably simplifies and speeds up post sequencing processes, providing consensus sequences and BLAST output through a single command. Classifying consensus sequences instead of raw sequences improves classification accuracy and drastically decreases the amount of sequences that need to be classified. Overall it is a user friendly option for researchers with limited knowledge of script based data processing.

Download Full-text

A STATIC OPTIMALITY TRANSFORMATION WITH APPLICATIONS TO PLANAR POINT LOCATION

International Journal of Computational Geometry & Applications ◽

10.1142/s0218195912600084 ◽

2012 ◽

Vol 22 (04) ◽

pp. 327-340 ◽

Cited By ~ 6

Author(s):

JOHN IACONO ◽

WOLFGANG MULZER

Keyword(s):

Data Structures ◽

Prior Information ◽

Query Sequence ◽

Binary Search ◽

Asymptotic Performance ◽

Search Trees ◽

Point Location ◽

Static Structure ◽

Additional Information ◽

Planar Point Location

Over the last decade, there have been several data structures that, given a planar subdivision and a probability distribution over the plane, provide a way for answering point location queries that is fine-tuned for the distribution. All these methods suffer from the requirement that the query distribution must be known in advance. We present a new data structure for point location queries in planar triangulations. Our structure is asymptotically as fast as the optimal structures, but it requires no prior information about the queries. This is a 2-D analogue of the jump from Knuth's optimum binary search trees (discovered in 1971) to the splay trees of Sleator and Tarjan in 1985. While the former need to know the query distribution, the latter are statically optimal. This means that we can adapt to the query sequence and achieve the same asymptotic performance as an optimum static structure, without needing any additional information.

Download Full-text

An Exploration of the Triplet Periodicity in Nucleotide Sequences with a Mature Self-Adaptive Spectral Rotation Approach

Journal of Applied Mathematics ◽

10.1155/2014/176943 ◽

2014 ◽

Vol 2014 ◽

pp. 1-9 ◽

Cited By ~ 1

Author(s):

Bo Chen ◽

Ping Ji

Keyword(s):

Random Walk ◽

Complex Plane ◽

Query Sequence ◽

Nucleotide Sequences ◽

Sequence Pattern ◽

Coding Regions ◽

Persistent Pattern ◽

Triplet Periodicity ◽

Self Adaptive

Previously, for predicting coding regions in nucleotide sequences, a self-adaptive spectral rotation (SASR) method has been developed, based on a universal statistical feature of the coding regions, named triplet periodicity (TP). It outputs a random walk, that is, TP walk, in the complex plane for the query sequence. Each step in the walk is corresponding to a position in the sequence and generated from a long-term statistic of the TP in the sequence. The coding regions (TP intensive) are then visually discriminated from the noncoding ones (without TP), in the TP walk. In this paper, the behaviors of the walks for random nucleotide sequences are further investigated qualitatively. A slightly leftward trend (a negative noise) in such walks is observed, which is not reported in the previous SASR literatures. An improved SASR, named the mature SASR, is proposed, in order to eliminate the noise and correct the TP walks. Furthermore, a potential sequence pattern opposite to the TP persistent pattern, that is, the TP antipersistent pattern, is explored. The applications of the algorithms on simulated datasets show their capabilities in detecting such a potential sequence pattern.

Download Full-text

BLAST2GENE: a comprehensive conversion of BLAST output into independent genes and gene fragments

Bioinformatics ◽

10.1093/bioinformatics/bth225 ◽

2004 ◽

Author(s):

M. Suyama

Keyword(s):

Blast Output

Download Full-text

A Novel Method of Predicting Protein Disordered Regions Based on Sequence Features

BioMed Research International ◽

10.1155/2013/414327 ◽

2013 ◽

Vol 2013 ◽

pp. 1-8 ◽

Cited By ~ 5

Author(s):

Tong-Hui Zhao ◽

Min Jiang ◽

Tao Huang ◽

Bi-Qing Li ◽

Ning Zhang ◽

...

Keyword(s):

Conservation Score ◽

Query Sequence ◽

Disordered Proteins ◽

Training Set ◽

Disordered Structures ◽

Feature List ◽

Novel Method ◽

Scoring Matrix ◽

Fold Cross Validation ◽

Disordered Regions

With a large number of disordered proteins and their important functions discovered, it is highly desired to develop effective methods to computationally predict protein disordered regions. In this study, based on Random Forest (RF), Maximum Relevancy Minimum Redundancy (mRMR), and Incremental Feature Selection (IFS), we developed a new method to predict disordered regions in proteins. The mRMR criterion was used to rank the importance of all candidate features. Finally, top 128 features were selected from the ranked feature list to build the optimal model, including 92 Position Specific Scoring Matrix (PSSM) conservation score features and 36 secondary structure features. As a result, Matthews correlation coefficient (MCC) of 0.3895 was achieved on the training set by 10-fold cross-validation. On the basis of predicting results for each query sequence by using the method, we used the scanning and modification strategy to improve the performance. The accuracy (ACC) and MCC were increased by 4% and almost 0.2%, respectively, compared with other three popular predictors: DISOPRED, DISOclust, and OnD-CRF. The selected features may shed some light on the understanding of the formation mechanism of disordered structures, providing guidelines for experimental validation.

Download Full-text

Automated Retroviral Insertion Site Sequence Analysis and Mapping Tool Followed by Database Analysis.

Blood ◽

10.1182/blood.v106.11.5528.5528 ◽

2005 ◽

Vol 106 (11) ◽

pp. 5528-5528

Author(s):

Stephanie Laufs ◽

Frank A. Giordano ◽

Agnes Hotz-Wagenblatt ◽

Uwe Appelt ◽

Daniel Lauterborn ◽

...

Keyword(s):

Insertional Mutagenesis ◽

Integration Site ◽

Query Sequence ◽

Cpg Islands ◽

Insertion Site ◽

High Throughput Analysis ◽

Site Analysis ◽

Conventional Analysis ◽

Integration Site Analysis ◽

Insertion Sites

Abstract Increasing use of retroviral vector-mediated gene transfer and recent reports on insertional mutagenesis in mice and humans created intense interest to characterize vector integrations on the genomic level. Techniques to determine insertion sites, mainly based on time consuming manual data processing and compilation, are thus commonly applied in gene therapy laboratories. Since a high variability in processing methods hampers further data comparison, there is an urgent need to systematically process the data arising from such analysis. The obtained sequences from the integration site analysis are judged to be authentic only if the matching part of the genomic query sequence is surrounded by the 5′LTR-sequence on the one side and the adapter-sequence on the other side. Therefore we developed an Integrationseq tool. In this task, different methods for converting the ABI sequence trace files to high quality sequences and for recognizing and deleting the LTR and adaptor parts of the isolated clones were implemented. If neither a primer nor a LTR could be found, the sequence is discarded. If the LTR is found on the complementary strand, the integration sequence is reversed. The remaining sequence between primer and LTR positions are taken as the n integration sequence and written to a sequence output file. We validated the Integrationseq tool using 259 trace files originating from integration site analysis (LM-PCR). Sequences can be trimmed by IntegrationSeq, leading to an increased yield of valid integration sequence detection, which has shown to be more sensitive (100%) than conventional analysis (94.3%) and 15 times faster than conventional analysis, while the specifities are equal (both 100%). Valid integration sequences get further processed with IntegrationMap for automatic genomic mapping. IntegrationMap runs 50 times faster than conventional methods and retrieves detailed information about whether integrations are located in or close to genes, the name of the gene, the exact localization in the transcriptional units and further parameters like the distance from the transcription start site to the integration. Further information, e.g. data about CpG-Islands, LINEs or SINEs, and their distances to the integration is also displayed. Output files generated by the task were found to be 99.8% identical with results retrieved by conventional mapping with the Ensembl alignment tool. Using both tools, IntegrationSeq and IntegrationMap, a validated, fast and standardized high-throughput analysis of insertion sites can be achieved for the first time.

Download Full-text