query sequence
Recently Published Documents

Although remarkable achievements, such as AlphaFold2, have been made in end-to-end structure prediction, fragment libraries remain essential for de novo protein structure prediction, which can help explore and understand the protein-folding mechanism. In this work, we developed a variable-length fragment library (VFlib). In VFlib, a master structure database was first constructed from the Protein Data Bank through sequence clustering. The Hidden Markov Model (HMM) profile of each protein in the master structure database was generated by HHsuite, and the secondary structure of each protein was calculated by DSSP. For the query sequence, the HMM-profile was first constructed. Then, variable-length fragments were retrieved from the master structure database through dynamically variable-length profile-profile comparison. A complete method for chopping the query HMM-profile during this process was proposed to obtain fragments with increased diversity. Finally, secondary structure information was used to further screen the retrieved fragments to generate the final fragment library of specific query sequence. The experimental results obtained with a set of 120 nonredundant proteins showed that the global precision and coverage of the fragment library generated by VFlib were 55.04% and 94.95% at the RMSD cutoff of 1.5 Å, respectively. Compared to the benchmark method of NNMake, the global precision of our fragment library had increased by 62.89% with equivalent coverage. Furthermore, the fragments generated by VFlib and NNMake were used to predict structure models through fragment assembly. Controlled experimental results demonstrated that the average TM-score of VFlib was 16.00% higher than that of NNMake.

Download Full-text

The Design and Implementation of an Improved Lightweight BLASTP on CUDA GPU

Symmetry ◽

10.3390/sym13122385 ◽

2021 ◽

Vol 13 (12) ◽

pp. 2385

Author(s):

Xue Sun ◽

Chao-Chin Wu ◽

Yan-Fang Liu

Keyword(s):

Sequence Alignment ◽

Query Sequence ◽

Length Distribution ◽

Lookup Table ◽

Sequence Length ◽

Position Information ◽

Memory Space ◽

Table Entry ◽

The Usa ◽

Index Table

In the field of computational biology, sequence alignment is a very important methodology. BLAST is a very common tool for performing sequence alignment in bioinformatics provided by National Center for Biotechnology Information (NCBI) in the USA. The BLAST server receives tens of thousands of queries every day on average. Among the procedures of BLAST, the hit detection process whose core architecture is a lookup table is the most time-consuming. In the latest work, a lightweight BLASTP on CUDA GPU with a hybrid query-index table was proposed for servicing the sequence query length shorter than 512, which effectively improved the query efficiency. According to the reported protein sequence length distribution, about 90% of sequences are equal to or smaller than 1024. In this paper, we propose an improved lightweight BLASTP to speed up the hit detection time for longer query sequences. The largest sequence is enlarged from 512 to 1024. As a result, one more bit is required to encode each sequence position. To meet the requirement, an extended hybrid query-index table (EHQIT) is proposed to accommodate three sequence positions in a four-byte table entry, making only one memory access sufficient to retrieve all the position information as long as the number of hits is equal to or smaller than three. Moreover, if there are more than three hits for a possible word, all the position information will be stored in contiguous table entries, which eliminates branch divergence and reduces memory space for pointers to overflow buffer. A square symmetric scoring matrix, Blosum62, is used to determine the relative score made by matching two characters in a sequence alignment. The experimental results show that for queries shorter than 512 our improved lightweight BLASTP outperforms the original lightweight BLASTP with speedups of 1.2 on average. When the number of hit overflows increases, the speedup can be as high as two. For queries shorter than 1024, our improved lightweight BLASTP can provide speedups ranging from 1.56 to 3.08 over the CUDA-BLAST. In short, the improved lightweight BLASTP can replace the original one because it can support a longer query sequence and provide better performance.

Download Full-text

Coherent Dialog Generation with Query Graph

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3462551 ◽

2021 ◽

Vol 20 (6) ◽

pp. 1-23

Author(s):

Jun Xu ◽

Zeyang Lei ◽

Haifeng Wang ◽

Zheng-Yu Niu ◽

Hua Wu ◽

...

Keyword(s):

Web Search ◽

Query Sequence ◽

Policy Learning ◽

Action Representation ◽

Open Domain ◽

Query Graph ◽

Policy Model ◽

Sequence Planning ◽

Planning Decisions ◽

High Level

Learning to generate coherent and informative dialogs is an enduring challenge for open-domain conversation generation. Previous work leverage knowledge graph or documents to facilitate informative dialog generation, with little attention on dialog coherence. In this article, to enhance multi-turn open-domain dialog coherence, we propose to leverage a new knowledge source, web search session data, to facilitate hierarchical knowledge sequence planning, which determines a sketch of a multi-turn dialog. Specifically, we formulate knowledge sequence planning or dialog policy learning as a graph grounded Reinforcement Learning (RL) problem. To this end, we first build a two-level query graph with queries as utterance-level vertices and their topics (entities in queries) as topic-level vertices. We then present a two-level dialog policy model that plans a high-level topic sequence and a low-level query sequence over the query graph to guide a knowledge aware response generator. In particular, to foster forward-looking knowledge planning decisions for better dialog coherence, we devise a heterogeneous graph neural network to incorporate neighbouring vertex information, or possible future RL action information, into each vertex (as an RL action) representation. Experiment results on two benchmark dialog datasets demonstrate that our framework can outperform strong baselines in terms of dialog coherence, informativeness, and engagingness.

Download Full-text

OptiFit: an improved method for fitting amplicon sequences to existing OTUs

10.1101/2021.11.09.468000 ◽

2021 ◽

Author(s):

Kelly L Sovacool ◽

Sarah L Westcott ◽

M Brodie Mumphrey ◽

Gabrielle A Dotson ◽

Patrick D. Schloss

Keyword(s):

Processing Speed ◽

De Novo ◽

Sequence Data ◽

Query Sequence ◽

Reference Sequence ◽

Reference Database ◽

Clustering Methods ◽

Improved Method ◽

Operational Taxonomic Units ◽

External Reference

Assigning amplicon sequences to operational taxonomic units (OTUs) is often an important step in characterizing the composition of microbial communities across large datasets. OptiClust, a de novo OTU clustering method, has been shown to produce higher quality OTU assignments than other methods and at comparable or faster speeds. A notable difference between de novo clustering and database-dependent reference clustering methods is that OTU assignments from de novo methods may change when new sequences are added to a dataset. However, in some cases one may wish to incorporate new samples into a previously clustered dataset without performing clustering again on all sequences, such as when comparing across datasets or deploying machine learning models where OTUs are features. Existing reference-based clustering methods produce consistent OTUs, but they only consider the similarity of each query sequence to a single reference sequence in an OTU, thus resulting in OTU assignments that are significantly worse than those generated by de novo methods. To provide an efficient and robust method to fit amplicon sequence data to existing OTUs, we developed the OptiFit algorithm. Inspired by OptiClust, OptiFit considers the similarity of all pairs of reference and query sequences in an OTU to produce OTUs of the best possible quality. We tested OptiFit using four microbiome datasets with two different strategies: by clustering to an external reference database or by splitting the dataset into a reference and query set and clustering the query sequences to the reference set after clustering it using OptiClust. The result is an improved implementation of closed and open-reference clustering. OptiFit produces OTUs of similar quality as OptiClust and at faster speeds when using the split dataset strategy, although the OTU quality and processing speed depends on the database chosen when using the external database strategy. OptiFit provides a suitable option for users who require consistent OTU assignments at the same quality afforded by de novo clustering methods.

Download Full-text

Protein sequence profile prediction using ProtAlbert transformer1

10.1101/2021.09.23.461475 ◽

2021 ◽

Author(s):

Fatemeh Zare-Mirakabad ◽

Armin Behjati ◽

Seyed Shahriar Arab ◽

Abbas Nowzari-Dalini

Keyword(s):

Amino Acids ◽

Protein Sequence ◽

Nearest Neighbor ◽

Tertiary Structure ◽

Query Sequence ◽

Protein Secondary Structure ◽

Protein Sequences ◽

Family Characteristics ◽

Sequence Profile ◽

Protein Sequence Profile

Protein sequences can be viewed as a language; therefore, we benefit from using the models initially developed for natural languages such as transformers. ProtAlbert is one of the best pre-trained transformers on protein sequences, and its efficiency enables us to run the model on longer sequences with less computation power while having similar performance with the other pre-trained transformers. This paper includes two main parts: transformer analysis and profile prediction. In the first part, we propose five algorithms to assess the attention heads in different layers of ProtAlbert for five protein characteristics, nearest-neighbor interactions, type of amino acids, biochemical and biophysical properties of amino acids, protein secondary structure, and protein tertiary structure. These algorithms are performed on 55 proteins extracted from CASP13 and three case study proteins whose sequences, experimental tertiary structures, and HSSP profiles are available. This assessment shows that although the model is only pre-trained on protein sequences, attention heads in the layers of ProtAlbert are representative of some protein family characteristics. This conclusion leads to the second part of our work. We propose an algorithm called PA_SPP for protein sequence profile prediction by pre-trained ProtAlbert using masked-language modeling. PA_SPP algorithm can help the researchers to predict an HSSP profile while there are no similar sequences to a query sequence in the database for making the HSSP profile.

Download Full-text

ProteInfer: deep networks for protein functional inference

10.1101/2021.09.20.461077 ◽

2021 ◽

Author(s):

Theo Sanderson ◽

Maxwell L Bileschi ◽

David Belanger ◽

Lucy Colwell

Keyword(s):

Amino Acid ◽

Amino Acid Sequence ◽

Protein Function ◽

Protein Function Prediction ◽

Query Sequence ◽

Functional Space ◽

Amino Acid Sequences ◽

Deep Convolutional Neural Networks ◽

Software Interfaces ◽

Downstream Analysis

Predicting the function of a protein from its amino acid sequence is a long-standing challenge in bioinformatics. Traditional approaches use sequence alignment to compare a query sequence either to thousands of models of protein families or to large databases of individual protein sequences. Here we instead employ deep convolutional neural networks to directly predict a variety of protein functions -- EC numbers and GO terms -- directly from an unaligned amino acid sequence. This approach provides precise predictions which complement alignment-based methods, and the computational efficiency of a single neural network permits novel and lightweight software interfaces, which we demonstrate with an in-browser graphical interface for protein function prediction in which all computation is performed on the user's personal computer with no data uploaded to remote servers. Moreover, these models place full-length amino acid sequences into a generalised functional space, facilitating downstream analysis and interpretation. To read the interactive version of this paper, visit https://google-research.github.io/proteinfer/

Download Full-text

ProteinPrompt: a webserver for predicting protein-protein interactions

10.1101/2021.09.03.458859 ◽

2021 ◽

Author(s):

Sebastian Canzler ◽

David Ulbricht ◽

Markus Fischer ◽

Nikola Ristic ◽

Peter Werner Hildebrand ◽

...

Keyword(s):

Neural Network ◽

Random Forest ◽

Protein Interactions ◽

Query Sequence ◽

Input Sequence ◽

Machine Learning Algorithms ◽

Protein Protein Interactions ◽

Accuracy Rate ◽

Binding Partners ◽

Potential Binding

Motivation: Protein-protein interactions play an essential role in a great variety of cellular processes and are therefore of significant interest for the design of new therapeutic compounds as well as the identification of side-effects due to unexpected binding. Here, we present ProteinPrompt, a webserver that uses machine-learning algorithms to calculate specific, currently unknown protein-protein interactions. Our tool is designed to quickly and reliably predict contacts based on an input sequence in order to scan large sequence libraries for potential binding partners, with the goal to accelerate and assure the quality of the laborious process of drug target identification. Methods: We collected and thoroughly filtered a comprehensive database of known contacts from several sources, which is available as download. ProteinPrompt provides two complementary search methods of similar accuracy for comparison and consensus building. The default method is a random forest algorithm that uses the auto-correlations of seven amino acid scales. Alternatively, a graph neural network implementation can be selected. For each query sequence, potential binding partners are identified from a protein sequence database. The proteom of several organisms are available and can be searched for contacts. Results: To evaluate the predictive power of the algorithms, we prepared a test dataset that was rigorously filtered for redundancy. No sequence pairs similar to the ones used for training were included in this dataset. With this challenging dataset, the random forest method achieved an accuracy rate of 0.88 and an area under curve of 0.95. The graph neural network achieved an accuracy rate of 0.86 using the same dataset. Since the underlying learning approaches are unrelated, comparing the results of random forest and graph neural networks reduces the likelihood of errors. ProteinPrompt is available online at: http://proteinformatics.org/ProteinPrompt The server makes it possible to scan the human proteome for potential binding partners of an input sequence within minutes. Conclusion: We offer a fast, accurate, easy-to-use online service for predicting binding partners from an input sequence

Download Full-text

SHOOT: phylogenetic gene search and ortholog inference

10.1101/2021.09.01.458564 ◽

2021 ◽

Author(s):

David Emms ◽

Steven Kelly

Keyword(s):

Phylogenetic Analysis ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Phylogenetic Trees ◽

Query Sequence ◽

Gene Tree ◽

Biological Research ◽

Gene Sequences ◽

Multiple Sequence ◽

Gene Search

Determining the evolutionary relationships between gene sequences is fundamental to comparative biological research. However, conducting such analyses requires a high degree of technical proficiency in several computational tools including gene family construction, multiple sequence alignment, and phylogenetic inference. Here we present SHOOT, an easy to use phylogenetic search engine for fast and accurate phylogenetic analysis of biological sequences. SHOOT searches a user-provided query sequence against a database of phylogenetic trees of gene sequences (gene trees) and returns a gene tree with the given query sequence correctly grafted within it. We show that SHOOT can perform this search and placement with comparable speed to a conventional BLAST search. We demonstrate that SHOOT phylogenetic placements are as accurate as conventional multiple sequence alignment and maximum likelihood tree inference approaches. We further show that SHOOT can be used to identify orthologs with equivalent accuracy to conventional orthology inference methods. In summary, SHOOT is an accurate and fast tool for complete phylogenetic analysis of novel query sequences. An easy to use webserver is available online at www.shoot.bio.

Download Full-text

ACES: Analysis of Conservation with Expansive Species

10.1101/2021.06.16.448733 ◽

2021 ◽

Author(s):

Evin M. Padhi ◽

Elvisa Mehinovic ◽

Eleanor I. Sams ◽

Jeffrey K. Ng ◽

Tychele N. Turner

Keyword(s):

Multiple Sequence Alignment ◽

Large Scale ◽

Query Sequence ◽

Specific Sequence ◽

Multiple Sequence ◽

Fragment Assembly ◽

Data Files ◽

Computational Workflow ◽

Reference Genomes ◽

Specific Sequences

Motivation: An abundance of new reference genomes are becoming available through large-scale sequencing efforts. While the reference FASTA for each genome is available, there is currently no automated mechanism to query a specific sequence across all new reference genomes. Results: We developed ACES (Analysis of Conservation with Expansive Species) as a computational workflow to query specific sequences of interest (e.g., enhancers, promoters, exons) against reference genomes with an available reference FASTA. This automated workflow generates BLAST hits against each of the reference genomes, a multiple sequence alignment file, a graphical fragment assembly file, and a phylogenetic tree file. These data files can then be used by the researcher in several ways to provide key insights into conservation of the query sequence. Availability: ACES is available at https://github.com/TNTurnerLab/ACES

Download Full-text

In Silico Evaluation of the Structural Dynamics Beta-Amylase from Sweet Potato (Ipomoea batatas)

Asian Journal of Biotechnology and Bioresource Technology ◽

10.9734/ajb2t/2021/v7i330100 ◽

2021 ◽

pp. 1-10

Author(s):

David Akintayo Obe ◽

Toluwase Hezekiah Fatoki

Keyword(s):

Sweet Potato ◽

Structural Dynamics ◽

In Silico ◽

Ipomoea Batatas ◽

Molecular Mechanisms ◽

Query Sequence ◽

Industrial Applications ◽

Amino Acid Residues ◽

Catalytic Role ◽

Mean Square Fluctuation

Background: Sweet potato tubers are invaluable crop that could serve both dietary and industrial purposes owing to its high β-amylase content. β-amylases play essential role in plant carbohydrate metabolism as well as in many industrial applications such as the malting process in the brewing and distilling industries. Aim: This study aims at better understanding of the evolutionary and molecular properties, and structural dynamics of β-amylase of sweet potato using in silico approach. Methodology: 16 of the 250 sequences that are at least 69% identity to the query sequence (P10537) were manually selected from UniProt database for further analysis. Result: It has theoretical isoelectric point of 4.97 and molecular weight of 56 kDa. The root-mean-square fluctuation (RMSF) of sweet potato β-amylase showed possible conservation of the amino acid residues 105-130 and 260-345, with highest fluctuation in C-terminal loop (residues 443-498). The catalytic role of Glu187 and Thr344 in β-amylase of sweet potato has been elucidated, and it provided the missing link in the previously available mechanisms, while Cys96 is essential for the inactivation of enzyme activity. Conclusion: Elucidation of molecular mechanisms of expression and catalytic activity, together with the understanding of physicochemical properties of β-amylase from sweet potato will help in development of useful applications that are of industrial importance.

Download Full-text

query sequenceRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Construct a variable-length fragment library for de novo protein structure prediction

The Design and Implementation of an Improved Lightweight BLASTP on CUDA GPU

Coherent Dialog Generation with Query Graph

OptiFit: an improved method for fitting amplicon sequences to existing OTUs

Protein sequence profile prediction using ProtAlbert transformer1

ProteInfer: deep networks for protein functional inference

ProteinPrompt: a webserver for predicting protein-protein interactions

SHOOT: phylogenetic gene search and ortholog inference

ACES: Analysis of Conservation with Expansive Species

In Silico Evaluation of the Structural Dynamics Beta-Amylase from Sweet Potato (Ipomoea batatas)

query sequence
Recently Published Documents