biological sequence analysis Latest Research Papers

A Greedy Clustering Algorithm for Multiple Sequence Alignment

International Journal of Cognitive Informatics and Natural Intelligence ◽

10.4018/ijcini.20211001oa28 ◽

2021 ◽

Vol 15 (4) ◽

pp. 0-0

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Clustering Algorithm ◽

Optimization Procedure ◽

Search Space ◽

Divide And Conquer ◽

Biological Sequence ◽

Multiple Sequence ◽

Biological Sequence Analysis ◽

Np Hard Problem

This paper presents a strategy to tackle the Multiple Sequence Alignment (MSA) problem, which is one of the most important tasks in the biological sequence analysis. Its role is to align the sequences in their entirety to derive relationships and common characteristics between a set of protein or nucleotide sequences. The MSA problem was proved to be an NP-Hard problem. The proposed strategy incorporates a new idea based on the well-known divide and conquer paradigm. This paper presents a novel method of clustering sequences as a preliminary step to improve the final alignment; this decomposition can be used as an optimization procedure with any MSA aligner to explore promising alignments of the search space. In their solution, authors proposed to align the clusters in a parallel and distributed way in order to benefit from parallel architectures. The strategy was tested using classical benchmarks like BAliBASE, Sabre, Prefab4 and Oxm, and the experimental results show that it gives good results by comparing to the other aligners.

A Greedy Clustering Algorithm for Multiple Sequence Alignment

International Journal of Cognitive Informatics and Natural Intelligence ◽

10.4018/ijcini.20211001.oa41 ◽

2021 ◽

Vol 15 (4) ◽

pp. 1-17

Author(s):

Rabah Lebsir ◽

Abdesslem Layeb ◽

Tahi Fariza

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Clustering Algorithm ◽

Optimization Procedure ◽

Search Space ◽

Divide And Conquer ◽

Biological Sequence ◽

Multiple Sequence ◽

Biological Sequence Analysis ◽

Np Hard Problem

This paper presents a strategy to tackle the Multiple Sequence Alignment (MSA) problem, which is one of the most important tasks in the biological sequence analysis. Its role is to align the sequences in their entirety to derive relationships and common characteristics between a set of protein or nucleotide sequences. The MSA problem was proved to be an NP-Hard problem. The proposed strategy incorporates a new idea based on the well-known divide and conquer paradigm. This paper presents a novel method of clustering sequences as a preliminary step to improve the final alignment; this decomposition can be used as an optimization procedure with any MSA aligner to explore promising alignments of the search space. In their solution, authors proposed to align the clusters in a parallel and distributed way in order to benefit from parallel architectures. The strategy was tested using classical benchmarks like BAliBASE, Sabre, Prefab4 and Oxm, and the experimental results show that it gives good results by comparing to the other aligners.

Constructing benchmark test sets for biological sequence analysis using independent set algorithms

10.1101/2021.09.29.462285 ◽

2021 ◽

Author(s):

Samantha Petti ◽

Sean R Eddy

Keyword(s):

Sequence Analysis ◽

Sequence Data ◽

Independent Set ◽

Training Sequence ◽

Test Sequence ◽

Biological Sequence ◽

Biological Sequence Analysis ◽

Training Sequences ◽

Benchmark Datasets ◽

Test Sets

Statistical inference and machine learning methods are benchmarked on test data independent of the data used to train the method. Biological sequence families are highly non-independent because they are related by evolution, so the strategy for splitting data into separate training and test sets is a nontrivial choice in bench marking sequence analysis methods. A random split is insufficient because it will yield test sequences that are closely related or even identical to training sequences. Adapting ideas from independent set graph algorithms, we describe two new meth- ods for splitting sequence data into dissimilar training and test sets. These algo rithms input a sequence family and produce a split in which each test sequence is less than p % identical to any individual training sequence. These algorithms successfully split more families than a previous approach, enabling construction of more diverse benchmark datasets.

BioSeq-BLM: a platform for analyzing DNA, RNA and protein sequences based on biological language models

Nucleic Acids Research ◽

10.1093/nar/gkab829 ◽

2021 ◽

Author(s):

Hong-Liang Li ◽

Yi-He Pang ◽

Bin Liu

Keyword(s):

Sequence Analysis ◽

Language Processing ◽

State Of The Art ◽

Sequence Data ◽

Language Models ◽

Biological Sequence ◽

Protein Sequence Analysis ◽

Processing Technologies ◽

Biological Sequence Analysis ◽

Important Field

Abstract In order to uncover the meanings of ‘book of life’, 155 different biological language models (BLMs) for DNA, RNA and protein sequence analysis are discussed in this study, which are able to extract the linguistic properties of ‘book of life’. We also extend the BLMs into a system called BioSeq-BLM for automatically representing and analyzing the sequence data. Experimental results show that the predictors generated by BioSeq-BLM achieve comparable or even obviously better performance than the exiting state-of-the-art predictors published in literatures, indicating that BioSeq-BLM will provide new approaches for biological sequence analysis based on natural language processing technologies, and contribute to the development of this very important field. In order to help the readers to use BioSeq-BLM for their own experiments, the corresponding web server and stand-alone package are established and released, which can be freely accessed at http://bliulab.net/BioSeq-BLM/.

SecProCT: In Silico Prediction of Human Secretory Proteins Based on Capsule Network and Transformer

International Journal of Molecular Sciences ◽

10.3390/ijms22169054 ◽

2021 ◽

Vol 22 (16) ◽

pp. 9054

Author(s):

Wei Du ◽

Xuan Zhao ◽

Yu Sun ◽

Lei Zheng ◽

Ying Li ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Amino Acid Sequences ◽

Secretory Proteins ◽

Biological Sequence ◽

Learning Methods ◽

Biological Sequence Analysis ◽

Proposed Model ◽

Conventional Machine ◽

Deep Learning Model

Identifying secretory proteins from blood, saliva or other body fluids has become an effective method of diagnosing diseases. Existing secretory protein prediction methods are mainly based on conventional machine learning algorithms and are highly dependent on the feature set from the protein. In this article, we propose a deep learning model based on the capsule network and transformer architecture, SecProCT, to predict secretory proteins using only amino acid sequences. The proposed model was validated using cross-validation and achieved 0.921 and 0.892 accuracy for predicting blood-secretory proteins and saliva-secretory proteins, respectively. Meanwhile, the proposed model was validated on an independent test set and achieved 0.917 and 0.905 accuracy for predicting blood-secretory proteins and saliva-secretory proteins, respectively, which are better than conventional machine learning methods and other deep learning methods for biological sequence analysis. The main contributions of this article are as follows: (1) a deep learning model based on a capsule network and transformer architecture is proposed for predicting secretory proteins. The results of this model are better than the those of existing conventional machine learning methods and deep learning methods for biological sequence analysis; (2) only amino acid sequences are used in the proposed model, which overcomes the high dependence of existing methods on the annotated protein features; (3) the proposed model can accurately predict most experimentally verified secretory proteins and cancer protein biomarkers in blood and saliva.

Representation learning applications in biological sequence analysis

Computational and Structural Biotechnology Journal ◽

10.1016/j.csbj.2021.05.039 ◽

2021 ◽

Author(s):

Hitoshi Iuchi ◽

Taro Matsutani ◽

Keisuke Yamada ◽

Natsuki Iwano ◽

Shunsuke Sumi ◽

...

Keyword(s):

Sequence Analysis ◽

Representation Learning ◽

Biological Sequence ◽

Biological Sequence Analysis

Representation learning applications in biological sequence analysis

10.1101/2021.02.26.433129 ◽

2021 ◽

Author(s):

Hitoshi Iuchi ◽

Taro Matsutani ◽

Keisuke Yamada ◽

Shunsuke Sumi ◽

Shion Hosoda ◽

...

Keyword(s):

Sequence Analysis ◽

Language Processing ◽

High Throughput Sequencing ◽

Probabilistic Models ◽

Representation Learning ◽

Biological Sequences ◽

Biological Sequence ◽

Structure Estimation ◽

Biological Sequence Analysis ◽

Essential Step

Remarkable advances in high-throughput sequencing have resulted in rapid data accumulation, and analyzing biological (DNA/RNA/protein) sequences to discover new insights in biology has become more critical and challenging. To tackle this issue, the application of natural language processing (NLP) to biological sequence analysis has received increased attention, because biological sequences are regarded as sentences and k-mers in these sequences as words. Embedding is an essential step in NLP, which converts words into vectors. This transformation is called representation learning and can be applied to biological sequences. Vectorized biological sequences can be used for function and structure estimation, or as inputs for other probabilistic models. Given the importance and growing trend in the application of representation learning in biology, here, we review the existing knowledge in representation learning for biological sequence analysis.

New Construction of Family of MLCS Algorithms

Journal of Healthcare Engineering ◽

10.1155/2021/6636710 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Haihe Shi ◽

Jun Wang

Keyword(s):

Sequence Analysis ◽

Text Processing ◽

Longest Common Subsequence ◽

Feature Model ◽

Generic Programming ◽

Biological Sequence ◽

Biological Sequence Analysis ◽

Character Sequences ◽

Common Subsequence ◽

Component Assembly

The multiple longest common subsequence (MLCS) problem involves finding all the longest common subsequences of multiple character sequences. This problem is encountered in a variety of areas, including data mining, text processing, and bioinformatics, and is particularly important for biological sequence analysis. By taking the MLCS problem and algorithms for its solution as research domain, this study analyzes the domain of multiple longest common subsequence algorithms, extracts features that algorithms in the domain do and do not have in common, and creates a domain feature model for the MLCS problem by using generic programming, domain engineering, abstraction, and related technologies. A component library for the domain is designed based on the feature model for the MLCS problem, and the partition and recur (PAR) platform is used to ensure that highly reliable MLCS algorithms can be quickly assembled through component assembly. This study provides a valuable reference for obtaining rapid solutions to problems of biological sequence analysis and improves the reliability and assembly flexibility of assembling algorithms.

IEEE Access Special Section Editorial: Feature Representation and Learning Methods With Applications in Large-Scale Biological Sequence Analysis

IEEE Access ◽

10.1109/access.2021.3060612 ◽

2021 ◽

Vol 9 ◽

pp. 33110-33119

Author(s):

Feifei Cui ◽

Quan Zou ◽

Qin Ma ◽

Leyi Wei ◽

Jijun Tang ◽

...

Keyword(s):

Sequence Analysis ◽

Large Scale ◽

Special Section ◽

Feature Representation ◽

Biological Sequence ◽

Learning Methods ◽

Biological Sequence Analysis

Artificial Intelligence Techniques to Computational Proteomics, Genomics, and Biological Sequence Analysis

Current Protein and Peptide Science ◽

10.2174/138920372111201203091924 ◽

2020 ◽

Vol 21 (11) ◽

pp. 1042-1043

Author(s):

Wenzheng Bao

Keyword(s):

Artificial Intelligence ◽

Sequence Analysis ◽

Biological Sequence ◽

Computational Proteomics ◽

Artificial Intelligence Techniques ◽

Biological Sequence Analysis

biological sequence analysis
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

A Greedy Clustering Algorithm for Multiple Sequence Alignment

A Greedy Clustering Algorithm for Multiple Sequence Alignment

Constructing benchmark test sets for biological sequence analysis using independent set algorithms

BioSeq-BLM: a platform for analyzing DNA, RNA and protein sequences based on biological language models

SecProCT: In Silico Prediction of Human Secretory Proteins Based on Capsule Network and Transformer

Representation learning applications in biological sequence analysis

Representation learning applications in biological sequence analysis

New Construction of Family of MLCS Algorithms

IEEE Access Special Section Editorial: Feature Representation and Learning Methods With Applications in Large-Scale Biological Sequence Analysis

Artificial Intelligence Techniques to Computational Proteomics, Genomics, and Biological Sequence Analysis

Export Citation Format

biological sequence analysisRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

A Greedy Clustering Algorithm for Multiple Sequence Alignment

A Greedy Clustering Algorithm for Multiple Sequence Alignment

Constructing benchmark test sets for biological sequence analysis using independent set algorithms

BioSeq-BLM: a platform for analyzing DNA, RNA and protein sequences based on biological language models

SecProCT: In Silico Prediction of Human Secretory Proteins Based on Capsule Network and Transformer

Representation learning applications in biological sequence analysis

Representation learning applications in biological sequence analysis

New Construction of Family of MLCS Algorithms

IEEE Access Special Section Editorial: Feature Representation and Learning Methods With Applications in Large-Scale Biological Sequence Analysis

Artificial Intelligence Techniques to Computational Proteomics, Genomics, and Biological Sequence Analysis

biological sequence analysis
Recently Published Documents