Constructing benchmark test sets for biological sequence analysis using independent set algorithms

Mapping Intimacies ◽

10.1101/2021.09.29.462285 ◽

2021 ◽

Author(s):

Samantha Petti ◽

Sean R Eddy

Keyword(s):

Sequence Analysis ◽

Sequence Data ◽

Independent Set ◽

Training Sequence ◽

Test Sequence ◽

Biological Sequence ◽

Biological Sequence Analysis ◽

Training Sequences ◽

Benchmark Datasets ◽

Test Sets

Statistical inference and machine learning methods are benchmarked on test data independent of the data used to train the method. Biological sequence families are highly non-independent because they are related by evolution, so the strategy for splitting data into separate training and test sets is a nontrivial choice in bench marking sequence analysis methods. A random split is insufficient because it will yield test sequences that are closely related or even identical to training sequences. Adapting ideas from independent set graph algorithms, we describe two new meth- ods for splitting sequence data into dissimilar training and test sets. These algo rithms input a sequence family and produce a split in which each test sequence is less than p % identical to any individual training sequence. These algorithms successfully split more families than a previous approach, enabling construction of more diverse benchmark datasets.

Download Full-text

BioSeq-BLM: a platform for analyzing DNA, RNA and protein sequences based on biological language models

Nucleic Acids Research ◽

10.1093/nar/gkab829 ◽

2021 ◽

Author(s):

Hong-Liang Li ◽

Yi-He Pang ◽

Bin Liu

Keyword(s):

Sequence Analysis ◽

Language Processing ◽

State Of The Art ◽

Sequence Data ◽

Language Models ◽

Biological Sequence ◽

Protein Sequence Analysis ◽

Processing Technologies ◽

Biological Sequence Analysis ◽

Important Field

Abstract In order to uncover the meanings of ‘book of life’, 155 different biological language models (BLMs) for DNA, RNA and protein sequence analysis are discussed in this study, which are able to extract the linguistic properties of ‘book of life’. We also extend the BLMs into a system called BioSeq-BLM for automatically representing and analyzing the sequence data. Experimental results show that the predictors generated by BioSeq-BLM achieve comparable or even obviously better performance than the exiting state-of-the-art predictors published in literatures, indicating that BioSeq-BLM will provide new approaches for biological sequence analysis based on natural language processing technologies, and contribute to the development of this very important field. In order to help the readers to use BioSeq-BLM for their own experiments, the corresponding web server and stand-alone package are established and released, which can be freely accessed at http://bliulab.net/BioSeq-BLM/.

Download Full-text