Semiotic thoughts on biological sequence representations

Author(s):  
Guillermo Restrepo

: The deluge of biological sequences ranging from those of proteins, DNA and RNA to genomes has increased the models for their representation, which are further used to contrast those sequences. Here we present a brief bibliometric description of the research area devoted to representation of biological sequences and highlight the semiotic reaches of this process. Finally, we argue that this research area needs further research according to the evolution of mathematical chemistry and its drawbacks are required to be overcome.

2020 ◽  
Author(s):  
Eli N. Weinstein ◽  
Debora S. Marks

AbstractLarge-scale sequencing has revealed extraordinary diversity among biological sequences, produced over the course of evolution and within the lifetime of individual organisms. Existing methods for building statistical models of sequences often pre-process the data using multiple sequence alignment, an unreliable approach for many genetic elements (antibodies, disordered proteins, etc.) that is subject to fundamental statistical pathologies. Here we introduce a structured emission distribution (the MuE distribution) that accounts for mutational variability (substitutions and indels) and use it to construct generative and predictive hierarchical Bayesian models (H-MuE models). Our framework enables the application of arbitrary continuous-space vector models (e.g. linear regression, factor models, image neural-networks) to unaligned sequence data. Theoretically, we show that the MuE generalizes classic probabilistic alignment models. Empirically, we show that H-MuE models can infer latent representations and features for immune repertoires, predict functional unobserved members of disordered protein families, and forecast the future evolution of pathogens.


Author(s):  
Ashesh Nandy

The exponential growth in the depositories of biological sequence data have generated an urgent need to store, retrieve and analyse the data efficiently and effectively for which the standard practice of using alignment procedures are not adequate due to high demand on computing resources and time. Graphical representation of sequences has become one of the most popular alignment-free strategies to analyse the biological sequences where each basic unit of the sequences – the bases adenine, cytosine, guanine and thymine for DNA/RNA, and the 20 amino acids for proteins – are plotted on a multi-dimensional grid. The resulting curve in 2D and 3D space and the implied graph in higher dimensions provide a perception of the underlying information of the sequences through visual inspection; numerical analyses, in geometrical or matrix terms, of the plots provide a measure of comparison between sequences and thus enable study of sequence hierarchies. The new approach has also enabled studies of comparisons of DNA sequences over many thousands of bases and provided new insights into the structure of the base compositions of DNA sequences In this article we review in brief the origins and applications of graphical representations and highlight the future perspectives in this field.


2020 ◽  
Author(s):  
Chao Wei ◽  
Junying Zhang ◽  
Xiguo Yuan ◽  
Zongzhen He ◽  
Guojun Liu

ABSTRACTProtein coding regions prediction is a very important but overlooked subtask for tasks such as prediction of complete gene structure, coding/noncoding RNA. Many machine learning methods have been proposed for this problem, they first encode a biological sequence into numerical values and then feed them into a classifier for final prediction. However, encoding schemes directly influence the classifier capability to capture coding features and how to choose a proper encoding scheme remains uncertain. Recently, we proposed a protein coding region prediction method in transcript sequences based on a bidirectional recurrent neural network with non-overlapping kmer, and achieved considerable improvement over existing methods, but there is still much room to improve the performance. In fact, kmer features that count the occurrence frequency of trinucleotides only reflect the local sequence order information between the most contiguous nucleotides, which loses almost all the global sequence order information. In viewing of the point, we here present a deep learning framework with hybrid encoding for protein coding regions prediction in biological sequences, which effectively exploiting global sequence order information, non-overlapping kmer features and statistical dependencies among coding labels. Evaluated on genomic and transcript sequences, our proposed method significantly outperforms existing state-of-the-art methods.


2018 ◽  
Author(s):  
K S Naveenkumar ◽  
Babu R Mohammed Harun ◽  
R Vinayakumar ◽  
KP Soman

AbstractProtein classification is responsible for the biological sequence, we came up with an idea which deals with the classification of proteomics using deep learning algorithm. This algorithm focuses mainly to classify sequences of protein-vector which is used for the representation of proteomics. Selection of the type protein representation is challenging based on which output in terms of accuracy is depended on, The protein representation used here is n-gram i.e. 3-gram and Keras embedding used for biological sequences like protein. In this paper we are working on the Protein classification to show the strength and representation of biological sequence of the proteins.


Symmetry ◽  
2020 ◽  
Vol 12 (12) ◽  
pp. 2090
Author(s):  
Yue Lu ◽  
Long Zhao ◽  
Zhao Li ◽  
Xiangjun Dong

Similarity analysis of DNA sequences can clarify the homology between sequences and predict the structure of, and relationship between, them. At the same time, the frequent patterns of biological sequences explain not only the genetic characteristics of the organism, but they also serve as relevant markers for certain events of biological sequences. However, most of the aforementioned biological sequence similarity analysis methods are targeted at the entire sequential pattern, which ignores the missing gene fragment that may induce potential disease. The similarity analysis of such sequences containing a missing gene item is a blank. Consequently, some sequences with missing bases are ignored or not effectively analyzed. Thus, this paper presents a new method for DNA sequence similarity analysis. Using this method, we first mined not only positive sequential patterns, but also sequential patterns that were missing some of the base terms (collectively referred to as negative sequential patterns). Subsequently, we used these frequent patterns for similarity analysis on a two-dimensional plane. Several experiments were conducted in order to verify the effectiveness of this algorithm. The experimental results demonstrated that the algorithm can obtain various results through the selection of frequent sequential patterns and that accuracy and time efficiency was improved.


2021 ◽  
pp. 1-18
Author(s):  
Hafiz Asadul Rehman ◽  
Kashif Zafar ◽  
Ayesha Khan ◽  
Abdullah Imtiaz

Discovering structural, functional and evolutionary information in biological sequences have been considered as a core research area in Bioinformatics. Multiple Sequence Alignment (MSA) tries to align all sequences in a given query set to provide us ease in annotation of new sequences. Traditional methods to find the optimal alignment are computationally expensive in real time. This research presents an enhanced version of Bird Swarm Algorithm (BSA), based on bio inspired optimization. Enhanced Bird Swarm Align Algorithm (EBSAA) is proposed for multiple sequence alignment problem to determine the optimal alignment among different sequences. Twenty-one different datasets have been used in order to compare performance of EBSAA with Genetic Algorithm (GA) and Particle Swarm Align Algorithm (PSAA). The proposed technique results in better alignment as compared to GA and PSAA in most of the cases.


2020 ◽  
Author(s):  
Robson P. Bonidia ◽  
Danilo S. Sanches ◽  
André C.P.L.F. de Carvalho

AbstractMachine learning algorithms have been very successfully applied to extract new and relevant knowledge from biological sequences. However, the predictive performance of these algorithms is largely affected by how the sequences are represented. Thereby, the main challenge is how to numerically represent a biological sequence in a numeric vector with an efficient mathematical expression. Several feature extraction techniques have been proposed for biological sequences, where most of them are available in feature extraction packages. However, there are relevant approaches that are not available in existing packages, techniques based on mathematical descriptors, e.g., Fourier, entropy, and graphs. Therefore, this paper presents a new package, named MathFeature, which implements mathematical descriptors able to extract relevant information from biological sequences. MathFeature provides 20 approaches based on several studies found in the literature, e.g., multiple numeric mappings, genomic signal processing, chaos game theory, entropy, and complex networks. MathFeature also allows the extraction of alternative features, complementing the existing packages.Availability and implementationMathFeature is freely available at https://bonidia.github.io/MathFeature/ or https://github.com/Bonidia/[email protected], [email protected]


2021 ◽  
Author(s):  
Kaitlin M. Carey ◽  
Robert Hubley ◽  
George T. Lesica ◽  
Daniel Olson ◽  
Jack W. Roddy ◽  
...  

AbstractAnnotation of a biological sequence is usually performed by aligning that sequence to a database of known sequence elements. When that database contains elements that are highly similar to each other, the proper annotation may be ambiguous, because several entries in the database produce high-scoring alignments. Typical annotation methods work by assigning a label based on the candidate annotation with the highest alignment score; this can overstate annotation certainty, mislabel boundaries, and fails to identify large scale rearrangements or insertions within the annotated sequence. Here, we present a new software tool, PolyA, that adjudicates between competing alignment-based annotations by computing estimates of annotation confidence, identifying a trace with maximal confidence, and recursively splicing/stitching inserted elements. PolyA communicates annotation certainty, identifies large scale rearrangements, and detects boundaries between neighboring elements.


2015 ◽  
Vol 1 (4) ◽  
pp. 404
Author(s):  
Nooruldeen Nasih Qader ◽  
Hussein K. Al-Khafaji

Biodata are rich of information. Knowing the properties of biological sequence can be valuable in analyzing data and making appropriate conclusions. This research applied naturalistic methodology to investigate the structural properties of biological sequences (i.e., DNA). The research implemented in the field of motif finding. Two new motifs properties were discovered named identical neighbors and adjacent neighbors.  The analysis is done in different situations of background frequency and motif model, using distinctive real data set of varied data size. The analysis demonstrated the strong existence of the properties. Exploiting of these properties considers significant steps towards developing powerful algorithms in molecular biology.


2019 ◽  
Vol 14 (4) ◽  
pp. 574-589
Author(s):  
Linyan Xue ◽  
Xiaoke Zhang ◽  
Fei Xie ◽  
Shuang Liu ◽  
Peng Lin

In the application of bioinformatics, the existing algorithms cannot be directly and efficiently implement sequence pattern mining. Two fast and efficient biological sequence pattern mining algorithms for biological single sequence and multiple sequences are proposed in this paper. The concept of the basic pattern is proposed, and on the basis of mining frequent basic patterns, the frequent pattern is excavated by constructing prefix trees for frequent basic patterns. The proposed algorithms implement rapid mining of frequent patterns of biological sequences based on pattern prefix trees. In experiment the family sequence data in the pfam protein database is used to verify the performance of the proposed algorithm. The prediction results confirm that the proposed algorithms can’t only obtain the mining results with effective biological significance, but also improve the running time efficiency of the biological sequence pattern mining.


Sign in / Sign up

Export Citation Format

Share Document