The organization of biological sequences into constrained and unconstrained parts determines fundamental properties of genotype–phenotype maps

S. F. Greenbury; S. E. Ahnert

doi:10.1098/rsif.2015.0724

The organization of biological sequences into constrained and unconstrained parts determines fundamental properties of genotype–phenotype maps

Journal of The Royal Society Interface ◽

10.1098/rsif.2015.0724 ◽

2015 ◽

Vol 12 (113) ◽

pp. 20150724 ◽

Cited By ~ 17

Author(s):

S. F. Greenbury ◽

S. E. Ahnert

Keyword(s):

Secondary Structure ◽

Rna Secondary Structure ◽

Shape Space ◽

Biological Information ◽

Biological Sequences ◽

Protein Coding ◽

Fundamental Properties ◽

Phenotypic Robustness ◽

Logarithmic Scaling ◽

Biased Distribution

Biological information is stored in DNA, RNA and protein sequences, which can be understood as genotypes that are translated into phenotypes. The properties of genotype–phenotype (GP) maps have been studied in great detail for RNA secondary structure. These include a highly biased distribution of genotypes per phenotype, negative correlation of genotypic robustness and evolvability, positive correlation of phenotypic robustness and evolvability, shape-space covering, and a roughly logarithmic scaling of phenotypic robustness with phenotypic frequency. More recently similar properties have been discovered in other GP maps, suggesting that they may be fundamental to biological GP maps, in general, rather than specific to the RNA secondary structure map. Here we propose that the above properties arise from the fundamental organization of biological information into ‘constrained' and ‘unconstrained' sequences, in the broadest possible sense. As ‘constrained' we describe sequences that affect the phenotype more immediately, and are therefore more sensitive to mutations, such as, e.g. protein-coding DNA or the stems in RNA secondary structure. ‘Unconstrained' sequences, on the other hand, can mutate more freely without affecting the phenotype, such as, e.g. intronic or intergenic DNA or the loops in RNA secondary structure. To test our hypothesis we consider a highly simplified GP map that has genotypes with ‘coding' and ‘non-coding' parts. We term this the Fibonacci GP map, as it is equivalent to the Fibonacci code in information theory. Despite its simplicity the Fibonacci GP map exhibits all the above properties of much more complex and biologically realistic GP maps. These properties are therefore likely to be fundamental to many biological GP maps.

Download Full-text

RNA secondary structure as a reusable interface to biological information resources

Gene ◽

10.1016/s0378-1119(96)00855-4 ◽

1997 ◽

Vol 190 (2) ◽

pp. GC59-GC70 ◽

Cited By ~ 7

Author(s):

Ramon M. Felciano ◽

Richard O. Chen ◽

Russ B. Altman

Keyword(s):

Secondary Structure ◽

Rna Secondary Structure ◽

Information Resources ◽

Biological Information

Download Full-text

Neutral components show a hierarchical community structure in the genotype–phenotype map of RNA secondary structure

Journal of The Royal Society Interface ◽

10.1098/rsif.2020.0608 ◽

2020 ◽

Vol 17 (171) ◽

pp. 20200608

Author(s):

Marcel Weiß ◽

Sebastian E. Ahnert

Keyword(s):

Community Structure ◽

Secondary Structure ◽

Rna Secondary Structure ◽

Sampling Method ◽

Point Mutations ◽

Connected Components ◽

Biological Sequences ◽

Detection Algorithms ◽

The Relationship ◽

Sequence Constraints

Genotype–phenotype (GP) maps describe the relationship between biological sequences and structural or functional outcomes. They can be represented as networks in which genotypes are the nodes, and one-point mutations between them are the edges. The genotypes that map to the same phenotype form subnetworks consisting of one or multiple disjoint connected components–so-called neutral components (NCs). For the GP map of RNA secondary structure, the NCs have been found to exhibit distinctive network features that can affect the dynamical processes taking place on them. Here, we focus on the community structure of RNA secondary structure NCs. Building on previous findings, we introduce a method to reveal the hierarchical community structure solely from the sequence constraints and composition of the genotypes that form a given NC. Thereby, we obtain modularity values similar to common community detection algorithms, which are much more complex. From this knowledge, we endorse a sampling method that allows a fast exploration of the different communities of a given NC. Furthermore, we introduce a way to estimate the community structure from genotype samples, which is useful when an exhaustive analysis of the NC is not feasible, as is the case for longer sequence lengths.

Download Full-text

Research on RNA secondary structure predicting via bidirectional recurrent neural network

BMC Bioinformatics ◽

10.1186/s12859-021-04332-z ◽

2021 ◽

Vol 22 (S3) ◽

Author(s):

Weizhong Lu ◽

Yan Cao ◽

Hongjie Wu ◽

Yijie Ding ◽

Zhengwei Song ◽

...

Keyword(s):

Secondary Structure ◽

Protein Sequence ◽

Rna Secondary Structure ◽

Secondary Structure Prediction ◽

Weight Vector ◽

Biological Information ◽

Sequence Information ◽

Local Optimum ◽

Data Set ◽

Before And After

Abstract Background RNA secondary structure prediction is an important research content in the field of biological information. Predicting RNA secondary structure with pseudoknots has been proved to be an NP-hard problem. Traditional machine learning methods can not effectively apply protein sequence information with different sequence lengths to the prediction process due to the constraint of the self model when predicting the RNA secondary structure. In addition, there is a large difference between the number of paired bases and the number of unpaired bases in the RNA sequences, which means the problem of positive and negative sample imbalance is easy to make the model fall into a local optimum. To solve the above problems, this paper proposes a variable-length dynamic bidirectional Gated Recurrent Unit(VLDB GRU) model. The model can accept sequences with different lengths through the introduction of flag vector. The model can also make full use of the base information before and after the predicted base and can avoid losing part of the information due to truncation. Introducing a weight vector to predict the RNA training set by dynamically adjusting each base loss function solves the problem of balanced sample imbalance. Results The algorithm proposed in this paper is compared with the existing algorithms on five representative subsets of the data set RNA STRAND. The experimental results show that the accuracy and Matthews correlation coefficient of the method are improved by 4.7% and 11.4%, respectively. Conclusions The flag vector introduced allows the model to effectively use the information before and after the protein sequence; the introduced weight vector solves the problem of unbalanced sample balance. Compared with other algorithms, the LVDB GRU algorithm proposed in this paper has the best detection results.

Download Full-text