scholarly journals ICOR: Improving codon optimization with recurrent neural networks

2021 ◽  
Author(s):  
Rishab Jain ◽  
Aditya Jain ◽  
Elizabeth Mauro ◽  
Kevin LeShane ◽  
Douglas Densmore

In protein sequences—as there are 61 sense codons but only 20 standard amino acids—most amino acids are encoded by more than one codon. Although such synonymous codons do not alter the encoded amino acid sequence, their selection can dramatically affect the expression of the resulting protein. Codon optimization of synthetic DNA sequences is important for heterologous expression. However, existing solutions are primarily based on choosing high-frequency codons only, neglecting the important effects of rare codons. In this paper, we propose a novel recurrent-neural-network based codon optimization tool, ICOR, that aims to learn codon usage bias on a genomic dataset of Escherichia coli. We compile a dataset of over 7,000 non-redundant, high-expression, robust genes which are used for deep learning. The model uses a bidirectional long short-term memory-based architecture, allowing for the sequential context of codon usage in genes to be learned. Our tool can predict synonymous codons for synthetic genes toward optimal expression in Escherichia coli. We demonstrate that sequential context achieved via RNN may yield codon selection that is more similar to the host genome, therefore improving protein expression more than frequency-based approaches. ICOR is evaluated on 1,481 Escherichia coli genes as well as a benchmark set of 40 select DNA sequences whose heterologous expression has been previously characterized. ICOR's performance across five metrics is compared to that of five different codon optimization techniques. The codon adaptation index -- a metric indicative of high real-world expression -- was utilized as the primary benchmark in this study. ICOR is shown to improve the codon adaptation index by 41.69% and 17.25% compared to the original and Genscript's GenSmart-optimized sequences, respectively. Our tool is provided as an open-source software package that includes the benchmark set of sequences used in this study.

Parasitology ◽  
2004 ◽  
Vol 128 (3) ◽  
pp. 245-251 ◽  
Author(s):  
L. PEIXOTO ◽  
V. FERNÁNDEZ ◽  
H. MUSTO

The usage of alternative synonymous codons in the completely sequenced, extremely A+T-rich parasitePlasmodium falciparumwas studied. Confirming previous studies obtained with less than 3% of the total genes recently described, we found that A- and U-ending triplets predominate but translational selection increases the frequency of a subset of codons in highly expressed genes. However, some new results come from the analysis of the complete sequence. First, there is more variation in GC3 than previously described; second, the effect of natural selection acting at the level of translation has been analysed with real expression data at 4 different stages and third, we found that highly expressed proteins increment the frequency of energetically less expensive amino acids. The implications of these results are discussed.


2018 ◽  
Vol 115 (21) ◽  
pp. E4940-E4949 ◽  
Author(s):  
Idan Frumkin ◽  
Marc J. Lajoie ◽  
Christopher J. Gregg ◽  
Gil Hornung ◽  
George M. Church ◽  
...  

Although the genetic code is redundant, synonymous codons for the same amino acid are not used with equal frequencies in genomes, a phenomenon termed “codon usage bias.” Previous studies have demonstrated that synonymous changes in a coding sequence can exert significantciseffects on the gene’s expression level. However, whether the codon composition of a gene can also affect the translation efficiency of other genes has not been thoroughly explored. To study how codon usage bias influences the cellular economy of translation, we massively converted abundant codons to their rare synonymous counterpart in several highly expressed genes inEscherichia coli. This perturbation reduces both the cellular fitness and the translation efficiency of genes that have high initiation rates and are naturally enriched with the manipulated codon, in agreement with theoretical predictions. Interestingly, we could alleviate the observed phenotypes by increasing the supply of the tRNA for the highly demanded codon, thus demonstrating that the codon usage of highly expressed genes was selected in evolution to maintain the efficiency of global protein translation.


Genetics ◽  
1994 ◽  
Vol 138 (1) ◽  
pp. 227-234 ◽  
Author(s):  
D L Hartl ◽  
E N Moriyama ◽  
S A Sawyer

Abstract The patterns of nonrandom usage of synonymous codons (codon bias) in enteric bacteria were analyzed. Poisson random field (PRF) theory was used to derive the expected distribution of frequencies of nucleotides differing from the ancestral state at aligned sites in a set of DNA sequences. This distribution was applied to synonymous nucleotide polymorphisms and amino acid polymorphisms in the gnd and putP genes of Escherichia coli. For the gnd gene, the average intensity of selection against disfavored synonymous codons was estimated as approximately 7.3 x 10(-9); this value is significantly smaller than the estimated selection intensity against selectively disfavored amino acids in observed polymorphisms (2.0 x 10(-8)), but it is approximately of the same order of magnitude. The selection coefficients for optimal synonymous codons estimated from PRF theory were consistent with independent estimates based on codon usage for threonine and glycine. Across 118 genes in E. coli and Salmonella typhimurium, the distribution of estimated selection coefficients, expressed as multiples of the effective population size, has a mean and standard deviation of 0.5 +/- 0.4. No significant differences were found in the degree of codon bias between conserved positions and replacement positions, suggesting that translational misincorporation is not an important selective constraint among synonymous polymorphic codons in enteric bacteria. However, across the first 100 codons of the genes, conserved amino acids with identical codons have significantly greater codon bias than that of either synonymous or nonidentical codons, suggesting that there are unique selective constraints, perhaps including mRNA secondary structures, in this part of the coding region.


2019 ◽  
Vol 8 (1) ◽  
pp. 15-23
Author(s):  
Takashi Nakamura ◽  
Emi Takeda ◽  
Tomoko Kiryu ◽  
Kentaro Mori ◽  
Miyu Ohori ◽  
...  

Background: O-phospho-L-serine sulfhydrylase from the hyperthermophilic archaeon Aeropyrum pernix K1 (ApOPSS) is thermostable and tolerant to organic solvents. It can produce nonnatural amino acids in addition to L-cysteine. Objective: We aimed to obtain higher amounts of ApOPSS compared to those reported with previous methods for the convenience of research and for industrial production of L-cysteine and non-natural amino acids. Method: We performed codon optimization of cysO that encodes ApOPSS, for optimal expression in Escherichia coli. We then examined combinations of conditions such as the host strain, plasmid, culture medium, and isopropyl β-D-1-thiogalactopyranoside (IPTG) concentration to improve ApOPSS yield. Results and Discussion: E. coli strain Rosetta (DE3) harboring the expression plasmid pQE-80L with the codon-optimized cysO was cultured in Terrific broth with 0.01 mM IPTG at 37°C for 48 h to yield a 10-times higher amount of purified ApOPSS (650 mg·L-1) compared to that obtained by the conventional method (64 mg·L-1). We found that the optimal culture conditions along with codon optimization were essential for the increased ApOPSS production. The expressed ApOPSS had a 6-histidine tag at the N-terminal, which did not affect its activity. This method may facilitate the industrial production of cysteine and non-natural amino acids using ApOPSS. Conclusion: We conclude that these results could be used in applied research on enzymatic production of L-cysteine in E. coli, large scale production of non-natural amino acids, an enzymatic reaction in organic solvent, and environmental remediation by sulfur removal.


2010 ◽  
Vol 6 ◽  
pp. EBO.S4608 ◽  
Author(s):  
Soohyun Lee ◽  
Seyeon Weon ◽  
Sooncheol Lee ◽  
Changwon Kang

2016 ◽  
Author(s):  
Bohdan B. Khomtchouk ◽  
Claes Wahlestedt ◽  
Wolfgang Nonner

Codon usage in 2730 genomes is analyzed for evolutionary patterns in the usage of synonymous codons and amino acids across prokaryotic and eukaryotic taxa. We group genomes together that have similar amounts of intra-genomic bias in their codon usage, and then compare how usage of particular different codons is diversified across each genome group, and how that usage varies from group to group. Inter-genomic diversity of codon usage increases with intra-genomic usage bias, following a universal pattern. The frequencies of the different codons vary in robust mutual correlation, and the implied synonymous codon and amino acid usages drift together. This kind of correlation indicates that the variation of codon usage across organisms is chiefly a consequence of lateral DNA transfer among diverse organisms. The group of genomes with the greatest intra-genomic bias comprises two distinct subgroups, with each one restricting its codon usage to essentially one unique half of the genetic code table. These organisms include eubacteria and archaea thought to be closest to the hypothesized last universal common ancestor (LUCA). Their codon usages imply genetic diversity near the hypothesized base of the tree of life. There is a continuous evolutionary progression across taxa from the two extremely diversified usages toward balanced usage of different codons (as approached, e.g. in mammals). In that progression, codon frequency variations are correlated as expected from a blending of the two extreme codon usages seen in prokaryotes.AUTHOR SUMMARYThe redundancy intrinsic to the genetic code allows different amino acids to be encoded by up to six synonymous codons. Genomes of different organisms prefer different synonymous codons, a phenomenon known as ‘codon usage bias.’ The phenomenon of codon usage bias is of fundamental interest for evolutionary biology, and is important in a variety of applied settings (e.g., transgene expression). The spectrum of codon usage biases seen in current organisms is commonly thought to have arisen by the combined actions of mutations and selective pressures. This view focuses on codon usage in specific genomes and the consequences of that usage for protein expression.Here we investigate an unresolved question of molecular genetics: are there global rules governing the usage of synonymous codons made by genomic DNA across organisms? To answer this question, we employed a data-driven approach to surveying 2730 species from all kingdoms of the ‘tree of life’ in order to classify their codon usage. A first major result was that the large majority of these organisms use codons rather uniformly on the genome-wide scale, without giving preference to particular codons among possible synonymous alternatives. A second major result was that two compartments of codon usage seem to co-exist and to be expressed in different proportions by different organisms. As such, we investigate how individual different codons are used in different organisms from all taxa. Whereas codon usage is generally believed to be the evolutionary result of both mutations and natural selection, our results suggest a different perspective: the usage of different codons (and amino acids) by different organisms follows a superposition of two distinct patterns of usage. One distinction locates to the third base pair of all different codons, which in one pattern is U or A, and in the other pattern is G or C. This result has two major implications: (1) the variation of codon usage as seen across different organisms is best accounted for by lateral gene transfer among diverse organisms; (2) the organisms that are by protein homology grouped near the base of the ‘tree of life’ comprise two genetically distinct lineages.We find that, over evolutionary time, codon usages have converged from two distinct, non-overlapping usages (e.g., as evident in bacteria and archaea) to a near-uniform, balanced usage of synonymous codons (e.g., in mammals). This shows that the variations of codon (and amino acid) biases reveal a distinct evolutionary progression. We also find that codon usage in bacteria and archaea is most diverse between organisms thought to be closest to the hypothesized last universal common ancestor (LUCA). The dichotomy in codon (and amino acid usages) present near the origin of the current ‘tree of life’ might provide information about the evolutionary development of the genetic code.


2014 ◽  
Author(s):  
Hamzeh Alipour ◽  
Abbasali Raz ◽  
Navid Dinparast Djadid ◽  
Abbas Rami ◽  
Seyed Mohammad Amin Mahdian

A given amino acid sequence can be encoded by a huge number of different nucleic acid sequences. These sequences, however, prove not to be equally useful. The choice of sequence can significantly impact the expression of an encoded protein. As regards the importance of protein-coding sequence and promising industrial and medicinal applications of Clostridium histolyticum collagenase, this study examined the codon optimization of the Col H gene so as to enhance collagenase expression in Escherichia coli (E. coli). The coding region of mature Col H gene was optimized according to the codon usage of E. coli using Gene Designer software (DNA 2.0). The results revealed that relative frequency of codon usage in Col H gene was adapted to the most preferred triplets in E. coli in such a way that codon usage bias in E. coli was enhanced after codon optimization. Similarly, the higher level of collagenase expression was more likely the result of substituting rare codons with optimal codons. As has been reported elsewhere, the findings from this study suggest that codon optimization provides a theoretical improvement in Col H gene expression in E. coli. In spite of that, experimental research is needed to confirm the improvement.


Viruses ◽  
2021 ◽  
Vol 13 (7) ◽  
pp. 1215
Author(s):  
Hasan Arsın ◽  
Andrius Jasilionis ◽  
Håkon Dahle ◽  
Ruth-Anne Sandaa ◽  
Runar Stokke ◽  
...  

Marine viral sequence space is immense and presents a promising resource for the discovery of new enzymes interesting for research and biotechnology. However, bottlenecks in the functional annotation of viral genes and soluble heterologous production of proteins hinder access to downstream characterization, subsequently impeding the discovery process. While commonly utilized for the heterologous expression of prokaryotic genes, codon adjustment approaches have not been fully explored for viral genes. Herein, the sequence-based identification of a putative prophage is reported from within the genome of Hypnocyclicus thermotrophus, a Gram-negative, moderately thermophilic bacterium isolated from the Seven Sisters hydrothermal vent field. A prophage-associated gene cluster, consisting of 46 protein coding genes, was identified and given the proposed name Hypnocyclicus thermotrophus phage H1 (HTH1). HTH1 was taxonomically assigned to the viral family Siphoviridae, by lowest common ancestor analysis of its genome and phylogeny analyses based on proteins predicted as holin and DNA polymerase. The gene neighbourhood around the HTH1 lytic cassette was found most similar to viruses infecting Gram-positive bacteria. In the HTH1 lytic cassette, an N-acetylmuramoyl-L-alanine amidase (Amidase_2) with a peptidoglycan binding motif (LysM) was identified. A total of nine genes coding for enzymes putatively related to lysis, nucleic acid modification and of unknown function were subjected to heterologous expression in Escherichia coli. Codon optimization and codon harmonization approaches were applied in parallel to compare their effects on produced proteins. Comparison of protein yields and thermostability demonstrated that codon optimization yielded higher levels of soluble protein, but codon harmonization led to proteins with higher thermostability, implying a higher folding quality. Altogether, our study suggests that both codon optimization and codon harmonization are valuable approaches for successful heterologous expression of viral genes in E. coli, but codon harmonization may be preferable in obtaining recombinant viral proteins of higher folding quality.


Author(s):  
Darja Kanduc

AbstractInfectious diseases pose two main compelling issues. First, the identification of the molecular factors that allow chronic infections, that is, the often completely asymptomatic coexistence of infectious agents with the human host. Second, the definition of the mechanisms that allow the switch from pathogen dormancy to pathologic (re)activation. Furthering previous studies, the present work (1) analyzes the frequency of occurrence of synonymous codons in coding DNA, that is, codon usage, as a genetic tool that rules protein expression; (2) describes how human codon usage can inhibit protein expression of infectious agents during latency, so that pathogen genes the codon usage of which does not conform to the human codon usage cannot be translated; and (3) frames human codon usage among the front-line instruments of the innate immunity against infections. In parallel, it is shown that, while genetics can account for the molecular basis of pathogen latency, the changes of the quantitative relationship between codon frequencies and isoaccepting tRNAs during cell proliferation offer a biochemical mechanism that explains the pathogen switching to (re)activation. Immunologically, this study warns that using codon optimization methodologies can (re)activate, potentiate, and immortalize otherwise quiescent, asymptomatic pathogens, thus leading to uncontrollable pandemics.


Sign in / Sign up

Export Citation Format

Share Document