THE CONTRIBUTION OF STOP CODON FREQUENCY AND PURINE BIAS TO THE CLASSIFICATION OF CODING SEQUENCES

In this report, we revisited simple features that allow the classification of coding sequences (CDS) from non-coding DNA. The spectrum of codon usage of our sequence sample is large and suggests that these features are universal. The features that we investigated combine (i) the stop codon distribution, (ii) the product of purine probabilities in the three positions of nucleotide triplets, (iii) the product of Cytosine, Guanine, Adenine probabilities in 1st, 2nd, 3rd position of triplets, respectively, (iv) the product of G and C probabilities in 1st and 2nd position of triplets. These features are a natural consequence of the physico-chemical properties of proteins and their combination is successful in classifying CDS and non-coding DNA (introns) with a success rate >95% above 350 bp. The coding strand and coding frame are implicitly deduced when the sequences are classified as coding.

Download Full-text

SPInDel Analysis of the Non-Coding Regions of cpDNA as a More Useful Tool for the Identification of Rye (Poaceae: Secale) Species

International Journal of Molecular Sciences ◽

10.3390/ijms21249421 ◽

2020 ◽

Vol 21 (24) ◽

pp. 9421

Author(s):

Lidia Skuza ◽

Ewa Filip ◽

Izabela Szućko ◽

Jan Bocianowski

Keyword(s):

Mitochondrial Dna ◽

Molecular Markers ◽

Related Species ◽

Economic Importance ◽

Closely Related Species ◽

Coding Sequences ◽

Coding Regions ◽

Genomic Regions ◽

Tribe Triticeae

Secale is a small but very diverse genus from the tribe Triticeae (family Poaceae), which includes annual, perennial, self-pollinating and open-pollinating, cultivated, weedy and wild species of various phenotypes. Despite its high economic importance, classification of this genus, comprising 3–8 species, is inconsistent. This has resulted in significantly reduced progress in the breeding of rye which could be enriched with functional traits derived from wild rye species. Our previous research has suggested the utility of non-coding sequences of chloroplast and mitochondrial DNA in studies on closely related species of the genus Secale. Here we applied the SPInDel (Species Identification by Insertions/Deletions) approach, which targets hypervariable genomic regions containing multiple insertions/deletions (indels) and exhibiting extensive length variability. We analysed a total of 140 and 210 non-coding sequences from cpDNA and mtDNA, respectively. The resulting data highlight regions which may represent useful molecular markers with respect to closely related species of the genus Secale, however, we found the chloroplast genome to be more informative. These molecular markers include non-coding regions of chloroplast DNA: atpB-rbcL and trnT-trnL and non-coding regions of mitochondrial DNA: nad1B-nad1C and rrn5/rrn18. Our results demonstrate the utility of the SPInDel concept for the characterisation of Secale species.

Download Full-text

Mitochondrial genome sequencing and phylogeny of Haemagogus albomaculatus, Haemagogus leucocelaenus, Haemagogus spegazzinii, and Haemagogus tropicalis (Diptera: Culicidae)

Scientific Reports ◽

10.1038/s41598-020-73790-x ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Fábio Silva da Silva ◽

Ana Cecília Ribeiro Cruz ◽

Daniele Barbosa de Almeida Medeiros ◽

Sandro Patroca da Silva ◽

Márcio Roberto Teixeira Nunes ◽

...

Keyword(s):

Evolutionary Biology ◽

Yellow Fever Virus ◽

Stop Codon ◽

Average Length ◽

Molecular Taxonomy ◽

Purifying Selection ◽

Start Codon ◽

Morphological Aspects ◽

Mitochondrial Sequences

Abstract The genus Haemagogus (Diptera: Culicidae) comprises species of great epidemiological relevance, involved in transmission cycles of the Yellow fever virus and other arboviruses in South America. So far, only Haemagogus janthinomys has complete mitochondrial sequences available. Given the unavailability of information related to aspects of the evolutionary biology and molecular taxonomy of this genus, we report here, the first sequencing of the mitogenomes of Haemagogus albomaculatus, Haemagogus leucocelaenus, Haemagogus spegazzinii, and Haemagogus tropicalis. The mitogenomes showed an average length of 15,038 bp, average AT content of 79.3%, positive AT-skews, negative GC-skews, and comprised 37 functional subunits (13 PCGs, 22 tRNA, and 02 rRNA). The PCGs showed ATN as start codon, TAA as stop codon, and signs of purifying selection. The tRNAs had the typical leaf clover structure, except tRNASer1. Phylogenetic analyzes of Bayesian inference and Maximum Likelihood, based on concatenated sequences from all 13 PCGs, produced identical topologies and strongly supported the monophyletic relationship between the Haemagogus and Conopostegus subgenera, and corroborated with the known taxonomic classification of the evaluated taxa, based on external morphological aspects. The information produced on the mitogenomes of the Haemagogus species evaluated here may be useful in carrying out future taxonomic and evolutionary studies of the genus.

Download Full-text

Classifying Coding DNA with Nucleotide Statistics

Bioinformatics and Biology Insights ◽

10.4137/bbi.s3030 ◽

2009 ◽

Vol 3 ◽

pp. BBI.S3030 ◽

Cited By ~ 5

Author(s):

Nicolas Carels ◽

Diego Frías

Keyword(s):

Success Rate ◽

Homo Sapiens ◽

Stop Codon ◽

False Positive Rate ◽

Regression Line ◽

Compositional Bias ◽

Coding Sequences ◽

Automatic Translation ◽

Positive Rate ◽

Universal Correlation

In this report, we compared the success rate of classification of coding sequences (CDS) vs. introns by Codon Structure Factor (CSF) and by a method that we called Universal Feature Method (UFM). UFM is based on the scoring of purine bias (Rrr) and stop codon frequency. We show that the success rate of CDS/intron classification by UFM is higher than by CSF. UFM classifies ORFs as coding or non-coding through a score based on (i) the stop codon distribution, (ii) the product of purine probabilities in the three positions of nucleotide triplets, (iii) the product of Cytosine (C), Guanine (G), and Adenine (A) probabilities in the 1st, 2nd, and 3rd positions of triplets, respectively, (iv) the probabilities of G in 1st and 2nd position of triplets and (v) the distance of their GC3 vs. GC2 levels to the regression line of the universal correlation. More than 80% of CDSs (true positives) of Homo sapiens (>250 bp), Drosophila melanogaster (>250 bp) and Arabidopsis thaliana (>200 bp) are successfully classified with a false positive rate lower or equal to 5%. The method releases coding sequences in their coding strand and coding frame, which allows their automatic translation into protein sequences with 95% confidence. The method is a natural consequence of the compositional bias of nucleotides in coding sequences.

Download Full-text

Region Required for Protein Expression from the Stop-Start Pentanucleotide in the M Gene of Influenza B Virus

Journal of Virology ◽

10.1128/jvi.00180-09 ◽

2009 ◽

Vol 83 (11) ◽

pp. 5939-5942 ◽

Cited By ~ 10

Author(s):

Masato Hatta ◽

Candice K. Kohlmeier ◽

Yasuko Hatta ◽

Makoto Ozawa ◽

Yoshihiro Kawaoka

Keyword(s):

Protein Expression ◽

Stop Codon ◽

Initiation Codon ◽

Influenza B Virus ◽

Influenza B ◽

Coding Region ◽

M Gene ◽

Coding Sequences ◽

B Virus ◽

Efficient Expression

ABSTRACT Segment 7 of influenza B virus encodes two proteins, M1 and BM2. BM2 is expressed from a stop-start pentanucleotide, in which the BM2 initiation codon overlaps with the M1 stop codon. Here, we demonstrate that 45 nucleotides of the 3′ end of the M1 coding region, but not the 5′ end of the BM2 coding region, are sufficient for the efficient expression of the downstream protein. Placing these 45 nucleotides and the stop-start pentanucleotide in between the coding sequences induced the expression of at least three noninfluenza proteins, suggesting the utility of this system for expressing multiple proteins from one mRNA.

Download Full-text

Classification of Arabidopsis thaliana gene sequences: clustering of coding sequences into two groups according to codon usage improves gene prediction 1 1Edited by G. Von Heijne

Journal of Molecular Biology ◽

10.1006/jmbi.1998.2451 ◽

1999 ◽

Vol 285 (5) ◽

pp. 1977-1991 ◽

Cited By ~ 22

Author(s):

Catherine Mathé ◽

Anatoly Peresetsky ◽

Patrice Déhais ◽

Marc Van Montagu ◽

Pierre Rouzé

Keyword(s):

Arabidopsis Thaliana ◽

Codon Usage ◽

Gene Prediction ◽

Gene Sequences ◽

Coding Sequences ◽

Arabidopsis Thaliana Gene Sequences

Download Full-text

Function of 3? non-coding sequences and stop codon usage in expression of the chloroplast psaB gene in Chlamydomonas reinhardtii

Plant Molecular Biology ◽

10.1007/bf00021794 ◽

1996 ◽

Vol 31 (2) ◽

pp. 337-354 ◽

Cited By ~ 25

Author(s):

Hyeonmoo Lee ◽

Scott E. Bingham ◽

Andrew N. Webber

Keyword(s):

Codon Usage ◽

Chlamydomonas Reinhardtii ◽

Stop Codon ◽

Coding Sequences

Download Full-text

SINGLE‐PASS CLASSIFICATION OF ALL NON‐CODING SEQUENCES IN A BACTERIAL GENOME USING PHYLOGENETIC PROFILES

The FASEB Journal ◽

10.1096/fasebj.23.1_supplement.841.2 ◽

2009 ◽

Vol 23 (S1) ◽

Author(s):

Antonin Marchais ◽

Magali Naville ◽

Chantal Bohn ◽

Philippe Bouloc ◽

Daniel Gautheret

Keyword(s):

Bacterial Genome ◽

Coding Sequences ◽

Single Pass ◽

Phylogenetic Profiles

Download Full-text

Evolution of the G+C Content Frontier in the Rat Cytomegalovirus Genome

Virology Research and Treatment ◽

10.4137/vrt.s1023 ◽

2008 ◽

Vol 1 ◽

pp. VRT.S1023

Author(s):

Derek Gatherer

Keyword(s):

Common Ancestor ◽

Stop Codon ◽

Markov Chain Model ◽

Selective Constraint ◽

Chain Model ◽

Mutation Pressure ◽

Coding Sequences ◽

The Common ◽

Whole Genome Alignment ◽

The Right

Within the 230138 bp of the rat cytomegalovirus (RCMV) genome, the G+C content changes abruptly at position 142644, constituting a G+C content frontier. To the left of this point, overall G+C content is 69.2%, and to the right it is only 47.6%. A region of extremely low G+C content (33.8%) is found in the 5 kb immediately to the right of the frontier, in which there are no predicted coding sequences. To the right of position 147501, the G+C content rises and predicted coding sequences reappear. However, these genes are much shorter (average 848 bp, 50% G+C) than those in the left two-thirds of the genome (average 1462 bp, 70% G+C). Whole genome alignment of several viruses indicates that the initial ultra-low G+C region appeared in the common ancestor of the genera Cytomegalovirus and Muromegalovirus, and that the lowering of G+C in the right third has been a subsequent process in the lineage leading to RCMV. The left two-thirds of RCMV has stop codon occurrences at 67.5% of their expected level, based on a modified Markov chain model of stop codon distribution, and the corresponding figure for the right third is 78%. Therefore, despite heavy mutation pressure, selective constraint has operated in the right third of the RCMV genome to maintain a degree of gene length unusual for such low G+C sequences.

Download Full-text

A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences

Bioinformatics and Biology Insights ◽

10.4137/bbi.s10053 ◽

2013 ◽

Vol 7 ◽

pp. BBI.S10053 ◽

Cited By ~ 8

Author(s):

Nicolas Carels ◽

Diego Frías

Keyword(s):

Success Rate ◽

Prior Knowledge ◽

Homo Sapiens ◽

Stop Codon ◽

Protein Sequences ◽

Data Bank ◽

Low Complexity ◽

Statistical Parameters ◽

Reading Frame

In this study, we investigated the modalities of coding open reading frame (cORF) classification of expressed sequence tags (EST) by using the universal feature method (UFM). The UFM algorithm is based on the scoring of purine bias (Rrr) and stop codon frequencies. UFM classifies ORFs as coding or non-coding through a score based on 5 factors: (i) stop codon frequency; (ii) the product of the probabilities of purines occurring in the three positions of nucleotide triplets; (iii) the product of the probabilities of Cytosine (C), Guanine (G), and Adenine (A) occurring in the 1st, 2nd, and 3rd positions of triplets, respectively; (iv) the probabilities of a G occurring in the 1st and 2nd positions of triplets; and (v) the probabilities of a T occurring in the 1st and an A in the 2nd position of triplets. Because UFM is based on primary determinants of coding sequences that are conserved throughout the biosphere, it is suitable for cORF classification of any sequence in eukaryote transcriptomes without prior knowledge. Considering the protein sequences of the Protein Data Bank (RCSB PDB or more simply PDB) as a reference, we found that UFM classifies cORFs of ≥200 bp (if the coding strand is known) and cORFs of ≥300 bp (if the coding strand is unknown), and releases them in their coding strand and coding frame, which allows their automatic translation into protein sequences with a success rate equal to or higher than 95%. We first established the statistical parameters of UFM using ESTs from Plasmodium falciparum, Arabidopsis thaliana, Oryza sativa, Zea mays, Drosophila melanogaster, Homo sapiens and Chlamydomonas reinhardtii in reference to the protein sequences of PDB. Second, we showed that the success rate of cORF classification using UFM is expected to apply to approximately 95% of higher eukaryote genes that encode for proteins. Third, we used UFM in combination with CAP3 to assemble large EST samples into cORFs that we used to analyze transcriptome phenotypes in rice, maize, and humans. We discuss the error rate and the interference of noisy sequences such as pseudogenes, transposons, and retrotransposons. This method is suitable for rapid cORF extraction from transcriptome data and allows correct description of the genome phenotypes of plant genomes without prior knowledge. Additional care is necessary when addressing the human transcriptome due to the interference caused by large amounts of noisy sequences. UFM can be regarded as a low complexity tool for prior knowledge extraction concerning the coding fraction of the transcriptome of any eukaryote. Due to its low level of complexity, UFM is also very robust to variations of codon usage.

Download Full-text