STATISTICAL AND LINGUISTIC FEATURES OF DNA SEQUENCES

Fractals ◽  
1995 ◽  
Vol 03 (02) ◽  
pp. 269-284 ◽  
Author(s):  
S. HAVLIN ◽  
S.V. BULDYREV ◽  
A.L. GOLDBERGER ◽  
R.N. MANTEGNA ◽  
C.-K. PENG ◽  
...  

We present evidence supporting the idea that the DNA sequence in genes containing noncoding regions is correlated, and that the correlation is remarkably long range—indeed, base pairs thousands of base pairs distant are correlated. We do not find such a long-range correlation in the coding regions of the gene. We resolve the problem of the “non-stationarity” feature of the sequence of base pairs by applying a new algorithm called Detrended Fluctuation Analysis (DFA). We address the claim of Voss that there is no difference in the statistical properties of coding and noncoding regions of DNA by systematically applying the DFA algorithm, as well as standard FFT analysis, to all eukaryotic DNA sequences (33 301 coding and 29 453 noncoding) in the entire GenBank database. We describe a simple model to account for the presence of long-range power-law correlations which is based upon a generalization of the classic Lévy walk. Finally, we describe briefly some recent work showing that the noncoding sequences have certain statistical features in common with natural languages. Specifically, we adapt to DNA the Zipf approach to analyzing linguistic texts, and the Shannon approach to quantifying the “redundancy” of a linguistic text in terms of a measurable entropy function. We suggest that noncoding regions in plants and invertebrates may display a smaller entropy and larger redundancy than coding regions, further supporting the possibility that noncoding regions of DNA may carry biological information.

2015 ◽  
Vol 08 (01) ◽  
pp. 1550004 ◽  
Author(s):  
Chengjie Tan ◽  
Shanshan Li ◽  
Ping Zhu

Graphical representation of DNA sequences is a key component in studying biological problems. In order to gain new insights in DNA sequences, this paper combined the digitized methods of single-base, base pairs and coding in triplet bases with the times of base appearing, and then a novel 4D graphical representation method of DNA sequences was put forward. It was a one-to-one correspondence of the arbitrary DNA sequence and 4D graphical representation, that avoided causing non-unique 4D graphical representation and overlapping lines. The method could reflect the biological information features of DNA sequence more comprehensively and effectively without any losses. Based on the 4D graphical representation, we used the geometric center of 4D graphical representation as eigenvalue of DNA sequences analyses, which kept the original features of the data, and then established the Euclidean distances and included angles between vectors' terminal point for similarity analyses of the first extron of the beta-globulin gene among 11 species. Finally, we established the graph of systematic hierarchical cluster analysis of 11 species to observe more easily the relationship between species. A positive outcome was reached, and the results were in accord with biological taxonomy, which also supported the rationality and effectiveness of the novel 4D graphical representation.


Fractals ◽  
1993 ◽  
Vol 01 (03) ◽  
pp. 283-301 ◽  
Author(s):  
H.E. STANLEY ◽  
S.V. BULDYREV ◽  
A.L. GOLDBERGER ◽  
S. HAVLIN ◽  
S.M. OSSADNIK ◽  
...  

The purpose of this opening talk is to describe an example of recent progress in applying fractal concepts to biological systems. We first briefly review several biological systems, and then focus on the fractal features characterized by the long-range correlations found recently in DNA sequences containing non-coding material. We also discuss the evidence supporting the finding that for sequences containing only coding regions, there are no long-range correlations. Finally, we discuss the finding that the exponent α characterizing the long-range correlations increases with evolution.


2014 ◽  
Vol 2014 ◽  
pp. 1-14 ◽  
Author(s):  
Guangchen Liu ◽  
Yihui Luan

The identification of protein coding regions (exons) plays a critical role in eukaryotic gene structure prediction. Many techniques have been introduced for discriminating between the exons and the introns in the eukaryotic DNA sequences, such as the discrete Fourier transform (DFT) based techniques, but these DFT-based methods rapidly lose their effectiveness in the case of short DNA sequences. In this paper, a novel integrated algorithm based on autoregressive spectrum analysis and wavelet packets transform is presented to improve the efficiency and accuracy of the coding regions identification. The experimental results show that the new algorithm outperforms the conventional DFT-based approaches in improving the prediction accuracy of protein coding regions distinctly by testing GENSCAN65, HMR195, and BG570 benchmark datasets.


Genetics ◽  
2004 ◽  
Vol 166 (2) ◽  
pp. 661-668
Author(s):  
Mandy Kim ◽  
Erika Wolff ◽  
Tiffany Huang ◽  
Lilit Garibyan ◽  
Ashlee M Earl ◽  
...  

Abstract We have applied a genetic system for analyzing mutations in Escherichia coli to Deinococcus radiodurans, an extremeophile with an astonishingly high resistance to UV- and ionizing-radiation-induced mutagenesis. Taking advantage of the conservation of the β-subunit of RNA polymerase among most prokaryotes, we derived again in D. radiodurans the rpoB/Rif r system that we developed in E. coli to monitor base substitutions, defining 33 base change substitutions at 22 different base pairs. We sequenced >250 mutations leading to Rif r in D. radiodurans derived spontaneously in wild-type and uvrD (mismatch-repair-deficient) backgrounds and after treatment with N-methyl-N′-nitro-N-nitrosoguanidine (NTG) and 5-azacytidine (5AZ). The specificities of NTG and 5AZ in D. radiodurans are the same as those found for E. coli and other organisms. There are prominent base substitution hotspots in rpoB in both D. radiodurans and E. coli. In several cases these are at different points in each organism, even though the DNA sequences surrounding the hotspots and their corresponding sites are very similar in both D. radiodurans and E. coli. In one case the hotspots occur at the same site in both organisms.


Genetics ◽  
1974 ◽  
Vol 77 (1) ◽  
pp. 95-104
Author(s):  
J E Sulston ◽  
S Brenner

ABSTRACT Chemical analysis and a study of renaturation kinetics show that the nematode, Caenorhabditis elegans, has a haploid DNA content of 8 x IO7 base pairs (20 times the genome of E. coli). Eighty-three percent of the DNA sequences are unique. The mean base composition is 36% GC; a small component, containing the rRNA cistrons, has a base composition of 51% GC. The haploid genome contains about 300 genes for 4s RNA, 110 for 5s RNA, and 55 for (18 + 28)S RNA.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Svetlana Kalmykova ◽  
Marina Kalinina ◽  
Stepan Denisov ◽  
Alexey Mironov ◽  
Dmitry Skvortsov ◽  
...  

AbstractThe ability of nucleic acids to form double-stranded structures is essential for all living systems on Earth. Current knowledge on functional RNA structures is focused on locally-occurring base pairs. However, crosslinking and proximity ligation experiments demonstrated that long-range RNA structures are highly abundant. Here, we present the most complete to-date catalog of conserved complementary regions (PCCRs) in human protein-coding genes. PCCRs tend to occur within introns, suppress intervening exons, and obstruct cryptic and inactive splice sites. Double-stranded structure of PCCRs is supported by decreased icSHAPE nucleotide accessibility, high abundance of RNA editing sites, and frequent occurrence of forked eCLIP peaks. Introns with PCCRs show a distinct splicing pattern in response to RNAPII slowdown suggesting that splicing is widely affected by co-transcriptional RNA folding. The enrichment of 3’-ends within PCCRs raises the intriguing hypothesis that coupling between RNA folding and splicing could mediate co-transcriptional suppression of premature pre-mRNA cleavage and polyadenylation.


1995 ◽  
Vol 51 (5) ◽  
pp. 5084-5091 ◽  
Author(s):  
S. V. Buldyrev ◽  
A. L. Goldberger ◽  
S. Havlin ◽  
R. N. Mantegna ◽  
M. E. Matsa ◽  
...  

1996 ◽  
Vol 222 (5) ◽  
pp. 354-360 ◽  
Author(s):  
V.R. Chechetkin ◽  
V.V. Lobzin

1993 ◽  
Vol 47 (5) ◽  
pp. 3730-3733 ◽  
Author(s):  
C.-K. Peng ◽  
S. V. Buldyrev ◽  
A. L. Goldberger ◽  
S. Havlin ◽  
M. Simons ◽  
...  

Author(s):  
Samapika Roy ◽  
◽  
Sukhada ◽  
Anil Kr. Singh ◽  
◽  
...  

News Headlines (NHs) are of the most creative uses of natural languages in a media text. An NH is the frontline of a news article. Specific characteristics make NHs standout: for instance, article omission, use of active verbs, dropping the copula to save space and to attract the reader’s attention to the most significant words, etc. Some research has been done on linguistic analysis of British English NH, Hindi-Urdu NHs, but hardly any work has been conducted on IndENH. This paper attempts to analyze Indian English newspaper headlines (IndENH), and aims to contribute to the accuracy of News Headline parsing. This study determines the linguistic features of the IndENH, to improve the quality of the parsed output of NHs. This paper covers sentence construction, tense, punctuation marks, metaphors, etc. for linguistic analysis.


Sign in / Sign up

Export Citation Format

Share Document