Information Theory and Multivariate Techniques for Analyzing DNA Sequence Data: An Example from Tomato Genes

DNA and amino acid sequences are alphabetic symbols having no underlying metric. Use of information theory is one of the solutions for sequence metric problems. The reflection of DNA sequence complexity in phenotype stability might be useful for crop improvement. Shannon-Weaver index (Shannon Entropy, H') and mutual information (MI) index were estimated from DNA sequences of 22 genes, consisted of two gene families of tomato, namely disease resistance and fruit quality. Main objective was use of information theory and multivariate techniques to understand diversity among genes and relate the sequence complexity with phenotypes. The normalized H' value ranged from 0.429 to 0.461. The highest diversity was observed in the gene Crtr-B (beta carotene hydroxylase). Two principal components which accounted for 36.65% variation placed these genes into four groups. Groupings of these genes by both principal component and cluster analyses showed clearly the similarity at phenotypes levels within cluster. Sequences similarity among genes was observed within a family. Diversity assessment of genes applying information theory should link to understand the sequences complexity with respect to gene stability for example stability of resistance gene.Key words: Diversity analysis; DNA sequences; principal component analysis; tomato genesNepal Journal of Biotechnology, 2011, Vol. 1, No. 1 pp.1-9

Download Full-text

Characterization of an unusually conserved AluI highly reiterated DNA sequence family from the honeybee, Apis mellifera.

Genetics ◽

10.1093/genetics/134.4.1195 ◽

1993 ◽

Vol 134 (4) ◽

pp. 1195-1204

Author(s):

S Tarès ◽

J M Cornuet ◽

P Abad

Keyword(s):

Apis Mellifera ◽

Dna Sequence ◽

Dna Sequences ◽

Sequence Data ◽

Sequence Divergence ◽

Repeated Sequence ◽

Consensus Sequences ◽

Dna Sequence Data ◽

Repeat Class ◽

Honeybee Subspecies

Abstract An AluI family of highly reiterated nontranscribed sequences has been found in the genome of the honeybee Apis mellifera. This repeated sequence is shown to be present at approximately 23,000 copies per haploid genome constituting about 2% of the total genomic DNA. The nucleotide sequence of 10 monomers was determined. The consensus sequences is 176 nucleotides long and has an A + T content of 58%. There are clusters of both direct and inverted repeats. Internal subrepeating units ranging from 11 to 17 nucleotides are observed, suggesting that it could have evolved from a shorter sequence. DNA sequence data reveal that this repeat class is unusually homogeneous compared to the other class of invertebrate highly reiterated DNA sequences. The average pairwise sequence divergence between the repeats is 2.5%. In spite of this unusual homogeneity, divergence has been found in the repeated sequence hybridization ladder between four different honeybee subspecies. Therefore, the AluI highly reiterated sequences provide a new probe for fingerprinting in A. m. mellifera.

Download Full-text

Alignment of Amino Acid and DNA Sequences of Human Proline-rich Proteins

Critical Reviews in Oral Biology & Medicine ◽

10.1177/10454411930040030501 ◽

1993 ◽

Vol 4 (3) ◽

pp. 287-292 ◽

Cited By ~ 12

Author(s):

D.L. Kauffman ◽

P.J. Keller ◽

A. Bennick ◽

M. Blum

Keyword(s):

Amino Acid ◽

Dna Sequences ◽

Sequence Data ◽

Gel Filtration ◽

Exchange Chromatography ◽

Amino Acid Sequences ◽

Secreted Proteins ◽

Dna Encoding ◽

Protein Amino Acid ◽

Primary Gene

Human proline-rich proteins (PRPs) constitute a complex family of salivary proteins that are encoded by a small number of genes. The primary gene product is cleaved by proteases, thereby giving rise to about 20 secreted proteins. To determine the genes for the secreted PRPs, therefore, it is necessary to obtain sequences of both the secreted proteins and the DNA encoding these proteins. We have sequenced most PRPs from one donor (D.K.) and aligned the protein sequences with available DNA sequences from unrelated individuals. Partial sequence data have now been obtained for an additional PRP from D.K. named II-1. This protein was purified from parotid saliva by gel filtration and ion-exchange chromatography. Peptides were obtained by cleavage with trypsin, clostripain, and N-bromosuccinimide, followed by column chromatography. The peptides were sequenced on a gas-phase protein sequenator. Overlapping peptide sequences were obtained for most of II-1 and aligned with translated DNA sequences. The best fit was obtained with clones containing sequences for the allele PRB4" (Lyons et al., 1988). However, there was not complete identity of the protein amino acid sequence and the DNA-derived sequences, indicating that II-1 is not encoded by PRB4". Other PRPs isolated from D.K. also fail to conform to any DNA structure so far reported. This shows the need to obtain amino acid sequences and corresponding DNA sequences from the same person to assign genes for the PRPs and to determine the location of the postribosomal cleavage points in the primary translation product.

Download Full-text

Association of larvae and adults of Mexican species of Macrelmis (Coleoptera: Elmidae): a preliminary analysis using DNA sequences

Zootaxa ◽

10.11646/zootaxa.3361.1.5 ◽

2012 ◽

Vol 3361 (1) ◽

pp. 56-62 ◽

Cited By ~ 7

Author(s):

JOSEFINA CURIEL ◽

JUAN J. MORRONE

Keyword(s):

Dna Sequence ◽

Dna Sequences ◽

Preliminary Analysis ◽

Sequence Data ◽

Strong Association ◽

Life Stages ◽

Adult Morphology ◽

Dna Sequence Data ◽

Mexican Species ◽

Major Impediment

Insect life stages are known imperfectly in many cases, and classifications are usually based on adult morphology. This isunfortunate as information on other life stages may be useful for biomonitoring. The major impediment to using elmid(Coleoptera) larvae for freshwater biomonitoring is the lack of larval descriptions and illustrations. Reliable molecular proto-cols may be used to associate larvae and adults. After adults of seven species of Mexican Macrelmis were identified morpho-logically, seven larval specimens were associated to them based on two gene fragments: Cox1 and Cob. The phylogeneticanalysis allowed identifying the larval specimens as Macrelmis leonilae, M. scutellaris, M. species 7, M. species 10, and M.species 11. Two species based on adults associated uncertainly with one larva, and one larva did not match with any adult. Adult/larval association in elmids using DNA sequence data seems to be promising in terms of speed and reliability.

Download Full-text

Toward a Natural Classification of Botryosphaeriaceae: A Study of the Type Specimens of Botryosphaeria sensu lato

Frontiers in Microbiology ◽

10.3389/fmicb.2021.737541 ◽

2021 ◽

Vol 12 ◽

Author(s):

Ying Zhang ◽

Yupei Zhou ◽

Wei Sun ◽

Lili Zhao ◽

D. Pavlic-Zupanc ◽

...

Keyword(s):

Dna Sequence ◽

Dna Sequences ◽

Sequence Data ◽

Morphological Characteristics ◽

Taxonomic Status ◽

Morphological Characters ◽

Phylogenetic Position ◽

Type Specimens ◽

Natural Classification ◽

Dna Sequence Data

The genus Botryosphaeria includes more than 200 epithets, but only the type species, Botryosphaeria dothidea and a dozen or more other species have been identified based on DNA sequence data. The taxonomic status of the other species remains unconfirmed because they lack either morphological information or DNA sequence data. In this study, types or authentic specimens of 16 “Botryosphaeria” species are reassessed to clarify their identity and phylogenetic position. nuDNA sequences of four regions, ITS, LSU, tef1-α and tub2, are analyzed and considered in combination with morphological characteristics. Based on the multigene phylogeny and morphological characters, Botryosphaeria cruenta and Botryosphaeria hamamelidis are transferred to Neofusicoccum. The generic status of Botryosphaeria aterrima and Botryosphaeria mirabile is confirmed in Botryosphaeria. Botryosphaeria berengeriana var. weigeliae and B. berengeriana var. acerina are treated synonyms of B. dothidea. Botryosphaeria mucosa is transferred to Neodeightonia as Neodeightonia mucosa, and Botryosphaeria ferruginea to Nothophoma as Nothophoma ferruginea. Botryosphaeria foliicola is reduced to synonymy with Phyllachorella micheliae. Botryosphaeria abuensis, Botryosphaeria aesculi, Botryosphaeria dasylirii, and Botryosphaeria wisteriae are tentatively kept in Botryosphaeria sensu stricto until further phylogenetic analysis is carried out on verified specimens. The ordinal status of Botryosphaeria apocyni, Botryosphaeria gaubae, and Botryosphaeria smilacinina cannot be determined, and tentatively accommodate these species in Dothideomycetes incertae sedis. The study demonstrates the significance of a polyphasic approach in characterizing type specimens, including the importance of using of DNA sequence data.

Download Full-text

Another plea for ‘best practice’ in molecular approaches to trematode systematics: Diplostomum sp. clade Q identified as Diplostomum baeri Dubois, 1937 in Europe

Parasitology ◽

10.1017/s0031182021002092 ◽

2022 ◽

pp. 1-16

Author(s):

Anna Faltýnková ◽

Olena Kudlai ◽

Camila Pantoja ◽

Galina Yakovleva ◽

Daria Lebedeva

Keyword(s):

Dna Sequence ◽

Dna Sequences ◽

Best Practice ◽

Sequence Data ◽

Integrative Taxonomy ◽

Systematic Parasitology ◽

Intermediate Hosts ◽

Accurate Identification ◽

Molecular Approaches ◽

Radix Auricularia

Abstract DNA sequence data became an integral part of species characterization and identification. Still, specimens associated with a particular DNA sequence must be identified by the use of traditional morphology-based analysis and correct linking of sequence and identification must be ensured. Only a small part of DNA sequences of the genus Diplostomum (Diplostomidae) is based on adult isolates which are essential for accurate identification. In this study, we provide species identification with an aid of morphological and molecular (cox1, ITS-5.8S-ITS2 and 28S) characterization of adults of Diplostomum baeri Dubois, 1937 from naturally infected Larus canus Linnaeus in Karelia, Russia. Furthermore, we reveal that the DNA sequences of our isolates of D. baeri are identical with those of the lineage Diplostomum sp. clade Q , while other sequences labelled as the ‘D. baeri’ complex do not represent lineages of D. baeri. Our new material of cercariae from Radix balthica (Linnaeus) in Ireland is also linked to Diplostomum sp. clade Q. We reveal that D. baeri is widely distributed in Europe; as first intermediate hosts lymnaeid snails (Radix auricularia (Linnaeus), R. balthica) are used; metacercariae occur in eye lens of cyprinid fishes. In light of the convoluted taxonomy of D. baeri and other Diplostomum spp., we extend the recommendations of Blasco-Costa et al. (2016, Systematic Parasitology 93, 295–306) for the ‘best practice’ in molecular approaches to trematode systematics. The current study is another step in elucidating the species spectrum of Diplostomum based on integrative taxonomy with well-described morphology of adults linked to sequences.

Download Full-text

Characterization of a dispersed repetitive DNA sequence associated with the CCDD genome of wild rice

Genome ◽

10.1139/g95-086 ◽

1995 ◽

Vol 38 (4) ◽

pp. 681-688 ◽

Cited By ~ 6

Author(s):

M. C. Kiefer-Meyer ◽

A. S. Reddy ◽

M. Delseny

Keyword(s):

Nucleotide Sequence ◽

Dna Sequence ◽

Repetitive Dna ◽

Dna Sequences ◽

Wild Rice ◽

Repeated Sequence ◽

Amino Acid Sequences ◽

Open Reading Frames ◽

Repetitive Dna Sequence ◽

Wild Rice Species

A HindII repetitive fragment (pOD3) was isolated and cloned from the genomic DNA of an accession of Oryza latifolia, a wild rice species that possesses a tetraploid CCDD genome. Southern blot analysis using this clone as a probe demonstrated that this repetitive DNA sequence had a dispersed organization in the CCDD genome and seemed to be highly specific for this genome type. This fragment is the first CCDD-specific repeated DNA sequence to be described. The hybridization pattern is similar for most CCDD accessions tested, although a few showed no hybridization signal. The nucleotide sequence of the element cloned in pOD3 was determined and analysed. The 1783 base pair long repeated sequence shows no homology with other known nucleotide sequences. In addition, none of the amino acid sequences deduced from the potential open reading frames contained in the pOD3 repeat is homologous to any known protein. The nucleotide sequence presents several internal repeats, direct or inverted, but their significance remains unknown.Key words: rice, dispersed repetitive DNA sequences, genome-specific sequences.

Download Full-text

Accurate deep learning off-target prediction with novel sgRNA-DNA sequence encoding in CRISPR-Cas9 gene editing

Bioinformatics ◽

10.1093/bioinformatics/btab112 ◽

2021 ◽

Author(s):

Jeremy Charlier ◽

Robert Nadon ◽

Vladimir Makarenkov

Keyword(s):

Neural Networks ◽

Deep Learning ◽

Dna Sequence ◽

Dna Sequences ◽

Gene Editing ◽

Sequence Data ◽

Target Prediction ◽

Feedforward Neural Networks ◽

Strong Impact ◽

Sequence Encoding

Abstract Motivation Off-target predictions are crucial in gene editing research. Recently, significant progress has been made in the field of prediction of off-target mutations, particularly with CRISPR-Cas9 data, thanks to the use of deep learning. CRISPR-Cas9 is a gene editing technique which allows manipulation of DNA fragments. The sgRNA-DNA (single guide RNA-DNA) sequence encoding for deep neural networks, however, has a strong impact on the prediction accuracy. We propose a novel encoding of sgRNA-DNA sequences that aggregates sequence data with no loss of information. Results In our experiments, we compare the proposed sgRNA-DNA sequence encoding applied in a deep learning prediction framework with state-of-the-art encoding and prediction methods. We demonstrate the superior accuracy of our approach in a simulation study involving Feedforward Neural Networks (FNNs), Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) as well as the traditional Random Forest (RF), Naive Bayes (NB) and Logistic Regression (LR) classifiers.We highlight the quality of our results by building several FNNs, CNNs and RNNs with various layer depths and performing predictions on two popular CRISPOR and GUIDE-seq gene editing data sets. In all our experiments, the new encoding led to more accurate off-target prediction results, providing an improvement of the area under the Receiver Operating Characteristic (ROC) curve up to 35%. Availability The code and data used in this study are available at: https://github.com/dagrate/dl-offtarget

Download Full-text

Hierarchical Models for Mitochondrial DNA Sequence Data

Austrian Journal of Statistics ◽

10.17713/ajs.v36i1.320 ◽

2016 ◽

Vol 36 (1) ◽

Author(s):

Paola Berchialla

Keyword(s):

Mitochondrial Dna ◽

Dna Sequence ◽

Dna Sequences ◽

Sequence Data ◽

Population History ◽

Parametric Models ◽

Italian Population ◽

Bayesian Hierarchical ◽

Dna Sequence Data ◽

Mitochondrial Dna Sequence

We introduce a Bayesian hierarchical model for mitochondrial DNA sequence data, which is fitted via acceptance-rejection algorithms. The model incorporates parametric models of population history explicitly as well as a mutational process allowing for a simultaneous parameter estimation whose importance has become increasingly clear in many recent studies. The model is applied to a sample of DNA sequences from the Italian population.

Download Full-text

Restriction enzymes use a 24 dimensional coding space to recognize 6 base long DNA sequences

10.1101/538025 ◽

2019 ◽

Author(s):

Thomas D. Schneider ◽

Vishnu Jejjala

Keyword(s):

Information Theory ◽

Dna Sequence ◽

Dna Sequences ◽

Sphere Packing ◽

Restriction Enzymes ◽

High Dimensional ◽

Leech Lattice ◽

Communications Systems ◽

Single Protein ◽

Specific Sequences

AbstractRestriction enzymes recognize and bind to specific sequences on invading bacteriophage DNA. Like a key in a lock, these proteins require many contacts to specify the correct DNA sequence. Using information theory we develop an equation that defines the number of independent contacts, which is the dimensionality of the binding. We show that EcoRI, which binds to the sequence GAATTC, functions in 24 dimensions. Information theory represents messages as spheres in high dimensional spaces. Better sphere packing leads to better communications systems. The densest known packing of hyperspheres occurs on the Leech lattice in 24 dimensions. We suggest that the single protein EcoRI molecule employs a Leech lattice in its operation. Optimizing density of sphere packing explains why 6 base restriction enzymes are so common.

Download Full-text

Ecology and seasonality of Pseudo-nitzschia species (Bacillariophyceae) in the northwestern Adriatic Sea over a 30-years period (1988–2020)

Mediterranean Marine Science ◽

10.12681/mms.26021 ◽

2021 ◽

Vol 22 (3) ◽

pp. 505

Author(s):

SONIA GIULIETTI ◽

TIZIANA ROMAGNOLI ◽

ALESSANDRA CAMPANELLI ◽

CECILIA TOTTI ◽

STEFANO ACCORONI

Keyword(s):

Dna Sequence ◽

Dna Sequences ◽

Species Complex ◽

Adriatic Sea ◽

Sequence Data ◽

Coastal Station ◽

Dna Sequence Data ◽

The Mean ◽

First Time ◽

North Western

The ecology and seasonality of Pseudo-nitzschia species and their contribution to phytoplankton community were analysed for the first time at the coastal station of the LTER-Senigallia-Susak transect (north-western Adriatic Sea) from 1988 to 2020. Species composition was addressed using DNA sequence data obtained from 106 monoclonal strains isolated from January 2018 to January 2020. The mean annual cycle of total phytoplankton in the study period (Feb 1988–Jan 2020) showed maximum abundances in winter followed by other peaks in spring and autumn. Diatoms were the main contributors in terms of abundance during the winter and the spring blooms. The autumn peak was due to phytoflagellates and diatoms. In summer phytoflagellates dominated the community, followed by diatoms and dinoflagellates, which in this season reached their annual maximum. Pseudo-nitzschia spp. represented on average 0.4–17.6% of diatom community, but during their blooms they could reach up to up to 90% of the total diatom abundances with 106 cells l-1. By LM, six different taxa were recognized: Pseudo-nitzschia cf. delicatissima and P. cf. pseudodelicatissima were the most abundant, followed by P. cf. fraudulenta, P. pungens, P. multistriata and P. cf. galaxiae. P. cf. fraudulenta and P. pungens were indicator taxa of winter. P. cf. delicatissima and P. cf. pseudodelicatissima were spring and summer taxa, respectively. P. galaxiae showed maximum abundances in autumn. DNA sequences revealed the presence of two species belonging to the ’P. seriata group’ (i.e. P. fraudulenta and P. pungens) and four species belonging to the ‘P. delicatissima group’ (P. calliantha and P. mannii within the P. pseudodelicatissima species complex, and P. delicatissima and P. cf. arenysensis within the P. delicatissima species complex). The presence of several cryptic and pseudo-cryptic species highlights the need to combine LM observations with DNA sequence data when the ecology of Pseudo-nitzschia is investigated.

Download Full-text