scholarly journals Information Theory and Multivariate Techniques for Analyzing DNA Sequence Data: An Example from Tomato Genes

1970 ◽  
Vol 1 (1) ◽  
pp. 1-8 ◽  
Author(s):  
Bal K Joshi ◽  
Dilip R Panthee

DNA and amino acid sequences are alphabetic symbols having no underlying metric. Use of information theory is one of the solutions for sequence metric problems. The reflection of DNA sequence complexity in phenotype stability might be useful for crop improvement. Shannon-Weaver index (Shannon Entropy, H') and mutual information (MI) index were estimated from DNA sequences of 22 genes, consisted of two gene families of tomato, namely disease resistance and fruit quality. Main objective was use of information theory and multivariate techniques to understand diversity among genes and relate the sequence complexity with phenotypes. The normalized H' value ranged from 0.429 to 0.461. The highest diversity was observed in the gene Crtr-B (beta carotene hydroxylase). Two principal components which accounted for 36.65% variation placed these genes into four groups. Groupings of these genes by both principal component and cluster analyses showed clearly the similarity at phenotypes levels within cluster. Sequences similarity among genes was observed within a family. Diversity assessment of genes applying information theory should link to understand the sequences complexity with respect to gene stability for example stability of resistance gene.Key words: Diversity analysis; DNA sequences; principal component analysis; tomato genesNepal Journal of Biotechnology, 2011, Vol. 1, No. 1 pp.1-9

Genetics ◽  
1993 ◽  
Vol 134 (4) ◽  
pp. 1195-1204
Author(s):  
S Tarès ◽  
J M Cornuet ◽  
P Abad

Abstract An AluI family of highly reiterated nontranscribed sequences has been found in the genome of the honeybee Apis mellifera. This repeated sequence is shown to be present at approximately 23,000 copies per haploid genome constituting about 2% of the total genomic DNA. The nucleotide sequence of 10 monomers was determined. The consensus sequences is 176 nucleotides long and has an A + T content of 58%. There are clusters of both direct and inverted repeats. Internal subrepeating units ranging from 11 to 17 nucleotides are observed, suggesting that it could have evolved from a shorter sequence. DNA sequence data reveal that this repeat class is unusually homogeneous compared to the other class of invertebrate highly reiterated DNA sequences. The average pairwise sequence divergence between the repeats is 2.5%. In spite of this unusual homogeneity, divergence has been found in the repeated sequence hybridization ladder between four different honeybee subspecies. Therefore, the AluI highly reiterated sequences provide a new probe for fingerprinting in A. m. mellifera.


1993 ◽  
Vol 4 (3) ◽  
pp. 287-292 ◽  
Author(s):  
D.L. Kauffman ◽  
P.J. Keller ◽  
A. Bennick ◽  
M. Blum

Human proline-rich proteins (PRPs) constitute a complex family of salivary proteins that are encoded by a small number of genes. The primary gene product is cleaved by proteases, thereby giving rise to about 20 secreted proteins. To determine the genes for the secreted PRPs, therefore, it is necessary to obtain sequences of both the secreted proteins and the DNA encoding these proteins. We have sequenced most PRPs from one donor (D.K.) and aligned the protein sequences with available DNA sequences from unrelated individuals. Partial sequence data have now been obtained for an additional PRP from D.K. named II-1. This protein was purified from parotid saliva by gel filtration and ion-exchange chromatography. Peptides were obtained by cleavage with trypsin, clostripain, and N-bromosuccinimide, followed by column chromatography. The peptides were sequenced on a gas-phase protein sequenator. Overlapping peptide sequences were obtained for most of II-1 and aligned with translated DNA sequences. The best fit was obtained with clones containing sequences for the allele PRB4" (Lyons et al., 1988). However, there was not complete identity of the protein amino acid sequence and the DNA-derived sequences, indicating that II-1 is not encoded by PRB4". Other PRPs isolated from D.K. also fail to conform to any DNA structure so far reported. This shows the need to obtain amino acid sequences and corresponding DNA sequences from the same person to assign genes for the PRPs and to determine the location of the postribosomal cleavage points in the primary translation product.


Zootaxa ◽  
2012 ◽  
Vol 3361 (1) ◽  
pp. 56-62 ◽  
Author(s):  
JOSEFINA CURIEL ◽  
JUAN J. MORRONE

Insect life stages are known imperfectly in many cases, and classifications are usually based on adult morphology. This isunfortunate as information on other life stages may be useful for biomonitoring. The major impediment to using elmid(Coleoptera) larvae for freshwater biomonitoring is the lack of larval descriptions and illustrations. Reliable molecular proto-cols may be used to associate larvae and adults. After adults of seven species of Mexican Macrelmis were identified morpho-logically, seven larval specimens were associated to them based on two gene fragments: Cox1 and Cob. The phylogeneticanalysis allowed identifying the larval specimens as Macrelmis leonilae, M. scutellaris, M. species 7, M. species 10, and M.species 11. Two species based on adults associated uncertainly with one larva, and one larva did not match with any adult. Adult/larval association in elmids using DNA sequence data seems to be promising in terms of speed and reliability.


2021 ◽  
Vol 12 ◽  
Author(s):  
Ying Zhang ◽  
Yupei Zhou ◽  
Wei Sun ◽  
Lili Zhao ◽  
D. Pavlic-Zupanc ◽  
...  

The genus Botryosphaeria includes more than 200 epithets, but only the type species, Botryosphaeria dothidea and a dozen or more other species have been identified based on DNA sequence data. The taxonomic status of the other species remains unconfirmed because they lack either morphological information or DNA sequence data. In this study, types or authentic specimens of 16 “Botryosphaeria” species are reassessed to clarify their identity and phylogenetic position. nuDNA sequences of four regions, ITS, LSU, tef1-α and tub2, are analyzed and considered in combination with morphological characteristics. Based on the multigene phylogeny and morphological characters, Botryosphaeria cruenta and Botryosphaeria hamamelidis are transferred to Neofusicoccum. The generic status of Botryosphaeria aterrima and Botryosphaeria mirabile is confirmed in Botryosphaeria. Botryosphaeria berengeriana var. weigeliae and B. berengeriana var. acerina are treated synonyms of B. dothidea. Botryosphaeria mucosa is transferred to Neodeightonia as Neodeightonia mucosa, and Botryosphaeria ferruginea to Nothophoma as Nothophoma ferruginea. Botryosphaeria foliicola is reduced to synonymy with Phyllachorella micheliae. Botryosphaeria abuensis, Botryosphaeria aesculi, Botryosphaeria dasylirii, and Botryosphaeria wisteriae are tentatively kept in Botryosphaeria sensu stricto until further phylogenetic analysis is carried out on verified specimens. The ordinal status of Botryosphaeria apocyni, Botryosphaeria gaubae, and Botryosphaeria smilacinina cannot be determined, and tentatively accommodate these species in Dothideomycetes incertae sedis. The study demonstrates the significance of a polyphasic approach in characterizing type specimens, including the importance of using of DNA sequence data.


Parasitology ◽  
2022 ◽  
pp. 1-16
Author(s):  
Anna Faltýnková ◽  
Olena Kudlai ◽  
Camila Pantoja ◽  
Galina Yakovleva ◽  
Daria Lebedeva

Abstract DNA sequence data became an integral part of species characterization and identification. Still, specimens associated with a particular DNA sequence must be identified by the use of traditional morphology-based analysis and correct linking of sequence and identification must be ensured. Only a small part of DNA sequences of the genus Diplostomum (Diplostomidae) is based on adult isolates which are essential for accurate identification. In this study, we provide species identification with an aid of morphological and molecular (cox1, ITS-5.8S-ITS2 and 28S) characterization of adults of Diplostomum baeri Dubois, 1937 from naturally infected Larus canus Linnaeus in Karelia, Russia. Furthermore, we reveal that the DNA sequences of our isolates of D. baeri are identical with those of the lineage Diplostomum sp. clade Q , while other sequences labelled as the ‘D. baeri’ complex do not represent lineages of D. baeri. Our new material of cercariae from Radix balthica (Linnaeus) in Ireland is also linked to Diplostomum sp. clade Q. We reveal that D. baeri is widely distributed in Europe; as first intermediate hosts lymnaeid snails (Radix auricularia (Linnaeus), R. balthica) are used; metacercariae occur in eye lens of cyprinid fishes. In light of the convoluted taxonomy of D. baeri and other Diplostomum spp., we extend the recommendations of Blasco-Costa et al. (2016, Systematic Parasitology 93, 295–306) for the ‘best practice’ in molecular approaches to trematode systematics. The current study is another step in elucidating the species spectrum of Diplostomum based on integrative taxonomy with well-described morphology of adults linked to sequences.


Genome ◽  
1995 ◽  
Vol 38 (4) ◽  
pp. 681-688 ◽  
Author(s):  
M. C. Kiefer-Meyer ◽  
A. S. Reddy ◽  
M. Delseny

A HindII repetitive fragment (pOD3) was isolated and cloned from the genomic DNA of an accession of Oryza latifolia, a wild rice species that possesses a tetraploid CCDD genome. Southern blot analysis using this clone as a probe demonstrated that this repetitive DNA sequence had a dispersed organization in the CCDD genome and seemed to be highly specific for this genome type. This fragment is the first CCDD-specific repeated DNA sequence to be described. The hybridization pattern is similar for most CCDD accessions tested, although a few showed no hybridization signal. The nucleotide sequence of the element cloned in pOD3 was determined and analysed. The 1783 base pair long repeated sequence shows no homology with other known nucleotide sequences. In addition, none of the amino acid sequences deduced from the potential open reading frames contained in the pOD3 repeat is homologous to any known protein. The nucleotide sequence presents several internal repeats, direct or inverted, but their significance remains unknown.Key words: rice, dispersed repetitive DNA sequences, genome-specific sequences.


Author(s):  
Jeremy Charlier ◽  
Robert Nadon ◽  
Vladimir Makarenkov

Abstract Motivation Off-target predictions are crucial in gene editing research. Recently, significant progress has been made in the field of prediction of off-target mutations, particularly with CRISPR-Cas9 data, thanks to the use of deep learning. CRISPR-Cas9 is a gene editing technique which allows manipulation of DNA fragments. The sgRNA-DNA (single guide RNA-DNA) sequence encoding for deep neural networks, however, has a strong impact on the prediction accuracy. We propose a novel encoding of sgRNA-DNA sequences that aggregates sequence data with no loss of information. Results In our experiments, we compare the proposed sgRNA-DNA sequence encoding applied in a deep learning prediction framework with state-of-the-art encoding and prediction methods. We demonstrate the superior accuracy of our approach in a simulation study involving Feedforward Neural Networks (FNNs), Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) as well as the traditional Random Forest (RF), Naive Bayes (NB) and Logistic Regression (LR) classifiers.We highlight the quality of our results by building several FNNs, CNNs and RNNs with various layer depths and performing predictions on two popular CRISPOR and GUIDE-seq gene editing data sets. In all our experiments, the new encoding led to more accurate off-target prediction results, providing an improvement of the area under the Receiver Operating Characteristic (ROC) curve up to 35%. Availability The code and data used in this study are available at: https://github.com/dagrate/dl-offtarget


2016 ◽  
Vol 36 (1) ◽  
Author(s):  
Paola Berchialla

We introduce a Bayesian hierarchical model for mitochondrial DNA sequence data, which is fitted via acceptance-rejection algorithms. The model incorporates parametric models of population history explicitly as well as a mutational process allowing for a simultaneous parameter estimation whose importance has become increasingly clear in many recent studies. The model is applied to a sample of DNA sequences from the Italian population.


2019 ◽  
Author(s):  
Thomas D. Schneider ◽  
Vishnu Jejjala

AbstractRestriction enzymes recognize and bind to specific sequences on invading bacteriophage DNA. Like a key in a lock, these proteins require many contacts to specify the correct DNA sequence. Using information theory we develop an equation that defines the number of independent contacts, which is the dimensionality of the binding. We show that EcoRI, which binds to the sequence GAATTC, functions in 24 dimensions. Information theory represents messages as spheres in high dimensional spaces. Better sphere packing leads to better communications systems. The densest known packing of hyperspheres occurs on the Leech lattice in 24 dimensions. We suggest that the single protein EcoRI molecule employs a Leech lattice in its operation. Optimizing density of sphere packing explains why 6 base restriction enzymes are so common.


2021 ◽  
Vol 22 (3) ◽  
pp. 505
Author(s):  
SONIA GIULIETTI ◽  
TIZIANA ROMAGNOLI ◽  
ALESSANDRA CAMPANELLI ◽  
CECILIA TOTTI ◽  
STEFANO ACCORONI

The ecology and seasonality of Pseudo-nitzschia species and their contribution to phytoplankton community were analysed for the first time at the coastal station of the LTER-Senigallia-Susak transect (north-western Adriatic Sea) from 1988 to 2020. Species composition was addressed using DNA sequence data obtained from 106 monoclonal strains isolated from January 2018 to January 2020. The mean annual cycle of total phytoplankton in the study period (Feb 1988–Jan 2020) showed maximum abundances in winter followed by other peaks in spring and autumn. Diatoms were the main contributors in terms of abundance during the winter and the spring blooms. The autumn peak was due to phytoflagellates and diatoms. In summer phytoflagellates dominated the community, followed by diatoms and dinoflagellates, which in this season reached their annual maximum. Pseudo-nitzschia spp. represented on average 0.4–17.6% of diatom community, but during their blooms they could reach up to up to 90% of the total diatom abundances with 106 cells l-1. By LM, six different taxa were recognized: Pseudo-nitzschia cf. delicatissima and P. cf. pseudodelicatissima were the most abundant, followed by P. cf. fraudulenta, P. pungens, P. multistriata and P. cf. galaxiae. P. cf. fraudulenta and P. pungens were indicator taxa of winter. P. cf. delicatissima and P. cf. pseudodelicatissima were spring and summer taxa, respectively. P. galaxiae showed maximum abundances in autumn. DNA sequences revealed the presence of two species belonging to the ’P. seriata group’ (i.e. P. fraudulenta and P. pungens) and four species belonging to the ‘P. delicatissima group’ (P. calliantha and P. mannii within the P. pseudodelicatissima species complex, and P. delicatissima and P. cf. arenysensis within the P. delicatissima species complex). The presence of several cryptic and pseudo-cryptic species highlights the need to combine LM observations with DNA sequence data when the ecology of Pseudo-nitzschia is investigated. 


Sign in / Sign up

Export Citation Format

Share Document