The number of spaced-word matches between two DNA sequences as a function of the underlying pattern weight

AbstractWe study the number Nk of (spaced) word matches between pairs of evolutionarily related DNA sequences depending on the word length or pattern weight k, respectively. We show that, under the Jukes-Cantor model, the number of substitutions per site that occurred since two sequences evolved from their last common ancestor, can be esti-mated from the slope of a certain function of Nk. Based on these considerations, we implemented a software program for alignment-free sequence comparison called Slope-SpaM. Test runs on simulated sequence data show that Slope-SpaM can estimate phylogenetic dis-tances with high accuracy for up to around 0.5 substitutions per po-sitions. The statistical stability of our results is improved if spaced words are used instead of contiguous k-mers. Unlike previous methods that are based on the number of (spaced) word matches, our approach can deal with sequences that share only local homologies.

Download Full-text

CRAFT: Compact genome Representation towards large-scale Alignment-Free daTabase

10.1101/2020.07.10.196741 ◽

2020 ◽

Author(s):

Yang Young Lu ◽

Jiaxing Bai ◽

Yiwen Wang ◽

Ying Wang ◽

Fengzhu Sun

Keyword(s):

Dna Sequences ◽

Sequence Comparison ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Practical Interest ◽

Supplementary Information ◽

Computationally Efficient ◽

Sequencing Technologies ◽

Alignment Free

AbstractMotivationRapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption.ResultsWe report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing (HTS) data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102 – 104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures.AvailabilityCRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/[email protected]; [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

CRAFT: Compact genome Representation toward large-scale Alignment-Free daTabase

Bioinformatics ◽

10.1093/bioinformatics/btaa699 ◽

2020 ◽

Author(s):

Yang Young Lu ◽

Jiaxing Bai ◽

Yiwen Wang ◽

Ying Wang ◽

Fengzhu Sun

Keyword(s):

Dna Sequences ◽

Sequence Comparison ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Practical Interest ◽

Supplementary Information ◽

Sequencing Data ◽

Computationally Efficient ◽

Alignment Free

Abstract Motivation Rapid developments in sequencing technologies have boosted generating high volumes of sequence data. To archive and analyze those data, one primary step is sequence comparison. Alignment-free sequence comparison based on k-mer frequencies offers a computationally efficient solution, yet in practice, the k-mer frequency vectors for large k of practical interest lead to excessive memory and storage consumption. Results We report CRAFT, a general genomic/metagenomic search engine to learn compact representations of sequences and perform fast comparison between DNA sequences. Specifically, given genome or high throughput sequencing data as input, CRAFT maps the data into a much smaller embedding space and locates the best matching genome in the archived massive sequence repositories. With 102−104-fold reduction of storage space, CRAFT performs fast query for gigabytes of data within seconds or minutes, achieving comparable performance as six state-of-the-art alignment-free measures. Availability and implementation CRAFT offers a user-friendly graphical user interface with one-click installation on Windows and Linux operating systems, freely available at https://github.com/jiaxingbai/CRAFT. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Insertions and deletions as phylogenetic signal in alignment-free sequence comparison

10.1101/2021.02.03.429685 ◽

2021 ◽

Author(s):

Niklas Birth ◽

Thomas Dencker ◽

Burkhard Morgenstern

Keyword(s):

Sequence Comparison ◽

Phylogenetic Trees ◽

Common Ancestor ◽

Phylogenetic Signal ◽

Last Common Ancestor ◽

Amino Acid Residues ◽

Tree Reconstruction ◽

Sequence Alignments ◽

Insertions And Deletions ◽

Alignment Free

AbstractMost methods for phylogenetic tree reconstruction are based on sequence alignments; they infer phylogenies based on aligned nucleotide or amino-acid residues. Gaps in alignments are usually not used as phylogenetic signal, even though they can, in principle, provide valuable information. In this paper, we explore an alignment-free approach to utilize insertions and deletions for phylogeny inference. We are using our previously developed approach Multi-SpaM, to generate local gap-free four-way alignments, so-called quartet blocks. For pairs of quartet blocks involving the same four sequences, we consider the distances between these blocks in the four sequences, to obtain hints about insertions or deletions that may have occurred since the four sequences evolved from their last common ancestor. This way, a pair of quartet blocks can support one of the three possible quartet topologies for the four involved sequences. We use this information as input for Maximum-Parsimony and for the software program Quartet MaxCut to reconstruct phylogenetic trees that are only based on insertions and deletions.

Download Full-text

Reconstructing the ancestral phenotypes of great apes and humans (Homininae) using subspecies-level phylogenies

Biological Journal of the Linnean Society ◽

10.1093/biolinnean/blz140 ◽

2019 ◽

Author(s):

Keaghan J Yaxley ◽

Robert A Foley

Keyword(s):

Dna Sequences ◽

Phenotypic Variation ◽

Common Ancestor ◽

Phylogenetic Signal ◽

Great Apes ◽

Last Common Ancestor ◽

Ancestral State ◽

Great Ape ◽

Subspecies Level ◽

African Great Apes

Abstract Owing to their close affinity, the African great apes are of interest in the study of human evolution. Although numerous researchers have described the ancestors we share with these species with reference to extant great apes, few have done so with phylogenetic comparative methods. One obstacle to the application of these techniques is the within-species phenotypic variation found in this group. Here, we leverage this variation, modelling common ancestors using ancestral state reconstructions (ASRs) with reference to subspecies-level trait data. A subspecies-level phylogeny of the African great apes and humans was estimated from full-genome mitochondrial DNA sequences and used to implement ASRs for 14 continuous traits known to vary between great ape subspecies. Although the inclusion of within-species phenotypic variation increased the phylogenetic signal for our traits and improved the performance of our ASRs, whether this was done through the inclusion of subspecies phylogeny or through the use of existing methods made little difference. Our ASRs corroborate previous findings that the last common ancestor of humans, chimpanzees and bonobos was a chimp-like animal, but also suggest that the last common ancestor of humans, chimpanzees, bonobos and gorillas was an animal unlike any extant African great ape.

Download Full-text

Speciation and Domestication in Maize and Its Wild Relatives: Evidence From the Globulin-1 Gene

Genetics ◽

10.1093/genetics/150.2.863 ◽

1998 ◽

Vol 150 (2) ◽

pp. 863-872 ◽

Cited By ~ 16

Author(s):

Holly Hilton ◽

Brandon S Gaut

Keyword(s):

Dna Sequences ◽

Sequence Variation ◽

Common Ancestor ◽

Seed Storage Protein ◽

Sequence Data ◽

Seed Storage ◽

Neutral Evolution ◽

Recent Event ◽

Intermediate Size ◽

Founder Event

Abstract The grass genus Zea contains the domesticate maize and several wild taxa indigenous to Central and South America. Here we study the genetic consequences of speciation and domestication in this group by sampling DNA sequences from four taxa—maize (Zea mays ssp. mays), its wild progenitor (Z. mays ssp. parviglumis), a more distant species within the genus (Z. luxurians), and a representative of the sister genus (Tripsacum dactyloides). We sampled a total of 26 sequences from the glb1 locus, which encodes a nonessential seed storage protein. Within the Zea taxa sampled, the progenitor to maize contains the most sequence diversity. Maize contains 60% of the level of genetic diversity of its progenitor, and Z. luxurians contains even less diversity (32% of the level of diversity of Z. mays ssp. parviglumis). Sequence variation within the glb1 locus is consistent with neutral evolution in all four taxa. The glb1 data were combined with adh1 data from a previous study to make inferences about the population genetic histories of these taxa. Comparisons of sequence data between the two morphologically similar wild Zea taxa indicate that the species diverged ∼700,000 years ago from a common ancestor of intermediate size to their present populations. Conversely, the domestication of maize was a recent event that could have been based on a very small number of founding individuals. Maize retained a substantial proportion of the genetic variation of its progenitor through this founder event, but diverged rapidly in morphology.

Download Full-text

SAINT: automatic taxonomy embedding and categorization by Siamese triplet network

10.1101/2021.01.20.426920 ◽

2021 ◽

Author(s):

Yang Young Lu ◽

Yiwen Wang ◽

Fang Zhang ◽

Jiaxing Bai ◽

Ying Wang

Keyword(s):

Sequence Analysis ◽

Sequence Comparison ◽

Large Scale ◽

Sequence Data ◽

Comparison Method ◽

Supplementary Information ◽

Data Alignment ◽

Alignment Free ◽

Comparison Methods ◽

Real World Datasets

AbstractMotivationUnderstanding the phylogenetic relationship among organisms is the key in contemporary evolutionary study and sequence analysis is the workhorse towards this goal. Conventional approaches to sequence analysis are based on sequence alignment, which is neither scalable to large-scale datasets due to computational inefficiency nor adaptive to next-generation sequencing (NGS) data. Alignment-free approaches are typically used as computationally effective alternatives yet still suffering the high demand of memory consumption. One desirable sequence comparison method at large-scale requires succinctly-organized sequence data management, as well as prompt sequence retrieval given a never-before-seen sequence as query.ResultsIn this paper, we proposed a novel approach, referred to as SAINT, for efficient and accurate alignment-free sequence comparison. Compared to existing alignment-free sequence comparison methods, SAINT offers advantages in two aspects: (1) SAINT is a weakly-supervised learning method where the embedding function is learned automatically from the easily-acquired data; (2) SAINT utilizes the non-linear deep learning-based model which potentially better captures the complicated relationship among genome sequences. We have applied SAINT to real-world datasets to demonstrate its empirical utility, both qualitatively and quantitatively. Considering the extensive applicability of alignment-free sequence comparison methods, we expect SAINT to motivate a more extensive set of applications in sequence comparison at large scale.AvailabilityThe open source, Apache licensed, python-implemented code will be available upon acceptance.Supplementary informationSupplementary data are available at Bioinformatics online.

Download Full-text

Numerical Characterization of DNA Sequences for Alignment-free Sequence Comparison – A Review

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207324666210811101437 ◽

2021 ◽

Vol 24 ◽

Author(s):

Natarajan Ramanathan ◽

Jayalakshmi Ramamurthy ◽

Ganapathy Natarajan

Keyword(s):

Dna Sequences ◽

Sequence Comparison ◽

Graphical Representation ◽

Dimensional Space ◽

Building Blocks ◽

Chaos Game Representation ◽

Alignment Free ◽

Comparison Methods ◽

Numerical Characterization

Background: Biological macromolecules namely, DNA, RNA, and protein have their building blocks organized in a particular sequence and the sequential arrangement encodes evolutionary history of the organism (species). Hence, biological sequences have been used for studying evolutionary relationships among the species. This is usually carried out by multiple sequence algorithms (MSA). Due to certain limitations of MSA, alignment-free sequence comparison methods were developed. The present review is on alignment-free sequence comparison methods carried out using numerical characterization of DNA sequences. Discussion: The graphical representation of DNA sequences by chaos game representation and other 2-dimesnional and 3-dimensional methods are discussed. The evolution of numerical characterization from the various graphical representations and the application of the DNA invariants thus computed in phylogenetic analysis is presented. The extension of computing molecular descriptors in chemometrics to the calculation of new set of DNA invariants and their use in alignment-free sequence comparison in a N-dimensional space and construction of phylogenetic tress is also reviewed. Conclusion: The phylogenetic tress constructed by the alignment-free sequence comparison methods using DNA invariants were found to be better than those constructed using alignment-based tools such as PHLYIP and ClustalW. One of the graphical representation methods is now extended to study viral sequences of infectious diseases for the identification of conserved regions to design peptide-based vaccine by combining numerical characterization and graphical representation.

Download Full-text

Characterization of squalene synthase gene from Gymnema sylvestre R. Br.

Beni-Suef University Journal of Basic and Applied Sciences ◽

10.1186/s43088-020-00094-4 ◽

2021 ◽

Vol 10 (1) ◽

Author(s):

Kuldeepsingh A. Kalariya ◽

Ram Prasnna Meena ◽

Lipi Poojara ◽

Deepa Shahi ◽

Sandip Patel

Keyword(s):

Dna Sequences ◽

Genomic Dna ◽

Competitive Inhibition ◽

Sequence Data ◽

Homology Model ◽

Squalene Synthase ◽

Gymnema Sylvestre ◽

Gardenia Jasminoides ◽

Ramachandran Plots ◽

Flanking Regions

Abstract Background Squalene synthase (SQS) is a rate-limiting enzyme necessary to produce pentacyclic triterpenes in plants. It is an important enzyme producing squalene molecules required to run steroidal and triterpenoid biosynthesis pathways working in competitive inhibition mode. Reports are available on information pertaining to SQS gene in several plants, but detailed information on SQS gene in Gymnema sylvestre R. Br. is not available. G. sylvestre is a priceless rare vine of central eco-region known for its medicinally important triterpenoids. Our work aims to characterize the GS-SQS gene in this high-value medicinal plant. Results Coding DNA sequences (CDS) with 1245 bp length representing GS-SQS gene predicted from transcriptome data in G. sylvestre was used for further characterization. The SWISS protein structure modeled for the GS-SQS amino acid sequence data had MolProbity Score of 1.44 and the Clash Score 3.86. The quality estimates and statistical score of Ramachandran plots analysis indicated that the homology model was reliable. For full-length amplification of the gene, primers designed from flanking regions of CDS encoding GS-SQS were used to get amplification against genomic DNA as template which resulted in approximately 6.2-kb sized single-band product. The sequencing of this product through NGS was carried out generating 2.32 Gb data and 3347 number of scaffolds with N50 value of 457 bp. These scaffolds were compared to identify similarity with other SQS genes as well as the GS-SQSs of the transcriptome. Scaffold_3347 representing the GS-SQS gene harbored two introns of 101 and 164 bp size. Both these intronic regions were validated by primers designed from adjoining outside regions of the introns on the scaffold representing GS-SQS gene. The amplification took place when the template was genomic DNA and failed when the template was cDNA confirmed the presence of two introns in GS-SQS gene in Gymnema sylvestre R. Br. Conclusion This study shows GS-SQS gene was very closely related to Coffea arabica and Gardenia jasminoides and this gene harbored two introns of 101 and 164 bp size.

Download Full-text

Experimental evidence for yawn contagion in orangutans (Pongo pygmaeus)

Scientific Reports ◽

10.1038/s41598-020-79160-x ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Evy van Berlo ◽

Alejandra P. Díaz-Loyo ◽

Oscar E. Juárez-Mora ◽

Mariska E. Kret ◽

Jorg J. M. Massen

Keyword(s):

Experimental Evidence ◽

Common Ancestor ◽

Great Apes ◽

Pongo Pygmaeus ◽

Last Common Ancestor ◽

Contagious Yawning ◽

Proximate Mechanism ◽

Social Species ◽

Ultimate Causation

AbstractYawning is highly contagious, yet both its proximate mechanism(s) and its ultimate causation remain poorly understood. Scholars have suggested a link between contagious yawning (CY) and sociality due to its appearance in mostly social species. Nevertheless, as findings are inconsistent, CY’s function and evolution remains heavily debated. One way to understand the evolution of CY is by studying it in hominids. Although CY has been found in chimpanzees and bonobos, but is absent in gorillas, data on orangutans are missing despite them being the least social hominid. Orangutans are thus interesting for understanding CY’s phylogeny. Here, we experimentally tested whether orangutans yawn contagiously in response to videos of conspecifics yawning. Furthermore, we investigated whether CY was affected by familiarity with the yawning individual (i.e. a familiar or unfamiliar conspecific and a 3D orangutan avatar). In 700 trials across 8 individuals, we found that orangutans are more likely to yawn in response to yawn videos compared to control videos of conspecifics, but not to yawn videos of the avatar. Interestingly, CY occurred regardless of whether a conspecific was familiar or unfamiliar. We conclude that CY was likely already present in the last common ancestor of humans and great apes, though more converging evidence is needed.

Download Full-text

A Chromosome-Based Model for Estimating the Number of Conserved Segments Between Pairs of Species From Comparative Genetic Maps

Genetics ◽

10.1093/genetics/154.1.323 ◽

2000 ◽

Vol 154 (1) ◽

pp. 323-332

Author(s):

David Waddington ◽

Anthea J Springbett ◽

David W Burt

Keyword(s):

Maximum Likelihood ◽

Dna Sequences ◽

Common Ancestor ◽

Length Distribution ◽

Genetic Maps ◽

Segment Length ◽

Syntenic Block ◽

Comparative Maps

Abstract Comparative genetic maps of two species allow insights into the rearrangements of their genomes since divergence from a common ancestor. When the map details the positions of genes (or any set of orthologous DNA sequences) on chromosomes, syntenic blocks of one or more genes may be identified and used, with appropriate models, to estimate the number of chromosomal segments with conserved content conserved between species. We propose a model for the distribution of the lengths of unobserved segments on each chromosome that allows for widely differing chromosome lengths. The model uses as data either the counts of genes in a syntenic block or the distance between extreme members of a block, or both. The parameters of the proposed segment length distribution, estimated by maximum likelihood, give predictions of the number of conserved segments per chromosome. The model is applied to data from two comparative maps for the chicken, one with human and one with mouse.

Download Full-text