Effective sequence similarity detection with strobemers

Genome Research ◽

10.1101/gr.275648.121 ◽

2021 ◽

Author(s):

Kristoffer Sahlin

Keyword(s):

Sequence Comparison ◽

Sequence Similarity ◽

Simulated Data ◽

Mutation Rates ◽

Single Mutation ◽

Sequence Matching ◽

Sequencing Data ◽

Sequence Comparisons ◽

Oxford Nanopore ◽

Or Groups

k-mer-based methods are widely used in bioinformatics for various types of sequence comparisons. However, a single mutation will mutate k consecutive k-mers and make most k-mer-based applications for sequence comparison sensitive to variable mutation rates. Many techniques have been studied to overcome this sensitivity, for example, spaced k-mers and k-mer permutation techniques, but these techniques do not handle indels well. For indels, pairs or groups of small k-mers are commonly used, but these methods first produce k-mer matches, and only in a second step, a pairing or grouping of k-mers is performed. Such techniques produce many redundant k-mer matches owing to the size of k. Here, we propose strobemers as an alternative to k-mers for sequence comparison. Intuitively, strobemers consist of two or more linked shorter k-mers, where the combination of linked k-mers is decided by a hash function. We use simulated data to show that strobemers provide more evenly distributed sequence matches and are less sensitive to different mutation rates than k-mers and spaced k-mers. Strobemers also produce higher match coverage across sequences. We further implement a proof-of-concept sequence-matching tool StrobeMap and use synthetic and biological Oxford Nanopore sequencing data to show the utility of using strobemers for sequence comparison in different contexts such as sequence clustering and alignment scenarios.

Download Full-text

Strobemers: an alternative to k-mers for sequence comparison

10.1101/2021.01.28.428549 ◽

2021 ◽

Author(s):

Kristoffer Sahlin

Keyword(s):

Error Correction ◽

Sequence Alignment ◽

Sequence Comparison ◽

Mutation Rates ◽

Second Step ◽

Single Mutation ◽

Sequence Comparisons ◽

Selection Technique ◽

Or Groups ◽

Reference Implementation

K-mer-based methods are widely used in bioinformatics for various types of sequence comparison. However, a single mutation will mutate k consecutive k-mers and makes most k-mer based applications for sequence comparison sensitive to variable mutation rates. Many techniques have been studied to overcome this sensitivity, e.g., spaced k-mers and k-mer permutation techniques, but these techniques do not handle indels well. For indels, pairs or groups of small k-mers are commonly used, but these methods first produce k-mer matches, and only in a second step, a pairing or grouping of k-mers is performed. Such techniques produce many redundant k-mer matches due to the size of k.Here, we propose strobemers as an alternative to k-mers for sequence comparison. Intuitively, strobemers consists of linked minimizers. We show that under a certain minimizer selection technique, strobemers provide more evenly distributed sequence matches than k-mers and are less sensitive to different mutation rates and distributions. Strobemers also produce a higher total match coverage across sequences. Strobemers are a useful alternative to k-mers for performing sequence comparisons as commonly used in sequence alignment, clustering, classification, and error-correction. A reference implementation with code for analyses is available at https://github.com/ksahlin/strobemers.

Download Full-text

Partial recN gene sequencing: a new tool for identification and phylogeny within the genus Streptococcus

INTERNATIONAL JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY ◽

10.1099/ijs.0.018176-0 ◽

2010 ◽

Vol 60 (9) ◽

pp. 2140-2148 ◽

Cited By ~ 43

Author(s):

Olga O. Glazunova ◽

Didier Raoult ◽

Véronique Roux

Keyword(s):

Genetic Diversity ◽

Sequence Comparison ◽

Gene Sequence ◽

Sequence Similarity ◽

Rrna Gene ◽

Sequence Comparisons ◽

High Genetic Diversity ◽

Repair Protein ◽

The Mean ◽

The 16S Rrna Gene

Partial sequences of the recN gene (1249 bp), which encodes a recombination and repair protein, were analysed to determine the phylogenetic relationship and identification of streptococci. The partial sequences presented interspecies nucleotide similarity of 56.4–98.2 % and intersubspecies similarity of 89.8–98 %. The mean DNA sequence similarity of recN gene sequences (66.6 %) was found to be lower than those of the 16S rRNA gene (94.1 %), rpoB (84.6 %), sodA (74.8 %), groEL (78.1 %) and gyrB (73.2 %). Phylogenetically derived trees revealed six statistically supported groups: Streptococcus salivarius, S. equinus, S. hyovaginalis/S. pluranimalium/S. thoraltensis, S. pyogenes, S. mutans and S. suis. The ‘mitis’ group was not supported by a significant bootstrap value, but three statistically supported subgroups were noted: Streptococcus sanguinis/S. cristatus/S. sinensis, S. anginosus/S. intermedius/S. constellatus (the ‘anginosus’ subgroup) and S. mitis/S. infantis/S. peroris/S. oralis/S. oligofermentans/S. pneumoniae/S. pseudopneumoniae. The partial recN gene sequence comparison highlighted a high percentage of divergence between Streptococcus dysgalactiae subsp. dysgalactiae and S. dysgalactiae subsp. equisimilis. This observation is confirmed by other gene sequence comparisons (groEL, gyrB, rpoB and sodA). A high percentage of similarity was found between S. intermedius and S. constellatus after sequence comparison of the recN gene. To study the genetic diversity among the ‘anginosus’ subgroup, recN, groEL, sodA, gyrB and rpoB sequences were determined for 36 clinical isolates. The results that were obtained confirmed the high genetic diversity within this group of streptococci.

Download Full-text

A Bayesian framework for inferring the influence of sequence context on single base modifications

10.1101/571646 ◽

2019 ◽

Author(s):

Guy Ling ◽

Danielle Miller ◽

Rasmus Nielsen ◽

Adi Stern

Keyword(s):

Context Effects ◽

Genetic Material ◽

Simulated Data ◽

Mutation Rates ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Sequence Context ◽

Specific Sequence ◽

Single Base ◽

Base Modifications

AbstractThe probability of single base modifications (mutations and DNA/RNA modifications) is expected to be highly influenced by the flanking nucleotides that surround them, known as the sequence context. This phenomenon may be mainly attributed to the enzyme that modifies or mutates the genetic material, since most enzymes tend to have specific sequence contexts that dictate their activity. Thus, identification of context effects may lead to the discovery of additional editing sites or unknown enzymatic factors. Here, we develop a statistical model that allows for the detection and evaluation of the effects of different sequence contexts on mutation rates from deep population sequencing data. This task is computationally challenging, as the complexity of the model increases exponentially as the context size increases. We established our novel Bayesian method based on sparse model selection methods, with the leading assumption that the number of actual sequence contexts that directly influence mutation rates is minuscule compared to the number of possible sequence contexts. We show that our method is highly accurate on simulated data using pentanucleotide contexts, even when accounting for noisy data. We next analyze empirical population sequencing data from polioviruses and detect a significant enrichment in sequence contexts associated with deamination by the cellular deaminases ADAR 1/2. In the current era, where next generation sequencing data is highly abundant, our approach can be used on any population sequencing data to reveal context-dependent base alterations, and may assist in the discovery of novel mutable sites or editing sites.

Download Full-text

Chromosome level assembly and comparative genome analysis confirm lager-brewing yeasts originated from a single hybridization

BMC Genomics ◽

10.1186/s12864-019-6263-3 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 5

Author(s):

Alex N. Salazar ◽

Arthur R. Gorter de Vries ◽

Marcel van den Broek ◽

Nick Brouwers ◽

Pilar de la Torre Cortès ◽

...

Keyword(s):

Evolutionary History ◽

Sequence Similarity ◽

Sequencing Data ◽

Brewing Yeast ◽

Linear Evolution ◽

Oxford Nanopore ◽

Pure Cultures ◽

Group 2 ◽

Chromosome Level ◽

Group 1

Abstract Background The lager brewing yeast, S. pastorianus, is a hybrid between S. cerevisiae and S. eubayanus with extensive chromosome aneuploidy. S. pastorianus is subdivided into Group 1 and Group 2 strains, where Group 2 strains have higher copy number and a larger degree of heterozygosity for S. cerevisiae chromosomes. As a result, Group 2 strains were hypothesized to have emerged from a hybridization event distinct from Group 1 strains. Current genome assemblies of S. pastorianus strains are incomplete and highly fragmented, limiting our ability to investigate their evolutionary history. Results To fill this gap, we generated a chromosome-level genome assembly of the S. pastorianus strain CBS 1483 from Oxford Nanopore MinION DNA sequencing data and analysed the newly assembled subtelomeric regions and chromosome heterozygosity. To analyse the evolutionary history of S. pastorianus strains, we developed Alpaca: a method to compute sequence similarity between genomes without assuming linear evolution. Alpaca revealed high similarities between the S. cerevisiae subgenomes of Group 1 and 2 strains, and marked differences from sequenced S. cerevisiae strains. Conclusions Our findings suggest that Group 1 and Group 2 strains originated from a single hybridization involving a heterozygous S. cerevisiae strain, followed by different evolutionary trajectories. The clear differences between both groups may originate from a severe population bottleneck caused by the isolation of the first pure cultures. Alpaca provides a computationally inexpensive method to analyse evolutionary relationships while considering non-linear evolution such as horizontal gene transfer and sexual reproduction, providing a complementary viewpoint beyond traditional phylogenetic approaches.

Download Full-text

A Bayesian Framework for Inferring the Influence of Sequence Context on Point Mutations

Molecular Biology and Evolution ◽

10.1093/molbev/msz248 ◽

2019 ◽

Vol 37 (3) ◽

pp. 893-903 ◽

Cited By ~ 1

Author(s):

Guy Ling ◽

Danielle Miller ◽

Rasmus Nielsen ◽

Adi Stern

Keyword(s):

Genetic Material ◽

Point Mutations ◽

Simulated Data ◽

Mutation Rates ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Sequence Context ◽

Specific Sequence ◽

Significant Enrichment ◽

Actual Sequence

Abstract The probability of point mutations is expected to be highly influenced by the flanking nucleotides that surround them, known as the sequence context. This phenomenon may be mainly attributed to the enzyme that modifies or mutates the genetic material, because most enzymes tend to have specific sequence contexts that dictate their activity. Here, we develop a statistical model that allows for the detection and evaluation of the effects of different sequence contexts on mutation rates from deep population sequencing data. This task is computationally challenging, as the complexity of the model increases exponentially as the context size increases. We established our novel Bayesian method based on sparse model selection methods, with the leading assumption that the number of actual sequence contexts that directly influence mutation rates is minuscule compared with the number of possible sequence contexts. We show that our method is highly accurate on simulated data using pentanucleotide contexts, even when accounting for noisy data. We next analyze empirical population sequencing data from polioviruses and HIV-1 and detect a significant enrichment in sequence contexts associated with deamination by the cellular deaminases ADAR 1/2 and APOBEC3G, respectively. In the current era, where next-generation sequencing data are highly abundant, our approach can be used on any population sequencing data to reveal context-dependent base alterations and may assist in the discovery of novel mutable sites or editing sites.

Download Full-text

IsoDetect: Detection of splice isoforms from third generation long reads based on short feature sequences

Current Bioinformatics ◽

10.2174/1574893615666200316101205 ◽

2020 ◽

Vol 15 ◽

Author(s):

Hongdong Li ◽

Wenjing Zhang ◽

Yuwen Luo ◽

Jianxin Wang

Keyword(s):

Sequence Similarity ◽

Detection Methods ◽

Sequence Information ◽

Third Generation ◽

Sequencing Data ◽

Splice Isoforms ◽

Third Generation Sequencing ◽

Long Reads ◽

Feature Sequence ◽

Generation Sequencing

Aims: Accurately detect isoforms from third generation sequencing data. Background: Transcriptome annotation is the basis for the analysis of gene expression and regulation. The transcriptome annotation of many organisms such as humans is far from incomplete, due partly to the challenge in the identification of isoforms that are produced from the same gene through alternative splicing. Third generation sequencing (TGS) reads provide unprecedented opportunity for detecting isoforms due to their long length that exceeds the length of most isoforms. One limitation of current TGS reads-based isoform detection methods is that they are exclusively based on sequence reads, without incorporating the sequence information of known isoforms. Objective: Develop an efficient method for isoform detection. Method: Based on annotated isoforms, we propose a splice isoform detection method called IsoDetect. First, the sequence at exon-exon junction is extracted from annotated isoforms as the “short feature sequence”, which is used to distinguish different splice isoforms. Second, we aligned these feature sequences to long reads and divided long reads into groups that contain the same set of feature sequences, thereby avoiding the pair-wise comparison among the large number of long reads. Third, clustering and consensus generation are carried out based on sequence similarity. For the long reads that do not contain any short feature sequence, clustering analysis based on sequence similarity is performed to identify isoforms. Result: Tested on two datasets from Calypte Anna and Zebra Finch, IsoDetect showed higher speed and compelling accuracy compared with four existing methods. Conclusion: IsoDetect is a promising method for isoform detection. Other: This paper was accepted by the CBC2019 conference.

Download Full-text

Sequence and phylogentic analysis of MERS-CoV in Saudi Arabia, 2012–2019

Virology Journal ◽

10.1186/s12985-021-01563-7 ◽

2021 ◽

Vol 18 (1) ◽

Author(s):

Mohamed A. Farrag ◽

Haitham M. Amer ◽

Rauf Bhat ◽

Fahad N. Almajhdi

Keyword(s):

Middle East ◽

Gene Sequence ◽

Selective Pressure ◽

Mutation Rates ◽

Receptor Binding Domain ◽

Sequence Comparisons ◽

S Gene ◽

Clade B ◽

Human Coronaviruses ◽

Early Phases

Abstract Background The Middle East Respiratory Syndrome-related Coronavirus (MERS-CoV) continues to exist in the Middle East sporadically. Thorough investigations of the evolution of human coronaviruses (HCoVs) are urgently required. In the current study, we studied amplified fragments of ORF1a/b, Spike (S) gene, ORF3/4a, and ORF4b of four human MERS-CoV strains for tracking the evolution of MERS-CoV over time. Methods RNA isolated from nasopharyngeal aspirate, sputum, and tracheal swabs/aspirates from hospitalized patients with suspected MERS-CoV infection were analyzed for amplification of nine variable genomic fragments. Sequence comparisons were done using different bioinformatics tools available. Results Several mutations were identified in ORF1a/b, ORF3/4a and ORF4b, with the highest mutation rates in the S gene. Five codons; 4 in ORF1a and 1 in the S gene, were found to be under selective pressure. Characteristic amino acid changes, potentially hosted and year specific were defined across the S protein and in the receptor-binding domain Phylogenetic analysis using S gene sequence revealed clustering of MERS-CoV strains into three main clades, A, B and C with subdivision of with clade B into B1 to B4. Conclusions In conclusion, MERS-CoV appears to continuously evolve. It is recommended that the molecular and pathobiological characteristics of future MERS-CoV strains should be analyzed on regular basis to prevent potential future outbreaks at early phases.

Download Full-text

The Complete Chloroplast Genome of the Vulnerable Oreocharis esquirolii (Gesneriaceae): Structural Features, Comparative and Phylogenetic Analysis

Plants ◽

10.3390/plants9121692 ◽

2020 ◽

Vol 9 (12) ◽

pp. 1692

Author(s):

Li Gu ◽

Ting Su ◽

Ming-Tai An ◽

Guo-Xiong Hu

Keyword(s):

Phylogenetic Analysis ◽

Sequence Similarity ◽

Single Copy ◽

Structural Features ◽

Rrna Genes ◽

Trna Genes ◽

Sequencing Data ◽

High Sequence Similarity ◽

Plastid Genomes ◽

Cp Genome

Oreocharis esquirolii, a member of Gesneriaceae, is known as Thamnocharis esquirolii, which has been regarded a synonym of the former. The species is endemic to Guizhou, southwestern China, and is evaluated as vulnerable (VU) under the International Union for Conservation of Nature (IUCN) criteria. Until now, the sequence and genome information of O. esquirolii remains unknown. In this study, we assembled and characterized the complete chloroplast (cp) genome of O. esquirolii using Illumina sequencing data for the first time. The total length of the cp genome was 154,069 bp with a typical quadripartite structure consisting of a pair of inverted repeats (IRs) of 25,392 bp separated by a large single copy region (LSC) of 85,156 bp and a small single copy region (SSC) of18,129 bp. The genome comprised 114 unique genes with 80 protein-coding genes, 30 tRNA genes, and four rRNA genes. Thirty-one repeat sequences and 74 simple sequence repeats (SSRs) were identified. Genome alignment across five plastid genomes of Gesneriaceae indicated a high sequence similarity. Four highly variable sites (rps16-trnQ, trnS-trnG, ndhF-rpl32, and ycf 1) were identified. Phylogenetic analysis indicated that O. esquirolii grouped together with O. mileensis, supporting resurrection of the name Oreocharis esquirolii from Thamnocharisesquirolii. The complete cp genome sequence will contribute to further studies in molecular identification, genetic diversity, and phylogeny.

Download Full-text

Sphingopyxis italica sp. nov., isolated from Roman catacombs

INTERNATIONAL JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY ◽

10.1099/ijs.0.046573-0 ◽

2013 ◽

Vol 63 (Pt_7) ◽

pp. 2565-2569 ◽

Cited By ~ 11

Author(s):

Cynthia Alias-Villegas ◽

Valme Jurado ◽

Leonila Laiz ◽

Cesareo Saiz-Jimenez

Keyword(s):

16S Rrna ◽

16S Rrna Gene ◽

Type Species ◽

Sequence Similarity ◽

Rrna Gene ◽

Sequence Comparisons ◽

Content Type ◽

Link Type ◽

Dna Base ◽

Physiological Tests

A Gram-stain-negative, aerobic, motile, rod-shaped bacterium, strain SC13E-S71T, was isolated from tuff, volcanic rock, where the Roman catacombs of Saint Callixtus in Rome, Italy, was excavated. Analysis of 16S rRNA gene sequences revealed that strain SC13E-S71T belongs to the genus Sphingopyxis , and that it shows the greatest sequence similarity with Sphingopyxis chilensis DSM 14889T (98.72 %), Sphingopyxis taejonensis DSM 15583T (98.65 %), Sphingopyxis ginsengisoli LMG 23390T (98.16 %), Sphingopyxis panaciterrae KCTC 12580T (98.09 %), Sphingopyxis alaskensis DSM 13593T (98.09 %), Sphingopyxis witflariensis DSM 14551T (98.09 %), Sphingopyxis bauzanensis DSM 22271T (98.02 %), Sphingopyxis granuli KCTC 12209T (97.73 %), Sphingopyxis macrogoltabida KACC 10927T (97.49 %), Sphingopyxis ummariensis DSM 24316T (97.37 %) and Sphingopyxis panaciterrulae KCTC 22112T (97.09 %). The predominant fatty acids were C18 : 1ω7c, summed feature 3 (iso-C15 : 0 2-OH and/or C16 : 1ω7c), C14 : 0 2-OH and C16 : 0. The predominant menaquinone was MK-10. The major polar lipids were diphosphatidylglycerol, phosphatidylethanolamine, phosphatidylglycerol, phosphatidylcholine and sphingoglycolipid. These chemotaxonomic data are common to members of the genus Sphingopyxis . However, a polyphasic approach using physiological tests, DNA base ratios, DNA–DNA hybridization and 16S rRNA gene sequence comparisons showed that the isolate SC13E-S71T belongs to a novel species within the genus Sphingopyxis , for which the name Sphingopyxis italica sp. nov. is proposed. The type strain is SC13E-S71T ( = DSM 25229T = CECT 8016T).

Download Full-text

Litoribacter ruber gen. nov., sp. nov., an alkaliphilic, halotolerant bacterium isolated from a soda lake sediment

INTERNATIONAL JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY ◽

10.1099/ijs.0.021626-0 ◽

2010 ◽

Vol 60 (12) ◽

pp. 2996-3001 ◽

Cited By ~ 24

Author(s):

Shi-Ping Tian ◽

Yong-Xia Wang ◽

Bin Hu ◽

Xiao-Xia Zhang ◽

Wei Xiao ◽

...

Keyword(s):

Sequence Similarity ◽

Soda Lake ◽

Optimal Growth ◽

Growth Conditions ◽

Rrna Gene ◽

Strictly Aerobic ◽

Sequence Comparisons ◽

Respiratory Quinone ◽

Major Respiratory Quinone ◽

Respective Type

A novel alkaliphilic, halotolerant, rod-shaped bacterium, designated strain YIM CH208T, was isolated from a soda lake in Yunnan, south-west China. The taxonomy of strain YIM CH208T was investigated by a polyphasic approach. Strain YIM CH208T was Gram-negative, strictly aerobic and non-motile and formed red colonies. Optimal growth conditions were 28 °C, pH 8.5 and 0.5–2.5 % NaCl. Phylogenetic analysis based on 16S rRNA gene sequence comparisons showed that the isolate formed a distinct line within a clade containing the genus Echinicola in the phylum Bacteroidetes and was related to the species Echinicola pacifica and Rhodonellum psychrophilum, with sequence similarity of 91.7 and 91.6 % to the respective type strains. The DNA G+C content was 45.1 mol%. The major respiratory quinone was menaquinone-7 (MK-7). The predominant cellular fatty acids were iso-C17 : 1 ω9c (19.9 %), C15 : 0 3-OH (12.1 %), iso-C17 : 0 3-OH (11.3 %), summed feature 3 (iso-C15 : 0 2-OH and/or C16 : 1 ω7c; 10.7 %) and C17 : 1 ω6c (8.7 %). On the basis of the phenotypic, chemotaxonomic and phylogenetic data, strain YIM CH208T represents a novel species of a new genus, for which the name Litoribacter ruber gen. nov., sp. nov. is proposed. The type strain of Litoribacter ruber is YIM CH208T (=ACCC 05414T =KCTC 22899T).

Download Full-text