scholarly journals A Bayesian framework for inferring the influence of sequence context on single base modifications

2019 ◽  
Author(s):  
Guy Ling ◽  
Danielle Miller ◽  
Rasmus Nielsen ◽  
Adi Stern

AbstractThe probability of single base modifications (mutations and DNA/RNA modifications) is expected to be highly influenced by the flanking nucleotides that surround them, known as the sequence context. This phenomenon may be mainly attributed to the enzyme that modifies or mutates the genetic material, since most enzymes tend to have specific sequence contexts that dictate their activity. Thus, identification of context effects may lead to the discovery of additional editing sites or unknown enzymatic factors. Here, we develop a statistical model that allows for the detection and evaluation of the effects of different sequence contexts on mutation rates from deep population sequencing data. This task is computationally challenging, as the complexity of the model increases exponentially as the context size increases. We established our novel Bayesian method based on sparse model selection methods, with the leading assumption that the number of actual sequence contexts that directly influence mutation rates is minuscule compared to the number of possible sequence contexts. We show that our method is highly accurate on simulated data using pentanucleotide contexts, even when accounting for noisy data. We next analyze empirical population sequencing data from polioviruses and detect a significant enrichment in sequence contexts associated with deamination by the cellular deaminases ADAR 1/2. In the current era, where next generation sequencing data is highly abundant, our approach can be used on any population sequencing data to reveal context-dependent base alterations, and may assist in the discovery of novel mutable sites or editing sites.

2019 ◽  
Vol 37 (3) ◽  
pp. 893-903 ◽  
Author(s):  
Guy Ling ◽  
Danielle Miller ◽  
Rasmus Nielsen ◽  
Adi Stern

Abstract The probability of point mutations is expected to be highly influenced by the flanking nucleotides that surround them, known as the sequence context. This phenomenon may be mainly attributed to the enzyme that modifies or mutates the genetic material, because most enzymes tend to have specific sequence contexts that dictate their activity. Here, we develop a statistical model that allows for the detection and evaluation of the effects of different sequence contexts on mutation rates from deep population sequencing data. This task is computationally challenging, as the complexity of the model increases exponentially as the context size increases. We established our novel Bayesian method based on sparse model selection methods, with the leading assumption that the number of actual sequence contexts that directly influence mutation rates is minuscule compared with the number of possible sequence contexts. We show that our method is highly accurate on simulated data using pentanucleotide contexts, even when accounting for noisy data. We next analyze empirical population sequencing data from polioviruses and HIV-1 and detect a significant enrichment in sequence contexts associated with deamination by the cellular deaminases ADAR 1/2 and APOBEC3G, respectively. In the current era, where next-generation sequencing data are highly abundant, our approach can be used on any population sequencing data to reveal context-dependent base alterations and may assist in the discovery of novel mutable sites or editing sites.


2018 ◽  
Author(s):  
Jordan M Singer ◽  
Darwin Y Fu ◽  
Jacob J Hughey

Simulated data are invaluable for assessing a computational method's ability to distinguish signal from noise. Although many biological systems show rhythmicity, there is no general-purpose tool to simulate large-scale, rhythmic data. Here we present Simphony, an R package for simulating data from experiments in which the abundances of rhythmic and non-rhythmic features (e.g., genes) are measured at multiple time points in multiple conditions. Simphony has parameters for specifying experimental design and each feature's rhythmic properties (e.g., shape, amplitude, and phase). In addition, Simphony can sample measurements from Gaussian and negative binomial distributions, the latter of which approximates read counts from next-generation sequencing data. We show an example of using Simphony to benchmark a method for detecting rhythms. Our results suggest that Simphony can aid experimental design and computational method development. Simphony is thoroughly documented and freely available at https://github.com/hugheylab/simphony.


2017 ◽  
Vol 35 (4_suppl) ◽  
pp. 622-622 ◽  
Author(s):  
Christopher E. Jensen ◽  
Arturo Loaiza-Bonilla ◽  
Jonathan Yap Villanueva ◽  
Jennifer J. Morrissette

622 Background: Recent reports have demonstrated inferior outcomes for patients with right-sided colorectal cancer (CRC) compared to patients with left-sided disease (Schrag et al 2016, JCO 34 (suppl; abstr 3505)) as well as differences in treatment response based on disease sidedness (Venook et al 2016, JCO 34 (suppl; abstr 3504)). However, the biological and genetic underpinnings of these clinical differences are incompletely understood. Methods: We compared mutation rates among 38 genes in a retrospective review of next-generation sequencing data of CRC samples obtained in routine clinical practice at a single academic medical center. Primary location was identified via chart review. Right = cecum to transverse colon; Left = descending colon to rectum. Results: Among 293 samples (167 left-sided, 103 right, 23 synchronous or without clear primary), BRAF and CTNNB1 mutations were more prevalent in right-sided CRC. BRAF was mutated in 15.5% of right-sided CRC (95% CI: 8.5-22.5%) compared to 4.8% (CI: 1.6-8.0%) (p=0.003). CTNNB1 was mutated in 3.9% of right-sided CRC (CI: 0.2-7.6%) compared to no instances of CTNNB1 mutations in left-sided disease (p=0.01). Among right-sided CRC, there was a trend toward more KRAS mutations at 57.3% (CI: 47.7-66.8%) versus 44.9% (CI: 37.4-52.5%) and more PIK3CA mutations at 26.2% (CI:17.7-34.7%) versus 17.4% (CI: 11.6-23.1%), though these differences did not rise to the level of statistical significance. The overall rate of BRAF mutations in our sample (8.9%, CI: 5.5-12.3%) was consistent with the rate of BRAF mutations among large intestine adenocarcinomas in the COSMIC database (10.6%, CI: 10.4-10.8%), lending support to the external validity of these data. Conclusions: These differing mutation rates may implicate these genetic pathways in the mechanisms underlying the discrepant outcomes and treatment responses between right and left-sided disease described in prior studies. Further work is needed to more clearly elucidate genetic difference between right and left-sided CRC, as well as the mechanistic relationship between these mutations and prognosis.


Author(s):  
Gökalp Çelik ◽  
TIMUR TUNCALI

Runs of long homozygous stretches (ROH) are considered to be the result of consanguinity and usually contain recessive deleterious disease causing mutations (Szpiech et al., 2013). Several algorithms have been developed to detect ROHs. Here, we developed a simple, alternative strategy by examining X chromosome non-pseudoautosomal region to detect the ROHs from next generation sequencing data utilizing the genotype probabilities and the Hidden Markov Model algorithm as a tool, namely ROHMM. It is implemented purely in java and contains both command-line and a graphical user interface. We tested ROHMM on simulated data as well as real population data from 1000G Project and a clinical sample. Our results have shown that ROHMM can perform robustly producing highly accurate homozygosity estimations under all conditions thereby meeting and even exceeding the performance of its natural competitors.


2019 ◽  
Author(s):  
Joanna E Handzlik ◽  
Spyros Tastsoglou ◽  
Ioannis S Vlachos ◽  
Artemis G Hatzigeorgiou

AbstractSmall non-coding RNAs (sncRNAs) play important roles in health and disease. Next Generation Sequencing technologies are considered as the most powerful and versatile methodologies to explore small RNA (sRNA) transcriptomes in diverse experimental and clinical studies. Small RNA-Seq data analysis proved to be challenging due to non-unique genomic origin, short length and abundant post-transcriptional modifications of sRNA species. Here we present Manatee, an algorithm for quantification of sRNA classes and detection of uncharacterized expressed non-coding loci. Manatee adopts a novel approach for abundance estimation of genomic reads that combines sRNA annotation with reliable alignment density information and extensive reads salvation. Comparison of Manatee against state-of-the-art implementations using real/simulated data sets demonstrates its superior accuracy in quantification of diverse sRNA classes providing at the same time insights about unannotated expressed loci. It is user-friendly, easily embeddable in pipelines and provides a simplified output suitable for direct usage in downstream analyses and functional studies.


2021 ◽  
Author(s):  
Kristoffer Sahlin

k-mer-based methods are widely used in bioinformatics for various types of sequence comparisons. However, a single mutation will mutate k consecutive k-mers and make most k-mer-based applications for sequence comparison sensitive to variable mutation rates. Many techniques have been studied to overcome this sensitivity, for example, spaced k-mers and k-mer permutation techniques, but these techniques do not handle indels well. For indels, pairs or groups of small k-mers are commonly used, but these methods first produce k-mer matches, and only in a second step, a pairing or grouping of k-mers is performed. Such techniques produce many redundant k-mer matches owing to the size of k. Here, we propose strobemers as an alternative to k-mers for sequence comparison. Intuitively, strobemers consist of two or more linked shorter k-mers, where the combination of linked k-mers is decided by a hash function. We use simulated data to show that strobemers provide more evenly distributed sequence matches and are less sensitive to different mutation rates than k-mers and spaced k-mers. Strobemers also produce higher match coverage across sequences. We further implement a proof-of-concept sequence-matching tool StrobeMap and use synthetic and biological Oxford Nanopore sequencing data to show the utility of using strobemers for sequence comparison in different contexts such as sequence clustering and alignment scenarios.


2018 ◽  
Author(s):  
A Iacoangeli ◽  
A Al Khleifat ◽  
W Sproviero ◽  
A Shatunov ◽  
AR Jones ◽  
...  

AbstractThe generation of DNA Next Generation Sequencing (NGS) data is a commonly applied approach for studying the genetic basis of biological processes, including diseases, and underpins the aspirations of precision medicine. However, there are significant challenges when dealing with NGS data. A huge number of bioinformatics tools exist and it is therefore challenging to design an analysis pipeline; NGS analysis is computationally intensive, requiring expensive infrastructure which can be problematic given that many medical and research centres do not have adequate high performance computing facilities and the use of cloud computing facilities is not always possible due to privacy and ownership issues. We have therefore developed a fast and efficient bioinformatics pipeline that allows for the analysis of DNA sequencing data, while requiring little computational effort and memory usage. We achieved this by exploiting state-of-the-art bioinformatics tools. DNAscan can analyse raw, 40x whole genome NGS data in 8 hours, using as little as 8 threads and 16 Gbs of RAM, while guaranteeing a high performance. DNAscan can look for SNVs, small indels, SVs, repeat expansions and viral genetic material (or any other organism). Its results are annotated using a customisable variety of databases including ClinVar, Exac and dbSNP, and a local deployment of the gene.iobio platform is available for an on-the-fly result visualisation.


2021 ◽  
Vol 15 (1) ◽  
Author(s):  
Zeeshan Ahmed ◽  
Eduard Gibert Renart ◽  
Saman Zeeshan ◽  
XinQi Dong

Abstract Background Genetic disposition is considered critical for identifying subjects at high risk for disease development. Investigating disease-causing and high and low expressed genes can support finding the root causes of uncertainties in patient care. However, independent and timely high-throughput next-generation sequencing data analysis is still a challenge for non-computational biologists and geneticists. Results In this manuscript, we present a findable, accessible, interactive, and reusable (FAIR) bioinformatics platform, i.e., GVViZ (visualizing genes with disease-causing variants). GVViZ is a user-friendly, cross-platform, and database application for RNA-seq-driven variable and complex gene-disease data annotation and expression analysis with a dynamic heat map visualization. GVViZ has the potential to find patterns across millions of features and extract actionable information, which can support the early detection of complex disorders and the development of new therapies for personalized patient care. The execution of GVViZ is based on a set of simple instructions that users without a computational background can follow to design and perform customized data analysis. It can assimilate patients’ transcriptomics data with the public, proprietary, and our in-house developed gene-disease databases to query, easily explore, and access information on gene annotation and classified disease phenotypes with greater visibility and customization. To test its performance and understand the clinical and scientific impact of GVViZ, we present GVViZ analysis for different chronic diseases and conditions, including Alzheimer’s disease, arthritis, asthma, diabetes mellitus, heart failure, hypertension, obesity, osteoporosis, and multiple cancer disorders. The results are visualized using GVViZ and can be exported as image (PNF/TIFF) and text (CSV) files that include gene names, Ensembl (ENSG) IDs, quantified abundances, expressed transcript lengths, and annotated oncology and non-oncology diseases. Conclusions We emphasize that automated and interactive visualization should be an indispensable component of modern RNA-seq analysis, which is currently not the case. However, experts in clinics and researchers in life sciences can use GVViZ to visualize and interpret the transcriptomics data, making it a powerful tool to study the dynamics of gene expression and regulation. Furthermore, with successful deployment in clinical settings, GVViZ has the potential to enable high-throughput correlations between patient diagnoses based on clinical and transcriptomics data.


Sign in / Sign up

Export Citation Format

Share Document