scholarly journals Personalized and graph genomes reveal missing signal in epigenomic data

2018 ◽  
Author(s):  
Cristian Groza ◽  
Tony Kwan ◽  
Nicole Soranzo ◽  
Tomi Pastinen ◽  
Guillaume Bourque

AbstractBackgroundEpigenomic studies that use next generation sequencing experiments typically rely on the alignment of reads to a reference sequence. However, because of genetic diversity and the diploid nature of the human genome, we hypothesized that using a generic reference could lead to incorrectly mapped reads and bias downstream results.ResultsWe show that accounting for genetic variation using a modified reference genome (MPG) or a denovo assembled genome (DPG) can alter histone H3K4me1 and H3K27ac ChIP-seq peak calls by either creating new personal peaks or by the loss of reference peaks. MPGs are found to alter approximately 1% of peak calls while DPGs alter up to 5% of peaks. We also show statistically significant differences in the amount of reads observed in regions associated with the new, altered and unchanged peaks. We report that short insertions and deletions (indels), followed by single nucleotide variants (SNVs), have the highest probability of modifying peak calls. A counter-balancing factor is peak width, with wider calls being less likely to be altered. Next, because high-quality DPGs remain hard to obtain, we show that using a graph personalized genome (GPG), represents a reasonable compromise between MPGs and DPGs and alters about 2.5% of peak calls. Finally, we demonstrate that altered peaks have a genomic distribution typical of other peaks. For instance, for H3K4me1, 518 personal-only peaks were replicated using at least two of three approaches, 394 of which were inside or within 10Kb of a gene.ConclusionsAnalysing epigenomic datasets with personalized and graph genomes allows the recovery of new peaks enriched for indels and SNVs. These altered peaks are more likely to differ between individuals and, as such, could be relevant in the study of various human phenotypes.


2021 ◽  
Author(s):  
Juliana D Siqueira ◽  
Livia R Goes ◽  
Brunna M Alves ◽  
Pedro S de Carvalho ◽  
Claudia Cicala ◽  
...  

Abstract Numerous factors have been identified to influence susceptibility to SARS-CoV-2 infection and disease severity. Cancer patients are more prone to clinically evolve to more severe COVID-19 conditions, but the determinants of such a more severe outcome remain largely unknown. We have determined the full-length SARS-CoV-2 genomic sequences of cancer patients and healthcare workers (non-cancer controls) by deep sequencing and investigated the within-host viral population of each infection, quantifying intrahost genetic diversity. Naso- and oropharyngeal SARS-CoV-2+ swabs from 57 cancer patients and 14 healthcare workers from the Brazilian National Cancer Institute were collected in April–May 2020. Complete genome amplification using ARTIC network V3 multiplex primers was performed followed by next-generation sequencing. Assemblies were conducted in Geneious R11, where consensus sequences were extracted and intrahost single nucleotide variants were identified. Maximum likelihood phylogenetic analysis was performed using PhyMLv.3.0 and lineages were classified using Pangolin and CoV-GLUE. Phylogenetic analysis showed that all but one strain belonged to clade B1.1. Four genetically linked mutations known as the globally dominant SARS-CoV-2 haplotype (C241T, C3037T, C14408T and A23403G) were found in the majority of consensus sequences. SNV signatures of previously characterized Brazilian genomes were also observed in most samples. Another 85 SNVs were found at a lower frequency (1.4-19.7%) among the consensus sequences. Cancer patients displayed a significantly higher intrahost viral genetic diversity compared to healthcare workers. This difference was independent of SARS-CoV-2 Ct values obtained at the diagnostic tests, which did not differ between the two groups. The most common nucleotide changes of intrahost SNVs in both groups were consistent with APOBEC and ADAR activities. Intrahost genetic diversity in cancer patients was not associated with disease severity, use of corticosteroids, or use of antivirals, characteristics that could influence viral diversity. Moreover, the presence of metastasis, either in general or specifically in the lung, was not associated with intrahost diversity among cancer patients. Cancer patients carried significantly higher numbers of minor variants compared to non-cancer counterparts. Further studies on SARS-CoV-2 diversity in especially vulnerable patients will shed light onto the understanding of the basis of COVID-19 different outcomes in humans.



2020 ◽  
Author(s):  
Daniel Shriner ◽  
Adebowale Adeyemo ◽  
Charles Rotimi

In clinical genomics, variant calling from short-read sequencing data typically relies on a pan-genomic, universal human reference sequence. A major limitation of this approach is that the number of reads that incorrectly map or fail to map increase as the reads diverge from the reference sequence. In the context of genome sequencing of genetically diverse Africans, we investigate the advantages and disadvantages of using a de novo assembly of the read data as the reference sequence in single sample calling. Conditional on sufficient read depth, the alignment-based and assembly-based approaches yielded comparable sensitivity and false discovery rates for single nucleotide variants when benchmarked against a gold standard call set. The alignment-based approach yielded coverage of an additional 270.8 Mb over which sensitivity was lower and the false discovery rate was higher. Although both approaches detected and missed clinically relevant variants, the assembly-based approach identified more such variants than the alignment-based approach. Of particular relevance to individuals of African descent, the assembly-based approach identified four heterozygous genotypes containing the sickle allele whereas the alignment-based approach identified no occurrences of the sickle allele. Variant annotation using dbSNP and gnomAD identified systematic biases in these databases due to underrepresentation of Africans. Using the counts of homozygous alternate genotypes from the alignment-based approach as a measure of genetic distance to the reference sequence GRCh38.p12, we found that the numbers of misassemblies, total variant sites, potentially novel single nucleotide variants (SNVs), and certain variant classes (e.g., splice acceptor variants, stop loss variants, missense variants, synonymous variants, and variants absent from gnomAD) were significantly correlated with genetic distance. In contrast, genomic coverage and other variant classes (e.g., ClinVar pathogenic or likely pathogenic variants, start loss variants, stop gain variants, splice donor variants, incomplete terminal codons, variants with CADD score ≥20) were not correlated with genetic distance. With improvement in coverage, the assembly-based approach can offer a viable alternative to the alignment-based approach, with the advantage that it can obviate the need to generate diverse human reference sequences or collections of alternate scaffolds.



2021 ◽  
Author(s):  
Jie Wang ◽  
Shiming Li ◽  
Lei Lan ◽  
Mushan Xie ◽  
Shu Cheng ◽  
...  

Abstract Background: Setaria italica is the second-most widely planted species of millets in the world and an important model grain crop for the research of C4 photosynthesis and abiotic stress tolerance. Through three genomes assembly and annotation efforts, all genomes were based on next generation sequencing technology, which limited the genome continuity. Results: Here we report a high-quality whole-genome of new cultivar Huagu11, using single-molecule real-time sequencing and High-throughput chromosome conformation capture (Hi-C) mapping technologies. The total assembly size of the Huagu11 genome was 408.37 Mb with a scaffold N50 size of 45.89 Mb. Compared with the other three reported millet genomes based on the next generation sequencing technology, the Huagu11 genome had the highest genomic continuity. Intraspecies comparison showed about 94.97% and 94.66% of the Yugu1 and Huagu11 genomes, respectively, were able to be aligned as one-to-one blocks with four chromosome inversion. The Huagu11 genome contained approximately 19.43 Mb Presence/absence Variation (PAV) with 627 protein-coding transcripts, while Yugu1 genomes had 20.53 Mb PAV sequences encoding 737 proteins. Overall, 969,596 Single-nucleotide polymorphism (SNPs) and 156,282 insertion-deletion (InDels) were identified between these two genomes. The genome comparison between Huagu11 and Yugu1 should reflect the genetic identity and variation between the cultivars of foxtail millet to a certain extent. The Ser-626-Aln substitution in acetohydroxy acid synthase (AHAS) was found to be relative to the imazethapyr tolerance in Huagu11. Conclusions: A new improved high-quality reference genome sequence of Setaria italica was assembled, and intraspecies genome comparison determined the genetic identity and variation between the cultivars of foxtail millet. Based on the genome sequence, it was found that the Ser-626-Aln substitution in AHAS was responsible for the imazethapyr tolerance in Huagu11. The new improved reference genome of Setaria italica will promote the genic and genomic studies of this species and be beneficial for cultivar improvement.



Genes ◽  
2019 ◽  
Vol 10 (9) ◽  
pp. 671 ◽  
Author(s):  
Pucker ◽  
Rückert ◽  
Stracke ◽  
Viehöver ◽  
Kalinowski ◽  
...  

Arabidopsis thaliana is one of the best studied plant model organisms. Besides cultivation in greenhouses, cells of this plant can also be propagated in suspension cell culture. At7 is one such cell line that was established about 25 years ago. Here, we report the sequencing and the analysis of the At7 genome. Large scale duplications and deletions compared to the Columbia-0 (Col-0) reference sequence were detected. The number of deletions exceeds the number of insertions, thus indicating that a haploid genome size reduction is ongoing. Patterns of small sequence variants differ from the ones observed between A. thaliana accessions, e.g., the number of single nucleotide variants matches the number of insertions/deletions. RNA-Seq analysis reveals that disrupted alleles are less frequent in the transcriptome than the native ones.





2016 ◽  
Vol 140 (10) ◽  
pp. 1085-1091 ◽  
Author(s):  
Eric J. Duncavage ◽  
Haley J. Abel ◽  
Jason D. Merker ◽  
John B. Bodner ◽  
Qin Zhao ◽  
...  

Context.—Most current proficiency testing challenges for next-generation sequencing assays are methods-based proficiency testing surveys that use DNA from characterized reference samples to test both the wet-bench and bioinformatics/dry-bench aspects of the tests. Methods-based proficiency testing surveys are limited by the number and types of mutations that either are naturally present or can be introduced into a single DNA sample. Objective.—To address these limitations by exploring a model of in silico proficiency testing in which sequence data from a single well-characterized specimen are manipulated electronically. Design.—DNA from the College of American Pathologists reference genome was enriched using the Illumina TruSeq and Life Technologies AmpliSeq panels and sequenced on the MiSeq and Ion Torrent platforms, respectively. The resulting data were mutagenized in silico and 26 variants, including single-nucleotide variants, deletions, and dinucleotide substitutions, were added at variant allele fractions (VAFs) from 10% to 50%. Participating clinical laboratories downloaded these files and analyzed them using their clinical bioinformatics pipelines. Results.—Laboratories using the AmpliSeq/Ion Torrent and/or the TruSeq/MiSeq participated in the 2 surveys. On average, laboratories identified 24.6 of 26 variants (95%) overall and 21.4 of 22 variants (97%) with VAFs greater than 15%. No false-positive calls were reported. The most frequently missed variants were single-nucleotide variants with VAFs less than 15%. Across both challenges, reported VAF concordance was excellent, with less than 1% median absolute difference between the simulated VAF and mean reported VAF. Conclusions.—The results indicate that in silico proficiency testing is a feasible approach for methods-based proficiency testing, and demonstrate that the sensitivity and specificity of current next-generation sequencing bioinformatics across clinical laboratories are high.



2020 ◽  
Author(s):  
Luciano Calderón ◽  
Nuria Mauri ◽  
Claudio Muñoz ◽  
Pablo Carbonell-Bejerano ◽  
Laura Bree ◽  
...  

AbstractGrapevine (Vitis vinifera L.) cultivars are clonally propagated to preserve their varietal attributes. However, novel genetic variation still accumulates due to somatic mutations. Aiming to study the potential impact of clonal propagation history on grapevines intra-cultivar genetic diversity, we have focused on ‘Malbec’. This cultivar is appreciated for red wines elaboration, it was originated in Southwestern France and introduced into Argentina during the 1850s. Here, we generated whole-genome resequencing data for four ‘Malbec’ clones with different historical backgrounds. A stringent variant calling procedure was established to identify reliable clonal polymorphisms, additionally corroborated by Sanger sequencing. This analysis retrieved 941 single nucleotide variants (SNVs), occurring among the analyzed clones. Based on a set of validated SNVs, a genotyping experiment was custom-designed to survey ‘Malbec’ genetic diversity. We successfully genotyped 214 samples and identified 14 different clonal genotypes, that clustered into two genetically divergent groups. Group-Ar was driven by clones with a long history of clonal propagation in Argentina, while Group-Fr was driven by clones that have longer remained in Europe. Findings show the ability of such approaches for clonal genotypes identification in grapevines. In particular, we provide evidence on how human actions may have shaped ‘Malbec’ extant genetic diversity pattern.



F1000Research ◽  
2014 ◽  
Vol 2 ◽  
pp. 217 ◽  
Author(s):  
Guillermo Barturen ◽  
Antonio Rueda ◽  
José L. Oliver ◽  
Michael Hackenberg

Whole genome methylation profiling at a single cytosine resolution is now feasible due to the advent of high-throughput sequencing techniques together with bisulfite treatment of the DNA. To obtain the methylation value of each individual cytosine, the bisulfite-treated sequence reads are first aligned to a reference genome, and then the profiling of the methylation levels is done from the alignments. A huge effort has been made to quickly and correctly align the reads and many different algorithms and programs to do this have been created. However, the second step is just as crucial and non-trivial, but much less attention has been paid to the final inference of the methylation states. Important error sources do exist, such as sequencing errors, bisulfite failure, clonal reads, and single nucleotide variants.We developed MethylExtract, a user friendly tool to: i) generate high quality, whole genome methylation maps and ii) detect sequence variation within the same sample preparation. The program is implemented into a single script and takes into account all major error sources. MethylExtract detects variation (SNVs – Single Nucleotide Variants) in a similar way to VarScan, a very sensitive method extensively used in SNV and genotype calling based on non-bisulfite-treated reads. The usefulness of MethylExtract is shown by means of extensive benchmarking based on artificial bisulfite-treated reads and a comparison to a recently published method, called Bis-SNP.MethylExtract is able to detect SNVs within High-Throughput Sequencing experiments of bisulfite treated DNA at the same time as it generates high quality methylation maps. This simultaneous detection of DNA methylation and sequence variation is crucial for many downstream analyses, for example when deciphering the impact of SNVs on differential methylation. An exclusive feature of MethylExtract, in comparison with existing software, is the possibility to assess the bisulfite failure in a statistical way. The source code, tutorial and artificial bisulfite datasets are available at http://bioinfo2.ugr.es/MethylExtract/ and http://sourceforge.net/projects/methylextract/, and also permanently accessible from 10.5281/zenodo.7144.



2019 ◽  
Author(s):  
Boas Pucker ◽  
Christian Rückert ◽  
Ralf Stracke ◽  
Prisca Viehöver ◽  
Jörn Kalinowski ◽  
...  

AbstractArabidopsis thaliana is one of the best studied plant model organisms. Besides cultivation in greenhouses, cells of this plant can also be propagated in suspension cell culture. At7 is one such cell line that has been established about 25 years ago. Here we report the sequencing and the analysis of the At7 genome. Large scale duplications and deletions compared to the Col-0 reference sequence were detected. The number of deletions exceeds the number of insertions thus indicating that a haploid genome size reduction is ongoing. Patterns of small sequence variants differ from the ones observed between A. thaliana accessions e.g. the number of single nucleotide variants matches the number of insertions/deletions. RNA-Seq analysis reveals that disrupted alleles are less frequent in the transcriptome than the native ones.



Sign in / Sign up

Export Citation Format

Share Document