scholarly journals Comparative analysis of somatic variant calling on matched FF and FFPE WGS samples

2020 ◽  
Author(s):  
Louise de Schaetzen van Brienen ◽  
Maarten Larmuseau ◽  
Kim Van der Eecken ◽  
Frederic De Ryck ◽  
Pauline Robbe ◽  
...  

Abstract Background. Research grade Fresh Frozen (FF) DNA material is not yet routinely collected in clinical practice. Many hospitals, however, collect and store Formalin Fixed Paraffin Embedded (FFPE) tumor samples. Consequently, the sample size of whole genome cancer cohort studies could be increased tremendously by including FFPE samples, although the presence of artefacts might obfuscate the variant calling. To assess whether FFPE material can be used for cohort studies, we performed an in-depth comparison of somatic SNVs called on matching FF and FFPE Whole Genome Sequence (WGS) samples extracted from the same tumor. Results. We first compared the calls between an FF and an FFPE from a metastatic prostate tumor, showing that on average 50% of the calls in the FF are recovered in the FFPE sample, with notable differences between variant callers. Combining the variants of the different callers using a simple heuristic increases both the precision and the sensitivity of the variant calling. Validating the heuristic on nine additional matched FF-FFPE samples, resulted in an average F1-score of 0.58 and an outperformance of any of the individual callers. In addition, we could show that part of the discrepancy between the FF and the FFPE samples can be attributed to intra-tumor heterogeneity (ITH). Conclusion. This study illustrates that when using the correct variant calling strategy, the majority of clonal SNVs can be recovered in an FFPE sample with high precision and sensitivity. These results suggest that somatic variants derived from WGS of FFPE material can be used in cohort studies.

2020 ◽  
Author(s):  
Louise de Schaetzen van Brienen ◽  
Maarten Larmuseau ◽  
Kim Van der Eecken ◽  
Frederic De Ryck ◽  
Pauline Robbe ◽  
...  

Abstract Background. Research grade Fresh Frozen (FF) DNA material is not yet routinely collected in clinical practice. Many hospitals, however, collect and store Formalin Fixed Paraffin Embedded (FFPE) tumor samples. Consequently, the sample size of whole genome cancer cohort studies could be increased tremendously by including FFPE samples, although the presence of artefacts might obfuscate the variant calling. To assess whether FFPE material can be used for cohort studies, we performed an in-depth comparison of somatic SNVs called on matching FF and FFPE Whole Genome Sequence (WGS) samples extracted from the same tumor. Results. We first compared the calls between an FF and an FFPE sample from a metastatic prostate tumor, showing that on average 50% of the calls in the FF are recovered in the FFPE sample, with notable differences between variant callers. Combining the variants of the different callers using a simple heuristic, increases both the precision and the sensitivity of the variant calling. Validating the heuristic on nine additional matched FF-FFPE samples, resulted in an average F1-score of 0.58 and an outperformance of any of the individual callers. In addition, we could show that part of the discrepancy between the FF and the FFPE samples can be attributed to intra-tumor heterogeneity (ITH). Conclusion. This study illustrates that when using the correct variant calling strategy, the majority of clonal SNVs can be recovered in an FFPE sample with high precision and sensitivity. These results suggest that somatic variants derived from WGS of FFPE material can be used in cohort studies.


2019 ◽  
Author(s):  
Louise de Schaetzen van Brienen ◽  
Maarten Larmuseau ◽  
Kim Van der Eecken ◽  
Jan Fostier ◽  
Piet Ost ◽  
...  

Abstract Background. Research grade Fresh Frozen (FF) DNA material is not yet routinely collected in clinical practice. Many hospitals, however, do collect and store Formalin Fixed Paraffin Embedded (FFPE) tumor samples. Consequently, the sample size of whole genome cancer cohort studies could be increased tremendously by including FFPE samples, although the presence of artifacts might obfuscate the variant calling. To assess whether FFPE material can be used for cohort studies, we performed an in-depth comparison of somatic SNVs called on matching FF and FFPE Whole Genome Sequence (WGS) samples extracted from the same prostate metastatic tumor. Results. We first compared the calls between FF and FFPE, showing that on average 50% of the calls in FF are recovered in FFPE, with notable differences between variant callers. Remarkably, this overlap was better than the overlap between different variant callers on the same sample. Inspecting the Variant Allele Frequency (VAF), we observed that many of the calls common to FF and FFPE belonged to the same clonal subpopulation but were detected at a lower VAF in FFPE. We also demonstrated that these calls receive higher significance scores and are often identified by more than one variant caller. Based on this observation, we propose a simple heuristic to perform reliable variant calling in FFPE samples. Our heuristic identified 3684 common calls at a F1-score of 0.83. Conclusion. This study illustrates that when using the correct variant calling strategy, the overlap between the FF and FFPE sample in somatic SNVs increases to such an extent that a large fraction of the calls detected in the FFPE sample are contained in the FF sample and the number of variants unique to each sample remains restricted. These results suggest that somatic variants derived from WGS of FFPE material can be used in cohort studies.


2017 ◽  
Author(s):  
Roberto Lozano ◽  
Dunia Pino del Carpio ◽  
Teddy Amuge ◽  
Ismail Siraj Kayondo ◽  
Alfred Ozimati Adebo ◽  
...  

AbstractBackgroundGenomic prediction models were, in principle, developed to include all the available marker information; with this approach, these models have shown in various crops moderate to high predictive accuracies. Previous studies in cassava have demonstrated that, even with relatively small training populations and low-density GBS markers, prediction models are feasible for genomic selection. In the present study, we prioritized SNPs in close proximity to genome regions with biological importance for a given trait. We used a number of strategies to select variants that were then included in single and multiple kernel GBLUP models. Specifically, our sources of information were transcriptomics, GWAS, and immunity-related genes, with the ultimate goal to increase predictive accuracies for Cassava Brown Streak Disease (CBSD) severity.ResultsWe used single and multi-kernel GBLUP models with markers imputed to whole genome sequence level to accommodate various sources of biological information; fitting more than one kinship matrix allowed for differential weighting of the individual marker relationships. We applied these GBLUP approaches to CBSD phenotypes (i.e., root infection and leaf severity three and six months after planting) in a Ugandan Breeding Population (n = 955). Three means of exploiting an established RNAseq experiment of CBSD-infected cassava plants were used. Compared to the biology-agnostic GBLUP model, the accuracy of the informed multi-kernel models increased the prediction accuracy only marginally (1.78% to 2.52%).ConclusionsOur results show that markers imputed to whole genome sequence level do not provide enhanced prediction accuracies compared to using standard GBS marker data in cassava. The use of transcriptomics data and other sources of biological information resulted in prediction accuracies that were nominally superior to those obtained from traditional prediction models.


2019 ◽  
Author(s):  
Aditya Vijay Bhagwate ◽  
Yuanhang Liu ◽  
Stacey J. Winham ◽  
Samantha J. McDonough ◽  
Melody L. Stallings-Mann ◽  
...  

Abstract Background Archived formalin fixed paraffin embedded (FFPE) samples are valuable clinical resources to examine clinically relevant morphology features and also to study genetic changes. However, DNA quality and quantity of FFPE samples are often sub-optimal, and resulting NGS-based genetics variant detections are prone to false positives. Evaluations of wet-lab and bioinformatics approaches are needed to optimize variant detection from FFPE samples. Results As a pilot study, we designed within-subject triplicate samples of DNA derived from paired FFPE and fresh frozen breast tissues to highlight FFPE-specific artifacts. For FFPE samples, we tested two FFPE DNA extraction methods to determine impact of wet-lab procedures on variant calling: QIAGEN QIAamp DNA Mini Kit ("QA"), and QIAGEN GeneRead DNA FFPE Kit ("QGR"). We also used negative-control (NA12891) and positive control samples (Horizon Discovery Reference Standard FFPE). All DNA sample libraries were prepared for NGS according to the QIAseq Human Breast Cancer Targeted DNA Panel protocol and sequenced on the HiSeq 4000. Variant calling and filtering were performed using QIAGEN Gene Globe Data Portal. Detailed variant concordance comparisons and mutational signature analysis were performed to investigate effects of FFPE samples compared to paired fresh frozen samples, along with different library preparations. In this study, we found that five times or more variants were called with FFPE samples, compared to their paired fresh-frozen tissue samples even after applying molecular barcoding error-correction and default bioinformatics filtering recommended by the vendor. We also found that QGR as an optimized FFPE-DNA extraction approach leads to much fewer discordant variants between paired fresh frozen and FFPE samples. Approximately 92% of the uniquely called FFPE variants were of low allelic frequency range (<5%), and collectively shared a “C>T|G>A” mutational signature known to be representative of FFPE artifacts resulting fromcytosine deamination. Based on control samples and FFPE-frozen replicates, we derived an effective filtering strategy with associated empirical false-discovery estimates. Conclusions Through this study, we demonstrated feasibility of calling and filtering genetic variants from FFPE tissue samples using a combined strategy with molecular barcodes, optimized DNA extraction, and bioinformatics methods incorporating genomics context such as mutational signature and variant allelic frequency.


2021 ◽  
Vol 12 ◽  
Author(s):  
Hongwei Li ◽  
Bo Zhu ◽  
Ling Xu ◽  
Zezhao Wang ◽  
Lei Xu ◽  
...  

A haplotype is defined as a combination of alleles at adjacent loci belonging to the same chromosome that can be transmitted as a unit. In this study, we used both the Illumina BovineHD chip (HD chip) and imputed whole-genome sequence (WGS) data to explore haploblocks and assess haplotype effects, and the haploblocks were defined based on the different LD thresholds. The accuracies of genomic prediction (GP) for dressing percentage (DP), meat percentage (MP), and rib eye roll weight (RERW) based on haplotype were investigated and compared for both data sets in Chinese Simmental beef cattle. The accuracies of GP using the entire imputed WGS data were lower than those using the HD chip data in all cases. For DP and MP, the accuracy of GP using haploblock approaches outperformed the individual single nucleotide polymorphism (SNP) approach (GBLUP_In_Block) at specific LD levels. Hotelling’s test confirmed that GP using LD-based haplotypes from WGS data can significantly increase the accuracies of GP for RERW, compared with the individual SNP approach (∼1.4 and 1.9% for GHBLUP and GHBLUP+GBLUP, respectively). We found that the accuracies using haploblock approach varied with different LD thresholds. The LD thresholds (r2 ≥ 0.5) were optimal for most scenarios. Our results suggested that LD-based haploblock approach can improve accuracy of genomic prediction for carcass traits using both HD chip and imputed WGS data under the optimal LD thresholds in Chinese Simmental beef cattle.


2019 ◽  
Vol 37 (15_suppl) ◽  
pp. e13016-e13016
Author(s):  
Shannon Terrell Bailey ◽  
Belynda Hicks ◽  
Bin Zhu ◽  
Nan Hu ◽  
Phil R. Taylor ◽  
...  

e13016 Background: Whole-genome sequencing (WGS) of formalin-fixed, paraffin-embedded (FFPE) samples could enable novel insights from archival sample collections, yet robust FFPE WGS is challenged by fragmented DNA, uneven genomic coverage & sequencing artifacts attributed to FFPE fixation. We report our proprietary extraction & library preparation methodology (SeqPlus) with high quality, uniform WGS sequencing performance comparable to that from fresh-frozen samples. Methods: We analyzed 20 paired esophageal carcinoma (EC) samples i.e., primary tumors & matched germline samples to assess SeqPlus performance on 10-15-year-old FFPE tissues, measure variant concordance between WGS and a high-depth sequencing panel (269 genes, 400x coverage) & identify novel genomic features. Results: At a targeted 70x WGS tumor sequencing depth, 93% of the genome was covered by ³ 20 reads, 99% of bases had 10x coverage & average duplicate reads were 31%. We noted similar transition/transversion ratios & mutational spectra as from fresh-frozen EC specimens, suggesting that extraction & library preparation contributes to prior FFPE artifacts. Concordance of tumor-specific SNVs & indels derived from WGS & targeted panel was high at 86%. All 76 targeted panel-detected variants above the WGS limit of detection (mutant allele frequency [MAF] > 10%) were detected by WGS, 2 variants (2 tumors) were detected only by WGS, and 12 variants at MAF ≤ 6% (9 tumors) were only detected by the targeted panel. Tumor WGS yielded SNV, indels & CNV findings beyond variants detected by targeted sequencing. WGS enabled detection of 10.4 putative cancer variants per tumor compared to 12 variants per patient from frozen specimens and a median of 7 (up to 16) cancer-associated variants in genes outside the targeted panel. WGS copy number analysis revealed CCND1, EGFR, TP63, and SOX2amplification, CDKN2A/B deletion and additional unrecognized genomic aberrations. Conclusions: Our study reinforces the utility of high-quality, uniform WGS sequencing of archival FFPE cancer samples with SeqPlus and unlocks the potential for massive-scale retrospective genomic analysis of archived pathology samples with associated clinical & outcomes data.


Author(s):  
Shatha Alosaimi ◽  
Noëlle van Biljon ◽  
Denis Awany ◽  
Prisca K Thami ◽  
Joel Defo ◽  
...  

Abstract Current variant calling (VC) approaches have been designed to leverage populations of long-range haplotypes and were benchmarked using populations of European descent, whereas most genetic diversity is found in non-European such as Africa populations. Working with these genetically diverse populations, VC tools may produce false positive and false negative results, which may produce misleading conclusions in prioritization of mutations, clinical relevancy and actionability of genes. The most prominent question is which tool or pipeline has a high rate of sensitivity and precision when analysing African data with either low or high sequence coverage, given the high genetic diversity and heterogeneity of this data. Here, a total of 100 synthetic Whole Genome Sequencing (WGS) samples, mimicking the genetics profile of African and European subjects for different specific coverage levels (high/low), have been generated to assess the performance of nine different VC tools on these contrasting datasets. The performances of these tools were assessed in false positive and false negative call rates by comparing the simulated golden variants to the variants identified by each VC tool. Combining our results on sensitivity and positive predictive value (PPV), VarDict [PPV = 0.999 and Matthews correlation coefficient (MCC) = 0.832] and BCFtools (PPV = 0.999 and MCC = 0.813) perform best when using African population data on high and low coverage data. Overall, current VC tools produce high false positive and false negative rates when analysing African compared with European data. This highlights the need for development of VC approaches with high sensitivity and precision tailored for populations characterized by high genetic variations and low linkage disequilibrium.


2019 ◽  
Author(s):  
Roger Ros-Freixedes ◽  
Andrew Whalen ◽  
Ching-Yi Chen ◽  
Gregor Gorjanc ◽  
William O Herring ◽  
...  

AbstractBackgroundWe demonstrate high accuracy of whole-genome sequence imputation in large livestock populations where only a small fraction of individuals (2%) had been sequenced, mostly at low coverage.MethodsWe used data from four pig populations of different sizes (18,349 to 107,815 individuals) that were broadly genotyped at densities between 15,000 and 75,000 markers genome-wide. Around 2% of the individuals in each population were sequenced (most at 1x or 2x and a small fraction at 30x; average coverage per individual: 4x). We imputed whole-genome sequence with hybrid peeling. We evaluated the imputation accuracy by removing the sequence data of a total of 284 individuals that had been sequenced at high coverage, using a leave-one-out design. We complemented these results with simulated data that mimicked the sequencing strategy used in the real populations to quantify the factors that affected the individual-wise and variant-wise imputation accuracies using regression trees.ResultsImputation accuracy was high for the majority of individuals in all four populations (median individual-wise correlation was 0.97). Individuals in the earliest generations of each population had lower accuracy than the rest, likely due to the lack of marker array data for themselves and their ancestors. The main factors that determined the individual-wise imputation accuracy were the genotyping status of the individual, the availability of marker array data for immediate ancestors, and the degree of connectedness of an individual to the rest of the population, but sequencing coverage had no effect. The main factors that determined variant-wise imputation accuracy were the minor allele frequency and the number of individuals with sequencing coverage at each variant site. These results were validated with the empirical observations.ConclusionsThe coupling of an appropriate sequencing strategy and imputation method, such as described and validated here, is a powerful strategy for generating whole-genome sequence data in large pedigreed populations with high accuracy. This is a critical step for the successful implementation of whole-genome sequence data for genomic predictions and fine-mapping of causal variants.


2020 ◽  
Author(s):  
Anuraj Nayarisseri ◽  
Sanjeev Kumar Singh

Abstract We announce the complete genome sequence of Bacillus tequilensis, a biosurfactant producing bacterium isolated from Chilika lake, Odisha, India(latitude and longitude: 19.8450 N 85.4788 E). The genome sequence is 4.47 MB consisting of 4,478,749 base pairs forming a circular chromosome with 528 scaffolds, 4492 protein-encoding genes(ORFs), 81 tRNA genes, and 114 ribosomal RNA transcription units. The total number of raw reads was 4209415 and processed reads were 4058238 with predicted genes of 4492. The whole-genome obtained from the present investigation was used for genome annotation, variant calling, variant annotation and comparative genome analysis with other existing Bacillus species. In this study we constructed a pathway which describe the biosurfactant metabolism of Bacillus tequilensis and identified the genes such as SrfAD, SrfAC, SrfAA which are involved in biosurfactant synthesis. The sequence of the same was deposited in Genbank database with accession MUG02427.1, MUG02428.1, MUG02429.1, MUG03515.1 respectively. The whole-genome sequence was submitted to Genbank with an accession RMVO00000000 and the raw reads can be obtained from SRA, NCBI repository using accession: SRX5023292.


2016 ◽  
Vol 48 (12) ◽  
pp. 922-927 ◽  
Author(s):  
Kari Branham ◽  
Hiroko Matsui ◽  
Pooja Biswas ◽  
Aditya A. Guru ◽  
Michael Hicks ◽  
...  

While more than 250 genes are known to cause inherited retinal degenerations (IRD), nearly 40–50% of families have the genetic basis for their disease unknown. In this study we sought to identify the underlying cause of IRD in a family by whole genome sequence (WGS) analysis. Clinical characterization including standard ophthalmic examination, fundus photography, visual field testing, electroretinography, and review of medical and family history was performed. WGS was performed on affected and unaffected family members using Illumina HiSeq X10. Sequence reads were aligned to hg19 using BWA-MEM and variant calling was performed with Genome Analysis Toolkit. The called variants were annotated with SnpEff v4.11, PolyPhen v2.2.2, and CADD v1.3. Copy number variations were called using Genome STRiP (svtoolkit 2.00.1611) and SpeedSeq software. Variants were filtered to detect rare potentially deleterious variants segregating with disease. Candidate variants were validated by dideoxy sequencing. Clinical evaluation revealed typical adolescent-onset recessive retinitis pigmentosa (arRP) in affected members. WGS identified about 4 million variants in each individual. Two rare and potentially deleterious compound heterozygous variants p.Arg281Cys and p.Arg487* were identified in the gene ATP/GTP binding protein like 5 ( AGBL5) as likely causal variants. No additional variants in IRD genes that segregated with disease were identified. Mutation analysis confirmed the segregation of these variants with the IRD in the pedigree. Homology models indicated destabilization of AGBL5 due to the p.Arg281Cys change. Our findings establish the involvement of mutations in AGBL5 in RP and validate the WGS variant filtering pipeline we designed.


Sign in / Sign up

Export Citation Format

Share Document