scholarly journals Measurements of Intrahost Viral Diversity Are Extremely Sensitive to Systematic Errors in Variant Calling

2016 ◽  
Vol 90 (15) ◽  
pp. 6884-6895 ◽  
Author(s):  
John T. McCrone ◽  
Adam S. Lauring

ABSTRACTWith next-generation sequencing technologies, it is now feasible to efficiently sequence patient-derived virus populations at a depth of coverage sufficient to detect rare variants. However, each sequencing platform has characteristic error profiles, and sample collection, target amplification, and library preparation are additional processes whereby errors are introduced and propagated. Many studies account for these errors by usingad hocquality thresholds and/or previously published statistical algorithms. Despite common usage, the majority of these approaches have not been validated under conditions that characterize many studies of intrahost diversity. Here, we use defined populations of influenza virus to mimic the diversity and titer typically found in patient-derived samples. We identified single-nucleotide variants using two commonly employed variant callers, DeepSNV and LoFreq. We found that the accuracy of these variant callers was lower than expected and exquisitely sensitive to the input titer. Small reductions in specificity had a significant impact on the number of minority variants identified and subsequent measures of diversity. We were able to increase the specificity of DeepSNV to >99.95% by applying an empirically validated set of quality thresholds. When applied to a set of influenza virus samples from a household-based cohort study, these changes resulted in a 10-fold reduction in measurements of viral diversity. We have made our sequence data and analysis code available so that others may improve on our work and use our data set to benchmark their own bioinformatics pipelines. Our work demonstrates that inadequate quality control and validation can lead to significant overestimation of intrahost diversity.IMPORTANCEAdvances in sequencing technology have made it feasible to sequence patient-derived viral samples at a level sufficient for detection of rare mutations. These high-throughput, cost-effective methods are revolutionizing the study of within-host viral diversity. However, the techniques are error prone, and the methods commonly used to control for these errors have not been validated under the conditions that characterize patient-derived samples. Here, we show that these conditions affect measurements of viral diversity. We found that the accuracy of previously benchmarked analysis pipelines was greatly reduced under patient-derived conditions. By carefully validating our sequencing analysis using known control samples, we were able to identify biases in our method and to improve our accuracy to acceptable levels. Application of our modified pipeline to a set of influenza virus samples from a cohort study provided a realistic picture of intrahost diversity and suggested the need for rigorous quality control in such studies.

mSystems ◽  
2018 ◽  
Vol 3 (3) ◽  
Author(s):  
Gabriel A. Al-Ghalith ◽  
Benjamin Hillmann ◽  
Kaiwei Ang ◽  
Robin Shields-Cutler ◽  
Dan Knights

ABSTRACT Next-generation sequencing technology is of great importance for many biological disciplines; however, due to technical and biological limitations, the short DNA sequences produced by modern sequencers require numerous quality control (QC) measures to reduce errors, remove technical contaminants, or merge paired-end reads together into longer or higher-quality contigs. Many tools for each step exist, but choosing the appropriate methods and usage parameters can be challenging because the parameterization of each step depends on the particularities of the sequencing technology used, the type of samples being analyzed, and the stochasticity of the instrumentation and sample preparation. Furthermore, end users may not know all of the relevant information about how their data were generated, such as the expected overlap for paired-end sequences or type of adaptors used to make informed choices. This increasing complexity and nuance demand a pipeline that combines existing steps together in a user-friendly way and, when possible, learns reasonable quality parameters from the data automatically. We propose a user-friendly quality control pipeline called SHI7 (canonically pronounced “shizen”), which aims to simplify quality control of short-read data for the end user by predicting presence and/or type of common sequencing adaptors, what quality scores to trim, whether the data set is shotgun or amplicon sequencing, whether reads are paired end or single end, and whether pairs are stitchable, including the expected amount of pair overlap. We hope that SHI7 will make it easier for all researchers, expert and novice alike, to follow reasonable practices for short-read data quality control. IMPORTANCE Quality control of high-throughput DNA sequencing data is an important but sometimes laborious task requiring background knowledge of the sequencing protocol used (such as adaptor type, sequencing technology, insert size/stitchability, paired-endedness, etc.). Quality control protocols typically require applying this background knowledge to selecting and executing numerous quality control steps with the appropriate parameters, which is especially difficult when working with public data or data from collaborators who use different protocols. We have created a streamlined quality control pipeline intended to substantially simplify the process of DNA quality control from raw machine output files to actionable sequence data. In contrast to other methods, our proposed pipeline is easy to install and use and attempts to learn the necessary parameters from the data automatically with a single command.


2020 ◽  
Vol 7 (1) ◽  
Author(s):  
Alexia L. Weeks ◽  
Richard W. Francis ◽  
Joao I. C. F. Neri ◽  
Nathaly M. C. Costa ◽  
Nivea M. R. Arrais ◽  
...  

Abstract Exome sequencing is widely used in the diagnosis of rare genetic diseases and provides useful variant data for analysis of complex diseases. There is not always adequate population-specific reference data to assist in assigning a diagnostic variant to a specific clinical condition. Here we provide a catalogue of variants called after sequencing the exomes of 45 babies from Rio Grande do Nord in Brazil. Sequence data were processed using an ‘intersect-then-combine’ (ITC) approach, using GATK and SAMtools to call variants. A total of 612,761 variants were identified in at least one individual in this Brazilian Cohort, including 559,448 single nucleotide variants (SNVs) and 53,313 insertion/deletions. Of these, 58,111 overlapped with nonsynonymous (nsSNVs) or splice site (ssSNVs) SNVs in dbNSFP. As an aid to clinical diagnosis of rare diseases, we used the American College of Medicine Genetics and Genomics (ACMG) guidelines to assign pathogenic/likely pathogenic status to 185 (0.32%) of the 58,111 nsSNVs and ssSNVs. Our data set provides a useful reference point for diagnosis of rare diseases in Brazil. (169 words).


Blood ◽  
2015 ◽  
Vol 126 (23) ◽  
pp. 4207-4207
Author(s):  
Brian S White ◽  
Irena Lanc ◽  
Daniel Auclair ◽  
Robert Fulton ◽  
Mark A Fiala ◽  
...  

Abstract Background: Multiple myeloma (MM) is a hematologic cancer characterized by a diversity of genetic lesions-translocations, copy number alterations (CNAs), and single nucleotide variants (SNVs). The prognostic value of translocations and of CNAs has been well established. Determining the clinical significance of SNVs, which are recurrently mutated at much lower frequencies, and how this significance is impacted by translocations and CNAs requires additional, large-scale correlative studies. Such studies can be facilitated by cost-effective targeted sequencing approaches. Hence, we designed a single-platform targeted sequencing approach capable of detecting all three variant types. Methods: We designed oligonucleotide probes complementary to the coding regions of 467 genes and to the IgH and MYC loci, allowing a probe to closely match at most 5 regions within the genome. Genes were selected if they were expressed in an independent RNA-seq MM data set and harbored germline SNP-filtered variants that: (1) occurred with frequency >3%, (2) were clustered in hotspots, (3) occurred in recurrently mutated "cancer genes" (as annotated in COSMIC or MutSig), or (4) occurred in genes involved in DNA repair and/or B-cell biology. IgH and MYC tiling was unbiased (with respect to annotated features within the loci) and spanned from 50 kilobasepairs (kbps) upstream of both regions to 50 kbps downstream of IgH and 100 kbps downstream of MYC. Results: We performed targeted sequencing of 96 CD138-enriched samples derived from MM patients, as well as matched peripheral blood leukocyte normal controls. Sequencing depth (mean 107X) was commensurate with that of available exome sequencing data from these samples (mean 71X). Samples harbored a mean of 25 non-silent variants, including those in known MM-associated genes: NRAS (24%), KRAS (22%), FAM46C (17%), TP53 (10%), DIS3 (8%), and BRAF (3%). Variants detected by both platforms showed a strong correlation (r^2 = 0.8). The capture array detected activating, oncogenic variants in NRAS Q61K (n=3 patients) and KRAS G12C/D/R/V (n=5) that were not detected in exome data. Additionally, we found non-silent, capture-specific variants in MTOR (3%) and in two transcription-related genes that have been previously implicated in cancer: ZFHX4 (5%) and CHD3 (5%). To assess the potential role of deep subclonal variants and our ability to detect them, we performed additional sequencing (mean 565X) on six of the tumor/normal pairs. This revealed 14 manually-reviewed, non-silent variants that were not detected by the initial targeted sequencing. These had a mean variant allele frequency of 2.8% and included mutations in DNMT3A and FAM46C. At least one of these 14 variants occurred in five of the six re-sequenced samples. This highlights the importance of this additional depth, which will be used in future studies. Our approach successfully detected CNAs near expected frequencies, including hyperdiploidy (52%), del(13) (43%), and gain of 1q (35%). Similarly, it inferred IgH translocations at expected frequencies: t(4;14) (14%), t(6;14) (3%), t(11;14) (15%), and t(14;20) (1%). As expected, translocations occur predominantly within the IgH constant region, but also frequently 5' (i.e., telomeric) of the IGHM switch region, and occasionally within the V and D regions. We detected MYC -associated translocations, whose frequencies have been the subject of debate, at 10% (n=9 patients), with five involving IgH, three having both partners in or near MYC, and one having both types. Finally, our platform detected novel IgH translocations with partners near DERL3 (n=2), MYCN (n=1), and FLT3 (n=1). Additional evidence suggests that DERL3 and MYCN may be targets of IgH-induced overexpression: of 84 RNA-seq patient samples, six exhibited outlying expression of DERL3, including one sample in which we detected the translocation in corresponding DNA, and one exhibited outlying expression of MYCN. Conclusion: Our MM-specific targeted sequencing strategy is capable of detecting deeply subclonal SNVs, in addition to CNAs and IgH and MYC translocations. Though additional validation is required, particularly with respect to translocation detection, we anticipate that such technology will soon enable clinical testing on a single sequencing platform. Disclosures Vij: Celgene, Onyx, Takeda, Novartis, BMS, Sanofi, Janssen, Merck: Consultancy; Takeda, Onyx: Research Funding.


eLife ◽  
2018 ◽  
Vol 7 ◽  
Author(s):  
John T McCrone ◽  
Robert J Woods ◽  
Emily T Martin ◽  
Ryan E Malosh ◽  
Arnold S Monto ◽  
...  

The evolutionary dynamics of influenza virus ultimately derive from processes that take place within and between infected individuals. Here we define influenza virus dynamics in human hosts through sequencing of 249 specimens from 200 individuals collected over 6290 person-seasons of observation. Because these viruses were collected from individuals in a prospective community-based cohort, they are broadly representative of natural infections with seasonal viruses. Consistent with a neutral model of evolution, sequence data from 49 serially sampled individuals illustrated the dynamic turnover of synonymous and nonsynonymous single nucleotide variants and provided little evidence for positive selection of antigenic variants. We also identified 43 genetically-validated transmission pairs in this cohort. Maximum likelihood optimization of multiple transmission models estimated an effective transmission bottleneck of 1–2 genomes. Our data suggest that positive selection is inefficient at the level of the individual host and that stochastic processes dominate the host-level evolution of influenza viruses.


2021 ◽  
Vol 53 (1) ◽  
Author(s):  
Maulana M. Naji ◽  
Yuri T. Utsunomiya ◽  
Johann Sölkner ◽  
Benjamin D. Rosen ◽  
Gábor Mészáros

Abstract Background Reference genomes are essential in the analysis of genomic data. As the cost of sequencing decreases, multiple reference genomes are being produced within species to alleviate problems such as low mapping accuracy and reference allele bias in variant calling that can be associated with the alignment of divergent samples to a single reference individual. The latest reference sequence adopted by the scientific community for the analysis of cattle data is ARS_UCD1.2, built from the DNA of a Hereford cow (Bos taurus taurus—B. taurus). A complementary genome assembly, UOA_Brahman_1, was recently built to represent the other cattle subspecies (Bos taurus indicus—B. indicus) from a Brahman cow haplotype to further support analysis of B. indicus data. In this study, we aligned the sequence data of 15 B. taurus and B. indicus breeds to each of these references. Results The alignment of B. taurus individuals against UOA_Brahman_1 detected up to five million more single-nucleotide variants (SNVs) compared to that against ARS_UCD1.2. Similarly, the alignment of B. indicus individuals against ARS_UCD1.2 resulted in one and a half million more SNVs than that against UOA_Brahman_1. The number of SNVs with nearly fixed alternative alleles also increased in the alignments with cross-subspecies. Interestingly, the alignment of B. taurus cattle against UOA_Brahman_1 revealed regions with a smaller than expected number of counts of SNVs with nearly fixed alternative alleles. Since B. taurus introgression represents on average 10% of the genome of Brahman cattle, we suggest that these regions comprise taurine DNA as opposed to indicine DNA in the UOA_Brahman_1 reference genome. Principal component and admixture analyses using genotypes inferred from this region support these taurine-introgressed loci. Overall, the flagged taurine segments represent 13.7% of the UOA_Brahman_1 assembly. The genes located within these segments were previously reported to be under positive selection in Brahman cattle, and include functional candidate genes implicated in feed efficiency, development and immunity. Conclusions We report a list of taurine segments that are in the UOA_Brahman_1 assembly, which will be useful for the interpretation of interesting genomic features (e.g., signatures of selection, runs of homozygosity, increased mutation rate, etc.) that could appear in future re-sequencing analysis of indicine cattle.


2021 ◽  
Vol 18 (1) ◽  
Author(s):  
César Augusto Diniz Xavier ◽  
Margaret Louise Allen ◽  
Anna Elizabeth Whitfield

Abstract Background Advances in sequencing and analysis tools have facilitated discovery of many new viruses from invertebrates, including ants. Solenopsis invicta is an invasive ant that has quickly spread worldwide causing significant ecological and economic impacts. Its virome has begun to be characterized pertaining to potential use of viruses as natural enemies. Although the S. invicta virome is the best characterized among ants, most studies have been performed in its native range, with less information from invaded areas. Methods Using a metatranscriptome approach, we further identified and molecularly characterized virus sequences associated with S. invicta, in two introduced areas, U.S and Taiwan. The data set used here was obtained from different stages (larvae, pupa, and adults) of S. invicta life cycle. Publicly available RNA sequences from GenBank’s Sequence Read Archive were downloaded and de novo assembled using CLC Genomics Workbench 20.0.1. Contigs were compared against the non-redundant protein sequences and those showing similarity to viral sequences were further analyzed. Results We characterized five putative new viruses associated with S. invicta transcriptomes. Sequence comparisons revealed extensive divergence across ORFs and genomic regions with most of them sharing less than 40% amino acid identity with those closest homologous sequences previously characterized. The first negative-sense single-stranded RNA virus genomic sequences included in the orders Bunyavirales and Mononegavirales are reported. In addition, two positive single-strand virus genome sequences and one single strand DNA virus genome sequence were also identified. While the presence of a putative tenuivirus associated with S. invicta was previously suggested to be a contamination, here we characterized and present strong evidence that Solenopsis invicta virus 14 (SINV-14) is a tenui-like virus that has a long-term association with the ant. Furthermore, based on virus sequence abundance compared to housekeeping genes, phylogenetic relationships, and completeness of viral coding sequences, our results suggest that four of five virus sequences reported, those being SINV-14, SINV-15, SINV-16 and SINV-17, may be associated to viruses actively replicating in the ant S. invicta. Conclusions The present study expands our knowledge about viral diversity associated with S. invicta in introduced areas with potential to be used as biological control agents, which will require further biological characterization.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Xunliang Tong ◽  
Xiaomao Xu ◽  
Guoyue Lv ◽  
He Wang ◽  
Anqi Cheng ◽  
...  

Abstract Background Coronavirus disease 2019 (COVID-19) is an emerging infectious disease that rapidly spreads worldwide and co-infection of COVID-19 and influenza may occur in some cases. We aimed to describe clinical features and outcomes of severe COVID-19 patients with co-infection of influenza virus. Methods Retrospective cohort study was performed and a total of 140 patients with severe COVID-19 were enrolled in designated wards of Sino-French New City Branch of Tongji Hospital between Feb 8th and March 15th in Wuhan city, Hubei province, China. The demographic, clinical features, laboratory indices, treatment and outcomes of these patients were collected. Results Of 140 severe COVID-19 hospitalized patients, including 73 patients (52.14%) with median age 62 years were influenza virus IgM-positive and 67 patients (47.86%) with median age 66 years were influenza virus IgM-negative. 76 (54.4%) of severe COVID-19 patients were males. Chronic comorbidities consisting mainly of hypertension (45.3%), diabetes (15.8%), chronic respiratory disease (7.2%), cardiovascular disease (5.8%), malignancy (4.3%) and chronic kidney disease (2.2%). Clinical features, including fever (≥38 °C), chill, cough, chest pain, dyspnea, diarrhea and fatigue or myalgia were collected. Fatigue or myalgia was less found in COVID-19 patients with IgM-positive (33.3% vs 50/7%, P = 0.0375). Higher proportion of prolonged activated partial thromboplastin time (APTT) > 42 s was observed in COVID-19 patients with influenza virus IgM-negative (43.8% vs 23.6%, P = 0.0127). Severe COVID-19 Patients with influenza virus IgM positive have a higher cumulative survivor rate than that of patients with influenza virus IgM negative (Log-rank P = 0.0308). Considering age is a potential confounding variable, difference in age was adjusted between different influenza virus IgM status groups, the HR was 0.29 (95% CI, 0.081–1.100). Similarly, difference in gender was adjusted as above, the HR was 0.262 (95% CI, 0.072–0.952) in the COX regression model. Conclusions Influenza virus IgM positive may be associated with decreasing in-hospital death.


2020 ◽  
Vol 148 ◽  
Author(s):  
B. E. Young ◽  
T. M. Mak ◽  
L. W. Ang ◽  
S. Sadarangani ◽  
H. J. Ho ◽  
...  

Abstract Influenza vaccine effectiveness (VE) wanes over the course of a temperate climate winter season but little data are available from tropical countries with year-round influenza virus activity. In Singapore, a retrospective cohort study of adults vaccinated from 2013 to 2017 was conducted. Influenza vaccine failure was defined as hospital admission with polymerase chain reaction-confirmed influenza infection 2–49 weeks after vaccination. Relative VE was calculated by splitting the follow-up period into 8-week episodes (Lexis expansion) and the odds of influenza infection in the first 8-week period after vaccination (weeks 2–9) compared with subsequent 8-week periods using multivariable logistic regression adjusting for patient factors and influenza virus activity. Records of 19 298 influenza vaccinations were analysed with 617 (3.2%) influenza infections. Relative VE was stable for the first 26 weeks post-vaccination, but then declined for all three influenza types/subtypes to 69% at weeks 42–49 (95% confidence interval (CI) 52–92%, P = 0.011). VE declined fastest in older adults, in individuals with chronic pulmonary disease and in those who had been previously vaccinated within the last 2 years. Vaccine failure was significantly associated with a change in recommended vaccine strains between vaccination and observation period (adjusted odds ratio 1.26, 95% CI 1.06–1.50, P = 0.010).


Genetics ◽  
2003 ◽  
Vol 165 (3) ◽  
pp. 1385-1395
Author(s):  
Claus Vogl ◽  
Aparup Das ◽  
Mark Beaumont ◽  
Sujata Mohanty ◽  
Wolfgang Stephan

Abstract Population subdivision complicates analysis of molecular variation. Even if neutrality is assumed, three evolutionary forces need to be considered: migration, mutation, and drift. Simplification can be achieved by assuming that the process of migration among and drift within subpopulations is occurring fast compared to mutation and drift in the entire population. This allows a two-step approach in the analysis: (i) analysis of population subdivision and (ii) analysis of molecular variation in the migrant pool. We model population subdivision using an infinite island model, where we allow the migration/drift parameter 0398; to vary among populations. Thus, central and peripheral populations can be differentiated. For inference of 0398;, we use a coalescence approach, implemented via a Markov chain Monte Carlo (MCMC) integration method that allows estimation of allele frequencies in the migrant pool. The second step of this approach (analysis of molecular variation in the migrant pool) uses the estimated allele frequencies in the migrant pool for the study of molecular variation. We apply this method to a Drosophila ananassae sequence data set. We find little indication of isolation by distance, but large differences in the migration parameter among populations. The population as a whole seems to be expanding. A population from Bogor (Java, Indonesia) shows the highest variation and seems closest to the species center.


Sign in / Sign up

Export Citation Format

Share Document