One Size Doesn't Fit All - RefEditor: Building Personalized Diploid Reference Genome to Improve Read Mapping and Genotype Calling in Next Generation Sequencing Studies

ABSTRACTA fundamental challenge in analyzing next-generation sequencing data is to determine an individual’s genotype correctly as the accuracy of the inferred genotype is essential to downstream analyses. Some genotype callers, such as GATK and SAMtools, directly calculate the base-calling error rates from phred scores or recalibrated base quality scores. Others, such as SeqEM, estimate error rates from the read data without using any quality scores. It is also a common quality control procedure to filter out reads with low phred scores. However, choosing an appropriate phred score threshold is problematic as a too-high threshold may lose data while a too-low threshold may introduce errors. We propose a new likelihood-based genotype-calling approach that exploits all reads and estimates the per-base error rates by incorporating phred scores through a logistic regression model. The algorithm, which we call PhredEM, uses the Expectation-Maximization (EM) algorithm to obtain consistent estimates of genotype frequencies and logistic regression parameters. We also develop a simple, computationally efficient screening algorithm to identify loci that are estimated to be monomorphic, so that only loci estimated to be non-monomorphic require application of the EM algorithm. We evaluate the performance of PhredEM using both simulated data and real sequencing data from the UK10K project. The results demonstrate that PhredEM is an improved, robust and widely applicable genotype-calling approach for next-generation sequencing studies. The relevant software is freely available.

Download Full-text

SeqEM: an adaptive genotype-calling approach for next-generation sequencing studies

Bioinformatics ◽

10.1093/bioinformatics/btq526 ◽

2010 ◽

Vol 26 (22) ◽

pp. 2803-2810 ◽

Cited By ~ 57

Author(s):

E. R. Martin ◽

D. D. Kinnamon ◽

M. A. Schmidt ◽

E. H. Powell ◽

S. Zuchner ◽

...

Keyword(s):

Next Generation Sequencing ◽

Next Generation ◽

Genotype Calling ◽

Sequencing Studies ◽

Generation Sequencing

Download Full-text

Demonstrating the utility of flexible sequence queries against indexed short reads with FlexTyper

10.1101/2020.03.02.973750 ◽

2020 ◽

Author(s):

Phillip A. Richmond ◽

Alice M. Kaye ◽

Godfrain Jacques Kounkou ◽

Tamar V. Av-Shalom ◽

Wyeth W. Wasserman

Keyword(s):

Next Generation Sequencing ◽

Reference Genome ◽

Next Generation Sequencing Data ◽

Rna Seq ◽

Next Generation ◽

Sequencing Data ◽

Read Mapping ◽

Mapping Approach ◽

Reverse Mapping ◽

Generation Sequencing

AbstractAcross the life sciences, processing next generation sequencing data commonly relies upon a computationally expensive process where reads are mapped onto a reference sequence. Prior to such processing, however, there is a vast amount of information that can be ascertained from the reads, potentially obviating the need for processing, or allowing optimized mapping approaches to be deployed. Here, we present a method termed FlexTyper which facilitates a “reverse mapping” approach in which high throughput sequence queries, in the form of k-mer searches, are run against indexed short-read datasets in order to extract useful information. This reverse mapping approach enables the rapid counting of target sequences of interest. We demonstrate FlexTyper’s utility for recovering depth of coverage, and accurate genotyping of SNP sites across the human genome. We show that genotyping unmapped reads can correctly inform a sample’s population, sex, and relatedness in a family setting. Detection of pathogen sequences within RNA-seq data was sensitive and accurate, performing comparably to existing methods, but with increased flexibility. We present two examples of ways in which this flexibility allows the analysis of genome features not well-represented in a linear reference. First, we analyze contigs from African genome sequencing studies, showing how they distribute across families from three distinct populations. Second, we show how gene-marking k-mers for the killer immune receptor locus allow allele detection in a region that is challenging for standard read mapping pipelines. The future adoption of the reverse mapping approach represented by FlexTyper will be enabled by more efficient methods for FM-index generation and biology-informed collections of reference queries. In the long-term, selection of population-specific references or weighting of edges in pan-population reference genome graphs will be possible using the FlexTyper approach. FlexTyper is available at https://github.com/wassermanlab/OpenFlexTyper.Author SummaryIn the past 15 years, next generation sequencing technology has revolutionized our capacity to process and analyze DNA sequencing data. From agriculture to medicine, this technology is enabling a deeper understanding of the blueprint of life. Next generation sequencing data is composed of short sequences of DNA, referred to as “reads”, which are often shorter than 200 base pairs making them many orders of magnitude smaller than the entirety of a human genome. Gaining insights from this data has typically leveraged a reference-guided mapping approach, where the reads are aligned to a reference genome and then post-processed to gain actionable information such as presence or absence of genomic sequence, or variation between the reference genome and the sequenced sample. Many experts in the field of genomics have concluded that selecting a single, linear reference genome for mapping reads against is limiting, and several current research endeavors are focused on exploring options for improved analysis methods to unlock the full utility of sequencing data. Among these improvements are the usage of sex-matched genomes, population-specific reference genomes, and emergent graph-based reference pan-genomes. However, advanced methods that use raw DNA sequencing data to inform the choice of reference genome and guide the alignment of reads to enriched reference genomes are needed. Here we develop a method termed FlexTyper, which creates a searchable index of the short read data and enables flexible, user-guided queries to provide valuable insights without the need for reference-guided mapping. We demonstrate the utility of our method by identifying sample ancestry and sex in human whole genome sequencing data, detecting viral pathogen reads in RNA-seq data, African-enriched genome regions absent from the global reference, and HLA alleles that are complex to discern using standard read mapping. We anticipate early adoption of FlexTyper within analysis pipelines as a pre-mapping component, and further envision the bioinformatics and genomics community will leverage the tool for creative uses of sequence queries from unmapped data.

Download Full-text

PhredEM: a phred-score-informed genotype-calling approach for next-generation sequencing studies

Genetic Epidemiology ◽

10.1002/gepi.22048 ◽

2017 ◽

Vol 41 (5) ◽

pp. 375-387 ◽

Cited By ~ 13

Author(s):

Peizhou Liao ◽

Glen A. Satten ◽

Yi-Juan Hu

Keyword(s):

Next Generation Sequencing ◽

Next Generation ◽

Genotype Calling ◽

Phred Score ◽

Sequencing Studies ◽

Generation Sequencing

Download Full-text

Identifying, understanding, and correcting technical artifacts on the sex chromosomes in next-generation sequencing data

GigaScience ◽

10.1093/gigascience/giz074 ◽

2019 ◽

Vol 8 (7) ◽

Cited By ~ 13

Author(s):

Timothy H Webster ◽

Madeline Couse ◽

Bruno M Grande ◽

Eric Karlins ◽

Tanya N Phung ◽

...

Keyword(s):

Next Generation Sequencing ◽

Sex Chromosomes ◽

Sequence Homology ◽

Reference Genome ◽

Variant Calling ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Read Mapping ◽

Generation Sequencing

Abstract Background Mammalian X and Y chromosomes share a common evolutionary origin and retain regions of high sequence similarity. Similar sequence content can confound the mapping of short next-generation sequencing reads to a reference genome. It is therefore possible that the presence of both sex chromosomes in a reference genome can cause technical artifacts in genomic data and affect downstream analyses and applications. Understanding this problem is critical for medical genomics and population genomic inference. Results Here, we characterize how sequence homology can affect analyses on the sex chromosomes and present XYalign, a new tool that (1) facilitates the inference of sex chromosome complement from next-generation sequencing data; (2) corrects erroneous read mapping on the sex chromosomes; and (3) tabulates and visualizes important metrics for quality control such as mapping quality, sequencing depth, and allele balance. We find that sequence homology affects read mapping on the sex chromosomes and this has downstream effects on variant calling. However, we show that XYalign can correct mismapping, resulting in more accurate variant calling. We also show how metrics output by XYalign can be used to identify XX and XY individuals across diverse sequencing experiments, including low- and high-coverage whole-genome sequencing, and exome sequencing. Finally, we discuss how the flexibility of the XYalign framework can be leveraged for other uses including the identification of aneuploidy on the autosomes. XYalign is available open source under the GNU General Public License (version 3). Conclusions Sex chromsome sequence homology causes the mismapping of short reads, which in turn affects downstream analyses. XYalign provides a reproducible framework to correct mismapping and improve variant calling on the sex chromsomes.

Download Full-text

Next-Generation Sequencing Studies: Optimal Design and Analysis, Missing Heritability and Rare Variants

Current Epidemiology Reports ◽

10.1007/s40471-014-0022-4 ◽

2014 ◽

Vol 1 (4) ◽

pp. 213-219 ◽

Cited By ~ 2

Author(s):

Paul Marjoram ◽

Duncan C. Thomas

Keyword(s):

Next Generation Sequencing ◽

Optimal Design ◽

Rare Variants ◽

Next Generation ◽

Missing Heritability ◽

Sequencing Studies ◽

Generation Sequencing

Download Full-text

Gene mutations in gastric cancer: a review of recent next-generation sequencing studies

Tumor Biology ◽

10.1007/s13277-015-4002-1 ◽

2015 ◽

Vol 36 (10) ◽

pp. 7385-7394 ◽

Cited By ~ 33

Author(s):

Y. Lin ◽

Z. Wu ◽

W. Guo ◽

J. Li

Keyword(s):

Gastric Cancer ◽

Next Generation Sequencing ◽

Gene Mutations ◽

Next Generation ◽

Sequencing Studies ◽

Generation Sequencing

Download Full-text

Co-extraction of genomic DNA & total RNA from recalcitrant woody tissues for next-generation sequencing studies

Future Science OA ◽

10.4155/fsoa-2018-0026 ◽

2018 ◽

Vol 4 (6) ◽

pp. FSO309

Author(s):

Zhen Zeng ◽

Tommaso Raffaello ◽

Meng-Xia Liu ◽

Fred O Asiegbu

Keyword(s):

Next Generation Sequencing ◽

Genomic Dna ◽

Next Generation ◽

Total Rna ◽

Sequencing Studies ◽

Woody Tissues ◽

Generation Sequencing

Download Full-text

HDAM: a resource of human disease associated mutations from next generation sequencing studies

BMC Medical Genomics ◽

10.1186/1755-8794-6-s1-s16 ◽

2013 ◽

Vol 6 (S1) ◽

Author(s):

Meiwen Jia ◽

Yanli Liu ◽

Zhongchao Shen ◽

Chen Zhao ◽

Meixia Zhang ◽

...

Keyword(s):

Next Generation Sequencing ◽

Human Disease ◽

Next Generation ◽

Sequencing Studies ◽

Generation Sequencing

Download Full-text

Pharmacogenomics variants are associated with BMI differences between individuals with bipolar and other psychiatric disorders

Pharmacogenomics ◽

10.2217/pgs-2021-0012 ◽

2021 ◽

Vol 22 (12) ◽

pp. 749-760

Author(s):

Aggeliki Charalampidi ◽

Zoe Kordou ◽

Evangelia-Eirini Tsermpini ◽

Panagiotis Bosganas ◽

Wasun Chantratita ◽

...

Keyword(s):

Next Generation Sequencing ◽

Psychiatric Disorder ◽

Psychiatric Disorders ◽

Statistical Methods ◽

Potential Effect ◽

Next Generation ◽

Potential Influence ◽

Sequencing Studies ◽

The Mean ◽

Generation Sequencing

Aim: Regardless of the plethora of next-generation sequencing studies in the field of pharmacogenomics (PGx), the potential effect of covariate variables on PGx response within deeply phenotyped cohorts remains unexplored. Materials & methods: We explored with advanced statistical methods the potential influence of BMI, as a covariate variable, on PGx response in a Greek cohort with psychiatric disorders. Results: Nine PGx variants within UGT1A6, SLC22A4, GSTP1, CYP4B1, CES1, SLC29A3 and DPYD were associated with altered BMI in different psychiatric disorder groups. Carriers of rs2070959 ( UGT1A6), rs199861210 ( SLC29A3) and rs2297595 ( DPYD) were also characterized by significant changes in the mean BMI, depending on the presence of psychiatric disorders. Conclusion: Specific PGx variants are significantly associated with BMI in a Greek cohort with psychiatric disorders.

Download Full-text