Evaluation of Vicinity-based Hidden Markov Models for Genotype Imputation

The decreasing cost of DNA sequencing has led to a great increase in our knowledge about genetic variation. While population-scale projects bring important insight into genotype-phenotype relationships, the cost of performing whole-genome sequencing on large samples is still prohibitive. In-silico genotype imputation coupled with genotyping-by-arrays is a cost-effective and accurate alternative for genotyping of common and uncommon variants. Imputation methods compare the genotypes of the typed variants with the large population-specific reference panels and estimate the genotypes of untyped variants by making use of the linkage disequilibrium patterns. Most accurate imputation methods are based on the Li-Stephens hidden Markov model, HMM, that treats the sequence of each chromosome as a mosaic of the haplotypes from the reference panel. Here we assess the accuracy of local-HMMs, where each untyped variant is imputed using the typed variants in a small window around itself (as small as 1 centimorgan). Locality-based imputation is used recently by machine learning-based genotype imputation approaches. We assess how the parameters of the local-HMMs impact the imputation accuracy in a comprehensive set of benchmarks and show that local-HMMs can accurately impute common and uncommon variants and can be relaxed to impute rare variants as well. The source code for the local HMM implementations is publicly available at https://github.com/harmancilab/LoHaMMer.

Download Full-text

Kinpute: using identity by descent to improve genotype imputation

Bioinformatics ◽

10.1093/bioinformatics/btz221 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4321-4326

Author(s):

Mark Abney ◽

Aisha ElSherbiny

Keyword(s):

Sequence Data ◽

Imputation Accuracy ◽

Genotype Imputation ◽

Supplementary Information ◽

Specific Reference ◽

Imputation Methods ◽

Identical By Descent ◽

Novel Method ◽

Optimal Set ◽

Genotype Probabilities

Abstract Motivation Genotype imputation, though generally accurate, often results in many genotypes being poorly imputed, particularly in studies where the individuals are not well represented by standard reference panels. When individuals in the study share regions of the genome identical by descent (IBD), it is possible to use this information in combination with a study-specific reference panel (SSRP) to improve the imputation results. Kinpute uses IBD information—due to recent, familial relatedness or distant, unknown ancestors—in conjunction with the output from linkage disequilibrium (LD) based imputation methods to compute more accurate genotype probabilities. Kinpute uses a novel method for IBD imputation, which works even in the absence of a pedigree, and results in substantially improved imputation quality. Results Given initial estimates of average IBD between subjects in the study sample, Kinpute uses a novel algorithm to select an optimal set of individuals to sequence and use as an SSRP. Kinpute is designed to use as input both this SSRP and the genotype probabilities output from other LD-based imputation software, and uses a new method to combine the LD imputed genotype probabilities with IBD configurations to substantially improve imputation. We tested Kinpute on a human population isolate where 98 individuals have been sequenced. In half of this sample, whose sequence data was masked, we used Impute2 to perform LD-based imputation and Kinpute was used to obtain higher accuracy genotype probabilities. Measures of imputation accuracy improved significantly, particularly for those genotypes that Impute2 imputed with low certainty. Availability and implementation Kinpute is an open-source and freely available C++ software package that can be downloaded from. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Imputation benchmark of β-value and M-value from DNA methylation data under different missing data mechanisms.

10.21203/rs.2.20718/v1 ◽

2020 ◽

Author(s):

Pietro Di Lena ◽

Claudia Sala ◽

Andrea Prodi ◽

Christine Nardini

Keyword(s):

Dna Methylation ◽

Missing Data ◽

Missing Values ◽

Imputation Accuracy ◽

Missing At Random ◽

Cost Effective ◽

Methylation Data ◽

Imputation Methods ◽

Value Range ◽

The Cost

Abstract Background: High-throughput technologies enable the cost-effective collection and analysis of DNA methylation data throughout the human genome. This naturally entails missing values management that can complicate the analysis of the data. Several general and specific imputation methods are suitable for DNA methylation data. However, there are no detailed studies of their performances under different missing data mechanisms -(completely) at random or not- and different representations of DNA methylation levels (β and M-value). Results: We make an extensive analysis of the imputation performances of seven imputation methods on simulated missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR) methylation data. We further consider imputation performances on the β- and M-value popular representations of methylation levels. Overall, β -values enable better imputation performances than M-values. Imputation accuracy is lower for mid-range β -values, while it is generally more accurate for values at the extremes of the β -value range. The MAR values distribution is on the average more dense in the mid-range in comparison to the expected β -value distribution. As a consequence, MAR values are on average harder to impute. Conclusions: The results of the analysis provide guidelines for the most suitable imputation approaches for DNA methylation data under different representations of DNA methylation levels and different missing data mechanisms.

Download Full-text

A Fast Data-Driven Method for Genotype Imputation, Phasing, and Local Ancestry Inference: MendelImpute.jl

10.1101/2020.10.24.353755 ◽

2020 ◽

Author(s):

Benjamin B. Chu ◽

Eric M. Sobel ◽

Rory Wasiolek ◽

Janet S. Sinsheimer ◽

Hua Zhou ◽

...

Keyword(s):

Markov Models ◽

Hidden Markov ◽

Imputation Accuracy ◽

Genotype Imputation ◽

Data Transport ◽

Model Calculations ◽

Local Ancestry ◽

Order Of Magnitude ◽

Computationally Intensive ◽

Missing Genotypes

1AbstractCurrent methods for genotype imputation and phasing exploit the sheer volume of data in haplotype reference panels and rely on hidden Markov models. Existing programs all have essentially the same imputation accuracy, are computationally intensive, and generally require pre-phasing the typed markers. We propose a novel data-mining method for genotype imputation and phasing that substitutes highly efficient linear algebra routines for hidden Markov model calculations. This strategy, embodied in our Julia program MendelImpute.jl, avoids explicit assumptions about recombination and population structure while delivering similar prediction accuracy, better memory usage, and an order of magnitude or better run-times compared to the fastest competing method. MendelImpute operates on both dosage data and unphased genotype data and simultaneously imputes missing genotypes and phase at both the typed and untyped SNPs. Finally, MendelImpute naturally extends to global and local ancestry estimation and lends itself to new strategies for data compression and hence faster data transport and sharing.

Download Full-text

Methylation data imputation performances under different representations and missingness patterns

10.21203/rs.2.20718/v2 ◽

2020 ◽

Author(s):

Pietro Di Lena ◽

Claudia Sala ◽

Andrea Prodi ◽

Christine Nardini

Keyword(s):

Dna Methylation ◽

Missing Data ◽

Missing Values ◽

Imputation Accuracy ◽

Missing At Random ◽

Cost Effective ◽

Methylation Data ◽

Imputation Methods ◽

Value Range ◽

The Cost

Abstract Background: High-throughput technologies enable the cost-effective collection and analysis of DNA methylation data throughout the human genome. This naturally entails missing values management that can complicate the analysis of the data. Several general and specific imputation methods are suitable for DNA methylation data. However, there are no detailed studies of their performances under different missing data mechanisms â(completely) at random or not- and different representations of DNA methylation levels ($\beta$ and $M$-value). Results: We make an extensive analysis of the imputation performances of seven imputation methods on simulated missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR) methylation data. We further consider imputation performances on the β- and M-value popular representations of methylation levels. Overall, β -values enable better imputation performances than M-values. Imputation accuracy is lower for mid-range β -values, while it is generally more accurate for values at the extremes of the β -value range. The MAR values distribution is on the average more dense in the mid-range in comparison to the expected β -value distribution. As a consequence, MAR values are on average harder to impute. Conclusions: The results of the analysis provide guidelines for the most suitable imputation approaches for DNA methylation data under different representations of DNA methylation levels and different missing data mechanisms.

Download Full-text

Imputation accuracy to whole-genome sequence in Nellore cattle

Genetics Selection Evolution ◽

10.1186/s12711-021-00622-5 ◽

2021 ◽

Vol 53 (1) ◽

Author(s):

Gerardo A. Fernandes Júnior ◽

Roberto Carvalheiro ◽

Henrique N. de Oliveira ◽

Mehdi Sargolzaei ◽

Roy Costilla ◽

...

Keyword(s):

Beef Cattle ◽

Genome Sequence ◽

Rare Variants ◽

Imputation Accuracy ◽

Low Frequency ◽

Cost Effective ◽

Target Population ◽

Genotype Imputation ◽

Whole Genome Sequence ◽

Whole Genome

Abstract Background A cost-effective strategy to explore the complete DNA sequence in animals for genetic evaluation purposes is to sequence key ancestors of a population, followed by imputation mechanisms to infer marker genotypes that were not originally reported in a target population of animals genotyped with single nucleotide polymorphism (SNP) panels. The feasibility of this process relies on the accuracy of the genotype imputation in that population, particularly for potential causal mutations which may be at low frequency and either within genes or regulatory regions. The objective of the present study was to investigate the imputation accuracy to the sequence level in a Nellore beef cattle population, including that for variants in annotation classes which are more likely to be functional. Methods Information of 151 key sequenced Nellore sires were used to assess the imputation accuracy from bovine HD BeadChip SNP (~ 777 k) to whole-genome sequence. The choice of the sires aimed at optimizing the imputation accuracy of a genotypic database, comprised of about 10,000 genotyped Nellore animals. Genotype imputation was performed using two computational approaches: FImpute3 and Minimac4 (after using Eagle for phasing). The accuracy of the imputation was evaluated using a fivefold cross-validation scheme and measured by the squared correlation between observed and imputed genotypes, calculated by individual and by SNP. SNPs were classified into a range of annotations, and the accuracy of imputation within each annotation classification was also evaluated. Results High average imputation accuracies per animal were achieved using both FImpute3 (0.94) and Minimac4 (0.95). On average, common variants (minor allele frequency (MAF) > 0.03) were more accurately imputed by Minimac4 and low-frequency variants (MAF ≤ 0.03) were more accurately imputed by FImpute3. The inherent Minimac4 Rsq imputation quality statistic appears to be a good indicator of the empirical Minimac4 imputation accuracy. Both software provided high average SNP-wise imputation accuracy for all classes of biological annotations. Conclusions Our results indicate that imputation to whole-genome sequence is feasible in Nellore beef cattle since high imputation accuracies per individual are expected. SNP-wise imputation accuracy is software-dependent, especially for rare variants. The accuracy of imputation appears to be relatively independent of annotation classification.

Download Full-text

Kinpute: Using identity by descent to improve genotype imputation

10.1101/399147 ◽

2018 ◽

Author(s):

Mark Abney ◽

Aisha El Sherbiny

Keyword(s):

Sequence Data ◽

Imputation Accuracy ◽

Genotype Imputation ◽

Specific Reference ◽

Identity By Descent ◽

Imputation Methods ◽

Identical By Descent ◽

Novel Method ◽

Optimal Set ◽

Genotype Probabilities

1AbstractMotivationGenotype imputation, though generally accurate, often results in many genotypes being poorly imputed, particularly in studies where the individuals are not well represented by standard reference panels. When individuals in the study share regions of the genome identical by descent (IBD), it is possible to use this information in combination with a study specific reference panel (SSRP) to improve the imputation results. Kinpute uses IBD information—due to either recent, familial relatedness or distant, unknown ancestors— in conjunction with the output from linkage disequilibrium (LD) based imputation methods to compute more accurate genotype probabilities. Kinpute uses a novel method for IBD imputation, which works even in the absence of a pedigree, and results in substantially improved imputation quality.ResultsGiven initial estimates of average IBD between subjects in the study sample, Kinpute uses a novel algorithm to select an optimal set of individuals to sequence and use as an SSRP. Kinpute is designed to use as input both this SSRP and the genotype probabilities output from other LD based imputation software, and uses a new method to combine the LD imputed genotype probabilities with IBD configurations to substantially improve imputation. We tested Kinpute on a human population isolate where 98 individuals have been sequenced. In half of this sample, whose sequence data was masked, we used Impute2 to perform LD based imputation and Kinpute was used to obtain higher accuracy genotype probabilities. Measures of imputation accuracy improved significantly, particularly for those genotypes that Impute2 imputed with low certainty.AvailabilityKinpute is an open-source and freely available C++ software package that can be downloaded from https://github.com/markabney/Kinpute/releases.

Download Full-text

Underwater noise recognition of marine vessels passages: two case studies using hidden Markov models

ICES Journal of Marine Science ◽

10.1093/icesjms/fsz194 ◽

2019 ◽

Vol 77 (6) ◽

pp. 2157-2170

Author(s):

Manuel Vieira ◽

M Clara P Amorim ◽

Andreas Sundelöf ◽

Nuno Prista ◽

Paulo J Fonseca

Keyword(s):

Hidden Markov Models ◽

Markov Models ◽

Hidden Markov ◽

Cost Effective ◽

Relevant Information ◽

Marine Habitats ◽

The Baltic ◽

Passive Acoustic ◽

Identification Processes ◽

Marine Vessels

Abstract Passive acoustic monitoring (PAM) is emerging as a cost-effective non-intrusive method to monitor the health and biodiversity of marine habitats, including the impacts of anthropogenic noise on marine organisms. When long PAM recordings are to be analysed, automatic recognition and identification processes are invaluable tools to extract the relevant information. We propose a pattern recognition methodology based on hidden Markov models (HMMs) for the detection and recognition of acoustic signals from marine vessels passages and test it in two different regions, the Tagus estuary in Portugal and the Öresund strait in the Baltic Sea. Results show that the combination of HMMs with PAM provides a powerful tool to monitor the presence of marine vessels and discriminate different vessels such as small boats, ferries, and large ships. Improvements to enhance the capability to discriminate different types of small recreational boats are discussed.

Download Full-text

Recommendations for performance optimizations when using GATK3.8 and GATK4

BMC Bioinformatics ◽

10.1186/s12859-019-3169-7 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 1

Author(s):

Jacob R Heldenbrand ◽

Saurabh Baheti ◽

Matthew A Bockol ◽

Travis M Drucker ◽

Steven N Hart ◽

...

Keyword(s):

Large Population ◽

Variant Calling ◽

Cost Effective ◽

Performance Improvements ◽

Computational Performance ◽

Trade Offs ◽

Genome Analysis Toolkit ◽

The Cost ◽

Multiple Samples ◽

Parameter Values

Abstract Background Use of the Genome Analysis Toolkit (GATK) continues to be the standard practice in genomic variant calling in both research and the clinic. Recently the toolkit has been rapidly evolving. Significant computational performance improvements have been introduced in GATK3.8 through collaboration with Intel in 2017. The first release of GATK4 in early 2018 revealed rewrites in the code base, as the stepping stone toward a Spark implementation. As the software continues to be a moving target for optimal deployment in highly productive environments, we present a detailed analysis of these improvements, to help the community stay abreast with changes in performance. Results We re-evaluated multiple options, such as threading, parallel garbage collection, I/O options and data-level parallelization. Additionally, we considered the trade-offs of using GATK3.8 and GATK4. We found optimized parameter values that reduce the time of executing the best practices variant calling procedure by 29.3% for GATK3.8 and 16.9% for GATK4. Further speedups can be accomplished by splitting data for parallel analysis, resulting in run time of only a few hours on whole human genome sequenced to the depth of 20X, for both versions of GATK. Nonetheless, GATK4 is already much more cost-effective than GATK3.8. Thanks to significant rewrites of the algorithms, the same analysis can be run largely in a single-threaded fashion, allowing users to process multiple samples on the same CPU. Conclusions In time-sensitive situations, when a patient has a critical or rapidly developing condition, it is useful to minimize the time to process a single sample. In such cases we recommend using GATK3.8 by splitting the sample into chunks and computing across multiple nodes. The resultant walltime will be nnn.4 hours at the cost of $41.60 on 4 c5.18xlarge instances of Amazon Cloud. For cost-effectiveness of routine analyses or for large population studies, it is useful to maximize the number of samples processed per unit time. Thus we recommend GATK4, running multiple samples on one node. The total walltime will be ∼34.1 hours on 40 samples, with 1.18 samples processed per hour at the cost of $2.60 per sample on c5.18xlarge instance of Amazon Cloud.

Download Full-text

Prohlatype: A Probabilistic Framework for HLA Typing

10.1101/244962 ◽

2018 ◽

Cited By ~ 1

Author(s):

Leonid Rozenberg ◽

Jeff Hammerbacher

Keyword(s):

Hidden Markov Models ◽

Markov Models ◽

Hidden Markov ◽

Probabilistic Inference ◽

Hla Typing ◽

Probabilistic Framework ◽

Sequencing Data ◽

Inference Problem ◽

Profile Hidden Markov Models ◽

The Cost

AbstractHLA typing from sequencing data is considered as a classical probabilistic inference problem and Profile Hidden Markov Models (PHMM) are motivated for the likelihood calculation. Their generative property makes them a natural and highly discernible method; at the cost of considerable computation. We discuss ways to ameliorate this burden, and present an implementation https://github.com/hammerlab/prohlatype.

Download Full-text

Optimizing Low-Cost Genotyping and Imputation Strategies for Genomic Selection in Atlantic Salmon

G3 Genes|Genome|Genetics ◽

10.1534/g3.119.400800 ◽

2019 ◽

Vol 10 (2) ◽

pp. 581-590 ◽

Cited By ~ 4

Author(s):

Smaragda Tsairidou ◽

Alastair Hamilton ◽

Diego Robledo ◽

James E. Bron ◽

Ross D. Houston

Keyword(s):

Atlantic Salmon ◽

Genomic Selection ◽

Environmental Sustainability ◽

Genomic Prediction ◽

Prediction Accuracy ◽

Imputation Accuracy ◽

Cost Effective ◽

High Density ◽

Genotype Imputation ◽

Breeding Programs

Genomic selection enables cumulative genetic gains in key production traits such as disease resistance, playing an important role in the economic and environmental sustainability of aquaculture production. However, it requires genome-wide genetic marker data on large populations, which can be prohibitively expensive. Genotype imputation is a cost-effective method for obtaining high-density genotypes, but its value in aquaculture breeding programs which are characterized by large full-sibling families has yet to be fully assessed. The aim of this study was to optimize the use of low-density genotypes and evaluate genotype imputation strategies for cost-effective genomic prediction. Phenotypes and genotypes (78,362 SNPs) were obtained for 610 individuals from a Scottish Atlantic salmon breeding program population (Landcatch, UK) challenged with sea lice, Lepeophtheirus salmonis. The genomic prediction accuracy of genomic selection was calculated using GBLUP approaches and compared across SNP panels of varying densities and composition, with and without imputation. Imputation was tested when parents were genotyped for the optimal SNP panel, and offspring were genotyped for a range of lower density imputation panels. Reducing SNP density had little impact on prediction accuracy until 5,000 SNPs, below which the accuracy dropped. Imputation accuracy increased with increasing imputation panel density. Genomic prediction accuracy when offspring were genotyped for just 200 SNPs, and parents for 5,000 SNPs, was 0.53. This accuracy was similar to the full high density and optimal density dataset, and markedly higher than using 200 SNPs without imputation. These results suggest that imputation from very low to medium density can be a cost-effective tool for genomic selection in Atlantic salmon breeding programs.

Download Full-text