scholarly journals Genotype imputation using the Positional Burrows Wheeler Transform

PLoS Genetics ◽  
2020 ◽  
Vol 16 (11) ◽  
pp. e1009049
Author(s):  
Simone Rubinacci ◽  
Olivier Delaneau ◽  
Jonathan Marchini

Genotype imputation is the process of predicting unobserved genotypes in a sample of individuals using a reference panel of haplotypes. In the last 10 years reference panels have increased in size by more than 100 fold. Increasing reference panel size improves accuracy of markers with low minor allele frequencies but poses ever increasing computational challenges for imputation methods. Here we present IMPUTE5, a genotype imputation method that can scale to reference panels with millions of samples. This method continues to refine the observation made in the IMPUTE2 method, that accuracy is optimized via use of a custom subset of haplotypes when imputing each individual. It achieves fast, accurate, and memory-efficient imputation by selecting haplotypes using the Positional Burrows Wheeler Transform (PBWT). By using the PBWT data structure at genotyped markers, IMPUTE5 identifies locally best matching haplotypes and long identical by state segments. The method then uses the selected haplotypes as conditioning states within the IMPUTE model. Using the HRC reference panel, which has ∼65,000 haplotypes, we show that IMPUTE5 is up to 30x faster than MINIMAC4 and up to 3x faster than BEAGLE5.1, and uses less memory than both these methods. Using simulated reference panels we show that IMPUTE5 scales sub-linearly with reference panel size. For example, keeping the number of imputed markers constant, increasing the reference panel size from 10,000 to 1 million haplotypes requires less than twice the computation time. As the reference panel increases in size IMPUTE5 is able to utilize a smaller number of reference haplotypes, thus reducing computational cost.

2019 ◽  
Author(s):  
Simone Rubinacci ◽  
Olivier Delaneau ◽  
Jonathan Marchini

AbstractGenotype imputation is the process of predicting unobserved genotypes in a sample of individuals using a reference panel of haplotypes. In the last 10 years reference panels have increased in size by more than 100 fold. Increasing reference panel size improves accuracy of markers with low minor allele frequencies but poses ever increasing computational challenges for imputation methods.Here we present IMPUTE5, a genotype imputation method that can scale to reference panels with millions of samples. This method continues to refine the observation made in the IMPUTE2 method, that accuracy is optimized via use of a custom subset of haplotypes when imputing each individual. It achieves fast, accurate, and memory-efficient imputation by selecting haplotypes using the Positional Burrows Wheeler Transform (PBWT). By using the PBWT data structure at genotyped markers, IMPUTE5 identifies locally best matching haplotypes and long identical by state segments. The method then uses the selected haplotypes as conditioning states within the IMPUTE model.Using the HRC reference panel, which has ~65,000 haplotypes, we show that IMPUTE5 is up to 30x faster than MINIMAC4 and up to 3x faster than BEAGLE5.1, and uses less memory than both these methods. Using simulated reference panels we show that IMPUTE5 scales sub-linearly with reference panel size. For example, keeping the number of imputed markers constant, increasing the reference panel size from 10,000 to 1 million haplotypes requires less than twice the computation time. As the reference panel increases in size IMPUTE5 is able to utilize a smaller number of reference haplotypes, thus reducing computational cost.Author summaryGenome-wide association studies (GWAS) typically use microarray technology to measure genotypes at several hundred thousand positions in the genome. However reference panels of genetic variation consist of haplotype data at >100 fold more positions in the genome. Genotype imputation makes genotype predictions at all the reference panel sites using the GWAS data. Reference panels are continuing to grow in size and this improves accuracy of the predictions, however methods need to be able to scale to increased size. We have developed a new version of the popular IMPUTE software than can handle referenece panels with millions of haplotypes, and has better performance than other published approaches. A notable property of the new method is that it scales sub-linearly with reference panel size. Keeping the number of imputed markers constant, a 100 fold increase in reference panel size requires less than twice the computation time.


2018 ◽  
Author(s):  
Brian L. Browning ◽  
Ying Zhou ◽  
Sharon R. Browning

AbstractGenotype imputation is commonly performed in genome-wide association studies because it greatly increases the number of markers that can be tested for association with a trait. In general, one should perform genotype imputation using the largest reference panel that is available because the number of accurately imputed variants increases with reference panel size. However, one impediment to using larger reference panels is the increased computational cost of imputation. We present a new genotype imputation method, Beagle 5.0, which greatly reduces the computational cost of imputation from large reference panels. We compare Beagle 5.0 with Beagle 4.1, Impute4, Minimac3, and Minimac4 using 1000 Genomes Project data, Haplotype Reference Consortium data, and simulated data for 10k, 100k, 1M, and 10M reference samples. All methods produce nearly identical accuracy, but Beagle 5.0 has the lowest computation time and the best scaling of computation time with increasing reference panel size. For 10k, 100k, 1M, and 10M reference samples and 1000 phased target samples, Beagle 5.0’s computation time is 3× (10k), 12× (100k), 43× (1M), and 533× (10M) faster than the fastest alternative method. Cost data from the Amazon Elastic Compute Cloud show that Beagle 5.0 can perform genome-wide imputation from 10M reference samples into 1000 phased target samples at a cost of less than one US cent per sample.Beagle 5.0 is freely available from https://faculty.washington.edu/browning/beagle/beagle.html.


2017 ◽  
Author(s):  
Sina Rüeger ◽  
Aaron McDaid ◽  
Zoltán Kutalik

AbstractMotivationSummary statistics imputation can be used to infer association summary statistics of an already conducted, genotype-based meta-analysis to higher ge-nomic resolution. This is typically needed when genotype imputation is not feasible for some cohorts. Oftentimes, cohorts of such a meta-analysis are variable in terms of (country of) origin or ancestry. This violates the assumption of current methods that an external LD matrix and the covariance of the Z-statistics are identical.ResultsTo address this issue, we present variance matching, an extention to the existing summary statistics imputation method, which manipulates the LD matrix needed for summary statistics imputation. Based on simulations using real data we find that accounting for ancestry admixture yields noticeable improvement only when the total reference panel size is > 1000. We show that for population specific variants this effect is more pronounced with increasing FST.


Geophysics ◽  
2018 ◽  
Vol 83 (2) ◽  
pp. V99-V113 ◽  
Author(s):  
Zhong-Xiao Li ◽  
Zhen-Chun Li

After multiple prediction, adaptive multiple subtraction is essential for the success of multiple removal. The 3D blind separation of convolved mixtures (3D BSCM) method, which is effective in conducting adaptive multiple subtraction, needs to solve an optimization problem containing L1-norm minimization constraints on primaries by the iterative reweighted least-squares (IRLS) algorithm. The 3D BSCM method can better separate primaries and multiples than the 1D/2D BSCM method and the method with energy minimization constraints on primaries. However, the 3D BSCM method has high computational cost because the IRLS algorithm achieves nonquadratic optimization with an LS optimization problem solved in each iteration. In general, it is good to have a faster 3D BSCM method. To improve the adaptability of field data processing, the fast iterative shrinkage thresholding algorithm (FISTA) is introduced into the 3D BSCM method. The proximity operator of FISTA can solve the L1-norm minimization problem efficiently. We demonstrate that our FISTA-based 3D BSCM method achieves similar accuracy of estimating primaries as that of the reference IRLS-based 3D BSCM method. Furthermore, our FISTA-based 3D BSCM method reduces computation time by approximately 60% compared with the reference IRLS-based 3D BSCM method in the synthetic and field data examples.


2021 ◽  
Author(s):  
Jaekwang Shin ◽  
Ankush Bansal ◽  
Randy Cheng ◽  
Alan Taub ◽  
Mihaela Banu

Accurate prediction of the defects occurring in incrementally formed parts has been gaining attention in recent years. This interest is because accurate predictions can overcome the limitation in the advancement of incremental forming in industrial-scale implementation, which has been held back by the increase in the cost and development time due to trial and error methods. The finite element method has been widely utilized to predict the defects in the formed part, e.g., bulge. However, the computation time of running these models and their mesh-size dependency in predicting the forming defects represent barriers in adopting these models as part of CAD-FEM-CAE platforms. Thus, robust analytical and data-driven algorithms must be developed for a cost-effective design of complex parts. In this paper, a new analytical model is proposed to predict the bulge location and geometry in two point incremental forming of an aerospace aluminum alloy AA7075-O for a 67° truncated cone. First, the algorithm calculates the region of interest based on the part geometry. A novel shape function and weighted summation method are then utilized to calculate the amplitude of the instability produced by material accumulation during forming, leading to a bulge on the unformed portion of the sample. It was found that the geometric profile of the part influences the shape function, which is a function created to incorporate the effects of process parameter and boundary condition. The calculated profile in each direction is finalized into one 3-dimensional profile, compared with the experimental results for validation. The proposed model has proven to predict an accurate bulge profile with 95% accuracy comparing with experiments with less than 5% computational cost of FEM modeling.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Pooja Rani ◽  
Rajneesh Kumar ◽  
Anurag Jain

PurposeDecision support systems developed using machine learning classifiers have become a valuable tool in predicting various diseases. However, the performance of these systems is adversely affected by the missing values in medical datasets. Imputation methods are used to predict these missing values. In this paper, a new imputation method called hybrid imputation optimized by the classifier (HIOC) is proposed to predict missing values efficiently.Design/methodology/approachThe proposed HIOC is developed by using a classifier to combine multivariate imputation by chained equations (MICE), K nearest neighbor (KNN), mean and mode imputation methods in an optimum way. Performance of HIOC has been compared to MICE, KNN, and mean and mode methods. Four classifiers support vector machine (SVM), naive Bayes (NB), random forest (RF) and decision tree (DT) have been used to evaluate the performance of imputation methods.FindingsThe results show that HIOC performed efficiently even with a high rate of missing values. It had reduced root mean square error (RMSE) up to 17.32% in the heart disease dataset and 34.73% in the breast cancer dataset. Correct prediction of missing values improved the accuracy of the classifiers in predicting diseases. It increased classification accuracy up to 18.61% in the heart disease dataset and 6.20% in the breast cancer dataset.Originality/valueThe proposed HIOC is a new hybrid imputation method that can efficiently predict missing values in any medical dataset.


Author(s):  
Anil Kakarla ◽  
Sanjeev Agarwal ◽  
Sanjay Kumar Madria

Information processing and collaborative computing using agents over a distributed network of heterogeneous platforms are important for many defense and civil applications. In this chapter, a mobile agent based collaborative and distributed computing framework for network centric information processing is presented using a military application. In this environment, the challenge is to continue processing efficiently while satisfying multiple constraints like computational cost, communication bandwidth, and energy in a distributed network. The authors use mobile agent technology for distributed computing to speed up data processing using the available systems resources in the network. The proposed framework provides a mechanism to bridge the gap between computation resources and dispersed data sources under variable bandwidth constraints. For every computation task raised in the network, a viable system that has resources and data to compute the task is identified and sent to the viable system for completion. Experimental evaluation under the real platform is reported. It shows that in spite of an increase of the communication load in comparison with other solutions the proposed framework leads to a decrease of the computation time.


2010 ◽  
Vol 6 (3) ◽  
pp. 1-10 ◽  
Author(s):  
Shichao Zhang

In this paper, the author designs an efficient method for imputing iteratively missing target values with semi-parametric kernel regression imputation, known as the semi-parametric iterative imputation algorithm (SIIA). While there is little prior knowledge on the datasets, the proposed iterative imputation method, which impute each missing value several times until the algorithms converges in each model, utilize a substantially useful amount of information. Additionally, this information includes occurrences involving missing values as well as capturing the real dataset distribution easier than the parametric or nonparametric imputation techniques. Experimental results show that the author’s imputation methods outperform the existing methods in terms of imputation accuracy, in particular in the situation with high missing ratio.


Symmetry ◽  
2020 ◽  
Vol 12 (11) ◽  
pp. 1792
Author(s):  
Shu-Fen Huang ◽  
Ching-Hsue Cheng

Medical data usually have missing values; hence, imputation methods have become an important issue. In previous studies, many imputation methods based on variable data had a multivariate normal distribution, such as expectation-maximization and regression-based imputation. These assumptions may lead to deviations in the results, which sometimes create a bottleneck. In addition, directly deleting instances with missing values may have several problems, such as losing important data, producing invalid research samples, and leading to research deviations. Therefore, this study proposed a safe-region imputation method for handling medical data with missing values; we also built a medical prediction model and compared the removed missing values with imputation methods in terms of the generated rules, accuracy, and AUC. First, this study used the kNN imputation, multiple imputation, and the proposed imputation to impute the missing data and then applied four attribute selection methods to select the important attributes. Then, we used the decision tree (C4.5), random forest, REP tree, and LMT classifier to generate the rules, accuracy, and AUC for comparison. Because there were four datasets with imbalanced classes (asymmetric classes), the AUC was an important criterion. In the experiment, we collected four open medical datasets from UCI and one international stroke trial dataset. The results show that the proposed safe-region imputation is better than the listing imputation methods and after imputing offers better results than directly deleting instances with missing values in the number of rules, accuracy, and AUC. These results will provide a reference for medical stakeholders.


Sign in / Sign up

Export Citation Format

Share Document