scholarly journals A scalable pipeline for local ancestry inference using tens of thousands of reference haplotypes

2021 ◽  
Author(s):  
Eric Y. Durand ◽  
Chuong B. Do ◽  
Peter R. Wilton ◽  
Joanna L. Mountain ◽  
Adam Auton ◽  
...  

AbstractAncestry deconvolution is the task of identifying the ancestral origins of chromosomal segments of admixed individuals. It has important applications, from mapping disease genes to identifying loci potentially under natural selection. However, most existing methods are limited to a small number of ancestral populations and are unsuitable for large-scale applications.In this article, we describe Ancestry Composition, a modular pipeline for accurate and efficient ancestry deconvolution. In the first stage, a string-kernel support-vector-machines classifier assigns provisional ancestry labels to short statistically phased genomic segments. In the second stage, an autoregressive pair hidden Markov model corrects phasing errors, smooths local ancestry estimates, and computes confidence scores.Using publicly available datasets and more than 12,000 individuals from the customer database of the personal genetics company, 23andMe, Inc., we have constructed a reference panel containing more than 14,000 unrelated individuals of unadmixed ancestry. We used principal components analysis (PCA) and uniform manifold approximation and projection (UMAP) to identify genetic clusters and define 45 distinct reference populations upon which to train our method. In cross-validation experiments, Ancestry Composition achieves high precision and recall.

2014 ◽  
Author(s):  
Eric Y Durand ◽  
Chuong B Do ◽  
Joanna L Mountain ◽  
J. Michael Macpherson

Ancestry deconvolution, the task of identifying the ancestral origin of chromosomal segments in admixed individuals, has important implications, from mapping disease genes to identifying candidate loci under natural selection. To date, however, most existing methods for ancestry deconvolution are typically limited to two or three ancestral populations, and cannot resolve contributions from populations related at a sub-continental scale. We describe Ancestry Composition, a modular three-stage pipeline that efficiently and accurately identifies the ancestral origin of chromosomal segments in admixed individuals. It assumes the genotype data have been phased. In the first stage, a support vector machine classifier assigns tentative ancestry labels to short local phased genomic regions. In the second stage, an autoregressive pair hidden Markov model simultaneously corrects phasing errors and produces reconciled local ancestry estimates and confidence scores based on the tentative ancestry labels. In the third stage, confidence estimates are recalibrated using isotonic regression. We compiled a reference panel of almost 10,000 individuals of homogeneous ancestry, derived from a combination of several publicly available datasets and over 8,000 individuals reporting four grandparents with the same country-of-origin from the member database of the personal genetics company, 23andMe, Inc., and excluding outliers identified through principal components analysis (PCA). In cross-validation experiments, Ancestry Composition achieves high precision and recall for labeling chromosomal segments across over 25 different populations worldwide.


2021 ◽  
Author(s):  
M. Tanveer ◽  
A. Tiwari ◽  
R. Choudhary ◽  
M. A. Ganaie

2013 ◽  
Vol 2013 ◽  
pp. 1-6 ◽  
Author(s):  
Ersen Yılmaz

An expert system having two stages is proposed for cardiac arrhythmia diagnosis. In the first stage, Fisher score is used for feature selection to reduce the feature space dimension of a data set. The second stage is classification stage in which least squares support vector machines classifier is performed by using the feature subset selected in the first stage to diagnose cardiac arrhythmia. Performance of the proposed expert system is evaluated by using an arrhythmia data set which is taken from UCI machine learning repository.


Author(s):  
Stanislaw Osowski ◽  
Tomasz Markiewicz

This chapter presents an automatic system for white blood cell recognition in myelogenous leukaemia on the basis of the image of a bone-marrow smear. It addresses the following fundamental problems of this task: the extraction of the individual cell image of the smear, generation of different features of the cell, selection of the best features, and final recognition using an efficient classifier network based on support vector machines. The chapter proposes the complete system solving all these problems, beginning from cell extraction using the watershed algorithm; the generation of different features based on texture, geometry, morphology, and the statistical description of the intensity of the image; feature selection using linear support vector machines; and finally classification by applying Gaussian kernel support vector machines. The results of numerical experiments on the recognition of up to 17 classes of blood cells of myelogenous leukaemia have shown that the proposed system is quite accurate and may find practical application in hospitals in the diagnosis of patients suffering from leukaemia.


Author(s):  
Denali Molitor ◽  
Deanna Needell

Abstract In today’s data-driven world, storing, processing and gleaning insights from large-scale data are major challenges. Data compression is often required in order to store large amounts of high-dimensional data, and thus, efficient inference methods for analyzing compressed data are necessary. Building on a recently designed simple framework for classification using binary data, we demonstrate that one can improve classification accuracy of this approach through iterative applications whose output serves as input to the next application. As a side consequence, we show that the original framework can be used as a data preprocessing step to improve the performance of other methods, such as support vector machines. For several simple settings, we showcase the ability to obtain theoretical guarantees for the accuracy of the iterative classification method. The simplicity of the underlying classification framework makes it amenable to theoretical analysis.


Sign in / Sign up

Export Citation Format

Share Document