A scalable pipeline for local ancestry inference using tens of thousands of reference haplotypes

AbstractAncestry deconvolution is the task of identifying the ancestral origins of chromosomal segments of admixed individuals. It has important applications, from mapping disease genes to identifying loci potentially under natural selection. However, most existing methods are limited to a small number of ancestral populations and are unsuitable for large-scale applications.In this article, we describe Ancestry Composition, a modular pipeline for accurate and efficient ancestry deconvolution. In the first stage, a string-kernel support-vector-machines classifier assigns provisional ancestry labels to short statistically phased genomic segments. In the second stage, an autoregressive pair hidden Markov model corrects phasing errors, smooths local ancestry estimates, and computes confidence scores.Using publicly available datasets and more than 12,000 individuals from the customer database of the personal genetics company, 23andMe, Inc., we have constructed a reference panel containing more than 14,000 unrelated individuals of unadmixed ancestry. We used principal components analysis (PCA) and uniform manifold approximation and projection (UMAP) to identify genetic clusters and define 45 distinct reference populations upon which to train our method. In cross-validation experiments, Ancestry Composition achieves high precision and recall.

Download Full-text

Ancestry Composition: A Novel, Efficient Pipeline for Ancestry Deconvolution

10.1101/010512 ◽

2014 ◽

Cited By ~ 35

Author(s):

Eric Y Durand ◽

Chuong B Do ◽

Joanna L Mountain ◽

J. Michael Macpherson

Keyword(s):

Support Vector ◽

Disease Genes ◽

Continental Scale ◽

Second Stage ◽

Ancestral Origin ◽

Third Stage ◽

Different Populations ◽

Components Analysis ◽

Genomic Regions ◽

Candidate Loci

Ancestry deconvolution, the task of identifying the ancestral origin of chromosomal segments in admixed individuals, has important implications, from mapping disease genes to identifying candidate loci under natural selection. To date, however, most existing methods for ancestry deconvolution are typically limited to two or three ancestral populations, and cannot resolve contributions from populations related at a sub-continental scale. We describe Ancestry Composition, a modular three-stage pipeline that efficiently and accurately identifies the ancestral origin of chromosomal segments in admixed individuals. It assumes the genotype data have been phased. In the first stage, a support vector machine classifier assigns tentative ancestry labels to short local phased genomic regions. In the second stage, an autoregressive pair hidden Markov model simultaneously corrects phasing errors and produces reconciled local ancestry estimates and confidence scores based on the tentative ancestry labels. In the third stage, confidence estimates are recalibrated using isotonic regression. We compiled a reference panel of almost 10,000 individuals of homogeneous ancestry, derived from a combination of several publicly available datasets and over 8,000 individuals reporting four grandparents with the same country-of-origin from the member database of the personal genetics company, 23andMe, Inc., and excluding outliers identified through principal components analysis (PCA). In cross-validation experiments, Ancestry Composition achieves high precision and recall for labeling chromosomal segments across over 25 different populations worldwide.

Download Full-text

Parameters selection in gene selection using Gaussian kernel support vector machines by genetic algorithm

Journal of Zhejiang University SCIENCE A ◽

10.1631/jzus.2005.b0961 ◽

2005 ◽

Vol 6B (10) ◽

pp. 961-973 ◽

Cited By ~ 17

Author(s):

Yong Mao ◽

Xiao-bo Zhou ◽

Dao-ying Pi ◽

You-xian Sun ◽

Stephen T.C. Wong

Keyword(s):

Genetic Algorithm ◽

Support Vector Machines ◽

Gene Selection ◽

Gaussian Kernel ◽

Support Vector ◽

Parameters Selection ◽

Vector Machines ◽

Kernel Support Vector Machines

Download Full-text

Classification of Brain MRI Images using Multi Kernel Support Vector Machines

10.1109/stcr51658.2021.9587925 ◽

2021 ◽

Author(s):

M. Ramkumar ◽

C.Ganesh Babu ◽

R.Sarath Kumar

Keyword(s):

Support Vector Machines ◽

Brain Mri ◽

Support Vector ◽

Vector Machines ◽

Kernel Support Vector Machines

Download Full-text

Large-scale pinball twin support vector machines

Machine Learning ◽

10.1007/s10994-021-06061-z ◽

2021 ◽

Author(s):

M. Tanveer ◽

A. Tiwari ◽

R. Choudhary ◽

M. A. Ganaie

Keyword(s):

Support Vector Machines ◽

Large Scale ◽

Support Vector ◽

Twin Support Vector Machines ◽

Vector Machines

Download Full-text

An Expert System Based on Fisher Score and LS-SVM for Cardiac Arrhythmia Diagnosis

Computational and Mathematical Methods in Medicine ◽

10.1155/2013/849674 ◽

2013 ◽

Vol 2013 ◽

pp. 1-6 ◽

Cited By ~ 19

Author(s):

Ersen Yılmaz

Keyword(s):

Expert System ◽

Cardiac Arrhythmia ◽

Feature Space ◽

Support Vector ◽

Feature Subset ◽

Fisher Score ◽

Data Set ◽

Second Stage ◽

Vector Machines ◽

Two Stages

An expert system having two stages is proposed for cardiac arrhythmia diagnosis. In the first stage, Fisher score is used for feature selection to reduce the feature space dimension of a data set. The second stage is classification stage in which least squares support vector machines classifier is performed by using the feature subset selected in the first stage to diagnose cardiac arrhythmia. Performance of the proposed expert system is evaluated by using an arrhythmia data set which is taken from UCI machine learning repository.

Download Full-text

A Sparse L 2-Regularized Support Vector Machines for Large-Scale Natural Language Learning

Information Retrieval Technology - Lecture Notes in Computer Science ◽

10.1007/978-3-642-17187-1_33 ◽

2010 ◽

pp. 340-349

Author(s):

Yu-Chieh Wu ◽

Yue-Shi Lee ◽

Jie-Chi Yang ◽

Show-Jane Yen

Keyword(s):

Support Vector Machines ◽

Natural Language ◽

Language Learning ◽

Large Scale ◽

Support Vector ◽

Vector Machines ◽

Natural Language Learning

Download Full-text

DCA for Gaussian Kernel Support Vector Machines with Feature Selection

10.1007/978-3-030-92666-3_19 ◽

2021 ◽

pp. 223-234

Author(s):

Hoai An Le Thi ◽

Vinh Thanh Ho

Keyword(s):

Feature Selection ◽

Support Vector Machines ◽

Gaussian Kernel ◽

Support Vector ◽

Vector Machines ◽

Kernel Support Vector Machines

Download Full-text

Natural Gas Load Forecasting using Fuzzy Sigmoid Kernel Support Vector Machines with Genetic Algorithms

2019 Chinese Automation Congress (CAC) ◽

10.1109/cac48633.2019.8997284 ◽

2019 ◽

Author(s):

Qiyun Liu ◽

Penglong Lian ◽

Han Liu

Keyword(s):

Genetic Algorithms ◽

Support Vector Machines ◽

Natural Gas ◽

Load Forecasting ◽

Support Vector ◽

Vector Machines ◽

Kernel Support Vector Machines

Download Full-text

Support Vector Machine for Recognition of White Blood Cells of Leukaemia

Kernel Methods in Bioengineering, Signal and Image Processing ◽

10.4018/978-1-59904-042-4.ch004 ◽

2011 ◽

pp. 93-122 ◽

Cited By ~ 4

Author(s):

Stanislaw Osowski ◽

Tomasz Markiewicz

Keyword(s):

Support Vector Machines ◽

Blood Cells ◽

White Blood Cells ◽

Myelogenous Leukaemia ◽

Image Feature ◽

Gaussian Kernel ◽

Support Vector ◽

Vector Machines ◽

Kernel Support Vector Machines ◽

The Individual

This chapter presents an automatic system for white blood cell recognition in myelogenous leukaemia on the basis of the image of a bone-marrow smear. It addresses the following fundamental problems of this task: the extraction of the individual cell image of the smear, generation of different features of the cell, selection of the best features, and final recognition using an efficient classifier network based on support vector machines. The chapter proposes the complete system solving all these problems, beginning from cell extraction using the watershed algorithm; the generation of different features based on texture, geometry, morphology, and the statistical description of the intensity of the image; feature selection using linear support vector machines; and finally classification by applying Gaussian kernel support vector machines. The results of numerical experiments on the recognition of up to 17 classes of blood cells of myelogenous leukaemia have shown that the proposed system is quite accurate and may find practical application in hospitals in the diagnosis of patients suffering from leukaemia.

Download Full-text

An iterative method for classification of binary data

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iaaa003 ◽

2020 ◽

Author(s):

Denali Molitor ◽

Deanna Needell

Keyword(s):

Binary Data ◽

Large Scale ◽

Support Vector ◽

Large Scale Data ◽

Classification Framework ◽

Vector Machines ◽

Inference Methods ◽

Compressed Data ◽

Scale Data

Abstract In today’s data-driven world, storing, processing and gleaning insights from large-scale data are major challenges. Data compression is often required in order to store large amounts of high-dimensional data, and thus, efficient inference methods for analyzing compressed data are necessary. Building on a recently designed simple framework for classification using binary data, we demonstrate that one can improve classification accuracy of this approach through iterative applications whose output serves as input to the next application. As a side consequence, we show that the original framework can be used as a data preprocessing step to improve the performance of other methods, such as support vector machines. For several simple settings, we showcase the ability to obtain theoretical guarantees for the accuracy of the iterative classification method. The simplicity of the underlying classification framework makes it amenable to theoretical analysis.

Download Full-text