Fast and robust identity-by-descent inference with the templated positional Burrows-Wheeler transform

AbstractEstimating the genomic location and length of identical-by-descent (IBD) segments among individuals is a crucial step in many genetic analyses. However, the exponential growth in the size of biobank and direct-to-consumer (DTC) genetic data sets makes accurate IBD inference a significant computational challenge. Here we present the templated positional Burrows-Wheeler transform (TPBWT) to make fast IBD estimates robust to genotype and phasing errors. Using haplotype data simulated over pedigrees with realistic genotyping and phasing errors we show that the TPBWT outperforms other state-of-the-art IBD inference algorithms in terms of speed and accuracy. For each phase-aware method, we explore the false positive and false negative rates of inferring IBD by segment length and characterize the types of error commonly found. Our results highlight the fragility of most phased IBD inference methods; the accuracy of IBD estimates can be highly sensitive to the quality of haplotype phasing. Additionally we compare the performance of the TPBWT against a widely used phase-free IBD inference approach that is robust to phasing errors. We introduce both in-sample and out-of-sample TPBWT-based IBD inference algorithms and demonstrate their computational efficiency on massive-scale datasets with millions of samples. Furthermore we describe the binary file format for TPBWT-compressed haplotypes that results in fast and efficient out-of-sample IBD computes against very large cohort panels. Finally, we demonstrate the utility of the TPBWT in a brief empirical analysis exploring geographic patterns of haplotype sharing within Mexico. Hierarchical clustering of IBD shared across regions within Mexico reveals geographically structured haplotype sharing and a strong signal of isolation by distance. Our software implementation of the TPBWT is freely available for non-commercial use in the code repository https://github.com/23andMe/phasedibd.

Download Full-text

Fast and robust identity-by-descent inference with the templated positional Burrows-Wheeler transform

Molecular Biology and Evolution ◽

10.1093/molbev/msaa328 ◽

2020 ◽

Author(s):

William A Freyman ◽

Kimberly F McManus ◽

Suyash S Shringarpure ◽

Ethan M Jewett ◽

Katarzyna Bryc ◽

...

Keyword(s):

Isolation By Distance ◽

False Negative ◽

Segment Length ◽

Data Sets ◽

Haplotype Sharing ◽

Binary File ◽

Inference Algorithms ◽

Out Of Sample ◽

Massive Scale ◽

Burrows Wheeler Transform

Abstract Estimating the genomic location and length of identical-by-descent (IBD) segments among individuals is a crucial step in many genetic analyses. However, the exponential growth in the size of biobank and direct-to-consumer (DTC) genetic data sets makes accurate IBD inference a significant computational challenge. Here we present the templated positional Burrows-Wheeler transform (TPBWT) to make fast IBD estimates robust to genotype and phasing errors. Using haplotype data simulated over pedigrees with realistic genotyping and phasing errors we show that the TPBWT outperforms other state-of-the-art IBD inference algorithms in terms of speed and accuracy. For each phase-aware method, we explore the false positive and false negative rates of inferring IBD by segment length and characterize the types of error commonly found. Our results highlight the fragility of most phased IBD inference methods; the accuracy of IBD estimates can be highly sensitive to the quality of haplotype phasing. Additionally we compare the performance of the TPBWT against a widely used phase-free IBD inference approach that is robust to phasing errors. We introduce both in-sample and out-of-sample TPBWT-based IBD inference algorithms and demonstrate their computational efficiency on massive-scale datasets with millions of samples. Furthermore we describe the binary file format for TPBWT-compressed haplotypes that results in fast and efficient out-of-sample IBD computes against very large cohort panels. Finally, we demonstrate the utility of the TPBWT in a brief empirical analysis exploring geographic patterns of haplotype sharing within Mexico. Hierarchical clustering of IBD shared across regions within Mexico reveals geographically structured haplotype sharing and a strong signal of isolation by distance. Our software implementation of the TPBWT is freely available for non-commercial use in the code repository https://github.com/23andMe/phasedibd.

Download Full-text

CASM: A Deep-Learning Approach for Identifying Collective Action Events with Text and Image Data from Social Media

Sociological Methodology ◽

10.1177/0081175019860244 ◽

2019 ◽

Vol 49 (1) ◽

pp. 1-57 ◽

Cited By ~ 9

Author(s):

Han Zhang ◽

Jennifer Pan

Keyword(s):

Neural Networks ◽

Social Media ◽

Collective Action ◽

Short Term Memory ◽

Image Data ◽

Data Sets ◽

Event Analysis ◽

Text Data ◽

Out Of Sample ◽

Media Reports

Protest event analysis is an important method for the study of collective action and social movements and typically draws on traditional media reports as the data source. We introduce collective action from social media (CASM)—a system that uses convolutional neural networks on image data and recurrent neural networks with long short-term memory on text data in a two-stage classifier to identify social media posts about offline collective action. We implement CASM on Chinese social media data and identify more than 100,000 collective action events from 2010 to 2017 (CASM-China). We evaluate the performance of CASM through cross-validation, out-of-sample validation, and comparisons with other protest data sets. We assess the effect of online censorship and find it does not substantially limit our identification of events. Compared to other protest data sets, CASM-China identifies relatively more rural, land-related protests and relatively few collective action events related to ethnic and religious conflict.

Download Full-text

Machine Learning Methods for Demand Estimation

The American Economic Review ◽

10.1257/aer.p20151021 ◽

2015 ◽

Vol 105 (5) ◽

pp. 481-485 ◽

Cited By ~ 44

Author(s):

Patrick Bajari ◽

Denis Nekipelov ◽

Stephen P. Ryan ◽

Miaoyu Yang

Keyword(s):

Large Data ◽

Demand Estimation ◽

Large Data Sets ◽

Linear Functions ◽

Data Sets ◽

Science Literature ◽

Data Set ◽

Out Of Sample ◽

Out Of Sample Prediction ◽

Scanner Panel

We survey and apply several techniques from the statistical and computer science literature to the problem of demand estimation. To improve out-of-sample prediction accuracy, we propose a method of combining the underlying models via linear regression. Our method is robust to a large number of regressors; scales easily to very large data sets; combines model selection and estimation; and can flexibly approximate arbitrary non-linear functions. We illustrate our method using a standard scanner panel data set and find that our estimates are considerably more accurate in out-of-sample predictions of demand than some commonly used alternatives.

Download Full-text

Mapping the Modified Rankin Scale (mRS) Measurement into the Generic EuroQol (EQ-5D) Health Outcome

Medical Decision Making ◽

10.1177/0272989x09349961 ◽

2009 ◽

Vol 30 (3) ◽

pp. 341-354 ◽

Cited By ~ 91

Author(s):

Oliver Rivero-Arias ◽

Melissa Ouellet ◽

Alastair Gray ◽

Jane Wolstenholme ◽

Peter M. Rothwell ◽

...

Keyword(s):

Multinomial Logistic Regression ◽

External Validation ◽

Ordinary Least Squares ◽

Data Sets ◽

Modified Rankin Scale ◽

Validation Data ◽

Utility Values ◽

The United Kingdom ◽

Out Of Sample ◽

Ischemic Attack

Background. Mapping disease-specific instruments into generic health outcomes or utility values is an expanding field of interest in health economics. This article constructs an algorithm to translate the modified Rankin scale (mRS) into EQ-5D utility values. Methods. mRS and EQ-5D information was derived from stroke or transient ischemic attack (TIA) patients identified as part of the Oxford Vascular study (OXVASC). Ordinary least squares (OLS) regression was used to predict UK EQ-5D tariffs from mRS scores. An alternative method, using multinomial logistic regression with a Monte Carlo simulation approach (MLogit) to predict responses to each EQ-5D question, was also explored. The performance of the models was compared according to the magnitude of their predicted-to-actual mean EQ-5D tariff difference, their mean absolute and mean squared errors (MAE and MSE), and associated 95% confidence intervals (CIs). Out-of-sample validation was carried out in a subset of coronary disease and peripheral vascular disease (PVD) patients also identified as part of OXVASC but not used in the original estimation. Results. The OLS and MLogit yielded similar MAE and MSE in the internal and external validation data sets. Both approaches also underestimated the uncertainty around the actual mean EQ-5D tariff producing tighter 95% CIs in both data sets. Conclusions. The choice of algorithm will be dependent on the study aim. Individuals outside the United Kingdom may find it more useful to use the multinomial results, which can be used with different country-specific tariff valuations. However, these algorithms should not replace prospective collection of utility data.

Download Full-text

Design and Analysis of a Relational Database for Behavioral Experiments Data Processing

International Journal of Online Engineering (iJOE) ◽

10.3991/ijoe.v14i02.7988 ◽

2018 ◽

Vol 14 (02) ◽

pp. 117 ◽

Cited By ~ 1

Author(s):

Radoslava Stankova Kraleva ◽

Velin Spasov Kralev ◽

Nina Sinyagina ◽

Petia Koprinkova-Hristova ◽

Nadejda Bocheva

Keyword(s):

Data Storage ◽

Relational Database ◽

Database Management System ◽

Data Sets ◽

Behavioral Experiments ◽

Local Data ◽

Specific Data ◽

Binary File ◽

Different Types ◽

The One

In this paper, the results of a comparative analysis between different approaches to experimental data storage and processing are presented. Several studies related to the problem and some methods for solving it have been discussed. Different types of databases, ways of using them and the areas of their application are analyzed. For the purposes of the study, a relational database for storing and analyzing a specific data from behavioral experiments was designed. The methodology and conditions for conducting the experiments are described. Three different indicators were analyzed, respectively: memory required to store the data, time to load the data from an external file into computer memory and iteration time across all records through one cycle. The obtained results show that for storing a large number of records (in the order of tens of millions of rows), either dynamic arrays (stored on external media in binary file format), or an approach based on a local or remote database management system can be used. Regarding the data loading time, the fastest approach was the one that uses dynamic arrays. It outperforms significantly the approaches based on a local or remote database. The obtained results show that the dynamic arrays and the local data sets approaches iterated much faster across all data records than the remote database approach. The paper concludes with proposal for further developments towards using of web services.

Download Full-text

A rapid, accurate approach to inferring pedigrees in endogamous populations

10.1101/2020.02.25.965376 ◽

2020 ◽

Cited By ~ 1

Author(s):

Cole M Williams ◽

Brooke A Scelza ◽

Michelle Daya ◽

Ethan M Lange ◽

Christopher R Gignoux ◽

...

Keyword(s):

External Information ◽

Haplotype Sharing ◽

High Confidence ◽

Population Genomic ◽

Identical By Descent ◽

Inference Algorithms ◽

Endogamous Population ◽

Endogamous Populations ◽

Concurrent Relationships ◽

Accurate Reconstruction

AbstractAccurate reconstruction of pedigrees from genetic data remains a challenging problem. Pedigree inference algorithms are often trained only on urban European-descent families, which are comparatively ‘outbred’ compared to many other global populations. Relationship categories can be difficult to distinguish (e.g. half-sibships versus avuncular) without external information. Furthermore, published software cannot accommodate endogamous populations where there may be reticulations within a pedigree or elevated haplotype sharing. We design a simple, rapid algorithm which initially uses only high-confidence first degree relationships to seed a machine learning step based on the number of identical by descent segments. Additionally, we define a new statistic to polarize individuals to ancestor versus descendant generation. We test our approach in a sample of 700 individuals from northern Namibia, sampled from an endogamous population. Due to a culture of concurrent relationships in this population, there is a high proportion of half-sibships. We accurately identify first through third degree relationships for all categories, including half-sibships, half-avuncular-ships etc. We further validate our approach in the Barbados Asthma Genetics Study (BAGS) dataset. Accurate reconstruction of pedigrees holds promise for tracing allele frequency trajectories, improved phasing and other population genomic questions.

Download Full-text

HISAT: Hierarchical Indexing for Spliced Alignment of Transcripts

10.1101/012591 ◽

2014 ◽

Cited By ~ 2

Author(s):

Daehwan Kim ◽

Ben Langmead ◽

Steven Salzberg

Keyword(s):

Human Genome ◽

Simulated Data ◽

Genomic Region ◽

Data Sets ◽

Rna Seq ◽

Efficient System ◽

Spliced Alignment ◽

Burrows Wheeler Transform ◽

Simulated Data Sets ◽

Free Open Source

HISAT is a new, highly efficient system for alignment of sequences from RNA sequencing experiments that achieves dramatically faster performance than previous methods. HISAT uses a new indexing scheme, hierarchical indexing, which is based on the Burrows-Wheeler transform and the Ferragina-Manzini (FM) index. Hierarchical indexing employs two types of indexes for alignment: (1) a whole-genome FM index to anchor each alignment, and (2) numerous local FM indexes for very rapid extensions of these alignments. HISAT?s hierarchical index for the human genome contains 48,000 local FM indexes, each representing a genomic region of ~64,000 bp. The algorithm includes several customized alignment strategies specifically designed for mapping RNA-seq reads across multiple exons. In tests on a variety of real and simulated data sets, we show that HISAT is the fastest system currently available, approximately 50 times faster than TopHat2 and 12 times faster than GSNAP, with equal or better accuracy than any other method. Despite its very large number of indexes, HISAT requires only 4.3 Gigabytes of memory to align reads to the human genome. HISAT supports genomes of any size, including those larger than 4 billion bases. HISAT is available as free, open-source software from http://www.ccb.jhu.edu/software/hisat.

Download Full-text

Application of a random walk model to geographic distributions of animal mitochondrial DNA variation.

Genetics ◽

10.1093/genetics/135.4.1209 ◽

1993 ◽

Vol 135 (4) ◽

pp. 1209-1220 ◽

Cited By ~ 3

Author(s):

J E Neigel ◽

J C Avise

Keyword(s):

Mitochondrial Dna ◽

Isolation By Distance ◽

Natural Populations ◽

Data Sets ◽

Long Distance ◽

Dna Variation ◽

Distance Model ◽

Limited Dispersal ◽

Geographic Distributions ◽

Lineage Structure

Abstract In rapidly evolving molecules, such as animal mitochondrial DNA, mutations that delineate specific lineages may not be dispersed at sufficient rates to attain an equilibrium between genetic drift and gene flow. Here we predict conditions that lead to nonequilibrium geographic distributions of mtDNA lineages, test the robustness of these predictions and examine mtDNA data sets for consistency with our model. Under a simple isolation by distance model, the variance of an mtDNA lineage's geographic distribution is expected be proportional to its age. Simulation results indicated that this relationship is fairly robust. Analysis of mtDNA data from natural populations revealed three qualitative distributional patterns: (1) significant departure of lineage structure from equilibrium geographic distributions, a pattern exhibited in three rodent species with limited dispersal; (2) nonsignificant departure from equilibrium expectations, exhibited by two avian and two marine fish species with potentials for relatively long-distance dispersal; and (3) a progression from nonequilibrium distributions for younger lineages to equilibrium distributions for older lineages, a condition displayed by one surveyed avian species. These results demonstrate the advantages of considering mutation and genealogy in the interpretation of mtDNA geographic variation.

Download Full-text

Comparison of the Validity and Generalizability of Machine Learning Algorithms for the Prediction of Energy Expenditure: Validation Study

JMIR mhealth and uhealth ◽

10.2196/23938 ◽

2021 ◽

Vol 9 (8) ◽

pp. e23938

Author(s):

Ruairi O'Driscoll ◽

Jake Turicchi ◽

Mark Hopkins ◽

Cristiana Duarte ◽

Graham W Horgan ◽

...

Keyword(s):

Physical Activity ◽

Machine Learning ◽

Neural Networks ◽

Random Forest ◽

Energy Expenditure ◽

Superior Performance ◽

Gradient Boosting ◽

Support Vector ◽

Data Sets ◽

Out Of Sample

Background Accurate solutions for the estimation of physical activity and energy expenditure at scale are needed for a range of medical and health research fields. Machine learning techniques show promise in research-grade accelerometers, and some evidence indicates that these techniques can be applied to more scalable commercial devices. Objective This study aims to test the validity and out-of-sample generalizability of algorithms for the prediction of energy expenditure in several wearables (ie, Fitbit Charge 2, ActiGraph GT3-x, SenseWear Armband Mini, and Polar H7) using two laboratory data sets comprising different activities. Methods Two laboratory studies (study 1: n=59, age 44.4 years, weight 75.7 kg; study 2: n=30, age=31.9 years, weight=70.6 kg), in which adult participants performed a sequential lab-based activity protocol consisting of resting, household, ambulatory, and nonambulatory tasks, were combined in this study. In both studies, accelerometer and physiological data were collected from the wearables alongside energy expenditure using indirect calorimetry. Three regression algorithms were used to predict metabolic equivalents (METs; ie, random forest, gradient boosting, and neural networks), and five classification algorithms (ie, k-nearest neighbor, support vector machine, random forest, gradient boosting, and neural networks) were used for physical activity intensity classification as sedentary, light, or moderate to vigorous. Algorithms were evaluated using leave-one-subject-out cross-validations and out-of-sample validations. Results The root mean square error (RMSE) was lowest for gradient boosting applied to SenseWear and Polar H7 data (0.91 METs), and in the classification task, gradient boost applied to SenseWear and Polar H7 was the most accurate (85.5%). Fitbit models achieved an RMSE of 1.36 METs and 78.2% accuracy for classification. Errors tended to increase in out-of-sample validations with the SenseWear neural network achieving RMSE values of 1.22 METs in the regression tasks and the SenseWear gradient boost and random forest achieving an accuracy of 80% in classification tasks. Conclusions Algorithms trained on combined data sets demonstrated high predictive accuracy, with a tendency for superior performance of random forests and gradient boosting for most but not all wearable devices. Predictions were poorer in the between-study validations, which creates uncertainty regarding the generalizability of the tested algorithms.

Download Full-text

A comparison of attentional neural network architectures for modeling with electronic medical records

JAMIA Open ◽

10.1093/jamiaopen/ooab064 ◽

2021 ◽

Vol 4 (3) ◽

Author(s):

Anthony Finch ◽

Alexander Crowell ◽

Yung-Chieh Chang ◽

Pooja Parameshwarappa ◽

Jose Martinez ◽

...

Keyword(s):

Transfer Learning ◽

Model Performance ◽

Mortality Prediction ◽

Weighted Averaging ◽

Data Sets ◽

Kaiser Permanente ◽

Out Of Sample ◽

Medical Modeling ◽

Using Data ◽

Demographic Features

Abstract Objective Attention networks learn an intelligent weighted averaging mechanism over a series of entities, providing increases to both performance and interpretability. In this article, we propose a novel time-aware transformer-based network and compare it to another leading model with similar characteristics. We also decompose model performance along several critical axes and examine which features contribute most to our model’s performance. Materials and methods Using data sets representing patient records obtained between 2017 and 2019 by the Kaiser Permanente Mid-Atlantic States medical system, we construct four attentional models with varying levels of complexity on two targets (patient mortality and hospitalization). We examine how incorporating transfer learning and demographic features contribute to model success. We also test the performance of a model proposed in recent medical modeling literature. We compare these models with out-of-sample data using the area under the receiver-operator characteristic (AUROC) curve and average precision as measures of performance. We also analyze the attentional weights assigned by these models to patient diagnoses. Results We found that our model significantly outperformed the alternative on a mortality prediction task (91.96% AUROC against 73.82% AUROC). Our model also outperformed on the hospitalization task, although the models were significantly more competitive in that space (82.41% AUROC against 80.33% AUROC). Furthermore, we found that demographic features and transfer learning features which are frequently omitted from new models proposed in the EMR modeling space contributed significantly to the success of our model. Discussion We proposed an original construction of deep learning electronic medical record models which achieved very strong performance. We found that our unique model construction outperformed on several tasks in comparison to a leading literature alternative, even when input data was held constant between them. We obtained further improvements by incorporating several methods that are frequently overlooked in new model proposals, suggesting that it will be useful to explore these options further in the future.

Download Full-text