scholarly journals Fast principal components analysis reveals convergent evolution of ADH1B gene in Europe and East Asia

2015 ◽  
Author(s):  
Kevin J Galinsky ◽  
Gaurav Bhatia ◽  
Po-Ru Loh ◽  
Stoyan Georgiev ◽  
Sayan Mukherjee ◽  
...  

Searching for genetic variants with unusual differentiation between subpopulations is an established approach for identifying signals of natural selection. However, existing methods generally require discrete subpopulations. We introduce a method that infers selection using principal components (PCs) by identifying variants whose differentiation along top PCs is significantly greater than the null distribution of genetic drift. To enable the application of this method to large data sets, we developed the FastPCA software, which employs recent advances in random matrix theory to accurately approximate top PCs while reducing time and memory cost from quadratic to linear in the number of individuals, a computational improvement of many orders of magnitude. We apply FastPCA to a cohort of 54,734 European Americans, identifying 5 distinct subpopulations spanning the top 4 PCs. Using the PC-based test for natural selection, we replicate previously known selected loci and identify three new genome-wide significant signals of selection, including selection in Europeans at the ADH1B gene. The coding variant rs1229984*T has previously been associated to a decreased risk of alcoholism and shown to be under selection in East Asians; we show that it is a rare example of independent evolution on two continents. We also detect new selection signals at IGFBP3 and IGH, which have also previously been associated to human disease.


2020 ◽  
Vol 37 (8) ◽  
pp. 2450-2460 ◽  
Author(s):  
Daniel J Wilson ◽  
Derrick W Crook ◽  
Timothy E A Peto ◽  
A Sarah Walker ◽  
Sarah J Hoosdally ◽  
...  

Abstract The dN/dS ratio provides evidence of adaptation or functional constraint in protein-coding genes by quantifying the relative excess or deficit of amino acid-replacing versus silent nucleotide variation. Inexpensive sequencing promises a better understanding of parameters, such as dN/dS, but analyzing very large data sets poses a major statistical challenge. Here, I introduce genomegaMap for estimating within-species genome-wide variation in dN/dS, and I apply it to 3,979 genes across 10,209 tuberculosis genomes to characterize the selection pressures shaping this global pathogen. GenomegaMap is a phylogeny-free method that addresses two major problems with existing approaches: 1) It is fast no matter how large the sample size and 2) it is robust to recombination, which causes phylogenetic methods to report artefactual signals of adaptation. GenomegaMap uses population genetics theory to approximate the distribution of allele frequencies under general, parent-dependent mutation models. Coalescent simulations show that substitution parameters are well estimated even when genomegaMap’s simplifying assumption of independence among sites is violated. I demonstrate the ability of genomegaMap to detect genuine signatures of selection at antimicrobial resistance-conferring substitutions in Mycobacterium tuberculosis and describe a novel signature of selection in the cold-shock DEAD-box protein A gene deaD/csdA. The genomegaMap approach helps accelerate the exploitation of big data for gaining new insights into evolution within species.



2019 ◽  
Vol 88 (1) ◽  
pp. 247-280 ◽  
Author(s):  
Gary J. Doherty ◽  
Michele Petruzzelli ◽  
Emma Beddowes ◽  
Saif S. Ahmad ◽  
Carlos Caldas ◽  
...  

The complexity of human cancer underlies its devastating clinical consequences. Drugs designed to target the genetic alterations that drive cancer have improved the outcome for many patients, but not the majority of them. Here, we review the genomic landscape of cancer, how genomic data can provide much more than a sum of its parts, and the approaches developed to identify and validate genomic alterations with potential therapeutic value. We highlight notable successes and pitfalls in predicting the value of potential therapeutic targets and discuss the use of multi-omic data to better understand cancer dependencies and drug sensitivity. We discuss how integrated approaches to collecting, curating, and sharing these large data sets might improve the identification and prioritization of cancer vulnerabilities as well as patient stratification within clinical trials. Finally, we outline how future approaches might improve the efficiency and speed of translating genomic data into clinically effective therapies and how the use of unbiased genome-wide information can identify novel predictive biomarkers that can be either simple or complex.



Paleobiology ◽  
1982 ◽  
Vol 8 (2) ◽  
pp. 143-150 ◽  
Author(s):  
Martin A. Buzas ◽  
Carl F. Koch ◽  
Stephen J. Culver ◽  
Norman F. Sohl

The distribution of species abundance (number of individuals per species) is well documented. The distribution of species occurrence (number of localities per species), however, has received little attention. This study investigates the distribution of species occurrence for five large data sets. For modern benthic foraminifera, species occurrence is examined from the Atlantic continental margin of North America, where 875 species were recorded 10,017 times at 542 localities, the Gulf of Mexico, where 848 species were recorded 18,007 times at 426 localities, and the Caribbean, where 1149 species were recorded 6684 times at 268 localities. For Late Cretaceous molluscs, species occurrence is examined from the Gulf Coast where 716 species were recorded 6236 times at 166 localities and a subset of this data consisting of 643 species recorded 3851 times at 86 localities.Logseries and lognormal distributions were fitted to these data sets. In most instances the logseries best predicts the distribution of species occurrence. The lognormal, however, also fits the data fairly well, and, in one instance, better. The use of these distributions allows the prediction of the number of species occurring once, twice, …, n times.Species abundance data are also available for the molluscan data sets. They indicate that the most abundant species (greatest number of individuals) usually occur most frequently. In all data sets approximately half the species occur four or less times. The probability of noting the presence of rarely occurring species is small, and, consequently, such species must be used with extreme caution in studies requiring knowledge of the distribution of species in space and time.



2013 ◽  
Vol 7 (1) ◽  
pp. 19-24
Author(s):  
Kevin Blighe

Elaborate downstream methods are required to analyze large microarray data-sets. At times, where the end goal is to look for relationships between (or patterns within) different subgroups or even just individual samples, large data-sets must first be filtered using statistical thresholds in order to reduce their overall volume. As an example, in anthropological microarray studies, such ‘dimension reduction’ techniques are essential to elucidate any links between polymorphisms and phenotypes for given populations. In such large data-sets, a subset can first be taken to represent the larger data-set. For example, polling results taken during elections are used to infer the opinions of the population at large. However, what is the best and easiest method of capturing a sub-set of variation in a data-set that can represent the overall portrait of variation? In this article, principal components analysis (PCA) is discussed in detail, including its history, the mathematics behind the process, and in which ways it can be applied to modern large-scale biological datasets. New methods of analysis using PCA are also suggested, with tentative results outlined.



2020 ◽  
Vol 2020 ◽  
pp. 1-13
Author(s):  
Nada A. Alqahtani ◽  
Zakiah I. Kalantan

Data scientists use various machine learning algorithms to discover patterns in large data that can lead to actionable insights. In general, high-dimensional data are reduced by obtaining a set of principal components so as to highlight similarities and differences. In this work, we deal with the reduced data using a bivariate mixture model and learning with a bivariate Gaussian mixture model. We discuss a heuristic for detecting important components by choosing the initial values of location parameters using two different techniques: cluster means, k-means and hierarchical clustering, and default values in the “mixtools” R package. The parameters of the model are obtained via an expectation maximization algorithm. The criteria from Bayesian point are evaluated for both techniques, demonstrating that both techniques are efficient with respect to computation capacity. The effectiveness of the discussed techniques is demonstrated through a simulation study and using real data sets from different fields.



2018 ◽  
Vol 2 (3) ◽  
pp. 324-335 ◽  
Author(s):  
Johannes Kvam ◽  
Lars Erik Gangsei ◽  
Jørgen Kongsro ◽  
Anne H Schistad Solberg

Abstract Computed tomography (CT) scanning of pigs has been shown to produce detailed phenotypes useful in pig breeding. Due to the large number of individuals scanned and corresponding large data sets, there is a need for automatic tools for analysis of these data sets. In this paper, the feasibility of deep learning for fully automatic segmentation of the skeleton of pigs from CT volumes is explored. To maximize performance, given the training data available, a series of problem simplifications are applied. The deep-learning approach can replace our currently used semiautomatic solution, with increased robustness and little or no need for manual control. Accuracy was highly affected by training data, and expanding the training set can further increase performance making this approach especially promising.



2005 ◽  
Vol 20 (5) ◽  
pp. 603-620 ◽  
Author(s):  
Carlo A. Favero ◽  
Massimiliano Marcellino ◽  
Francesca Neglia


2015 ◽  
Vol 14 ◽  
pp. CIN.S31363 ◽  
Author(s):  
Bjarne Johannessen ◽  
Anita Sveen ◽  
Rolf I. Skotheim

Alternative splicing is a key regulatory mechanism for gene expression, vital for the proper functioning of eukaryotic cells. Disruption of normal pre-mRNA splicing has the potential to cause and reinforce human disease. Owing to rapid advances in high-throughput technologies, it is now possible to identify novel mRNA isoforms and detect aberrant splicing patterns on a genome scale, across large data sets. Analogous to the genomic types of instability describing cancer genomes (eg, chromosomal instability and microsatellite instability), transcriptome instability (TIN) has recently been proposed as a splicing-related genome-wide characteristic of certain solid cancers. We present the R package TIN, available from Bioconductor, which implements a set of methods for TIN analysis based on exon-level microarray expression profiles. TIN provides tools for estimating aberrant exon usage across samples and for analyzing correlation patterns between TIN and splicing factor expression levels.



Sign in / Sign up

Export Citation Format

Share Document