unbalanced sampling
Recently Published Documents


TOTAL DOCUMENTS

6
(FIVE YEARS 2)

H-INDEX

3
(FIVE YEARS 1)

2019 ◽  
Vol 19 (1) ◽  
Author(s):  
Travis J Lawrence ◽  
Katherine CH Amrine ◽  
Wesley D Swingley ◽  
David H Ardell

Abstract Background Eukaryotes acquired the trait of oxygenic photosynthesis through endosymbiosis of the cyanobacterial progenitor of plastid organelles. Despite recent advances in the phylogenomics of Cyanobacteria, the phylogenetic root of plastids remains controversial. Although a single origin of plastids by endosymbiosis is broadly supported, recent phylogenomic studies are contradictory on whether plastids branch early or late within Cyanobacteria. One underlying cause may be poor fit of evolutionary models to complex phylogenomic data. Results Using Posterior Predictive Analysis, we show that recently applied evolutionary models poorly fit three phylogenomic datasets curated from cyanobacteria and plastid genomes because of heterogeneities in both substitution processes across sites and of compositions across lineages. To circumvent these sources of bias, we developed CYANO-MLP, a machine learning algorithm that consistently and accurately phylogenetically classifies (“phyloclassifies”) cyanobacterial genomes to their clade of origin based on bioinformatically predicted function-informative features in tRNA gene complements. Classification of cyanobacterial genomes with CYANO-MLP is accurate and robust to deletion of clades, unbalanced sampling, and compositional heterogeneity in input tRNA data. CYANO-MLP consistently classifies plastid genomes into a late-branching cyanobacterial sub-clade containing single-cell, starch-producing, nitrogen-fixing ecotypes, consistent with metabolic and gene transfer data. Conclusions Phylogenomic data of cyanobacteria and plastids exhibit both site-process heterogeneities and compositional heterogeneities across lineages. These aspects of the data require careful modeling to avoid bias in phylogenomic estimation. Furthermore, we show that amino acid recoding strategies may be insufficient to mitigate bias from compositional heterogeneities. However, the combination of our novel tRNA-specific strategy with machine learning in CYANO-MLP appears robust to these sources of bias with high accuracy in phyloclassification of cyanobacterial genomes. CYANO-MLP consistently classifies plastids as late-branching Cyanobacteria, consistent with independent evidence from signature-based approaches and some previous phylogenetic studies.


2019 ◽  
Vol 7 (2) ◽  
pp. 28 ◽  
Author(s):  
Marek Gruszczyński

The paper discusses methodological topics of bankruptcy prediction modelling—unbalanced sampling, sample bias, and unbiased predictions of bankruptcy. Bankruptcy models are typically estimated with the use of non-random samples, which creates sample choice biases. We consider two types of unbalanced samples: (a) when bankrupt and non-bankrupt companies enter the sample in unequal numbers; and (b) when sample composition allows for different ratios of bankrupt and non-bankrupt companies than those in the population. An imbalance of type (b), being more general, is examined in several sections of the paper. We offer an extended view of the relationship between the biased and unbiased estimated probabilities of bankruptcy—probability of default (PD). A common error in applications is neglecting the possibility of calibrating the PD obtained from a bankruptcy model to the unbiased PD that is population adjusted. We show that Skogsviks’ formula of 2013 coincides with prior correction known for the logit model. This, together with solutions for other binomial models, serves as practical advice for obtaining the calibration of unbiased PDs from popular bankruptcy models. In the final section, we explore sample bias effects on classification.


2018 ◽  
Author(s):  
Alex Diaz-Papkovich ◽  
Luke Anderson-Trocmé ◽  
Simon Gravel

AbstractGenetic structure in large cohorts results from technical, sampling and demographic variation. Visualisation is therefore a first step in most genomic analyses. However, existing data exploration methods struggle with unbalanced sampling and the many scales of population structure. We investigate an approach to dimension reduction of genomic data that combines principal components analysis (PCA) with uniform manifold approximation and projection (UMAP) to succinctly illustrate population structure in large cohorts and capture their relationships on local and global scales. Using data from large-scale genomic datasets, we demonstrate that PCA-UMAP effectively clusters closely related individuals while placing them in a global continuum of genetic variation. This approach reveals previously overlooked subpopulations within the American Hispanic population and fine-scale relationships between geography, genotypes, and phenotypes in the UK population. This opens new lines of investigation for demographic research and statistical genetics. Given its small computational cost, PCA-UMAP also provides a general-purpose approach to exploratory analysis in population-scale datasets.Author summaryBecause of geographic isolation, individuals tend to be more genetically related to people living nearby than to people living far. This is an example of population structure, a situation where a large population contains subgroups that share more than the average amount of DNA. This structure can tell us about human history, and it can also have a large effect on medical studies. We use a newly developed method (UMAP) to visualize population structure from three genomic datasets. Using genotype data alone, we reveal numerous subgroups related to ancestry and correlated with traits such as white blood cell count, height, and FEV1, a measure used to detect airway obstruction. We demonstrate that UMAP reveals previously unobserved patterns and fine-scale structure. We show that visualizations work especially well in large datasets containing populations with diverse backgrounds, which are rapidly becoming more common, and that unlike other visualization methods, we can preserve intuitive connections between populations that reflect their shared ancestries. The combination of these results and the effectiveness of the strategy on large and diverse datasets make this an important approach for exploratory analysis for geneticists studying ancestral events and phenotype distributions.


Sign in / Sign up

Export Citation Format

Share Document