scholarly journals Why most Principal Component Analyses (PCA) in population genetic studies are wrong

2021 ◽  
Author(s):  
Eran Elhaik

Principal Component Analysis (PCA) is a multivariate analysis that allows reduction of the complexity of datasets while preserving data's covariance and visualizing the information on colorful scatterplots, ideally with only a minimal loss of information. PCA applications are extensively used as the foremost analyses in population genetics and related fields (e.g., animal and plant or medical genetics), implemented in well-cited packages like EIGENSOFT and PLINK. PCA outcomes are used to shape study design, identify and characterize individuals and populations, and draw historical and ethnobiological conclusions on origins, evolution, whereabouts, and relatedness. The replicability crisis in science has prompted us to evaluate whether PCA results are reliable, robust, and replicable. We employed an intuitive color-based model alongside human population data for eleven common test cases. We demonstrate that PCA results are artifacts of the data and that they can be easily manipulated to generate desired outcomes. PCA results may not be reliable, robust, or replicable as the field assumes. Our findings raise concerns on the validity of results reported in the literature of population genetics and related fields that place a disproportionate reliance upon PCA outcomes and the insights derived from them. We conclude that PCA may have a biasing role in genetic investigations. An alternative mixed-admixture population genetic model is discussed.

2017 ◽  
Vol 107 (9) ◽  
pp. 1000-1010 ◽  
Author(s):  
N. J. Grünwald ◽  
S. E. Everhart ◽  
B. J. Knaus ◽  
Z. N. Kamvar

Population genetic analysis is a powerful tool to understand how pathogens emerge and adapt. However, determining the genetic structure of populations requires complex knowledge on a range of subtle skills that are often not explicitly stated in book chapters or review articles on population genetics. What is a good sampling strategy? How many isolates should I sample? How do I include positive and negative controls in my molecular assays? What marker system should I use? This review will attempt to address many of these practical questions that are often not readily answered from reading books or reviews on the topic, but emerge from discussions with colleagues and from practical experience. A further complication for microbial or pathogen populations is the frequent observation of clonality or partial clonality. Clonality invariably makes analyses of population data difficult because many assumptions underlying the theory from which analysis methods were derived are often violated. This review provides practical guidance on how to navigate through the complex web of data analyses of pathogens that may violate typical population genetics assumptions. We also provide resources and examples for analysis in the R programming environment.


2010 ◽  
Vol 42 (5) ◽  
pp. 499-519 ◽  
Author(s):  
Silke WERTH

AbstractPopulation genetics investigates the distribution of genetic variation in natural populations and the genetic differentiation among populations. Lichen-forming fungi are exciting subjects for population genetic studies due to their obligate symbiosis with a green-algal and/or cyanobacterial photobiont, and because their different reproductive strategies could influence fungal genetic structures in various ways. In this review, first, I briefly summarize the results from studies of chemotype variation in populations of lichen-forming fungi. Second, I compare and evaluate the DNA-based molecular tools available for population genetics of lichen-forming fungi. Third, I review the literature available on the genetic structure of lichen fungi to show general trends. I discuss some fascinating examples, and point out directions for future research.


2011 ◽  
Vol 63 (1) ◽  
pp. 55-58
Author(s):  
Dragana Puzovic ◽  
D. Dunjic ◽  
Branka Popovic ◽  
O. Stojkovic ◽  
Ivana Novakovic ◽  
...  

Dentin provides a protective enclosure for genomic and mitochondrial DNA. In the present study, DNA was obtained from pulverized or ground teeth. The quality of the DNA extracted from the teeth of 70 unrelated individuals was tested in the context of assessing the allelic and genotypic frequencies of autosomal loci D19S216, D20S502 and D20S842, and calculating a number of parameters of population genetics and forensic interest. This study illustrates that teeth can be a convenient tissue to extract DNA from large numbers of individuals for population genetic studies as well as for forensic case work.


2008 ◽  
Vol 57 (1-6) ◽  
pp. 41-44 ◽  
Author(s):  
F. Maghuly ◽  
K. Burg ◽  
W. Pinsker ◽  
F. Nittinger ◽  
W. Praznik ◽  
...  

AbstractNorway spruce is an important commercial tree species in northern and central Europe. Pure mitochondrial DNA isolated from tissue culture materials grown in the dark were used to construct a partial mitochondrial library. 100 clones were randomly selected and 19 markers were isolated. Three of these markers proved to be polymorphic and two showed maternal inheritance in controlled crosses. These markers will be useful for population genetic studies in P. abies.


Author(s):  
Adriana Fresneda Rodríguez ◽  
Luis Chasqui Velasco ◽  
David Alonso Carvajal

Microsatellites are molecular markers frequently used in population genetic studies despite of the high cost, and long time involved in developing them, mainly due to their high specificity. One method to save money and time is cross-amplification, which is the DNA amplification of the target species using primers developed for a different species. By using cross-amplification, the suitability of 15 developed microsatellite loci from Litopenaeus setiferus and L. vannamei to amplify microsatellite regions of L. schmitti and L. occidentalis was evaluated. Five primers showed consistent amplification and were polymorphic in L. schmitti and four in L. occidentalis. These results point out the usefulness of cross-amplification with these primers for population genetics studies of both species.


Author(s):  
Adrien Oliva ◽  
Raymond Tobler ◽  
Alan Cooper ◽  
Bastien Llamas ◽  
Yassine Souilmi

Abstract The current standard practice for assembling individual genomes involves mapping millions of short DNA sequences (also known as DNA ‘reads’) against a pre-constructed reference genome. Mapping vast amounts of short reads in a timely manner is a computationally challenging task that inevitably produces artefacts, including biases against alleles not found in the reference genome. This reference bias and other mapping artefacts are expected to be exacerbated in ancient DNA (aDNA) studies, which rely on the analysis of low quantities of damaged and very short DNA fragments (~30–80 bp). Nevertheless, the current gold-standard mapping strategies for aDNA studies have effectively remained unchanged for nearly a decade, during which time new software has emerged. In this study, we used simulated aDNA reads from three different human populations to benchmark the performance of 30 distinct mapping strategies implemented across four different read mapping software—BWA-aln, BWA-mem, NovoAlign and Bowtie2—and quantified the impact of reference bias in downstream population genetic analyses. We show that specific NovoAlign, BWA-aln and BWA-mem parameterizations achieve high mapping precision with low levels of reference bias, particularly after filtering out reads with low mapping qualities. However, unbiased NovoAlign results required the use of an IUPAC reference genome. While relevant only to aDNA projects where reference population data are available, the benefit of using an IUPAC reference demonstrates the value of incorporating population genetic information into the aDNA mapping process, echoing recent results based on graph genome representations.


2021 ◽  
pp. 000370282098784
Author(s):  
James Renwick Beattie ◽  
Francis Esmonde-White

Spectroscopy rapidly captures a large amount of data that is not directly interpretable. Principal Components Analysis (PCA) is widely used to simplify complex spectral datasets into comprehensible information by identifying recurring patterns in the data with minimal loss of information. The linear algebra underpinning PCA is not well understood by many applied analytical scientists and spectroscopists who use PCA. The meaning of features identified through PCA are often unclear. This manuscript traces the journey of the spectra themselves through the operations behind PCA, with each step illustrated by simulated spectra. PCA relies solely on the information within the spectra, consequently the mathematical model is dependent on the nature of the data itself. The direct links between model and spectra allow concrete spectroscopic explanation of PCA, such the scores representing ‘concentration’ or ‘weights’. The principal components (loadings) are by definition hidden, repeated and uncorrelated spectral shapes that linearly combine to generate the observed spectra. They can be visualized as subtraction spectra between extreme differences within the dataset. Each PC is shown to be a successive refinement of the estimated spectra, improving the fit between PC reconstructed data and the original data. Understanding the data-led development of a PCA model shows how to interpret application specific chemical meaning of the PCA loadings and how to analyze scores. A critical benefit of PCA is its simplicity and the succinctness of its description of a dataset, making it powerful and flexible.


Sign in / Sign up

Export Citation Format

Share Document