Genomic prediction with the additive-dominant model by dimensionality reduction methods

Abstract: The objective of this work was to evaluate the application of different dimensionality reduction methods in the additive-dominant model and to compare them with the genomic best linear unbiased prediction (G-BLUP) method. The dimensionality reduction methods evaluated were: principal components regression (PCR), partial least squares (PLS), and independent components regression (ICR). A simulated data set composed of 1,000 individuals and 2,000 single-nucleotide polymorphisms was used, being analyzed in four scenarios: two heritability levels × two genetic architectures. To help choose the number of components, the results were evaluated as to additive, dominant, and total genomic information. In general, PCR showed higher accuracy values than the other methods. However, none of the methodologies are able to recover true genomic heritabilities and all of them present biased estimates, under- or overestimating the genomic genetic values. For the simultaneous estimation of the additive and dominance marker effects, the best alternative is to choose the number of components that leads the dominance genomic value to a higher accuracy.

Download Full-text

A Traveler’s Guide to the Multiverse: Promises, Pitfalls, and a Framework for the Evaluation of Analytic Decisions

Advances in Methods and Practices in Psychological Science ◽

10.1177/2515245920954925 ◽

2021 ◽

Vol 4 (1) ◽

pp. 251524592095492

Author(s):

Marco Del Giudice ◽

Steven W. Gangestad

Keyword(s):

Degrees Of Freedom ◽

A Priori ◽

Simulated Data ◽

Style Analysis ◽

Data Set ◽

Scant Attention ◽

Equivalence Type ◽

The Impact ◽

Biased Estimates

Decisions made by researchers while analyzing data (e.g., how to measure variables, how to handle outliers) are sometimes arbitrary, without an objective justification for choosing one alternative over another. Multiverse-style methods (e.g., specification curve, vibration of effects) estimate an effect across an entire set of possible specifications to expose the impact of hidden degrees of freedom and/or obtain robust, less biased estimates of the effect of interest. However, if specifications are not truly arbitrary, multiverse-style analyses can produce misleading results, potentially hiding meaningful effects within a mass of poorly justified alternatives. So far, a key question has received scant attention: How does one decide whether alternatives are arbitrary? We offer a framework and conceptual tools for doing so. We discuss three kinds of a priori nonequivalence among alternatives—measurement nonequivalence, effect nonequivalence, and power/precision nonequivalence. The criteria we review lead to three decision scenarios: Type E decisions (principled equivalence), Type N decisions (principled nonequivalence), and Type U decisions (uncertainty). In uncertain scenarios, multiverse-style analysis should be conducted in a deliberately exploratory fashion. The framework is discussed with reference to published examples and illustrated with the help of a simulated data set. Our framework will help researchers reap the benefits of multiverse-style methods while avoiding their pitfalls.

Download Full-text

An Evaluation of Supervised Dimensionality Reduction For Large Scale Data

Journal of Machine and Computing ◽

10.53759/7669/jmc202202003 ◽

2022 ◽

pp. 17-25

Author(s):

Nancy Jan Sliper

Keyword(s):

Dimensionality Reduction ◽

Large Scale ◽

Simulated Data ◽

Principal Component ◽

Low Rank ◽

Learning Tools ◽

Large Scale Data ◽

Reduction Methods ◽

Low Dimensional ◽

Scale Data

Experimenters today frequently quantify millions or even billions of characteristics (measurements) each sample to address critical biological issues, in the hopes that machine learning tools would be able to make correct data-driven judgments. An efficient analysis requires a low-dimensional representation that preserves the differentiating features in data whose size and complexity are orders of magnitude apart (e.g., if a certain ailment is present in the person's body). While there are several systems that can handle millions of variables and yet have strong empirical and conceptual guarantees, there are few that can be clearly understood. This research presents an evaluation of supervised dimensionality reduction for large scale data. We provide a methodology for expanding Principal Component Analysis (PCA) by including category moment estimations in low-dimensional projections. Linear Optimum Low-Rank (LOLR) projection, the cheapest variant, includes the class-conditional means. We show that LOLR projections and its extensions enhance representations of data for future classifications while retaining computing flexibility and reliability using both experimental and simulated data benchmark. When it comes to accuracy, LOLR prediction outperforms other modular linear dimension reduction methods that require much longer computation times on conventional computers. LOLR uses more than 150 million attributes in brain image processing datasets, and many genome sequencing datasets have more than half a million attributes.

Download Full-text

Detecting Inversions with PCA in the Presence of Population Structure

10.1101/736900 ◽

2019 ◽

Author(s):

Ronald J. Nowling ◽

Krystal R. Manke ◽

Scott J. Emrich

Keyword(s):

Simulated Data ◽

Principal Component ◽

Real Data ◽

Malaria Vectors ◽

Anopheles Coluzzii ◽

Nucleotide Polymorphisms ◽

Data Set ◽

Single Nucleotide ◽

Closely Related Species ◽

Proper Analysis

ABSTRACTChromosomal inversions are associated with reproductive isolation and adaptation in insects such as Drosophila melanogaster and the malaria vectors Anopheles gambiae and Anopheles coluzzii. While methods based on read alignment have been useful in humans for detecting inversions, these methods are less successful in insects due to long repeated sequences at the breakpoints. Alternatively, inversions can be detected using principal component analysis (PCA) of single nucleotide polymorphisms (SNPs). We apply PCA-based inversion detection to a simulated data set and real data from multiple insect species, which vary in complexity from a single inversion in samples drawn from a single population to analyzing multiple overlapping inversions occurring in closely-related species, samples of which that were generated from multiple geographic locations. We show empirically that proper analysis of these data can be challenging when multiple inversions or populations are present, and that our alternative framework is more robust in these more difficult scenarios.

Download Full-text

Accounting for the bin structure of data removes bias when fitting size spectra

Marine Ecology Progress Series ◽

10.3354/meps13230 ◽

2020 ◽

Vol 636 ◽

pp. 19-33 ◽

Cited By ~ 1

Author(s):

AM Edwards ◽

JPW Robinson ◽

JL Blanchard ◽

JK Baum ◽

MJ Plank

Keyword(s):

Simulated Data ◽

Likelihood Method ◽

Size Spectrum ◽

Trawl Survey ◽

Power Law Distribution ◽

Data Types ◽

Data Set ◽

Size Spectra ◽

High Resolution Data ◽

Biased Estimates

Size spectra are recommended tools for detecting the response of marine communities to fishing or to management measures. A size spectrum succinctly describes how a property, such as abundance or biomass, varies with body size in a community. Required data are often collected in binned form, such as numbers of individuals in 1 cm length bins. Numerous methods have been employed to fit size spectra, but most give biased estimates when tested on simulated data, and none account for the data’s bin structure (breakpoints of bins). Here, we used 8 methods to fit an annual size-spectrum exponent, b, to an example data set (30 yr of the North Sea International Bottom Trawl Survey). The methods gave conflicting conclusions regarding b declining (the size spectrum steepening) through time, and so any resulting advice to ecosystem managers will be highly dependent upon the method used. Using simulated data, we showed that ignoring the bin structure gives biased estimates of b, even for high-resolution data. However, our extended likelihood method, which explicitly accounts for the bin structure, accurately estimated b and its confidence intervals, even for coarsely collected data. We developed a novel visualisation method that accounts for the bin structure and associated uncertainty, provide recommendations concerning different data types and have created an R package (sizeSpectra) to reproduce all results and encourage use of our methods. This work is also relevant to wider applications where a power-law distribution (the underlying distribution for a size spectrum) is fitted to binned data.

Download Full-text

Identification of Rainfall Patterns on Hydrological Simulation Using Robust Principal Component Analysis

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v11.i3.pp1162-1167 ◽

2018 ◽

Vol 11 (3) ◽

pp. 1162 ◽

Cited By ~ 1

Author(s):

S.M. Shaharudin ◽

N. Ahmad ◽

N.H. Zainuddin ◽

N.S. Mohamed

Keyword(s):

Principal Component Analysis ◽

Simulated Data ◽

Principal Component ◽

Breakdown Point ◽

Component Analysis ◽

Data Matrix ◽

Robust Pca ◽

Data Set ◽

Number Of Components ◽

Rainfall Patterns

A robust dimension reduction method in Principal Component Analysis (PCA) was used to rectify the issue of unbalanced clusters in rainfall patterns due to the skewed nature of rainfall data. A robust measure in PCA using Tukey’s biweight correlation to downweigh observations was introduced and the optimum breakdown point to extract the number of components in PCA using this approach is proposed. A set of simulated data matrix that mimicked the real data set was used to determine an appropriate breakdown point for robust PCA and compare the performance of the both approaches. The simulated data indicated a breakdown point of 70% cumulative percentage of variance gave a good balance in extracting the number of components .The results showed a more significant and substantial improvement with the robust PCA than the PCA based Pearson correlation in terms of the average number of clusters obtained and its cluster quality.

Download Full-text

LASSO with cross-validation for genomic selection

Genetics Research ◽

10.1017/s0016672309990334 ◽

2009 ◽

Vol 91 (6) ◽

pp. 427-436 ◽

Cited By ~ 94

Author(s):

M. GRAZIANO USAI ◽

MIKE E. GODDARD ◽

BEN J. HAYES

Keyword(s):

Genomic Selection ◽

Cross Validation ◽

Simulated Data ◽

Reference Population ◽

Good Alternative ◽

Data Sets ◽

Nucleotide Polymorphisms ◽

Data Set ◽

Estimated Breeding Values ◽

Selection Operator

SummaryWe used a least absolute shrinkage and selection operator (LASSO) approach to estimate marker effects for genomic selection. The least angle regression (LARS) algorithm and cross-validation were used to define the best subset of markers to include in the model. The LASSO–LARS approach was tested on two data sets: a simulated data set with 5865 individuals and 6000 Single Nucleotide Polymorphisms (SNPs); and a mouse data set with 1885 individuals genotyped for 10 656 SNPs and phenotyped for a number of quantitative traits. In the simulated data, three approaches were used to split the reference population into training and validation subsets for cross-validation: random splitting across the whole population; random sampling of validation set from the last generation only, either within or across families. The highest accuracy was obtained by random splitting across the whole population. The accuracy of genomic estimated breeding values (GEBVs) in the candidate population obtained by LASSO–LARS was 0·89 with 156 explanatory SNPs. This value was higher than those obtained by Best Linear Unbiased Prediction (BLUP) and a Bayesian method (BayesA), which were 0·75 and 0·84, respectively. In the mouse data, 1600 individuals were randomly allocated to the reference population. The GEBVs for the remaining 285 individuals estimated by LASSO–LARS were more accurate than those obtained by BLUP and BayesA for weight at six weeks and slightly lower for growth rate and body length. It was concluded that LASSO–LARS approach is a good alternative method to estimate marker effects for genomic selection, particularly when the cost of genotyping can be reduced by using a limited subset of markers.

Download Full-text

Detecting epistasis via Markov bases

Journal of Algebraic Statistics ◽

10.18409/jas.v2i1.27 ◽

2011 ◽

Vol 2 (1) ◽

Cited By ~ 3

Author(s):

Anna-Sapfo Malaspinas ◽

Caroline Uhler

Keyword(s):

Simulated Data ◽

Research Progress ◽

Nucleotide Polymorphisms ◽

Two Stage ◽

Data Set ◽

Exact Test ◽

Multiple Loci ◽

Genome Wide ◽

A Genome ◽

Logistic Regression Method

Rapid research progress in genotyping techniques have allowed large genome-wide associationstudies. Existing methods often focus on determining associations between single loci anda specic phenotype. However, a particular phenotype is usually the result of complex relationshipsbetween multiple loci and the environment. In this paper, we describe a two-stage methodfor detecting epistasis by combining the traditionally used single-locus search with a search formultiway interactions. Our method is based on an extended version of Fisher's exact test. Toperform this test, a Markov chain is constructed on the space of multidimensional contingencytables using the elements of a Markov basis as moves. We test our method on simulated data andcompare it to a two-stage logistic regression method and to a fully Bayesian method, showing thatwe are able to detect the interacting loci when other methods fail to do so. Finally, we apply ourmethod to a genome-wide data set consisting of 685 dogs and identify epistasis associated withcanine hair length for four pairs of single nucleotide polymorphisms (SNPs).

Download Full-text

Dimensionality reduction methods for Impedance Spectroscopy data of biological materials

Journal of Physics Conference Series ◽

10.1088/1742-6596/2008/1/012009 ◽

2021 ◽

Vol 2008 (1) ◽

pp. 012009

Author(s):

R Cavalieri ◽

P Bertemes-Filho

Keyword(s):

Neural Networks ◽

Impedance Spectroscopy ◽

Dimensionality Reduction ◽

Electrical Impedance ◽

Biological Materials ◽

Electrical Impedance Spectroscopy ◽

Spectroscopy Data ◽

Data Set ◽

Reduction Methods ◽

Pca Method

Abstract Electrical impedance spectroscopy combined with Neural Networks can be a powerful combination to identify biological materials. This paper utilizes a data set containing two biological samples taken from different species and applies the most popular methods of dimensionality reduction. This is done in order to find out which method is able to minimize computational demand and maximize accuracy in the classification test. This paper proposes that the classic PCA method is the fastest and the most accurate under the configurations used.

Download Full-text

Comparison of Selected Dimensionality Reduction Methods for Detection of Root-Knot Nematode Infestations in Potato Tubers Using Hyperspectral Imaging

Sensors ◽

10.3390/s22010367 ◽

2022 ◽

Vol 22 (1) ◽

pp. 367

Author(s):

Janez Lapajne ◽

Matej Knapič ◽

Uroš Žibrat

Keyword(s):

Discriminant Analysis ◽

Dimensionality Reduction ◽

Linear Discriminant Analysis ◽

Hyperspectral Imaging ◽

Support Vector ◽

Potato Tubers ◽

Data Set ◽

Linear Discriminant ◽

Extreme Gradient Boosting ◽

Reduction Methods

Hyperspectral imaging is a popular tool used for non-invasive plant disease detection. Data acquired with it usually consist of many correlated features; hence most of the acquired information is redundant. Dimensionality reduction methods are used to transform the data sets from high-dimensional, to low-dimensional (in this study to one or a few features). We have chosen six dimensionality reduction methods (partial least squares, linear discriminant analysis, principal component analysis, RandomForest, ReliefF, and Extreme gradient boosting) and tested their efficacy on a hyperspectral data set of potato tubers. The extracted or selected features were pipelined to support vector machine classifier and evaluated. Tubers were divided into two groups, healthy and infested with Meloidogyne luci. The results show that all dimensionality reduction methods enabled successful identification of inoculated tubers. The best and most consistent results were obtained using linear discriminant analysis, with 100% accuracy in both potato tuber inside and outside images. Classification success was generally higher in the outside data set, than in the inside. Nevertheless, accuracy was in all cases above 0.6.

Download Full-text

Quality-based guidance for exploratory dimensionality reduction

Information Visualization ◽

10.1177/1473871612460526 ◽

2012 ◽

Vol 12 (1) ◽

pp. 44-64 ◽

Cited By ~ 12

Author(s):

Sara Johansson Fernstad ◽

Jane Shaw ◽

Jimmy Johansson

Keyword(s):

Dimensionality Reduction ◽

High Dimensional Data ◽

Quality Metrics ◽

Exploratory Analysis ◽

High Dimensional ◽

Data Sets ◽

Data Set ◽

Bacterial Populations ◽

Reduction Methods ◽

Individual Variables

High-dimensional data sets containing hundreds of variables are difficult to explore, as traditional visualization methods often are unable to represent such data effectively. This is commonly addressed by employing dimensionality reduction prior to visualization. Numerous dimensionality reduction methods are available. However, few reduction approaches take the importance of several structures into account and few provide an overview of structures existing in the full high-dimensional data set. For exploratory analysis, as well as for many other tasks, several structures may be of interest. Exploration of the full high-dimensional data set without reduction may also be desirable. This paper presents flexible methods for exploratory analysis and interactive dimensionality reduction. Automated methods are employed to analyse the variables, using a range of quality metrics, providing one or more measures of ‘interestingness’ for individual variables. Through ranking, a single value of interestingness is obtained, based on several quality metrics, that is usable as a threshold for the most interesting variables. An interactive environment is presented in which the user is provided with many possibilities to explore and gain understanding of the high-dimensional data set. Guided by this, the analyst can explore the high-dimensional data set and interactively select a subset of the potentially most interesting variables, employing various methods for dimensionality reduction. The system is demonstrated through a use-case analysing data from a DNA sequence-based study of bacterial populations.

Download Full-text