scholarly journals Optimal dimensionality selection for independent component analysis of transcriptomic data

2021 ◽  
Author(s):  
John Luke McConn ◽  
Cameron R Lamoureux ◽  
Saugat Poudel ◽  
Bernhard O Palsson ◽  
Anand V Sastry

Independent Component Analysis (ICA) is an unsupervised machine learning algorithm that separates a set of mixed signals into a set of statistically independent source signals. Applied to high-quality gene expression datasets, ICA effectively reveals the source signals of the transcriptome as groups of co-regulated genes and their corresponding activities across diverse growth conditions. Two major variables that affect the output of ICA are the diversity and scope of the underlying data, and the user-defined number of independent components, or dimensionality, to compute. Availability of high-quality transcriptomic datasets has grown exponentially as high-throughput technologies have advanced; however, optimal dimensionality selection remains an open question. Here, we introduce a new method, called OptICA, for effectively finding the optimal dimensionality that consistently maximizes the number of biologically relevant components revealed while minimizing the potential for over-decomposition. We show that OptICA outperforms two previously proposed methods for selecting the number of independent components across four transcriptomic databases of varying sizes. OptICA avoids both over-decomposition and under-decomposition of transcriptomic datasets resulting in the best representation of the organism's underlying transcriptional regulatory network.

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
John Luke McConn ◽  
Cameron R. Lamoureux ◽  
Saugat Poudel ◽  
Bernhard O. Palsson ◽  
Anand V. Sastry

Abstract Background Independent component analysis is an unsupervised machine learning algorithm that separates a set of mixed signals into a set of statistically independent source signals. Applied to high-quality gene expression datasets, independent component analysis effectively reveals both the source signals of the transcriptome as co-regulated gene sets, and the activity levels of the underlying regulators across diverse experimental conditions. Two major variables that affect the final gene sets are the diversity of the expression profiles contained in the underlying data, and the user-defined number of independent components, or dimensionality, to compute. Availability of high-quality transcriptomic datasets has grown exponentially as high-throughput technologies have advanced; however, optimal dimensionality selection remains an open question. Methods We computed independent components across a range of dimensionalities for four gene expression datasets with varying dimensions (both in terms of number of genes and number of samples). We computed the correlation between independent components across different dimensionalities to understand how the overall structure evolves as the number of user-defined components increases. We then measured how well the resulting gene clusters reflected known regulatory mechanisms, and developed a set of metrics to assess the accuracy of the decomposition at a given dimension. Results We found that over-decomposition results in many independent components dominated by a single gene, whereas under-decomposition results in independent components that poorly capture the known regulatory structure. From these results, we developed a new method, called OptICA, for finding the optimal dimensionality that controls for both over- and under-decomposition. Specifically, OptICA selects the highest dimension that produces a low number of components that are dominated by a single gene. We show that OptICA outperforms two previously proposed methods for selecting the number of independent components across four transcriptomic databases of varying sizes. Conclusions OptICA avoids both over-decomposition and under-decomposition of transcriptomic datasets resulting in the best representation of the organism’s underlying transcriptional regulatory network.


2016 ◽  
Vol 37 (1) ◽  
Author(s):  
Klaus Nordhausen ◽  
Hannu Oja ◽  
Esa Ollila

Oja, Sirkiä, and Eriksson (2006) and Ollila, Oja, and Koivunen (2007) showed that, under general assumptions, any two scatter matrices with the so called independent components property can be used to estimate the unmixing matrix for the independent component analysis (ICA). The method is a generalization of Cardoso’s (Cardoso, 1989) FOBI estimate which uses the regular covariance matrix and a scatter matrix based on fourth moments. Different choices of the two scatter matrices are compared in a simulation study. Based on the study, we recommend always the use of two robust scatter matrices. For possible asymmetric independent components, symmetrized versions of the scatter matrix estimates should be used.


2019 ◽  
Vol 7 (3) ◽  
pp. SE19-SE42 ◽  
Author(s):  
David Lubo-Robles ◽  
Kurt J. Marfurt

During the past two decades, the number of volumetric seismic attributes has increased to the point at which interpreters are overwhelmed and cannot analyze all of the information that is available. Principal component analysis (PCA) is one of the best-known multivariate analysis techniques that decompose the input data into second-order statistics by maximizing the variance, thus obtaining mathematically uncorrelated components. Unfortunately, projecting the information in the multiple input data volumes onto an orthogonal basis often mixes rather than separates geologic features of interest. To address this issue, we have implemented and evaluated a relatively new unsupervised multiattribute analysis technique called independent component analysis (ICA), which is based on higher order statistics. We evaluate our algorithm to study the internal architecture of turbiditic channel complexes present in the Moki A sands Formation, Taranaki Basin, New Zealand. We input 12 spectral magnitude components ranging from 25 to 80 Hz into the ICA algorithm and we plot 3 of the resulting independent components against a red-green-blue color scheme to generate a single volume in which the colored independent components correspond to different seismic facies. The results obtained using ICA proved to be superior to those obtained using PCA. Specifically, ICA provides improved resolution and separates geologic features from noise. Moreover, with ICA, we can geologically analyze the different seismic facies and relate them to sand- and mud-prone seismic facies associated with axial and off-axis deposition and cut-and-fill architectures.


2014 ◽  
Vol 553 ◽  
pp. 564-569
Author(s):  
Yaseen Unnisa ◽  
Danh Tran ◽  
Fu Chun Huang

Independent Component Analysis (ICA) is a recent method of blind source separation, it has been employed in medical image processing and structural damge detection. It can extract source signals and the unmixing matrix of the system using mixture signals only. This novel method relies on the assumption that source signals are statistically independent. This paper looks at various measures of statistical independence (SI) employed in ICA, the measures proposed by Bakirov and his associates, and the effects of levels of SI of source signals on the output of ICA. Firstly, two statistical independent signals in the form of uniform random signals and a mixing matrix were used to simulate mixture signals to be anlysed byfastICApackage, secondly noise was added onto the signals to investigate effects of levels of SI on the output of ICA in the form of soure signals, the mixing and unmixing matrix. It was found that for p-value given by Bakirov’s SI statistical testing of the null hypothesis H0is a good indication of the SI between two variables and that for p-value larger than 0.05, fastICA performs satisfactorily.


1996 ◽  
Vol 07 (06) ◽  
pp. 671-687 ◽  
Author(s):  
AAPO HYVÄRINEN ◽  
ERKKI OJA

Recently, several neural algorithms have been introduced for Independent Component Analysis. Here we approach the problem from the point of view of a single neuron. First, simple Hebbian-like learning rules are introduced for estimating one of the independent components from sphered data. Some of the learning rules can be used to estimate an independent component which has a negative kurtosis, and the others estimate a component of positive kurtosis. Next, a two-unit system is introduced to estimate an independent component of any kurtosis. The results are then generalized to estimate independent components from non-sphered (raw) mixtures. To separate several independent components, a system of several neurons with linear negative feedback is used. The convergence of the learning rules is rigorously proven without any unnecessary hypotheses on the distributions of the independent components.


2014 ◽  
Vol 664 ◽  
pp. 148-152
Author(s):  
Shuang Xi Jing ◽  
Song Tao Guo ◽  
Jun Fa Leng ◽  
Xing Yu Zhao

Constrained independent component analysis (cICA) is a new theory and new method derived from the independent component analysis (ICA).It can extract the desired independent components (ICs) from the data based on some prior information, thus overcoming the uncertainty of the traditional ICA. Early gearbox fault signals is often very weak ,characterized by non-Gaussian,low signal-to-noise ratio (SNR), which make the existing diagnosis methods in the diagnosis of early application restricted. In this paper,cICA algorithm is applied to gear fault diagnosis. Through the case studies verify the feasibility of this method to extract the desired independent components (ICs), indicating the applicability and effectiveness of the method.


2009 ◽  
Vol 10 (2) ◽  
pp. 85-115 ◽  
Author(s):  
M. P. S. Chawla

Independent component analysis (ICA) is a new technique suitable for separating independent components from electrocardiogram (ECG) complex signals. The basic idea of using multidimensional independent component analysis (MICA) is to find stable higher dimensional source signal subspaces and to decompose each rotation into elementary rotations within all two-dimensional planes spanned by the coordinate axes useful for diagnostic information of heart. In this paper, ability of ICA for parameterization of ECG signals was felt to reduce the amount of redundant ECG data. This work aims at finding an independent subspace analysis (ISA) model for ECG analysis that allows applicability to any random vectors available in an ECG data set. For the common standards for electrocardiography (CSE) based ECG data sets, joint approximate diagonalization of eigen matrices (Jade) algorithm is used to find smaller subspaces. The extracted independent components are further cleaned by statistical measures. In this study, it is also observed that the value of kurtosis coefficients for the independent components, which represents the noise component, can be further reduced using parameterized multidimensional ICA (PMICA) technique. The indeterminacies if available in the ECG data are to be analysed also using modified version of Jade algorithm to PMICA and parameterized standard ICA (PsICA) for comparative studies. The indeterminacies if available in the ECG data are reduced in PMICA better in comparison to the analysis done using PsICA. The simulation results obtained indicate that ICA definitely improves signal–noise ratio (SNR) like the other higher order digital filtering methods like Kalman, Butterworth etc. with minimum reconstruction errors. Here, it is also confirmed that re-parameterization of the standard ICA model results into a ‘component model’ using MICA technique, which is geometric in spirit and free of indeterminacies existing in sICA model.


Sign in / Sign up

Export Citation Format

Share Document