scholarly journals Optimal dimensionality selection for independent component analysis of transcriptomic data

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
John Luke McConn ◽  
Cameron R. Lamoureux ◽  
Saugat Poudel ◽  
Bernhard O. Palsson ◽  
Anand V. Sastry

Abstract Background Independent component analysis is an unsupervised machine learning algorithm that separates a set of mixed signals into a set of statistically independent source signals. Applied to high-quality gene expression datasets, independent component analysis effectively reveals both the source signals of the transcriptome as co-regulated gene sets, and the activity levels of the underlying regulators across diverse experimental conditions. Two major variables that affect the final gene sets are the diversity of the expression profiles contained in the underlying data, and the user-defined number of independent components, or dimensionality, to compute. Availability of high-quality transcriptomic datasets has grown exponentially as high-throughput technologies have advanced; however, optimal dimensionality selection remains an open question. Methods We computed independent components across a range of dimensionalities for four gene expression datasets with varying dimensions (both in terms of number of genes and number of samples). We computed the correlation between independent components across different dimensionalities to understand how the overall structure evolves as the number of user-defined components increases. We then measured how well the resulting gene clusters reflected known regulatory mechanisms, and developed a set of metrics to assess the accuracy of the decomposition at a given dimension. Results We found that over-decomposition results in many independent components dominated by a single gene, whereas under-decomposition results in independent components that poorly capture the known regulatory structure. From these results, we developed a new method, called OptICA, for finding the optimal dimensionality that controls for both over- and under-decomposition. Specifically, OptICA selects the highest dimension that produces a low number of components that are dominated by a single gene. We show that OptICA outperforms two previously proposed methods for selecting the number of independent components across four transcriptomic databases of varying sizes. Conclusions OptICA avoids both over-decomposition and under-decomposition of transcriptomic datasets resulting in the best representation of the organism’s underlying transcriptional regulatory network.

2021 ◽  
Author(s):  
John Luke McConn ◽  
Cameron R Lamoureux ◽  
Saugat Poudel ◽  
Bernhard O Palsson ◽  
Anand V Sastry

Independent Component Analysis (ICA) is an unsupervised machine learning algorithm that separates a set of mixed signals into a set of statistically independent source signals. Applied to high-quality gene expression datasets, ICA effectively reveals the source signals of the transcriptome as groups of co-regulated genes and their corresponding activities across diverse growth conditions. Two major variables that affect the output of ICA are the diversity and scope of the underlying data, and the user-defined number of independent components, or dimensionality, to compute. Availability of high-quality transcriptomic datasets has grown exponentially as high-throughput technologies have advanced; however, optimal dimensionality selection remains an open question. Here, we introduce a new method, called OptICA, for effectively finding the optimal dimensionality that consistently maximizes the number of biologically relevant components revealed while minimizing the potential for over-decomposition. We show that OptICA outperforms two previously proposed methods for selecting the number of independent components across four transcriptomic databases of varying sizes. OptICA avoids both over-decomposition and under-decomposition of transcriptomic datasets resulting in the best representation of the organism's underlying transcriptional regulatory network.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Carlos G. Urzúa-Traslaviña ◽  
Vincent C. Leeuwenburgh ◽  
Arkajyoti Bhattacharya ◽  
Stefan Loipfinger ◽  
Marcel A. T. M. van Vugt ◽  
...  

AbstractThe interpretation of high throughput sequencing data is limited by our incomplete functional understanding of coding and non-coding transcripts. Reliably predicting the function of such transcripts can overcome this limitation. Here we report the use of a consensus independent component analysis and guilt-by-association approach to predict over 23,000 functional groups comprised of over 55,000 coding and non-coding transcripts using publicly available transcriptomic profiles. We show that, compared to using Principal Component Analysis, Independent Component Analysis-derived transcriptional components enable more confident functionality predictions, improve predictions when new members are added to the gene sets, and are less affected by gene multi-functionality. Predictions generated using human or mouse transcriptomic data are made available for exploration in a publicly available web portal.


2016 ◽  
Vol 37 (1) ◽  
Author(s):  
Klaus Nordhausen ◽  
Hannu Oja ◽  
Esa Ollila

Oja, Sirkiä, and Eriksson (2006) and Ollila, Oja, and Koivunen (2007) showed that, under general assumptions, any two scatter matrices with the so called independent components property can be used to estimate the unmixing matrix for the independent component analysis (ICA). The method is a generalization of Cardoso’s (Cardoso, 1989) FOBI estimate which uses the regular covariance matrix and a scatter matrix based on fourth moments. Different choices of the two scatter matrices are compared in a simulation study. Based on the study, we recommend always the use of two robust scatter matrices. For possible asymmetric independent components, symmetrized versions of the scatter matrix estimates should be used.


2019 ◽  
Vol 7 (3) ◽  
pp. SE19-SE42 ◽  
Author(s):  
David Lubo-Robles ◽  
Kurt J. Marfurt

During the past two decades, the number of volumetric seismic attributes has increased to the point at which interpreters are overwhelmed and cannot analyze all of the information that is available. Principal component analysis (PCA) is one of the best-known multivariate analysis techniques that decompose the input data into second-order statistics by maximizing the variance, thus obtaining mathematically uncorrelated components. Unfortunately, projecting the information in the multiple input data volumes onto an orthogonal basis often mixes rather than separates geologic features of interest. To address this issue, we have implemented and evaluated a relatively new unsupervised multiattribute analysis technique called independent component analysis (ICA), which is based on higher order statistics. We evaluate our algorithm to study the internal architecture of turbiditic channel complexes present in the Moki A sands Formation, Taranaki Basin, New Zealand. We input 12 spectral magnitude components ranging from 25 to 80 Hz into the ICA algorithm and we plot 3 of the resulting independent components against a red-green-blue color scheme to generate a single volume in which the colored independent components correspond to different seismic facies. The results obtained using ICA proved to be superior to those obtained using PCA. Specifically, ICA provides improved resolution and separates geologic features from noise. Moreover, with ICA, we can geologically analyze the different seismic facies and relate them to sand- and mud-prone seismic facies associated with axial and off-axis deposition and cut-and-fill architectures.


2014 ◽  
Vol 553 ◽  
pp. 564-569
Author(s):  
Yaseen Unnisa ◽  
Danh Tran ◽  
Fu Chun Huang

Independent Component Analysis (ICA) is a recent method of blind source separation, it has been employed in medical image processing and structural damge detection. It can extract source signals and the unmixing matrix of the system using mixture signals only. This novel method relies on the assumption that source signals are statistically independent. This paper looks at various measures of statistical independence (SI) employed in ICA, the measures proposed by Bakirov and his associates, and the effects of levels of SI of source signals on the output of ICA. Firstly, two statistical independent signals in the form of uniform random signals and a mixing matrix were used to simulate mixture signals to be anlysed byfastICApackage, secondly noise was added onto the signals to investigate effects of levels of SI on the output of ICA in the form of soure signals, the mixing and unmixing matrix. It was found that for p-value given by Bakirov’s SI statistical testing of the null hypothesis H0is a good indication of the SI between two variables and that for p-value larger than 0.05, fastICA performs satisfactorily.


Sign in / Sign up

Export Citation Format

Share Document