scholarly journals Tuning parameters of dimensionality reduction methods for single-cell RNA-seq analysis

2020 ◽  
Author(s):  
Felix Raimundo ◽  
Celine Vallot ◽  
Jean Philippe Vert

AbstractBackgroundMany computational methods have been developed recently to analyze single-cell RNA-seq (scRNA-seq) data. Several benchmark studies have compared these methods on their ability for dimensionality reduction, clustering or differential analysis, often relying on default parameters. Yet given the biological diversity of scRNA-seq datasets, parameter tuning might be essential for the optimal usage of methods, and determining how to tune parameters remains an unmet need.ResultsHere, we propose a benchmark to assess the performance of five methods, systematically varying their tunable parameters, for dimension reduction of scRNA-seq data, a common first step to many downstream applications such as cell type identification or trajectory inference. We run a total of 1.5 million experiments to assess the influence of parameter changes on the performance of each method, and propose two strategies to automatically tune parameters for methods that need it.ConclusionsWe find that principal component analysis (PCA)-based methods like scran and Seurat are competitive with default parameters but do not benefit much from parameter tuning, while more complex models like ZinbWave, DCA and scVI can reach better performance but after parameter tuning.

2020 ◽  
Vol 21 (S16) ◽  
Author(s):  
Ruiyu Xiao ◽  
Guoshan Lu ◽  
Wanqian Guo ◽  
Shuilin Jin

Abstract Background Single-cell RNA sequencing can be used to fairly determine cell types, which is beneficial to the medical field, especially the many recent studies on COVID-19. Generally, single-cell RNA data analysis pipelines include data normalization, size reduction, and unsupervised clustering. However, different normalization and size reduction methods will significantly affect the results of clustering and cell type enrichment analysis. Choices of preprocessing paths is crucial in scRNA-Seq data mining, because a proper preprocessing path can extract more important information from complex raw data and lead to more accurate clustering results. Results We proposed a method called NDRindex (Normalization and Dimensionality Reduction index) to evaluate data quality of outcomes of normalization and dimensionality reduction methods. The method includes a function to calculate the degree of data aggregation, which is the key to measuring data quality before clustering. For the five single-cell RNA sequence datasets we tested, the results proved the efficacy and accuracy of our index. Conclusions This method we introduce focuses on filling the blanks in the selection of preprocessing paths, and the result proves its effectiveness and accuracy. Our research provides useful indicators for the evaluation of RNA-Seq data.


2020 ◽  
Author(s):  
Micheal Olaolu Arowolo ◽  
Marion Olubunmi Adebiyi ◽  
Ayodele Ariyo Adebiyi ◽  
Oludayo Olugbara

Abstract RNA-Seq data are utilized for biological applications and decision making for the classification of genes. A lot of works in recent time are focused on reducing the dimension of RNA-Seq data. Dimensionality reduction approaches have been proposed in the transformation of these data. In this study, a novel optimized hybrid investigative approach is proposed. It combines an optimized genetic algorithm with Principal Component Analysis and Independent Component Analysis (GA-O-PCA and GAO-ICA), which are used to identify an optimum subset and latent correlated features, respectively. The classifier uses KNN on the reduced mosquito Anopheles gambiae dataset, to enhance the accuracy and scalability in the gene expression analysis. The proposed algorithm is used to fetch relevant features based on the high-dimensional input feature space. A fast algorithm for feature ranking is used to select relevant features. The performances of the model are evaluated and validated using the classification accuracy to compare existing approaches in the literature. The achieved experimental results prove to be promising for selecting relevant genes and classifying pertinent gene expression data analysis by indicating that the approach is a capable addition to prevailing machine learning methods.


2021 ◽  
Vol 50 (9) ◽  
pp. 2579-2589
Author(s):  
Micheal Olaolu Arowolo ◽  
Marion Olubunmi Adebiyi ◽  
Ayodele Ariyo Adebiyi

RNA-Seq data are utilized for biological applications and decision making for classification of genes. Lots of work in recent time are focused on reducing the dimension of RNA-Seq data. Dimensionality reduction approaches have been proposed in fetching relevant information in a given data. In this study, a novel optimized dimensionality reduction algorithm is proposed, by combining an optimized genetic algorithm with Principal Component Analysis and Independent Component Analysis (GA-O-PCA and GAO-ICA), which are used to identify an optimum subset and latent correlated features, respectively. The classifier uses Decision tree on the reduced mosquito anopheles gambiae dataset to enhance the accuracy and scalability in the gene expression analysis. The proposed algorithm is used to fetch relevant features based from the high-dimensional input feature space. A feature ranking and earlier experience are used. The performances of the model are evaluated and validated using the classification accuracy to compare existing approaches in the literature. The achieved experimental results prove to be promising for feature selection and classification in gene expression data analysis and specify that the approach is a capable accumulation to prevailing data mining techniques.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Micheal Olaolu Arowolo ◽  
Marion Olubunmi Adebiyi ◽  
Ayodele Ariyo Adebiyi ◽  
Oludayo Olugbara

AbstractRNA-Seq data are utilized for biological applications and decision making for the classification of genes. A lot of works in recent time are focused on reducing the dimension of RNA-Seq data. Dimensionality reduction approaches have been proposed in the transformation of these data. In this study, a novel optimized hybrid investigative approach is proposed. It combines an optimized genetic algorithm with Principal Component Analysis and Independent Component Analysis (GA-O-PCA and GAO-ICA), which are used to identify an optimum subset and latent correlated features, respectively. The classifier uses KNN on the reduced mosquito Anopheles gambiae dataset, to enhance the accuracy and scalability in the gene expression analysis. The proposed algorithm is used to fetch relevant features based on the high-dimensional input feature space. A fast algorithm for feature ranking is used to select relevant features. The performances of the model are evaluated and validated using the classification accuracy to compare existing approaches in the literature. The achieved experimental results prove to be promising for selecting relevant genes and classifying pertinent gene expression data analysis by indicating that the approach is capable of adding to prevailing machine learning methods.


2016 ◽  
Author(s):  
Peijie Lin ◽  
Michael Troup ◽  
Joshua W. K. Ho

Most existing dimensionality reduction and clustering packages for single-cell RNA-Seq (scRNA-Seq) data deal with dropouts by heavy modelling and computational machinery. Here we introduce CIDR (Clustering through Imputation and Dimensionality Reduction), an ultrafast algorithm which uses a novel yet very simple ‘implicit imputation’ approach to alleviate the impact of dropouts in scRNA-Seq data in a principled manner. Using a range of simulated and real data, we have shown that CIDR improves the standard principal component analysis and outperforms the state-of-the-art methods, namely t-SNE, ZIFA and RaceID, in terms of clustering accuracy. CIDR typically completes within seconds for processing a data set of hundreds of cells, and minutes for a data set of thousands of cells. CIDR can be downloaded at https://github.org/VCCRI/CIDR.


2015 ◽  
Author(s):  
Christopher Yau ◽  
Emma Pierson

Single cell RNA-seq data allows insight into normal cellular function and diseases including cancer through the molecular characterisation of cellular state at the single-cell level. Dimensionality reduction of such high-dimensional datasets is essential for visualization and analysis, but single-cell RNA-seq data is challenging for classical dimensionality reduction methods because of the prevalence of dropout events leading to zero-inflated data. Here we develop a dimensionality reduction method, (Z)ero (I)nflated (F)actor (A)nalysis (ZIFA), which explicitly models the dropout characteristics, and show that it improves performance on simulated and biological datasets.


2021 ◽  
Vol 12 ◽  
Author(s):  
Ruizhi Xiang ◽  
Wencan Wang ◽  
Lei Yang ◽  
Shiyuan Wang ◽  
Chaohan Xu ◽  
...  

Single-cell RNA sequencing (scRNA-seq) is a high-throughput sequencing technology performed at the level of an individual cell, which can have a potential to understand cellular heterogeneity. However, scRNA-seq data are high-dimensional, noisy, and sparse data. Dimension reduction is an important step in downstream analysis of scRNA-seq. Therefore, several dimension reduction methods have been developed. We developed a strategy to evaluate the stability, accuracy, and computing cost of 10 dimensionality reduction methods using 30 simulation datasets and five real datasets. Additionally, we investigated the sensitivity of all the methods to hyperparameter tuning and gave users appropriate suggestions. We found that t-distributed stochastic neighbor embedding (t-SNE) yielded the best overall performance with the highest accuracy and computing cost. Meanwhile, uniform manifold approximation and projection (UMAP) exhibited the highest stability, as well as moderate accuracy and the second highest computing cost. UMAP well preserves the original cohesion and separation of cell populations. In addition, it is worth noting that users need to set the hyperparameters according to the specific situation before using the dimensionality reduction methods based on non-linear model and neural network.


2021 ◽  
Author(s):  
Chengang Guo ◽  
Zhimin wei ◽  
Wei Lyu ◽  
Yanlou Geng

Abstract Quinoa saponins have complex, diverse and evident physiologic activities. However, the key regulatory genes for quinoa saponin metabolism are not yet well studied. The purpose of this study was to explore genes closely related to quinoa saponin metabolism. In this study, the significantly differentially expressed genes in yellow quinoa were firstly screened based on RNA-seq technology. Then, the key genes for saponin metabolism were selected by gene set enrichment analysis (GSEA) and principal component analysis (PCA) statistical methods. Finally, the specificity of the key genes was verified by hierarchical clustering. The results of differential analysis showed that 1654 differentially expressed genes were achieved after pseudogenes deletion. Therein, there were 142 long non-coding genes and 1512 protein-coding genes. Based on GSEA analysis, 116 key candidate genes were found to be significantly correlated with quinoa saponin metabolism. Through PCA dimension reduction analysis, 57 key genes were finally obtained. Hierarchical cluster analysis further demonstrated that these key genes can clearly separate the four groups of samples. The present results could provide references for the breeding of sweet quinoa and would be helpful for the rational utilization of quinoa saponins.


Sign in / Sign up

Export Citation Format

Share Document