Tuning parameters of dimensionality reduction methods for single-cell RNA-seq analysis

AbstractBackgroundMany computational methods have been developed recently to analyze single-cell RNA-seq (scRNA-seq) data. Several benchmark studies have compared these methods on their ability for dimensionality reduction, clustering or differential analysis, often relying on default parameters. Yet given the biological diversity of scRNA-seq datasets, parameter tuning might be essential for the optimal usage of methods, and determining how to tune parameters remains an unmet need.ResultsHere, we propose a benchmark to assess the performance of five methods, systematically varying their tunable parameters, for dimension reduction of scRNA-seq data, a common first step to many downstream applications such as cell type identification or trajectory inference. We run a total of 1.5 million experiments to assess the influence of parameter changes on the performance of each method, and propose two strategies to automatically tune parameters for methods that need it.ConclusionsWe find that principal component analysis (PCA)-based methods like scran and Seurat are competitive with default parameters but do not benefit much from parameter tuning, while more complex models like ZinbWave, DCA and scVI can reach better performance but after parameter tuning.

Download Full-text

NDRindex: a method for the quality assessment of single-cell RNA-Seq preprocessing data

BMC Bioinformatics ◽

10.1186/s12859-020-03883-x ◽

2020 ◽

Vol 21 (S16) ◽

Author(s):

Ruiyu Xiao ◽

Guoshan Lu ◽

Wanqian Guo ◽

Shuilin Jin

Keyword(s):

Dimensionality Reduction ◽

Data Quality ◽

Single Cell ◽

Enrichment Analysis ◽

Cell Types ◽

Size Reduction ◽

Rna Seq ◽

Reduction Methods ◽

The Many

Abstract Background Single-cell RNA sequencing can be used to fairly determine cell types, which is beneficial to the medical field, especially the many recent studies on COVID-19. Generally, single-cell RNA data analysis pipelines include data normalization, size reduction, and unsupervised clustering. However, different normalization and size reduction methods will significantly affect the results of clustering and cell type enrichment analysis. Choices of preprocessing paths is crucial in scRNA-Seq data mining, because a proper preprocessing path can extract more important information from complex raw data and lead to more accurate clustering results. Results We proposed a method called NDRindex (Normalization and Dimensionality Reduction index) to evaluate data quality of outcomes of normalization and dimensionality reduction methods. The method includes a function to calculate the degree of data aggregation, which is the key to measuring data quality before clustering. For the five single-cell RNA sequence datasets we tested, the results proved the efficacy and accuracy of our index. Conclusions This method we introduce focuses on filling the blanks in the selection of preprocessing paths, and the result proves its effectiveness and accuracy. Our research provides useful indicators for the evaluation of RNA-Seq data.

Download Full-text

Optimized Hybrid Heuristic Based Dimensionality Reduction Methods for Malaria Vector Using KNN Classifier

10.21203/rs.3.rs-107396/v1 ◽

2020 ◽

Author(s):

Micheal Olaolu Arowolo ◽

Marion Olubunmi Adebiyi ◽

Ayodele Ariyo Adebiyi ◽

Oludayo Olugbara

Keyword(s):

Gene Expression ◽

Dimensionality Reduction ◽

Principal Component ◽

Feature Space ◽

Component Analysis ◽

Rna Seq ◽

Knn Classifier ◽

Data Dimensionality Reduction ◽

Reduction Methods ◽

Mosquito Anopheles Gambiae

Abstract RNA-Seq data are utilized for biological applications and decision making for the classification of genes. A lot of works in recent time are focused on reducing the dimension of RNA-Seq data. Dimensionality reduction approaches have been proposed in the transformation of these data. In this study, a novel optimized hybrid investigative approach is proposed. It combines an optimized genetic algorithm with Principal Component Analysis and Independent Component Analysis (GA-O-PCA and GAO-ICA), which are used to identify an optimum subset and latent correlated features, respectively. The classifier uses KNN on the reduced mosquito Anopheles gambiae dataset, to enhance the accuracy and scalability in the gene expression analysis. The proposed algorithm is used to fetch relevant features based on the high-dimensional input feature space. A fast algorithm for feature ranking is used to select relevant features. The performances of the model are evaluated and validated using the classification accuracy to compare existing approaches in the literature. The achieved experimental results prove to be promising for selecting relevant genes and classifying pertinent gene expression data analysis by indicating that the approach is a capable addition to prevailing machine learning methods.

Download Full-text

Enhanced Dimensionality Reduction Methods for Classifying Malaria Vector Dataset using Decision Tree

Sains Malaysiana ◽

10.17576/jsm-2021-5009-07 ◽

2021 ◽

Vol 50 (9) ◽

pp. 2579-2589

Author(s):

Micheal Olaolu Arowolo ◽

Marion Olubunmi Adebiyi ◽

Ayodele Ariyo Adebiyi

Keyword(s):

Gene Expression ◽

Decision Tree ◽

Dimensionality Reduction ◽

Principal Component ◽

Feature Space ◽

Relevant Information ◽

Component Analysis ◽

Rna Seq ◽

Reduction Methods ◽

Mosquito Anopheles Gambiae

RNA-Seq data are utilized for biological applications and decision making for classification of genes. Lots of work in recent time are focused on reducing the dimension of RNA-Seq data. Dimensionality reduction approaches have been proposed in fetching relevant information in a given data. In this study, a novel optimized dimensionality reduction algorithm is proposed, by combining an optimized genetic algorithm with Principal Component Analysis and Independent Component Analysis (GA-O-PCA and GAO-ICA), which are used to identify an optimum subset and latent correlated features, respectively. The classifier uses Decision tree on the reduced mosquito anopheles gambiae dataset to enhance the accuracy and scalability in the gene expression analysis. The proposed algorithm is used to fetch relevant features based from the high-dimensional input feature space. A feature ranking and earlier experience are used. The performances of the model are evaluated and validated using the classification accuracy to compare existing approaches in the literature. The achieved experimental results prove to be promising for feature selection and classification in gene expression data analysis and specify that the approach is a capable accumulation to prevailing data mining techniques.

Download Full-text

Tuning parameters of dimensionality reduction methods for single-cell RNA-seq analysis

Genome Biology ◽

10.1186/s13059-020-02128-7 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Felix Raimundo ◽

Celine Vallot ◽

Jean-Philippe Vert

Keyword(s):

Dimensionality Reduction ◽

Single Cell ◽

Rna Seq ◽

Tuning Parameters ◽

Reduction Methods

Download Full-text

Optimized hybrid investigative based dimensionality reduction methods for malaria vector using KNN classifier

Journal Of Big Data ◽

10.1186/s40537-021-00415-z ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Micheal Olaolu Arowolo ◽

Marion Olubunmi Adebiyi ◽

Ayodele Ariyo Adebiyi ◽

Oludayo Olugbara

Keyword(s):

Gene Expression ◽

Dimensionality Reduction ◽

Principal Component ◽

Feature Space ◽

Component Analysis ◽

Rna Seq ◽

Knn Classifier ◽

Data Dimensionality Reduction ◽

Reduction Methods ◽

Mosquito Anopheles Gambiae

AbstractRNA-Seq data are utilized for biological applications and decision making for the classification of genes. A lot of works in recent time are focused on reducing the dimension of RNA-Seq data. Dimensionality reduction approaches have been proposed in the transformation of these data. In this study, a novel optimized hybrid investigative approach is proposed. It combines an optimized genetic algorithm with Principal Component Analysis and Independent Component Analysis (GA-O-PCA and GAO-ICA), which are used to identify an optimum subset and latent correlated features, respectively. The classifier uses KNN on the reduced mosquito Anopheles gambiae dataset, to enhance the accuracy and scalability in the gene expression analysis. The proposed algorithm is used to fetch relevant features based on the high-dimensional input feature space. A fast algorithm for feature ranking is used to select relevant features. The performances of the model are evaluated and validated using the classification accuracy to compare existing approaches in the literature. The achieved experimental results prove to be promising for selecting relevant genes and classifying pertinent gene expression data analysis by indicating that the approach is capable of adding to prevailing machine learning methods.

Download Full-text

CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-Seq data

10.1101/068775 ◽

2016 ◽

Cited By ~ 2

Author(s):

Peijie Lin ◽

Michael Troup ◽

Joshua W. K. Ho

Keyword(s):

Dimensionality Reduction ◽

Single Cell ◽

State Of The Art ◽

Principal Component ◽

Real Data ◽

Rna Seq ◽

Data Set ◽

Standard Principal Component Analysis ◽

The Impact ◽

Imputation Approach

Most existing dimensionality reduction and clustering packages for single-cell RNA-Seq (scRNA-Seq) data deal with dropouts by heavy modelling and computational machinery. Here we introduce CIDR (Clustering through Imputation and Dimensionality Reduction), an ultrafast algorithm which uses a novel yet very simple ‘implicit imputation’ approach to alleviate the impact of dropouts in scRNA-Seq data in a principled manner. Using a range of simulated and real data, we have shown that CIDR improves the standard principal component analysis and outperforms the state-of-the-art methods, namely t-SNE, ZIFA and RaceID, in terms of clustering accuracy. CIDR typically completes within seconds for processing a data set of hundreds of cells, and minutes for a data set of thousands of cells. CIDR can be downloaded at https://github.org/VCCRI/CIDR.

Download Full-text

ZIFA: Dimensionality reduction for zero-inflated single cell gene expression analysis

10.1101/019141 ◽

2015 ◽

Cited By ~ 1

Author(s):

Christopher Yau ◽

Emma Pierson

Keyword(s):

Dimensionality Reduction ◽

Single Cell ◽

Rna Seq ◽

Reduction Methods ◽

Cell Gene Expression ◽

High Dimensional Datasets ◽

Cell Gene ◽

Normal Cellular Function ◽

Insight Into ◽

Dimensionality Reduction Method

Single cell RNA-seq data allows insight into normal cellular function and diseases including cancer through the molecular characterisation of cellular state at the single-cell level. Dimensionality reduction of such high-dimensional datasets is essential for visualization and analysis, but single-cell RNA-seq data is challenging for classical dimensionality reduction methods because of the prevalence of dropout events leading to zero-inflated data. Here we develop a dimensionality reduction method, (Z)ero (I)nflated (F)actor (A)nalysis (ZIFA), which explicitly models the dropout characteristics, and show that it improves performance on simulated and biological datasets.

Download Full-text

A Comparison for Dimensionality Reduction Methods of Single-Cell RNA-seq Data

Frontiers in Genetics ◽

10.3389/fgene.2021.646936 ◽

2021 ◽

Vol 12 ◽

Author(s):

Ruizhi Xiang ◽

Wencan Wang ◽

Lei Yang ◽

Shiyuan Wang ◽

Chaohan Xu ◽

...

Keyword(s):

Dimensionality Reduction ◽

Dimension Reduction ◽

Single Cell ◽

High Throughput Sequencing ◽

Cellular Heterogeneity ◽

Specific Situation ◽

Rna Seq ◽

Reduction Methods ◽

The Stability ◽

Downstream Analysis

Single-cell RNA sequencing (scRNA-seq) is a high-throughput sequencing technology performed at the level of an individual cell, which can have a potential to understand cellular heterogeneity. However, scRNA-seq data are high-dimensional, noisy, and sparse data. Dimension reduction is an important step in downstream analysis of scRNA-seq. Therefore, several dimension reduction methods have been developed. We developed a strategy to evaluate the stability, accuracy, and computing cost of 10 dimensionality reduction methods using 30 simulation datasets and five real datasets. Additionally, we investigated the sensitivity of all the methods to hyperparameter tuning and gave users appropriate suggestions. We found that t-distributed stochastic neighbor embedding (t-SNE) yielded the best overall performance with the highest accuracy and computing cost. Meanwhile, uniform manifold approximation and projection (UMAP) exhibited the highest stability, as well as moderate accuracy and the second highest computing cost. UMAP well preserves the original cohesion and separation of cell populations. In addition, it is worth noting that users need to set the hyperparameters according to the specific situation before using the dimensionality reduction methods based on non-linear model and neural network.

Download Full-text

Bioinformatics-based Screening of Key Genes for Saponin Metabolism in Quinoa

10.21203/rs.3.rs-139481/v1 ◽

2021 ◽

Author(s):

Chengang Guo ◽

Zhimin wei ◽

Wei Lyu ◽

Yanlou Geng

Keyword(s):

Differentially Expressed Genes ◽

Principal Component ◽

Enrichment Analysis ◽

Hierarchical Cluster ◽

Gene Set Enrichment Analysis ◽

Differentially Expressed ◽

Differential Analysis ◽

Rna Seq ◽

Protein Coding ◽

Key Genes

Abstract Quinoa saponins have complex, diverse and evident physiologic activities. However, the key regulatory genes for quinoa saponin metabolism are not yet well studied. The purpose of this study was to explore genes closely related to quinoa saponin metabolism. In this study, the significantly differentially expressed genes in yellow quinoa were firstly screened based on RNA-seq technology. Then, the key genes for saponin metabolism were selected by gene set enrichment analysis (GSEA) and principal component analysis (PCA) statistical methods. Finally, the specificity of the key genes was verified by hierarchical clustering. The results of differential analysis showed that 1654 differentially expressed genes were achieved after pseudogenes deletion. Therein, there were 142 long non-coding genes and 1512 protein-coding genes. Based on GSEA analysis, 116 key candidate genes were found to be significantly correlated with quinoa saponin metabolism. Through PCA dimension reduction analysis, 57 key genes were finally obtained. Hierarchical cluster analysis further demonstrated that these key genes can clearly separate the four groups of samples. The present results could provide references for the breeding of sweet quinoa and would be helpful for the rational utilization of quinoa saponins.

Download Full-text