Dimensionality reduction by UMAP reinforces sample heterogeneity analysis in bulk transcriptomic data

AbstractTranscriptome profiling and differential gene expression constitute a ubiquitous tool in biomedical research and clinical application. Linear dimensionality reduction methods especially principal component analysis (PCA) are widely used in detecting sample-to-sample heterogeneity in bulk transcriptomic datasets so that appropriate analytic methods can be used to correct batch effects, remove outliers and distinguish subgroups. In response to the challenge in analysing transcriptomic datasets with large sample size such as single-cell RNA-sequencing (scRNA-seq), non-linear dimensionality reduction methods were developed. t-distributed stochastic neighbour embedding (t-SNE) and uniform manifold approximation and projection (UMAP) show the advantage of preserving local information among samples and enable effective identification of heterogeneity and efficient organisation of clusters in scRNA-seq analysis. However, the utility of t-SNE and UMAP in bulk transcriptomic analysis has not been carefully examined. Therefore, we compared major dimensionality reduction methods (linear: PCA; nonlinear: multidimensional scaling (MDS), t-SNE, and UMAP) in analysing 71 bulk transcriptomic datasets with large sample sizes. UMAP was found superior in preserving sample level neighbourhood information and maintaining clustering accuracy, thus conspicuously differentiating batch effects, identifying pre-defined biological groups and revealing in-depth clustering structures. We further verified that new clustering structures visualised by UMAP were associated with biological features and clinical meaning. Therefore, we recommend the adoption of UMAP in visualising and analysing of sizable bulk transcriptomic datasets.

Download Full-text

Performance Comparison of Tumor Classification Based on Linear and Non-linear Dimensionality Reduction Methods

Lecture Notes in Computer Science - Advanced Intelligent Computing Theories and Applications ◽

10.1007/978-3-642-14922-1_37 ◽

2010 ◽

pp. 291-300 ◽

Cited By ~ 7

Author(s):

Shu-Lin Wang ◽

Hong-Zhu You ◽

Ying-Ke Lei ◽

Xue-Ling Li

Keyword(s):

Dimensionality Reduction ◽

Performance Comparison ◽

Tumor Classification ◽

Non Linear ◽

Reduction Methods ◽

Linear Dimensionality Reduction

Download Full-text

Comparative study of feature extraction algorithms for complex-valued gradient fields of digital images using linear dimensionality reduction methods

Eleventh International Conference on Machine Vision (ICMV 2018) ◽

10.1117/12.2523098 ◽

2019 ◽

Author(s):

Egor Dmitriev ◽

Vladislav Myasnikov

Keyword(s):

Feature Extraction ◽

Comparative Study ◽

Dimensionality Reduction ◽

Digital Images ◽

Gradient Fields ◽

Reduction Methods ◽

Linear Dimensionality Reduction ◽

Complex Valued

Download Full-text

Feature Snatching and Performance Assessment for Connoting the Admittance Likelihood of student using Principal Component Reduction

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b2286.078219 ◽

2019 ◽

Vol 8 (2) ◽

pp. 4800-4807

Keyword(s):

Higher Education ◽

Machine Learning ◽

Dimensionality Reduction ◽

Admission Rate ◽

Principal Component ◽

Feature Reduction ◽

Experimental Result ◽

Sparse Pca ◽

Reduction Methods ◽

Incremental Pca

Recently, engineers are concentrating on designing an effective prediction model for finding the rate of student admission in order to raise the educational growth of the nation. The method to predict the student admission towards the higher education is a challenging task for any educational organization. There is a high visibility of crisis towards admission in the higher education. The admission rate of the student is the major risk to the educational society in the world. The student admission greatly affects the economic, social, academic, profit and cultural growth of the nation. The student admission rate also depends on the admission procedures and policies of the educational institutions. The chance of student admission also depends on the feedback given by all the stake holders of the educational sectors. The forecasting of the student admission is a major task for any educational institution to protect the profit and wealth of the organization. This paper attempts to analyze the performance of the student admission prediction by using machine learning dimensionality reduction algorithms. The Admission Predict dataset from Kaggle machine learning Repository is used for prediction analysis and the features are reduced by feature reduction methods. The prediction of the chance of Admit is achieved in four ways. Firstly, the correlation between each of the dataset attributes are found and depicted as a histogram. Secondly, the top most high correlated features are identified which are directly contributing to the prediction of chance of admit. Thirdly, the Admission Predict dataset is subjected to dimensionality reduction methods like principal component analysis (PCA), Sparse PCA, Incremental PCA , Kernel PCA and Mini Batch Sparse PCA. Fourth, the optimized dimensionality reduced dataset is then executed to analyze and compare the mean squared error, Mean Absolute Error and R2 Score of each method. The implementation is done by python in Anaconda Spyder Navigator Integrated Development Environment. Experimental Result shows that the CGPA, GRE Score and TOEFL Score are highly correlated features in predicting the chance of admit. The execution of performance analysis shows that Incremental PCA have achieved the effective prediction of chance of admit with minimum MSE of 0.09, MAE of 0.24 and reasonable R2 Score of 0.26.

Download Full-text

An Evaluation of Supervised Dimensionality Reduction For Large Scale Data

Journal of Machine and Computing ◽

10.53759/7669/jmc202202003 ◽

2022 ◽

pp. 17-25

Author(s):

Nancy Jan Sliper

Keyword(s):

Dimensionality Reduction ◽

Large Scale ◽

Simulated Data ◽

Principal Component ◽

Low Rank ◽

Learning Tools ◽

Large Scale Data ◽

Reduction Methods ◽

Low Dimensional ◽

Scale Data

Experimenters today frequently quantify millions or even billions of characteristics (measurements) each sample to address critical biological issues, in the hopes that machine learning tools would be able to make correct data-driven judgments. An efficient analysis requires a low-dimensional representation that preserves the differentiating features in data whose size and complexity are orders of magnitude apart (e.g., if a certain ailment is present in the person's body). While there are several systems that can handle millions of variables and yet have strong empirical and conceptual guarantees, there are few that can be clearly understood. This research presents an evaluation of supervised dimensionality reduction for large scale data. We provide a methodology for expanding Principal Component Analysis (PCA) by including category moment estimations in low-dimensional projections. Linear Optimum Low-Rank (LOLR) projection, the cheapest variant, includes the class-conditional means. We show that LOLR projections and its extensions enhance representations of data for future classifications while retaining computing flexibility and reliability using both experimental and simulated data benchmark. When it comes to accuracy, LOLR prediction outperforms other modular linear dimension reduction methods that require much longer computation times on conventional computers. LOLR uses more than 150 million attributes in brain image processing datasets, and many genome sequencing datasets have more than half a million attributes.

Download Full-text

The Connections between Principal Component Analysis and Dimensionality Reduction Methods of Manifolds

Advanced Intelligent Computing Theories and Applications. With Aspects of Artificial Intelligence - Lecture Notes in Computer Science ◽

10.1007/978-3-642-25944-9_83 ◽

2012 ◽

pp. 638-643

Author(s):

Bo Li ◽

Jin Liu

Keyword(s):

Principal Component Analysis ◽

Dimensionality Reduction ◽

Principal Component ◽

Component Analysis ◽

Reduction Methods

Download Full-text

Face recognition: Comparative study between linear and non linear dimensionality reduction methods

2015 International Conference on Electrical and Information Technologies (ICEIT) ◽

10.1109/eitech.2015.7162932 ◽

2015 ◽

Cited By ~ 1

Author(s):

Bouzalmat Anissa ◽

Belghini Naouar ◽

Zarghili Arsalane ◽

Kharroubi Jamal

Keyword(s):

Face Recognition ◽

Comparative Study ◽

Dimensionality Reduction ◽

Non Linear ◽

Reduction Methods ◽

Linear Dimensionality Reduction

Download Full-text

Comparison of linear dimensionality reduction methods in image annotation

2015 Seventh International Conference on Advanced Computational Intelligence (ICACI) ◽

10.1109/icaci.2015.7184729 ◽

2015 ◽

Cited By ~ 1

Author(s):

Shiqiang Li ◽

Hussain Dawood ◽

Ping Guo

Keyword(s):

Dimensionality Reduction ◽

Image Annotation ◽

Reduction Methods ◽

Linear Dimensionality Reduction

Download Full-text

Optimized Hybrid Heuristic Based Dimensionality Reduction Methods for Malaria Vector Using KNN Classifier

10.21203/rs.3.rs-107396/v1 ◽

2020 ◽

Author(s):

Micheal Olaolu Arowolo ◽

Marion Olubunmi Adebiyi ◽

Ayodele Ariyo Adebiyi ◽

Oludayo Olugbara

Keyword(s):

Gene Expression ◽

Dimensionality Reduction ◽

Principal Component ◽

Feature Space ◽

Component Analysis ◽

Rna Seq ◽

Knn Classifier ◽

Data Dimensionality Reduction ◽

Reduction Methods ◽

Mosquito Anopheles Gambiae

Abstract RNA-Seq data are utilized for biological applications and decision making for the classification of genes. A lot of works in recent time are focused on reducing the dimension of RNA-Seq data. Dimensionality reduction approaches have been proposed in the transformation of these data. In this study, a novel optimized hybrid investigative approach is proposed. It combines an optimized genetic algorithm with Principal Component Analysis and Independent Component Analysis (GA-O-PCA and GAO-ICA), which are used to identify an optimum subset and latent correlated features, respectively. The classifier uses KNN on the reduced mosquito Anopheles gambiae dataset, to enhance the accuracy and scalability in the gene expression analysis. The proposed algorithm is used to fetch relevant features based on the high-dimensional input feature space. A fast algorithm for feature ranking is used to select relevant features. The performances of the model are evaluated and validated using the classification accuracy to compare existing approaches in the literature. The achieved experimental results prove to be promising for selecting relevant genes and classifying pertinent gene expression data analysis by indicating that the approach is a capable addition to prevailing machine learning methods.

Download Full-text

Enhanced Dimensionality Reduction Methods for Classifying Malaria Vector Dataset using Decision Tree

Sains Malaysiana ◽

10.17576/jsm-2021-5009-07 ◽

2021 ◽

Vol 50 (9) ◽

pp. 2579-2589

Author(s):

Micheal Olaolu Arowolo ◽

Marion Olubunmi Adebiyi ◽

Ayodele Ariyo Adebiyi

Keyword(s):

Gene Expression ◽

Decision Tree ◽

Dimensionality Reduction ◽

Principal Component ◽

Feature Space ◽

Relevant Information ◽

Component Analysis ◽

Rna Seq ◽

Reduction Methods ◽

Mosquito Anopheles Gambiae

RNA-Seq data are utilized for biological applications and decision making for classification of genes. Lots of work in recent time are focused on reducing the dimension of RNA-Seq data. Dimensionality reduction approaches have been proposed in fetching relevant information in a given data. In this study, a novel optimized dimensionality reduction algorithm is proposed, by combining an optimized genetic algorithm with Principal Component Analysis and Independent Component Analysis (GA-O-PCA and GAO-ICA), which are used to identify an optimum subset and latent correlated features, respectively. The classifier uses Decision tree on the reduced mosquito anopheles gambiae dataset to enhance the accuracy and scalability in the gene expression analysis. The proposed algorithm is used to fetch relevant features based from the high-dimensional input feature space. A feature ranking and earlier experience are used. The performances of the model are evaluated and validated using the classification accuracy to compare existing approaches in the literature. The achieved experimental results prove to be promising for feature selection and classification in gene expression data analysis and specify that the approach is a capable accumulation to prevailing data mining techniques.

Download Full-text

Tuning parameters of dimensionality reduction methods for single-cell RNA-seq analysis

10.1101/2020.04.27.064816 ◽

2020 ◽

Author(s):

Felix Raimundo ◽

Celine Vallot ◽

Jean Philippe Vert

Keyword(s):

Dimensionality Reduction ◽

Single Cell ◽

Biological Diversity ◽

Unmet Need ◽

Principal Component ◽

Parameter Tuning ◽

Differential Analysis ◽

Rna Seq ◽

Reduction Methods ◽

Complex Models

AbstractBackgroundMany computational methods have been developed recently to analyze single-cell RNA-seq (scRNA-seq) data. Several benchmark studies have compared these methods on their ability for dimensionality reduction, clustering or differential analysis, often relying on default parameters. Yet given the biological diversity of scRNA-seq datasets, parameter tuning might be essential for the optimal usage of methods, and determining how to tune parameters remains an unmet need.ResultsHere, we propose a benchmark to assess the performance of five methods, systematically varying their tunable parameters, for dimension reduction of scRNA-seq data, a common first step to many downstream applications such as cell type identification or trajectory inference. We run a total of 1.5 million experiments to assess the influence of parameter changes on the performance of each method, and propose two strategies to automatically tune parameters for methods that need it.ConclusionsWe find that principal component analysis (PCA)-based methods like scran and Seurat are competitive with default parameters but do not benefit much from parameter tuning, while more complex models like ZinbWave, DCA and scVI can reach better performance but after parameter tuning.

Download Full-text