Dimensionality Reduction Methods Used in Machine Learning

AbstractIn most cases, a dataset obtained through observation, measurement, etc. cannot be directly used for the training of a machine learning based system due to the unavoidable existence of missing data, inconsistencies and high dimensional feature space. Additionally, the individual features can contain quite different data types and ranges. For this reason, a data preprocessing step is nearly always necessary before the data can be used. This paper gives a short review of the typical methods applicable in the preprocessing and dimensionality reduction of raw data.

Download Full-text

Supervised dimensionality reduction for big data

Nature Communications ◽

10.1038/s41467-021-23102-2 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Joshua T. Vogelstein ◽

Eric W. Bridgeford ◽

Minh Tang ◽

Da Zheng ◽

Christopher Douville ◽

...

Keyword(s):

Dimensionality Reduction ◽

Data Science ◽

Real Data ◽

Low Rank ◽

Conditional Moment ◽

Desktop Computer ◽

Reduction Techniques ◽

Reduction Methods ◽

The Individual ◽

Low Dimensional

AbstractTo solve key biomedical problems, experimentalists now routinely measure millions or billions of features (dimensions) per sample, with the hope that data science techniques will be able to build accurate data-driven inferences. Because sample sizes are typically orders of magnitude smaller than the dimensionality of these data, valid inferences require finding a low-dimensional representation that preserves the discriminating information (e.g., whether the individual suffers from a particular disease). There is a lack of interpretable supervised dimensionality reduction methods that scale to millions of dimensions with strong statistical theoretical guarantees. We introduce an approach to extending principal components analysis by incorporating class-conditional moment estimates into the low-dimensional projection. The simplest version, Linear Optimal Low-rank projection, incorporates the class-conditional means. We prove, and substantiate with both synthetic and real data benchmarks, that Linear Optimal Low-Rank Projection and its generalizations lead to improved data representations for subsequent classification, while maintaining computational efficiency and scalability. Using multiple brain imaging datasets consisting of more than 150 million features, and several genomics datasets with more than 500,000 features, Linear Optimal Low-Rank Projection outperforms other scalable linear dimensionality reduction techniques in terms of accuracy, while only requiring a few minutes on a standard desktop computer.

Download Full-text

Identifying Diseases and Diagnosis using Machine Learning

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.f1297.0886s219 ◽

2019 ◽

Vol 8 (6S2) ◽

pp. 978-981

Keyword(s):

Machine Learning ◽

Mathematical Model ◽

Dimensionality Reduction ◽

Disease Classification ◽

High Dimensional ◽

Classification Algorithms ◽

Major Work ◽

The Mathematical Model ◽

Best Fit ◽

Learning By Using

The method that is use to optimize the criterion efficiency that depend on the previous experience is known as machine learning. By using the statistics theory it creates the mathematical model, and its major work is to surmise from the examples gave. To take the data straightforwardly from the information the approach uses computational methods. For recognize and identify the disease correctly a pattern is very necessary in Diagnosis recognition of disease. for creating the different models machine learning is used, this model can use for prediction of output and this output is depend on the input that is related to the data which previously used. For curing any disease it is very important to identify and detect that disease. For classify the disease classification algorithms are used. It uses are many dimensionality reduction algorithms and classification algorithms. Without externally modified the computer can learn with the help of the machine learning. For taking the best fit from the observation set the hypothesis is selected. Multi-dimensional and high dimensional are used in machine learning. By using machine learning automatic and classy algorithms can build.

Download Full-text

Unsupervised Text Feature Learning via Deep Variational Auto-encoder

Information Technology And Control ◽

10.5755/j01.itc.49.3.25918 ◽

2020 ◽

Vol 49 (3) ◽

pp. 421-437

Author(s):

Genggeng Liu ◽

Lin Xie ◽

Chi-Hua Chen

Keyword(s):

Dimensionality Reduction ◽

High Dimensional Data ◽

Image Data ◽

Original Data ◽

Feature Representation ◽

High Dimensional ◽

Learning To Learn ◽

Text Feature ◽

Reduction Methods ◽

Low Dimensional

Dimensionality reduction plays an important role in the data processing of machine learning and data mining, which makes the processing of high-dimensional data more efficient. Dimensionality reduction can extract the low-dimensional feature representation of high-dimensional data, and an effective dimensionality reduction method can not only extract most of the useful information of the original data, but also realize the function of removing useless noise. The dimensionality reduction methods can be applied to all types of data, especially image data. Although the supervised learning method has achieved good results in the application of dimensionality reduction, its performance depends on the number of labeled training samples. With the growing of information from internet, marking the data requires more resources and is more difficult. Therefore, using unsupervised learning to learn the feature of data has extremely important research value. In this paper, an unsupervised multilayered variational auto-encoder model is studied in the text data, so that the high-dimensional feature to the low-dimensional feature becomes efficient and the low-dimensional feature can retain mainly information as much as possible. Low-dimensional feature obtained by different dimensionality reduction methods are used to compare with the dimensionality reduction results of variational auto-encoder (VAE), and the method can be significantly improved over other comparison methods.

Download Full-text

Machine Learning Dimensionality Reduction Showed Marginal Performance Benefit Over Stepwise Regression for Risk Stratification of Chest Pain Patients in the Emergency Department

10.21203/rs.3.rs-43703/v1 ◽

2020 ◽

Author(s):

Nan Liu ◽

Marcel Lucas Chee ◽

Zhi Xiong Koh ◽

Su Li Leow ◽

Andrew Fu Wah Ho ◽

...

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Chest Pain ◽

Risk Stratification ◽

Dimensionality Reduction ◽

Prediction Models ◽

Stepwise Logistic Regression ◽

Stepwise Variable Selection ◽

Reduction Methods ◽

Pain Patients

Abstract Background: Chest pain is among the most common presenting complaints in the emergency department (ED). Swift and accurate risk stratification of chest pain patients in the ED may improve patient outcomes and reduce unnecessary costs. Traditional logistic regression with stepwise variable selection has been used to build risk prediction models for ED chest pain patients. In this study, we aimed to investigate if machine learning dimensionality reduction methods can achieve superior performance than the stepwise approach in deriving risk stratification models. Methods: A retrospective analysis was conducted on the data of patients >20 years old who presented to the ED of Singapore General Hospital with chest pain between September 2010 and July 2015. Variables used included demographics, medical history, laboratory findings, heart rate variability (HRV), and HRnV parameters calculated from five to six-minute electrocardiograms (ECGs). The primary outcome was 30-day major adverse cardiac events (MACE), which included death, acute myocardial infarction, and revascularization. Candidate variables identified using univariable analysis were then used to generate the stepwise logistic regression model and eight machine learning dimensionality reduction prediction models. A separate set of models was derived by excluding troponin. Receiver operating characteristic (ROC) and calibration analysis was used to compare model performance.Results: 795 patients were included in the analysis, of which 247 (31%) met the primary outcome of 30-day MACE. Patients with MACE were older and more likely to be male. All eight dimensionality reduction methods marginally but non-significantly outperformed stepwise variable selection; The multidimensional scaling algorithm performed the best with an area under the curve (AUC) of 0.901. All HRnV-based models generated in this study outperformed several existing clinical scores in ROC analysis.Conclusions: HRnV-based models using stepwise logistic regression performed better than existing chest pain scores for predicting MACE, with only marginal improvements using machine learning dimensionality reduction. Moreover, traditional stepwise approach benefits from model transparency and interpretability; in comparison, machine learning dimensionality reduction models are black boxes, making them difficult to explain in clinical practice.

Download Full-text

Uncovering High-dimensional Structures of Projections from Dimensionality Reduction Methods

MethodsX ◽

10.1016/j.mex.2020.101093 ◽

2020 ◽

Vol 7 ◽

pp. 101093 ◽

Cited By ~ 1

Author(s):

Michael C. Thrun ◽

Alfred Ultsch

Keyword(s):

Dimensionality Reduction ◽

High Dimensional ◽

Reduction Methods

Download Full-text

Feature Snatching and Performance Assessment for Connoting the Admittance Likelihood of student using Principal Component Reduction

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b2286.078219 ◽

2019 ◽

Vol 8 (2) ◽

pp. 4800-4807

Keyword(s):

Higher Education ◽

Machine Learning ◽

Dimensionality Reduction ◽

Admission Rate ◽

Principal Component ◽

Feature Reduction ◽

Experimental Result ◽

Sparse Pca ◽

Reduction Methods ◽

Incremental Pca

Recently, engineers are concentrating on designing an effective prediction model for finding the rate of student admission in order to raise the educational growth of the nation. The method to predict the student admission towards the higher education is a challenging task for any educational organization. There is a high visibility of crisis towards admission in the higher education. The admission rate of the student is the major risk to the educational society in the world. The student admission greatly affects the economic, social, academic, profit and cultural growth of the nation. The student admission rate also depends on the admission procedures and policies of the educational institutions. The chance of student admission also depends on the feedback given by all the stake holders of the educational sectors. The forecasting of the student admission is a major task for any educational institution to protect the profit and wealth of the organization. This paper attempts to analyze the performance of the student admission prediction by using machine learning dimensionality reduction algorithms. The Admission Predict dataset from Kaggle machine learning Repository is used for prediction analysis and the features are reduced by feature reduction methods. The prediction of the chance of Admit is achieved in four ways. Firstly, the correlation between each of the dataset attributes are found and depicted as a histogram. Secondly, the top most high correlated features are identified which are directly contributing to the prediction of chance of admit. Thirdly, the Admission Predict dataset is subjected to dimensionality reduction methods like principal component analysis (PCA), Sparse PCA, Incremental PCA , Kernel PCA and Mini Batch Sparse PCA. Fourth, the optimized dimensionality reduced dataset is then executed to analyze and compare the mean squared error, Mean Absolute Error and R2 Score of each method. The implementation is done by python in Anaconda Spyder Navigator Integrated Development Environment. Experimental Result shows that the CGPA, GRE Score and TOEFL Score are highly correlated features in predicting the chance of admit. The execution of performance analysis shows that Incremental PCA have achieved the effective prediction of chance of admit with minimum MSE of 0.09, MAE of 0.24 and reasonable R2 Score of 0.26.

Download Full-text

A Study of Feature Selection and Dimensionality Reduction Methods for Classification-Based Phishing Detection System

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2021010101 ◽

2021 ◽

Vol 11 (1) ◽

pp. 1-35

Author(s):

Amit Singh ◽

Abhishek Tiwari

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Dimensionality Reduction ◽

Detection System ◽

Parallel Evolution ◽

Feature Selection Method ◽

Sensitive Information ◽

Selection Methods ◽

Survey Paper ◽

Reduction Methods

Phishing was introduced in 1996, and now phishing is the biggest cybercrime challenge. Phishing is an abstract way to deceive users over the internet. Purpose of phishers is to extract the sensitive information of the user. Researchers have been working on solutions of phishing problem, but the parallel evolution of cybercrime techniques have made it a tough nut to crack. Recently, machine learning-based solutions are widely adopted to tackle the menace of phishing. This survey paper studies various feature selection method and dimensionality reduction methods and sees how they perform with machine learning-based classifier. The selection of features is vital for developing a good performance machine learning model. This work is comparing three broad categories of feature selection methods, namely filter, wrapper, and embedded feature selection methods, to reduce the dimensionality of data. The effectiveness of these methods has been assessed on several machine learning classifiers using k-fold cross-validation score, accuracy, precision, recall, and time.

Download Full-text

Estimation of missing prices in real-estate market agent-based simulations with machine learning and dimensionality reduction methods

Neural Computing and Applications ◽

10.1007/s00521-018-3938-7 ◽

2019 ◽

Vol 32 (7) ◽

pp. 2665-2682

Author(s):

Iván García-Magariño ◽

Carlos Medrano ◽

Jorge Delgado

Keyword(s):

Machine Learning ◽

Real Estate ◽

Dimensionality Reduction ◽

Real Estate Market ◽

Agent Based ◽

Agent Based Simulations ◽

Reduction Methods

Download Full-text

Comparison of Matrix Dimensionality Reduction Methods in Uncovering Latent Structures in the Data

Journal of Information & Knowledge Management ◽

10.1142/s0219649210002498 ◽

2010 ◽

Vol 09 (01) ◽

pp. 81-92 ◽

Cited By ~ 3

Author(s):

Ch. Aswani Kumar ◽

Ramaraj Palanisamy

Keyword(s):

Dimensionality Reduction ◽

Time Series Data ◽

Matrix Decomposition ◽

Decomposition Methods ◽

Singular Value ◽

Series Data ◽

High Dimensional ◽

Reduction Methods ◽

Latent Structures ◽

Value Decomposition

Matrix decomposition methods: Singular Value Decomposition (SVD) and Semi Discrete Decomposition (SDD) are proved to be successful in dimensionality reduction. However, to the best of our knowledge, no empirical results are presented and no comparison between these methods is done to uncover latent structures in the data. In this paper, we present how these methods can be used to identify and visualise latent structures in the time series data. Results on a high dimensional dataset demonstrate that SVD is more successful in uncovering the latent structures.

Download Full-text