scholarly journals Behavior of Linear and Nonlinear Dimensionality Reduction for Collective Variable Identi�cation of Small Molecule Solution-Phase Reactions

Author(s):  
Hung Le ◽  
Sushant Kumar ◽  
Nathan May ◽  
Ernesto Martinez-Baez ◽  
Ravishankar Sundararaman ◽  
...  

Identifying collective variables for chemical reactions is essential to reduce the 3$N$ dimensional energy landscape into lower dimensional basins and barriers of interest. However in condensed phase processes, the non-meaningful motions of bulk solvent often overpower the ability of dimensionality reduction methods to identify correlated motions that underpin collective variables. Yet solvent can play important indirect or direct roles in reactivity and much can be lost through treatments that remove or dampen solvent motion. This has been amply demonstrated within principal component analysis, although less is known about the behavior of nonlinear dimensionality reduction methods, e.g., UMAP, that have become more popular recently. The latter presents an interesting alternative to linear methods though often at the expense of interpretability. This work presents distance attenuated projection methods of atomic coordinates that facilitate the application of both PCA and UMAP to identify collective variables in solution, and further the specific identity of solvent molecules that participate in chemical reactions. The performance of both methods is examined in detail for two reactions where the explicit solvent plays very different roles within the collective variables. The first reaction consists of the dynamic exchange of a cation about a polyhydroxy anion that is facilitated by waters of solvation, while the second reaction consists of a nucleophilic attack of water upon ethylene to initiate cis/trans isomerization. When applied to raw data, both PCA and UMAP representations are dominated by bulk solvent motions. On the other hand, when applied to data preprocessed by our attenuated projection methods, both PCA and UMAP identify the appropriate collective variables in solution.

2017 ◽  
Vol 18 (1) ◽  
Author(s):  
Jiaoyun Yang ◽  
Haipeng Wang ◽  
Huitong Ding ◽  
Ning An ◽  
Gil Alterovitz

2019 ◽  
Vol 8 (2) ◽  
pp. 4800-4807

Recently, engineers are concentrating on designing an effective prediction model for finding the rate of student admission in order to raise the educational growth of the nation. The method to predict the student admission towards the higher education is a challenging task for any educational organization. There is a high visibility of crisis towards admission in the higher education. The admission rate of the student is the major risk to the educational society in the world. The student admission greatly affects the economic, social, academic, profit and cultural growth of the nation. The student admission rate also depends on the admission procedures and policies of the educational institutions. The chance of student admission also depends on the feedback given by all the stake holders of the educational sectors. The forecasting of the student admission is a major task for any educational institution to protect the profit and wealth of the organization. This paper attempts to analyze the performance of the student admission prediction by using machine learning dimensionality reduction algorithms. The Admission Predict dataset from Kaggle machine learning Repository is used for prediction analysis and the features are reduced by feature reduction methods. The prediction of the chance of Admit is achieved in four ways. Firstly, the correlation between each of the dataset attributes are found and depicted as a histogram. Secondly, the top most high correlated features are identified which are directly contributing to the prediction of chance of admit. Thirdly, the Admission Predict dataset is subjected to dimensionality reduction methods like principal component analysis (PCA), Sparse PCA, Incremental PCA , Kernel PCA and Mini Batch Sparse PCA. Fourth, the optimized dimensionality reduced dataset is then executed to analyze and compare the mean squared error, Mean Absolute Error and R2 Score of each method. The implementation is done by python in Anaconda Spyder Navigator Integrated Development Environment. Experimental Result shows that the CGPA, GRE Score and TOEFL Score are highly correlated features in predicting the chance of admit. The execution of performance analysis shows that Incremental PCA have achieved the effective prediction of chance of admit with minimum MSE of 0.09, MAE of 0.24 and reasonable R2 Score of 0.26.


Author(s):  
Amir Hossein Karimi ◽  
Mohammad Javad Shafiee ◽  
Ali Ghodsi ◽  
Alexander Wong

Dimensionality reduction methods are widely used in informationprocessing systems to better understand the underlying structuresof datasets, and to improve the efficiency of algorithms for bigdata applications. Methods such as linear random projections haveproven to be simple and highly efficient in this regard, however,there is limited theoretical and experimental analysis for nonlinearrandom projections. In this study, we review the theoretical frameworkfor random projections and nonlinear rectified random projections,and introduce ensemble of nonlinear maximum random projections.We empirically evaluate the embedding performance on 3commonly used natural datasets and compare with linear randomprojections and traditional techniques such as PCA, highlightingthe superior generalization performance and stable embedding ofthe proposed method.


2022 ◽  
pp. 17-25
Author(s):  
Nancy Jan Sliper

Experimenters today frequently quantify millions or even billions of characteristics (measurements) each sample to address critical biological issues, in the hopes that machine learning tools would be able to make correct data-driven judgments. An efficient analysis requires a low-dimensional representation that preserves the differentiating features in data whose size and complexity are orders of magnitude apart (e.g., if a certain ailment is present in the person's body). While there are several systems that can handle millions of variables and yet have strong empirical and conceptual guarantees, there are few that can be clearly understood. This research presents an evaluation of supervised dimensionality reduction for large scale data. We provide a methodology for expanding Principal Component Analysis (PCA) by including category moment estimations in low-dimensional projections. Linear Optimum Low-Rank (LOLR) projection, the cheapest variant, includes the class-conditional means. We show that LOLR projections and its extensions enhance representations of data for future classifications while retaining computing flexibility and reliability using both experimental and simulated data benchmark. When it comes to accuracy, LOLR prediction outperforms other modular linear dimension reduction methods that require much longer computation times on conventional computers. LOLR uses more than 150 million attributes in brain image processing datasets, and many genome sequencing datasets have more than half a million attributes.


2019 ◽  
Vol 2019 ◽  
pp. 1-8
Author(s):  
Hui Xu ◽  
Yongguo Yang ◽  
Xin Wang ◽  
Mingming Liu ◽  
Hongxia Xie ◽  
...  

Traditional supervised multiple kernel learning (MKL) for dimensionality reduction is generally an extension of kernel discriminant analysis (KDA), which has some restrictive assumptions. In addition, they generally are based on graph embedding framework. A more general multiple kernel-based dimensionality reduction algorithm, called multiple kernel marginal Fisher analysis (MKL-MFA), is presented for supervised nonlinear dimensionality reduction combined with ratio-race optimization problem. MKL-MFA aims at relaxing the restrictive assumption that the data of each class is of a Gaussian distribution and finding an appropriate convex combination of several base kernels. To improve the efficiency of multiple kernel dimensionality reduction, the spectral regression frameworks are incorporated into the optimization model. Furthermore, the optimal weights of predefined base kernels can be obtained by solving a different convex optimization. Experimental results on benchmark datasets demonstrate that MKL-MFA outperforms the state-of-the-art supervised multiple kernel dimensionality reduction methods.


2020 ◽  
Author(s):  
Micheal Olaolu Arowolo ◽  
Marion Olubunmi Adebiyi ◽  
Ayodele Ariyo Adebiyi ◽  
Oludayo Olugbara

Abstract RNA-Seq data are utilized for biological applications and decision making for the classification of genes. A lot of works in recent time are focused on reducing the dimension of RNA-Seq data. Dimensionality reduction approaches have been proposed in the transformation of these data. In this study, a novel optimized hybrid investigative approach is proposed. It combines an optimized genetic algorithm with Principal Component Analysis and Independent Component Analysis (GA-O-PCA and GAO-ICA), which are used to identify an optimum subset and latent correlated features, respectively. The classifier uses KNN on the reduced mosquito Anopheles gambiae dataset, to enhance the accuracy and scalability in the gene expression analysis. The proposed algorithm is used to fetch relevant features based on the high-dimensional input feature space. A fast algorithm for feature ranking is used to select relevant features. The performances of the model are evaluated and validated using the classification accuracy to compare existing approaches in the literature. The achieved experimental results prove to be promising for selecting relevant genes and classifying pertinent gene expression data analysis by indicating that the approach is a capable addition to prevailing machine learning methods.


2021 ◽  
Vol 50 (9) ◽  
pp. 2579-2589
Author(s):  
Micheal Olaolu Arowolo ◽  
Marion Olubunmi Adebiyi ◽  
Ayodele Ariyo Adebiyi

RNA-Seq data are utilized for biological applications and decision making for classification of genes. Lots of work in recent time are focused on reducing the dimension of RNA-Seq data. Dimensionality reduction approaches have been proposed in fetching relevant information in a given data. In this study, a novel optimized dimensionality reduction algorithm is proposed, by combining an optimized genetic algorithm with Principal Component Analysis and Independent Component Analysis (GA-O-PCA and GAO-ICA), which are used to identify an optimum subset and latent correlated features, respectively. The classifier uses Decision tree on the reduced mosquito anopheles gambiae dataset to enhance the accuracy and scalability in the gene expression analysis. The proposed algorithm is used to fetch relevant features based from the high-dimensional input feature space. A feature ranking and earlier experience are used. The performances of the model are evaluated and validated using the classification accuracy to compare existing approaches in the literature. The achieved experimental results prove to be promising for feature selection and classification in gene expression data analysis and specify that the approach is a capable accumulation to prevailing data mining techniques.


Sign in / Sign up

Export Citation Format

Share Document