Virtual Screening for COX-2 Inhibitors with Random Forest Algorithm and Feature Selection

Author(s):  
Shangjie Ai ◽  
Yong Bai ◽  
Xiande Liu
Author(s):  
Indu Yekkala ◽  
Sunanda Dixit

Data is generated by the medical industry. Often this data is of very complex nature—electronic records, handwritten scripts, etc.—since it is generated from multiple sources. Due to the Complexity and sheer volume of this data necessitates techniques that can extract insight from this data in a quick and efficient way. These insights not only diagnose the diseases but also predict and can prevent disease. One such use of these techniques is cardiovascular diseases. Heart disease or coronary artery disease (CAD) is one of the major causes of death all over the world. Comprehensive research using single data mining techniques have not resulted in an acceptable accuracy. Further research is being carried out on the effectiveness of hybridizing more than one technique for increasing accuracy in the diagnosis of heart disease. In this article, the authors worked on heart stalog dataset collected from the UCI repository, used the Random Forest algorithm and Feature Selection using rough sets to accurately predict the occurrence of heart disease


Author(s):  
Mohammad Almseidin ◽  
AlMaha Abu Zuraiq ◽  
Mouhammd Al-kasassbeh ◽  
Nidal Alnidami

With increasing technology developments, the Internet has become everywhere and accessible by everyone. There are a considerable number of web-pages with different benefits. Despite this enormous number, not all of these sites are legitimate. There are so-called phishing sites that deceive users into serving their interests. This paper dealt with this problem using machine learning algorithms in addition to employing a novel dataset that related to phishing detection, which contains 5000 legitimate web-pages and 5000 phishing ones. In order to obtain the best results, various machine learning algorithms were tested. Then J48, Random forest, and Multilayer perceptron were chosen. Different feature selection tools were employed to the dataset in order to improve the efficiency of the models. The best result of the experiment achieved by utilizing 20 features out of 48 features and applying it to Random forest algorithm. The accuracy was 98.11%.


2020 ◽  
Vol 23 (4) ◽  
pp. 304-312
Author(s):  
ShaoPeng Wang ◽  
JiaRui Li ◽  
Xijun Sun ◽  
Yu-Hang Zhang ◽  
Tao Huang ◽  
...  

Background: As a newly uncovered post-translational modification on the ε-amino group of lysine residue, protein malonylation was found to be involved in metabolic pathways and certain diseases. Apart from experimental approaches, several computational methods based on machine learning algorithms were recently proposed to predict malonylation sites. However, previous methods failed to address imbalanced data sizes between positive and negative samples. Objective: In this study, we identified the significant features of malonylation sites in a novel computational method which applied machine learning algorithms and balanced data sizes by applying synthetic minority over-sampling technique. Method: Four types of features, namely, amino acid (AA) composition, position-specific scoring matrix (PSSM), AA factor, and disorder were used to encode residues in protein segments. Then, a two-step feature selection procedure including maximum relevance minimum redundancy and incremental feature selection, together with random forest algorithm, was performed on the constructed hybrid feature vector. Results: An optimal classifier was built from the optimal feature subset, which featured an F1-measure of 0.356. Feature analysis was performed on several selected important features. Conclusion: Results showed that certain types of PSSM and disorder features may be closely associated with malonylation of lysine residues. Our study contributes to the development of computational approaches for predicting malonyllysine and provides insights into molecular mechanism of malonylation.


Sign in / Sign up

Export Citation Format

Share Document