Computational Method for Identifying Malonylation Sites by Using Random Forest Algorithm

2020 ◽  
Vol 23 (4) ◽  
pp. 304-312
Author(s):  
ShaoPeng Wang ◽  
JiaRui Li ◽  
Xijun Sun ◽  
Yu-Hang Zhang ◽  
Tao Huang ◽  
...  

Background: As a newly uncovered post-translational modification on the ε-amino group of lysine residue, protein malonylation was found to be involved in metabolic pathways and certain diseases. Apart from experimental approaches, several computational methods based on machine learning algorithms were recently proposed to predict malonylation sites. However, previous methods failed to address imbalanced data sizes between positive and negative samples. Objective: In this study, we identified the significant features of malonylation sites in a novel computational method which applied machine learning algorithms and balanced data sizes by applying synthetic minority over-sampling technique. Method: Four types of features, namely, amino acid (AA) composition, position-specific scoring matrix (PSSM), AA factor, and disorder were used to encode residues in protein segments. Then, a two-step feature selection procedure including maximum relevance minimum redundancy and incremental feature selection, together with random forest algorithm, was performed on the constructed hybrid feature vector. Results: An optimal classifier was built from the optimal feature subset, which featured an F1-measure of 0.356. Feature analysis was performed on several selected important features. Conclusion: Results showed that certain types of PSSM and disorder features may be closely associated with malonylation of lysine residues. Our study contributes to the development of computational approaches for predicting malonyllysine and provides insights into molecular mechanism of malonylation.

Author(s):  
Mohammad Almseidin ◽  
AlMaha Abu Zuraiq ◽  
Mouhammd Al-kasassbeh ◽  
Nidal Alnidami

With increasing technology developments, the Internet has become everywhere and accessible by everyone. There are a considerable number of web-pages with different benefits. Despite this enormous number, not all of these sites are legitimate. There are so-called phishing sites that deceive users into serving their interests. This paper dealt with this problem using machine learning algorithms in addition to employing a novel dataset that related to phishing detection, which contains 5000 legitimate web-pages and 5000 phishing ones. In order to obtain the best results, various machine learning algorithms were tested. Then J48, Random forest, and Multilayer perceptron were chosen. Different feature selection tools were employed to the dataset in order to improve the efficiency of the models. The best result of the experiment achieved by utilizing 20 features out of 48 features and applying it to Random forest algorithm. The accuracy was 98.11%.


Author(s):  
Harsha A K

Abstract: Since the advent of encryption, there has been a steady increase in malware being transmitted over encrypted networks. Traditional approaches to detect malware like packet content analysis are inefficient in dealing with encrypted data. In the absence of actual packet contents, we can make use of other features like packet size, arrival time, source and destination addresses and other such metadata to detect malware. Such information can be used to train machine learning classifiers in order to classify malicious and benign packets. In this paper, we offer an efficient malware detection approach using classification algorithms in machine learning such as support vector machine, random forest and extreme gradient boosting. We employ an extensive feature selection process to reduce the dimensionality of the chosen dataset. The dataset is then split into training and testing sets. Machine learning algorithms are trained using the training set. These models are then evaluated against the testing set in order to assess their respective performances. We further attempt to tune the hyper parameters of the algorithms, in order to achieve better results. Random forest and extreme gradient boosting algorithms performed exceptionally well in our experiments, resulting in area under the curve values of 0.9928 and 0.9998 respectively. Our work demonstrates that malware traffic can be effectively classified using conventional machine learning algorithms and also shows the importance of dimensionality reduction in such classification problems. Keywords: Malware Detection, Extreme Gradient Boosting, Random Forest, Feature Selection.


2019 ◽  
Vol 20 (S2) ◽  
Author(s):  
Varun Khanna ◽  
Lei Li ◽  
Johnson Fung ◽  
Shoba Ranganathan ◽  
Nikolai Petrovsky

Abstract Background Toll-like receptor 9 is a key innate immune receptor involved in detecting infectious diseases and cancer. TLR9 activates the innate immune system following the recognition of single-stranded DNA oligonucleotides (ODN) containing unmethylated cytosine-guanine (CpG) motifs. Due to the considerable number of rotatable bonds in ODNs, high-throughput in silico screening for potential TLR9 activity via traditional structure-based virtual screening approaches of CpG ODNs is challenging. In the current study, we present a machine learning based method for predicting novel mouse TLR9 (mTLR9) agonists based on features including count and position of motifs, the distance between the motifs and graphically derived features such as the radius of gyration and moment of Inertia. We employed an in-house experimentally validated dataset of 396 single-stranded synthetic ODNs, to compare the results of five machine learning algorithms. Since the dataset was highly imbalanced, we used an ensemble learning approach based on repeated random down-sampling. Results Using in-house experimental TLR9 activity data we found that random forest algorithm outperformed other algorithms for our dataset for TLR9 activity prediction. Therefore, we developed a cross-validated ensemble classifier of 20 random forest models. The average Matthews correlation coefficient and balanced accuracy of our ensemble classifier in test samples was 0.61 and 80.0%, respectively, with the maximum balanced accuracy and Matthews correlation coefficient of 87.0% and 0.75, respectively. We confirmed common sequence motifs including ‘CC’, ‘GG’,‘AG’, ‘CCCG’ and ‘CGGC’ were overrepresented in mTLR9 agonists. Predictions on 6000 randomly generated ODNs were ranked and the top 100 ODNs were synthesized and experimentally tested for activity in a mTLR9 reporter cell assay, with 91 of the 100 selected ODNs showing high activity, confirming the accuracy of the model in predicting mTLR9 activity. Conclusion We combined repeated random down-sampling with random forest to overcome the class imbalance problem and achieved promising results. Overall, we showed that the random forest algorithm outperformed other machine learning algorithms including support vector machines, shrinkage discriminant analysis, gradient boosting machine and neural networks. Due to its predictive performance and simplicity, the random forest technique is a useful method for prediction of mTLR9 ODN agonists.


2019 ◽  
Vol 11 (24) ◽  
pp. 2925 ◽  
Author(s):  
Lucas Prado Osco ◽  
Ana Paula Marques Ramos ◽  
Danilo Roberto Pereira ◽  
Érika Akemi Saito Moriya ◽  
Nilton Nobuhiro Imai ◽  
...  

The traditional method of measuring nitrogen content in plants is a time-consuming and labor-intensive task. Spectral vegetation indices extracted from unmanned aerial vehicle (UAV) images and machine learning algorithms have been proved effective in assisting nutritional analysis in plants. Still, this analysis has not considered the combination of spectral indices and machine learning algorithms to predict nitrogen in tree-canopy structures. This paper proposes a new framework to infer the nitrogen content in citrus-tree at a canopy-level using spectral vegetation indices processed with the random forest algorithm. A total of 33 spectral indices were estimated from multispectral images acquired with a UAV-based sensor. Leaf samples were gathered from different planting-fields and the leaf nitrogen content (LNC) was measured in the laboratory, and later converted into the canopy nitrogen content (CNC). To evaluate the robustness of the proposed framework, we compared it with other machine learning algorithms. We used 33,600 citrus trees to evaluate the performance of the machine learning models. The random forest algorithm had higher performance in predicting CNC than all models tested, reaching an R2 of 0.90, MAE of 0.341 g·kg−1 and MSE of 0.307 g·kg−1. We demonstrated that our approach is able to reduce the need for chemical analysis of the leaf tissue and optimizes citrus orchard CNC monitoring.


2019 ◽  
Vol 13 ◽  
Author(s):  
Nandhini Abirami R. ◽  
Durai Raj Vincent

Background: Diagnosing diseases is an intricate job in medical field. Machine learning when applied to health care is capable of early detection of disease which would aid to provide early medical intervention. In heart disease prediction, machine learning techniques have played a significant role. Analysis of disease has become vital in health care sectors. The massive data collected by healthcare sectors are preprocessed and analyzed to discover the underlying information in the data for effective decision making and to provide proper medical intervention. The success of machine learning in medical industry is its capability in analyzing the huge amount of data gathered by the health sector and its effectiveness in decision making. Since medical field involves too many manual processes it has become necessary to automate these procedures. Remarkable advancements in electronic medical records have made it possible. Diagnosing diseases is an intricate job in medical field. Objective: The objective of this research is to design a robust machine learning algorithm to predict heart disease. The prediction of heart disease is performed using Ensemble of machine learning algorithms. This is to boost the accuracy achieved by individual machine learning algorithms. Method: Heart Disease Prediction System is developed where the user can input the patient details and the prediction for the particular patient is made using the model developed. The model will predict the output to be either normal or risky. Linear Discriminant Analysis (LDA), Classification and Regression Trees (CART), Support Vector Machines (SVM), K-Nearest Neighbors (KNN) and Naïve Bayes classifier are used as base learners. These algorithms are combined using random forest as the meta classifier. Results: The predictions of classifier are combined using random forest algorithm. The accuracy is lifted from 85.53% to 87.64% which is an impressive improvement on accuracy. Conclusion: Various techniques were adopted to preprocess the data to suite the requirement of analysis. Feature selections were made to optimize the performance of machine learning algorithms. Ensemble prediction gave better accuracy when combined using Random forest algorithm as combiner. Better feature selection techniques can be applied to further improve the accuracy.


2020 ◽  
Vol 11 (4) ◽  
pp. 1-22
Author(s):  
Adriaan Jacobus Prins ◽  
Adriaan van Niekerk

This study evaluates the use of LiDAR data and machine learning algorithms for mapping vineyards. Vineyards are planted in rows spaced at various distances, which can cause spectral mixing within individual pixels and complicate image classification. Four resolution where used for generating normalized digital surface model and intensity derivatives from the LiDAR data. In addition, texture measures with window sizes of 3x3 and 5x5 were generated from the LiDAR derivatives. The different combinations of the resolutions and window sizes resulted in eight data sets that were used as input to 11 machine learning algorithms. A larger window size was found to improve the overall accuracy for all the classifier–resolution combinations. The results showed that random forest with texture measures generated at a 5x5 window size outperformed the other experiments, regardless of the resolution used. The authors conclude that the random forest algorithm used on LiDAR derivatives with a resolution of 1.5m and a window size of 5x5 is the recommend configuration for vineyard mapping using LiDAR data.


Agronomy ◽  
2021 ◽  
Vol 11 (1) ◽  
pp. 145
Author(s):  
Zeinab Akhavan ◽  
Mahdi Hasanlou ◽  
Mehdi Hosseini ◽  
Heather McNairn

Polarimetric decomposition extracts scattering features that are indicative of the physical characteristics of the target. In this study, three polarimetric decomposition methods were tested for soil moisture estimation over agricultural fields using machine learning algorithms. Features extracted from model-based Freeman–Durden, Eigenvalue and Eigenvector based H/A/α, and Van Zyl decompositions were used as inputs in random forest and neural network regression algorithms. These algorithms were applied to retrieve soil moisture over soybean, wheat, and corn fields. A time series of polarimetric Uninhabited Aerial Vehicle Synthetic Aperture Radar (UAVSAR) data acquired during the Soil Moisture Active Passive Experiment 2012 (SMAPVEX12) field campaign was used for the training and validation of the algorithms. Three feature selection methods were tested to determine the best input features for the machine learning algorithms. The most accurate soil moisture estimates were derived from the random forest regression algorithm for soybeans, with a correlation of determination (R2) of 0.86, root mean square error (RMSE) of 0.041 m3 m−3 and mean absolute error (MAE) of 0.030 m3 m−3. Feature selection also impacted results. Some features like anisotropy, Horizontal transmit and Horizontal receive (HH), and surface roughness parameters (correlation length and RMS-H) had a direct effect on all algorithm performance enhancement as these parameters have a direct impact on the backscattered signal.


2019 ◽  
Vol 7 (3) ◽  
pp. SE69-SE79 ◽  
Author(s):  
Dong Li ◽  
Suping Peng ◽  
Yongxu Lu ◽  
Yinling Guo ◽  
Xiaoqin Cui

Interpretation of geologic structures entails ambiguity and uncertainties. It usually requires interpreter judgment and is time consuming. Deep exploitation of resources challenges the accuracy and efficiency of geologic structure interpretation. The application of machine-learning algorithms to seismic interpretation can effectively solve these problems. We analyzed the theory and applicability of five machine-learning algorithms. Seismic forward modeling is a key connection between the model and seismic response, and it can obtain seismic data of known geologic structures. Based on the modeling data, we first optimized the seismic attributes sensitive to the target geologic structure and then we verified the accuracy of the five machine-learning algorithms by the cross-checking method. In this case, the random forest algorithm had the highest accuracy. So we examined the structural interpretation method based on a random forest using the 3D seismic reflection data from coalfield exploration. The prediction effect of this interpretation workflow is verified by comparison with known geologic structures on the plane and profile. The results suggest that the random forest algorithm is feasible to indicate geologic structure interpretations in the case of collapsed column and fault structures and it can effectively improve the efficiency of seismic interpretation and its accuracy. The machine-learning-based workflow provides a new technique for seismic structure interpretation in coal mining.


2021 ◽  
Vol 25 (4) ◽  
pp. 973-991
Author(s):  
Yanben Wang ◽  
Jurong Bai

In the microblog network, users’ forwarding behavior is widespread and the propagation range is difficult to predict quantitatively. To solve this problem, machine learning algorithms are used to quantitatively predict propagation breadth and depth of microblog users’ forwarding behavior. The dataset is preprocessed, and the extracted features are divided into three types: user features, microblog features and social features. Then the dataset is analyzed in detail; machine learning algorithms are used to predict the propagation breadth and depth of users’ forwarding behavior; and the influence of the three types of features on prediction precision is studied. The experimental results show that the prediction precision of the improved random forest algorithm has less fluctuations, and it is not sensitive to the changes of various features. The improved random forest algorithm has higher precision and better generalization ability than the other algorithms, which shows that the prediction results have high reference value. Social features have the greatest influence on the prediction precision for each prediction algorithm. User features have the similar influence as microblog features on the prediction precision.


Sign in / Sign up

Export Citation Format

Share Document