scholarly journals Accurate prediction of DNA N4-methylcytosine sites via boost-learning various types of sequence features

BMC Genomics ◽  
2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Zhixun Zhao ◽  
Xiaocai Zhang ◽  
Fang Chen ◽  
Liang Fang ◽  
Jinyan Li

Abstract Background DNA N4-methylcytosine (4mC) is a critical epigenetic modification and has various roles in the restriction-modification system. Due to the high cost of experimental laboratory detection, computational methods using sequence characteristics and machine learning algorithms have been explored to identify 4mC sites from DNA sequences. However, state-of-the-art methods have limited performance because of the lack of effective sequence features and the ad hoc choice of learning algorithms to cope with this problem. This paper is aimed to propose new sequence feature space and a machine learning algorithm with feature selection scheme to address the problem. Results The feature importance score distributions in datasets of six species are firstly reported and analyzed. Then the impact of the feature selection on model performance is evaluated by independent testing on benchmark datasets, where ACC and MCC measurements on the performance after feature selection increase by 2.3% to 9.7% and 0.05 to 0.19, respectively. The proposed method is compared with three state-of-the-art predictors using independent test and 10-fold cross-validations, and our method outperforms in all datasets, especially improving the ACC by 3.02% to 7.89% and MCC by 0.06 to 0.15 in the independent test. Two detailed case studies by the proposed method have confirmed the excellent overall performance and correctly identified 24 of 26 4mC sites from the C.elegans gene, and 126 out of 137 4mC sites from the D.melanogaster gene. Conclusions The results show that the proposed feature space and learning algorithm with feature selection can improve the performance of DNA 4mC prediction on the benchmark datasets. The two case studies prove the effectiveness of our method in practical situations.

2021 ◽  
pp. 1-15
Author(s):  
Mohammed Ayub ◽  
El-Sayed M. El-Alfy

Web technology has become an indispensable part in human’s life for almost all activities. On the other hand, the trend of cyberattacks is on the rise in today’s modern Web-driven world. Therefore, effective countermeasures for the analysis and detection of malicious websites is crucial to combat the rising threats to the cyber world security. In this paper, we systematically reviewed the state-of-the-art techniques and identified a total of about 230 features of malicious websites, which are classified as internal and external features. Moreover, we developed a toolkit for the analysis and modeling of malicious websites. The toolkit has implemented several types of feature extraction methods and machine learning algorithms, which can be used to analyze and compare different approaches to detect malicious URLs. Moreover, the toolkit incorporates several other options such as feature selection and imbalanced learning with flexibility to be extended to include more functionality and generalization capabilities. Moreover, some use cases are demonstrated for different datasets.


2020 ◽  
Vol 34 (04) ◽  
pp. 3537-3544
Author(s):  
Xu Chen ◽  
Brett Wujek

Automated machine learning (AutoML) strives to establish an appropriate machine learning model for any dataset automatically with minimal human intervention. Although extensive research has been conducted on AutoML, most of it has focused on supervised learning. Research of automated semi-supervised learning and active learning algorithms is still limited. Implementation becomes more challenging when the algorithm is designed for a distributed computing environment. With this as motivation, we propose a novel automated learning system for distributed active learning (AutoDAL) to address these challenges. First, automated graph-based semi-supervised learning is conducted by aggregating the proposed cost functions from different compute nodes in a distributed manner. Subsequently, automated active learning is addressed by jointly optimizing hyperparameters in both the classification and query selection stages leveraging the graph loss minimization and entropy regularization. Moreover, we propose an efficient distributed active learning algorithm which is scalable for big data by first partitioning the unlabeled data and replicating the labeled data to different worker nodes in the classification stage, and then aggregating the data in the controller in the query selection stage. The proposed AutoDAL algorithm is applied to multiple benchmark datasets and a real-world electrocardiogram (ECG) dataset for classification. We demonstrate that the proposed AutoDAL algorithm is capable of achieving significantly better performance compared to several state-of-the-art AutoML approaches and active learning algorithms.


Forests ◽  
2021 ◽  
Vol 12 (2) ◽  
pp. 216
Author(s):  
Mi Luo ◽  
Yifu Wang ◽  
Yunhong Xie ◽  
Lai Zhou ◽  
Jingjing Qiao ◽  
...  

Increasing numbers of explanatory variables tend to result in information redundancy and “dimensional disaster” in the quantitative remote sensing of forest aboveground biomass (AGB). Feature selection of model factors is an effective method for improving the accuracy of AGB estimates. Machine learning algorithms are also widely used in AGB estimation, although little research has addressed the use of the categorical boosting algorithm (CatBoost) for AGB estimation. Both feature selection and regression for AGB estimation models are typically performed with the same machine learning algorithm, but there is no evidence to suggest that this is the best method. Therefore, the present study focuses on evaluating the performance of the CatBoost algorithm for AGB estimation and comparing the performance of different combinations of feature selection methods and machine learning algorithms. AGB estimation models of four forest types were developed based on Landsat OLI data using three feature selection methods (recursive feature elimination (RFE), variable selection using random forests (VSURF), and least absolute shrinkage and selection operator (LASSO)) and three machine learning algorithms (random forest regression (RFR), extreme gradient boosting (XGBoost), and categorical boosting (CatBoost)). Feature selection had a significant influence on AGB estimation. RFE preserved the most informative features for AGB estimation and was superior to VSURF and LASSO. In addition, CatBoost improved the accuracy of the AGB estimation models compared with RFR and XGBoost. AGB estimation models using RFE for feature selection and CatBoost as the regression algorithm achieved the highest accuracy, with root mean square errors (RMSEs) of 26.54 Mg/ha for coniferous forest, 24.67 Mg/ha for broad-leaved forest, 22.62 Mg/ha for mixed forests, and 25.77 Mg/ha for all forests. The combination of RFE and CatBoost had better performance than the VSURF–RFR combination in which random forests were used for both feature selection and regression, indicating that feature selection and regression performed by a single machine learning algorithm may not always ensure optimal AGB estimation. It is promising to extending the application of new machine learning algorithms and feature selection methods to improve the accuracy of AGB estimates.


2020 ◽  
pp. 1-11
Author(s):  
Jie Liu ◽  
Lin Lin ◽  
Xiufang Liang

The online English teaching system has certain requirements for the intelligent scoring system, and the most difficult stage of intelligent scoring in the English test is to score the English composition through the intelligent model. In order to improve the intelligence of English composition scoring, based on machine learning algorithms, this study combines intelligent image recognition technology to improve machine learning algorithms, and proposes an improved MSER-based character candidate region extraction algorithm and a convolutional neural network-based pseudo-character region filtering algorithm. In addition, in order to verify whether the algorithm model proposed in this paper meets the requirements of the group text, that is, to verify the feasibility of the algorithm, the performance of the model proposed in this study is analyzed through design experiments. Moreover, the basic conditions for composition scoring are input into the model as a constraint model. The research results show that the algorithm proposed in this paper has a certain practical effect, and it can be applied to the English assessment system and the online assessment system of the homework evaluation system algorithm system.


Sensors ◽  
2021 ◽  
Vol 21 (2) ◽  
pp. 656
Author(s):  
Xavier Larriva-Novo ◽  
Víctor A. Villagrá ◽  
Mario Vega-Barbas ◽  
Diego Rivera ◽  
Mario Sanz Rodrigo

Security in IoT networks is currently mandatory, due to the high amount of data that has to be handled. These systems are vulnerable to several cybersecurity attacks, which are increasing in number and sophistication. Due to this reason, new intrusion detection techniques have to be developed, being as accurate as possible for these scenarios. Intrusion detection systems based on machine learning algorithms have already shown a high performance in terms of accuracy. This research proposes the study and evaluation of several preprocessing techniques based on traffic categorization for a machine learning neural network algorithm. This research uses for its evaluation two benchmark datasets, namely UGR16 and the UNSW-NB15, and one of the most used datasets, KDD99. The preprocessing techniques were evaluated in accordance with scalar and normalization functions. All of these preprocessing models were applied through different sets of characteristics based on a categorization composed by four groups of features: basic connection features, content characteristics, statistical characteristics and finally, a group which is composed by traffic-based features and connection direction-based traffic characteristics. The objective of this research is to evaluate this categorization by using various data preprocessing techniques to obtain the most accurate model. Our proposal shows that, by applying the categorization of network traffic and several preprocessing techniques, the accuracy can be enhanced by up to 45%. The preprocessing of a specific group of characteristics allows for greater accuracy, allowing the machine learning algorithm to correctly classify these parameters related to possible attacks.


2021 ◽  
Author(s):  
Yingxian Liu ◽  
Cunliang Chen ◽  
Hanqing Zhao ◽  
Yu Wang ◽  
Xiaodong Han

Abstract Fluid properties are key factors for predicting single well productivity, well test interpretation and oilfield recovery prediction, which directly affect the success of ODP program design. The most accurate and direct method of acquisition is underground sampling. However, not every well has samples due to technical reasons such as excessive well deviation or high cost during the exploration stage. Therefore, analogies or empirical formulas have to be adopted to carry out research in many cases. But a large number of oilfield developments have shown that the errors caused by these methods are very large. Therefore, how to quickly and accurately obtain fluid physical properties is of great significance. In recent years, with the development and improvement of artificial intelligence or machine learning algorithms, their applications in the oilfield have become more and more extensive. This paper proposed a method for predicting crude oil physical properties based on machine learning algorithms. This method uses PVT data from nearly 100 wells in Bohai Oilfield. 75% of the data is used for training and learning to obtain the prediction model, and the remaining 25% is used for testing. Practice shows that the prediction results of the machine learning algorithm are very close to the actual data, with a very small error. Finally, this method was used to apply the preliminary plan design of the BZ29 oilfield which is a new oilfield. Especially for the unsampled sand bodies, the fluid physical properties prediction was carried out. It also compares the influence of the analogy method on the scheme, which provides potential and risk analysis for scheme design. This method will be applied in more oil fields in the Bohai Sea in the future and has important promotion value.


Sign in / Sign up

Export Citation Format

Share Document