Estimation of relative effectiveness of phylogenetic programs by machine learning

Reconstruction of phylogeny of a protein family from a sequence alignment can produce results of different quality. Our goal is to predict the quality of phylogeny reconstruction basing on features that can be extracted from the input alignment. We used Fitch–Margoliash (FM) method of phylogeny reconstruction and random forest as a predictor. For training and testing the predictor, alignments of orthologous series (OS) were used, for which the result of phylogeny reconstruction can be evaluated by comparison with trees of corresponding organisms. Our results show that the quality of phylogeny reconstruction can be predicted with more than 80% precision. Also, we tried to predict which phylogeny reconstruction method, FM or UPGMA, is better for a particular alignment. With the used set of features, among alignments for which the obtained predictor predicts a better performance of UPGMA, 56% really give a better result with UPGMA. Taking into account that in our testing set only for 34% alignments UPGMA performs better, this result shows a principal possibility to predict the better phylogeny reconstruction method basing on features of a sequence alignment.

Download Full-text

Development of a method for assessing the quality of machine translation based on ensemble methods in machine learning

Science Intensive Technologies ◽

10.18127/j19998465-202102-06 ◽

2021 ◽

Author(s):

A.V. Kozina ◽

Yu.S. Belov

Keyword(s):

Machine Learning ◽

Random Forest ◽

Quality Assessment ◽

Machine Translation ◽

Translation System ◽

Human Judgment ◽

Translation Quality ◽

Machine Translation System ◽

Translation Systems

Automatically assessing the quality of machine translation is an important yet challenging task for machine translation research. Translation quality assessment is understood as predicting translation quality without reference to the source text. Translation quality depends on the specific machine translation system and often requires post-editing. Manual editing is a long and expensive process. Since the need to quickly determine the quality of translation increases, its automation is required. In this paper, we propose a quality assessment method based on ensemble supervised machine learning methods. The bilingual corpus WMT 2019 for the EnglishRussian language pair was used as data. The text data volume is 17089 sentences, 85% of the data was used for training, and 15% for testing the model. Linguistic functions extracted from the text in the source and target languages were used as features for training the system, since it is these characteristics that can most accurately characterize the translation in terms of quality. The following tools were used for feature extraction: a free language modeling tool based on SRILM and a Stanford POS Tagger parts of speech tagger. Before training the system, the text was preprocessed. The model was trained using three regression methods: Bagging, Extra Tree, and Random Forest. The algorithms were implemented in the Python programming language using the Scikit learn library. The parameters of the random forest method have been optimized using a grid search. The performance of the model was assessed by the mean absolute error MAE and the root mean square error RMSE, as well as by the Pearsоn coefficient, which determines the correlation with human judgment. Testing was carried out using three machine translation systems: Google and Bing neural systems, Mouses statistical machine translation systems based on phrases and based on syntax. Based on the results of the work, the method of additional trees showed itself best. In addition, for all categories of indicators under consideration, the best results are achieved using the Google machine translation system. The developed method showed good results close to human judgment. The system can be used for further research in the task of assessing the quality of translation.

Download Full-text

Assessing the soil quality of Bansloi river basin, eastern India using soil-quality indices (SQIs) and Random Forest machine learning technique

Ecological Indicators ◽

10.1016/j.ecolind.2020.106804 ◽

2020 ◽

Vol 118 ◽

pp. 106804

Author(s):

Gopal Chandra Paul ◽

Sunil Saha ◽

Krishna Gopal Ghosh

Keyword(s):

Machine Learning ◽

Random Forest ◽

Soil Quality ◽

River Basin ◽

Eastern India ◽

Quality Indices ◽

Machine Learning Technique ◽

Learning Technique

Download Full-text

Evaluating the Performance of Machine Learning Approaches to Predict the Microbial Quality of Surface Waters and to Optimize the Sampling Effort

Water ◽

10.3390/w13182457 ◽

2021 ◽

Vol 13 (18) ◽

pp. 2457

Author(s):

Manel Naloufi ◽

Françoise S. Lucas ◽

Sami Souihi ◽

Pierre Servais ◽

Aurélie Janne ◽

...

Keyword(s):

Machine Learning ◽

Escherichia Coli ◽

Random Forest ◽

Microbiological Quality ◽

Support Vector ◽

Sampling Effort ◽

Temperature Conductivity ◽

Adaptive Boosting ◽

Paris Area

Exposure to contaminated water during aquatic recreational activities can lead to gastrointestinal diseases. In order to decrease the exposure risk, the fecal indicator bacteria Escherichia coli is routinely monitored, which is time-consuming, labor-intensive, and costly. To assist the stakeholders in the daily management of bathing sites, models have been developed to predict the microbiological quality. However, model performances are highly dependent on the quality of the input data which are usually scarce. In our study, we proposed a conceptual framework for optimizing the selection of the most adapted model, and to enrich the training dataset. This frameword was successfully applied to the prediction of Escherichia coli concentrations in the Marne River (Paris Area, France). We compared the performance of six machine learning (ML)-based models: K-nearest neighbors, Decision Tree, Support Vector Machines, Bagging, Random Forest, and Adaptive boosting. Based on several statistical metrics, the Random Forest model presented the best accuracy compared to the other models. However, 53.2 ± 3.5% of the predicted E. coli densities were inaccurately estimated according to the mean absolute percentage error (MAPE). Four parameters (temperature, conductivity, 24 h cumulative rainfall of the previous day the sampling, and the river flow) were identified as key variables to be monitored for optimization of the ML model. The set of values to be optimized will feed an alert system for monitoring the microbiological quality of the water through combined strategy of in situ manual sampling and the deployment of a network of sensors. Based on these results, we propose a guideline for ML model selection and sampling optimization.

Download Full-text

Research on dairy products detection based on machine learning algorithm

MATEC Web of Conferences ◽

10.1051/matecconf/202235503008 ◽

2022 ◽

Vol 355 ◽

pp. 03008

Author(s):

Yang Zhang ◽

Lei Zhang ◽

Yabin Ma ◽

Jinsen Guan ◽

Zhaoxia Liu ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Electronic Nose ◽

Milk Fat ◽

Dairy Products ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Extreme Gradient Boosting

In this study, an electronic nose model composed of seven kinds of metal oxide semiconductor sensors was developed to distinguish the milk source (the dairy farm to which milk belongs), estimate the content of milk fat and protein in milk, to identify the authenticity and evaluate the quality of milk. The developed electronic nose is a low-cost and non-destructive testing equipment. (1) For the identification of milk sources, this paper uses the method of combining the electronic nose odor characteristics of milk and the component characteristics to distinguish different milk sources, and uses Principal Component Analysis (PCA) and Linear Discriminant Analysis , LDA) for dimensionality reduction analysis, and finally use three machine learning algorithms such as Logistic Regression (LR), Support Vector Machine (SVM) and Random Forest (RF) to build a milk source (cow farm) Identify the model and evaluate and compare the classification effects. The experimental results prove that the classification effect of the SVM-LDA model based on the electronic nose odor characteristics is better than other single feature models, and the accuracy of the test set reaches 91.5%. The RF-LDA and SVM-LDA models based on the fusion feature of the two have the best effect Set accuracy rate is as high as 96%. (2) The three algorithms, Gradient Boosting Decision Tree (GBDT), Extreme Gradient Boosting (XGBoost) and Random Forest (RF), are used to construct the electronic nose odor data for milk fat rate and protein rate. The method of estimating the model, the results show that the RF model has the best estimation performance( R2 =0.9399 for milk fat; R2=0.9301for milk protein). And it prove that the method proposed in this study can improve the estimation accuracy of milk fat and protein, which provides a technical basis for predicting the quality of dairy products.

Download Full-text

Prediction of aqueous intrinsic solubility of druglike molecules using Random Forest regression trained with Wiki-pS0 database

ADMET & DMPK ◽

10.5599/admet.766 ◽

2020 ◽

Vol 8 (1) ◽

pp. 29-77 ◽

Cited By ~ 1

Author(s):

Alex Avdeef

Keyword(s):

Machine Learning ◽

Random Forest ◽

Prediction Models ◽

Chemical Space ◽

Solubility Data ◽

Statistical Machine Learning ◽

Random Forest Regression ◽

Factors Affecting ◽

Intrinsic Solubility

The accurate prediction of solubility of drugs is still problematic. It was thought for a long time that shortfalls had been due the lack of high-quality solubility data from the chemical space of drugs. This study considers the quality of solubility data, particularly of ionizable drugs. A database is described, comprising 6355 entries of intrinsic solubility for 3014 different molecules, drawing on 1325 citations. In an earlier publication, many factors affecting the quality of the measurement had been discussed, and suggestions were offered to improve ways of extracting more reliable information from legacy data. Many of the suggestions have been implemented in this study. By correcting solubility for ionization (i.e., deriving intrinsic solubility, S0) and by normalizing temperature (by transforming measurements performed in the range 10-50 Â°C to 25 Â°C), it can now be estimated that the average interlaboratory reproducibility is 0.17 log unit. Empirical methods to predict solubility at best have hovered around the root mean square error (RMSE) of 0.6 log unit. Three prediction methods are compared here: (a) Yalkowskyâ€™s general solubility equation (GSE), (b) Abraham solvation equation (ABSOLV), and (c) Random Forest regression (RFR) statistical machine learning. The latter two methods were trained using the new database. The RFR method outperforms the other two models, as anticipated. However, the ability to predict the solubility of drugs to the level of the quality of data is still out of reach. The data quality is not the limiting factor in prediction. The statistical machine learning methodologies are probably up to the task. Possibly whatâ€™s missing are solubility data from a few sparsely-covered chemical space of drugs (particularly of research compounds). Also, new descriptors which can better differentiate the factors affecting solubility between molecules could be critical for narrowing the gap between the accuracy of the prediction models and that of the experimental data.

Download Full-text

Machine learning–based random forest for predicting decreased quality of life in thyroid cancer patients after thyroidectomy

Supportive Care in Cancer ◽

10.1007/s00520-021-06657-0 ◽

2021 ◽

Author(s):

Yong Hong Liu ◽

Jian Jin ◽

Yun Jiang Liu

Keyword(s):

Quality Of Life ◽

Machine Learning ◽

Thyroid Cancer ◽

Random Forest ◽

Cancer Patients

Download Full-text

Sign language dactyl recognition based on machine learning algorithms

Eastern-European Journal of Enterprise Technologies ◽

10.15587/1729-4061.2021.239253 ◽

2021 ◽

Vol 4 (2(112)) ◽

pp. 58-72

Author(s):

Chingiz Kenshimov ◽

Zholdas Buribayev ◽

Yedilkhan Amirgaliyev ◽

Aisulyu Ataniyazova ◽

Askhat Aitimov

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Sign Language ◽

Gesture Recognition ◽

Research Work ◽

Gradient Boosting ◽

Support Vector ◽

Extreme Gradient Boosting

In the course of our research work, the American, Russian and Turkish sign languages were analyzed. The program of recognition of the Kazakh dactylic sign language with the use of machine learning methods is implemented. A dataset of 5000 images was formed for each gesture, gesture recognition algorithms were applied, such as Random Forest, Support Vector Machine, Extreme Gradient Boosting, while two data types were combined into one database, which caused a change in the architecture of the system as a whole. The quality of the algorithms was also evaluated. The research work was carried out due to the fact that scientific work in the field of developing a system for recognizing the Kazakh language of sign dactyls is currently insufficient for a complete representation of the language. There are specific letters in the Kazakh language, because of the peculiarities of the spelling of the language, problems arise when developing recognition systems for the Kazakh sign language. The results of the work showed that the Support Vector Machine and Extreme Gradient Boosting algorithms are superior in real-time performance, but the Random Forest algorithm has high recognition accuracy. As a result, the accuracy of the classification algorithms was 98.86 % for Random Forest, 98.68 % for Support Vector Machine and 98.54 % for Extreme Gradient Boosting. Also, the evaluation of the quality of the work of classical algorithms has high indicators. The practical significance of this work lies in the fact that scientific research in the field of gesture recognition with the updated alphabet of the Kazakh language has not yet been conducted and the results of this work can be used by other researchers to conduct further research related to the recognition of the Kazakh dactyl sign language, as well as by researchers, engaged in the development of the international sign language

Download Full-text

A Novel Hybrid IDS Based on Modified NSGAII-ANN and Random Forest

Electronics ◽

10.3390/electronics9040577 ◽

2020 ◽

Vol 9 (4) ◽

pp. 577 ◽

Cited By ~ 2

Author(s):

Anahita Golrang ◽

Alale Mohammadi Golrang ◽

Sule Yildirim Yayilgan ◽

Ogerta Elezaj

Keyword(s):

Machine Learning ◽

Random Forest ◽

Intrusion Detection ◽

Feature Selection Method ◽

Influential Factor ◽

Intrusion Detection Systems ◽

Machine Learning Techniques ◽

Multi Objective ◽

Detection Systems

Machine-learning techniques have received popularity in the intrusion-detection systems in recent years. Moreover, the quality of datasets plays a crucial role in the development of a proper machine-learning approach. Therefore, an appropriate feature-selection method could be considered to be an influential factor in improving the quality of datasets, which leads to high-performance intrusion-detection systems. In this paper, a hybrid multi-objective approach is proposed to detect attacks in a network efficiently. Initially, a multi-objective genetic method (NSGAII), as well as an artificial neural network (ANN), are run simultaneously to extract feature subsets. We modified the NSGAII approach maintaining the diversity control in this evolutionary algorithm. Next, a Random Forest approach, as an ensemble method, is used to evaluate the efficiency of the feature subsets. Results of the experiments show that using the proposed framework leads to better outcomes, which could be considered to be promising results compared to the solutions found in the literature.

Download Full-text

OSMWatchman: Learning How to Detect Vandalized Contributions in OSM Using a Random Forest Classifier

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi9090504 ◽

2020 ◽

Vol 9 (9) ◽

pp. 504

Author(s):

Quy Truong ◽

Guillaume Touya ◽

Cyril Runz

Keyword(s):

Machine Learning ◽

Random Forest ◽

Spatial Data ◽

Random Forest Classifier ◽

Experimental Results ◽

Supervised Machine Learning ◽

Learning Approaches ◽

Geographical Regions ◽

Vandalism Detection

Though Volunteered Geographic Information (VGI) has the advantage of providing free open spatial data, it is prone to vandalism, which may heavily decrease the quality of these data. Therefore, detecting vandalism in VGI may constitute a first way of assessing the data in order to improve their quality. This article explores the ability of supervised machine learning approaches to detect vandalism in OpenStreetMap (OSM) in an automated way. For this purpose, our work includes the construction of a corpus of vandalism data, given that no OSM vandalism corpus is available so far. Then, we investigate the ability of random forest methods to detect vandalism on the created corpus. Experimental results show that random forest classifiers perform well in detecting vandalism in the same geographical regions that were used for training the model and has more issues with vandalism detection in “unfamiliar regions”.

Download Full-text

Random Forest Algorithm for Soil Fertility Prediction and Grading using Machine Learning

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.l3609.119119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 1301-1304

Keyword(s):

Machine Learning ◽

Random Forest ◽

Soil Nutrient ◽

Nutrient Content ◽

High Rate ◽

Accuracy Score ◽

Target Variable ◽

Agriculture Sector ◽

Yield And Quality

n society the population is increasing at a high rate, people are not aware of the advancement of technologies. Machine learning can be used to increase the crop yield and quality of crops in the agriculture sector. In this project we propose a machine learning based solution for the analysis of the important soil properties and based on that we are dealing with the Grading of the Soil and Prediction of Crops suitable to the land. The various soil nutrient EC (Electrical Conductivity), pH (Power of Hydrogen), OC (Organic Carbon), etc. are the feature variables, whereas the grade of the particular soil based on its nutrient content is the target variable. Dataset is preprocessed and regression algorithm is applied and RMSE (Root Mean Square Error) is calculated for predicting rank of soil and we applied various Classification Algorithm for crop recommendation and found that Random Forest has the highest accuracy score.

Download Full-text