Estimation of relative effectiveness of phylogenetic programs by machine learning

2014 ◽  
Vol 12 (02) ◽  
pp. 1441004
Author(s):  
Mikhail Krivozubov ◽  
Florian Goebels ◽  
Sergei Spirin

Reconstruction of phylogeny of a protein family from a sequence alignment can produce results of different quality. Our goal is to predict the quality of phylogeny reconstruction basing on features that can be extracted from the input alignment. We used Fitch–Margoliash (FM) method of phylogeny reconstruction and random forest as a predictor. For training and testing the predictor, alignments of orthologous series (OS) were used, for which the result of phylogeny reconstruction can be evaluated by comparison with trees of corresponding organisms. Our results show that the quality of phylogeny reconstruction can be predicted with more than 80% precision. Also, we tried to predict which phylogeny reconstruction method, FM or UPGMA, is better for a particular alignment. With the used set of features, among alignments for which the obtained predictor predicts a better performance of UPGMA, 56% really give a better result with UPGMA. Taking into account that in our testing set only for 34% alignments UPGMA performs better, this result shows a principal possibility to predict the better phylogeny reconstruction method basing on features of a sequence alignment.

Author(s):  
A.V. Kozina ◽  
Yu.S. Belov

Automatically assessing the quality of machine translation is an important yet challenging task for machine translation research. Translation quality assessment is understood as predicting translation quality without reference to the source text. Translation quality depends on the specific machine translation system and often requires post-editing. Manual editing is a long and expensive process. Since the need to quickly determine the quality of translation increases, its automation is required. In this paper, we propose a quality assessment method based on ensemble supervised machine learning methods. The bilingual corpus WMT 2019 for the EnglishRussian language pair was used as data. The text data volume is 17089 sentences, 85% of the data was used for training, and 15% for testing the model. Linguistic functions extracted from the text in the source and target languages were used as features for training the system, since it is these characteristics that can most accurately characterize the translation in terms of quality. The following tools were used for feature extraction: a free language modeling tool based on SRILM and a Stanford POS Tagger parts of speech tagger. Before training the system, the text was preprocessed. The model was trained using three regression methods: Bagging, Extra Tree, and Random Forest. The algorithms were implemented in the Python programming language using the Scikit learn library. The parameters of the random forest method have been optimized using a grid search. The performance of the model was assessed by the mean absolute error MAE and the root mean square error RMSE, as well as by the Pearsоn coefficient, which determines the correlation with human judgment. Testing was carried out using three machine translation systems: Google and Bing neural systems, Mouses statistical machine translation systems based on phrases and based on syntax. Based on the results of the work, the method of additional trees showed itself best. In addition, for all categories of indicators under consideration, the best results are achieved using the Google machine translation system. The developed method showed good results close to human judgment. The system can be used for further research in the task of assessing the quality of translation.


Water ◽  
2021 ◽  
Vol 13 (18) ◽  
pp. 2457
Author(s):  
Manel Naloufi ◽  
Françoise S. Lucas ◽  
Sami Souihi ◽  
Pierre Servais ◽  
Aurélie Janne ◽  
...  

Exposure to contaminated water during aquatic recreational activities can lead to gastrointestinal diseases. In order to decrease the exposure risk, the fecal indicator bacteria Escherichia coli is routinely monitored, which is time-consuming, labor-intensive, and costly. To assist the stakeholders in the daily management of bathing sites, models have been developed to predict the microbiological quality. However, model performances are highly dependent on the quality of the input data which are usually scarce. In our study, we proposed a conceptual framework for optimizing the selection of the most adapted model, and to enrich the training dataset. This frameword was successfully applied to the prediction of Escherichia coli concentrations in the Marne River (Paris Area, France). We compared the performance of six machine learning (ML)-based models: K-nearest neighbors, Decision Tree, Support Vector Machines, Bagging, Random Forest, and Adaptive boosting. Based on several statistical metrics, the Random Forest model presented the best accuracy compared to the other models. However, 53.2 ± 3.5% of the predicted E. coli densities were inaccurately estimated according to the mean absolute percentage error (MAPE). Four parameters (temperature, conductivity, 24 h cumulative rainfall of the previous day the sampling, and the river flow) were identified as key variables to be monitored for optimization of the ML model. The set of values to be optimized will feed an alert system for monitoring the microbiological quality of the water through combined strategy of in situ manual sampling and the deployment of a network of sensors. Based on these results, we propose a guideline for ML model selection and sampling optimization.


2022 ◽  
Vol 355 ◽  
pp. 03008
Author(s):  
Yang Zhang ◽  
Lei Zhang ◽  
Yabin Ma ◽  
Jinsen Guan ◽  
Zhaoxia Liu ◽  
...  

In this study, an electronic nose model composed of seven kinds of metal oxide semiconductor sensors was developed to distinguish the milk source (the dairy farm to which milk belongs), estimate the content of milk fat and protein in milk, to identify the authenticity and evaluate the quality of milk. The developed electronic nose is a low-cost and non-destructive testing equipment. (1) For the identification of milk sources, this paper uses the method of combining the electronic nose odor characteristics of milk and the component characteristics to distinguish different milk sources, and uses Principal Component Analysis (PCA) and Linear Discriminant Analysis , LDA) for dimensionality reduction analysis, and finally use three machine learning algorithms such as Logistic Regression (LR), Support Vector Machine (SVM) and Random Forest (RF) to build a milk source (cow farm) Identify the model and evaluate and compare the classification effects. The experimental results prove that the classification effect of the SVM-LDA model based on the electronic nose odor characteristics is better than other single feature models, and the accuracy of the test set reaches 91.5%. The RF-LDA and SVM-LDA models based on the fusion feature of the two have the best effect Set accuracy rate is as high as 96%. (2) The three algorithms, Gradient Boosting Decision Tree (GBDT), Extreme Gradient Boosting (XGBoost) and Random Forest (RF), are used to construct the electronic nose odor data for milk fat rate and protein rate. The method of estimating the model, the results show that the RF model has the best estimation performance( R2 =0.9399 for milk fat; R2=0.9301for milk protein). And it prove that the method proposed in this study can improve the estimation accuracy of milk fat and protein, which provides a technical basis for predicting the quality of dairy products.


ADMET & DMPK ◽  
2020 ◽  
Vol 8 (1) ◽  
pp. 29-77 ◽  
Author(s):  
Alex Avdeef

The accurate prediction of solubility of drugs is still problematic. It was thought for a long time that shortfalls had been due the lack of high-quality solubility data from the chemical space of drugs. This study considers the quality of solubility data, particularly of ionizable drugs. A database is described, comprising 6355 entries of intrinsic solubility for 3014 different molecules, drawing on 1325 citations. In an earlier publication, many factors affecting the quality of the measurement had been discussed, and suggestions were offered to improve ways of extracting more reliable information from legacy data. Many of the suggestions have been implemented in this study. By correcting solubility for ionization (i.e., deriving intrinsic solubility, S0) and by normalizing temperature (by transforming measurements performed in the range 10-50 °C to 25 °C), it can now be estimated that the average interlaboratory reproducibility is 0.17 log unit. Empirical methods to predict solubility at best have hovered around the root mean square error (RMSE) of 0.6 log unit. Three prediction methods are compared here: (a) Yalkowsky’s general solubility equation (GSE), (b) Abraham solvation equation (ABSOLV), and (c) Random Forest regression (RFR) statistical machine learning. The latter two methods were trained using the new database. The RFR method outperforms the other two models, as anticipated. However, the ability to predict the solubility of drugs to the level of the quality of data is still out of reach. The data quality is not the limiting factor in prediction. The statistical machine learning methodologies are probably up to the task. Possibly what’s missing are solubility data from a few sparsely-covered chemical space of drugs (particularly of research compounds). Also, new descriptors which can better differentiate the factors affecting solubility between molecules could be critical for narrowing the gap between the accuracy of the prediction models and that of the experimental data.


2021 ◽  
Vol 4 (2(112)) ◽  
pp. 58-72
Author(s):  
Chingiz Kenshimov ◽  
Zholdas Buribayev ◽  
Yedilkhan Amirgaliyev ◽  
Aisulyu Ataniyazova ◽  
Askhat Aitimov

In the course of our research work, the American, Russian and Turkish sign languages were analyzed. The program of recognition of the Kazakh dactylic sign language with the use of machine learning methods is implemented. A dataset of 5000 images was formed for each gesture, gesture recognition algorithms were applied, such as Random Forest, Support Vector Machine, Extreme Gradient Boosting, while two data types were combined into one database, which caused a change in the architecture of the system as a whole. The quality of the algorithms was also evaluated. The research work was carried out due to the fact that scientific work in the field of developing a system for recognizing the Kazakh language of sign dactyls is currently insufficient for a complete representation of the language. There are specific letters in the Kazakh language, because of the peculiarities of the spelling of the language, problems arise when developing recognition systems for the Kazakh sign language. The results of the work showed that the Support Vector Machine and Extreme Gradient Boosting algorithms are superior in real-time performance, but the Random Forest algorithm has high recognition accuracy. As a result, the accuracy of the classification algorithms was 98.86 % for Random Forest, 98.68 % for Support Vector Machine and 98.54 % for Extreme Gradient Boosting. Also, the evaluation of the quality of the work of classical algorithms has high indicators. The practical significance of this work lies in the fact that scientific research in the field of gesture recognition with the updated alphabet of the Kazakh language has not yet been conducted and the results of this work can be used by other researchers to conduct further research related to the recognition of the Kazakh dactyl sign language, as well as by researchers, engaged in the development of the international sign language


Electronics ◽  
2020 ◽  
Vol 9 (4) ◽  
pp. 577 ◽  
Author(s):  
Anahita Golrang ◽  
Alale Mohammadi Golrang ◽  
Sule Yildirim Yayilgan ◽  
Ogerta Elezaj

Machine-learning techniques have received popularity in the intrusion-detection systems in recent years. Moreover, the quality of datasets plays a crucial role in the development of a proper machine-learning approach. Therefore, an appropriate feature-selection method could be considered to be an influential factor in improving the quality of datasets, which leads to high-performance intrusion-detection systems. In this paper, a hybrid multi-objective approach is proposed to detect attacks in a network efficiently. Initially, a multi-objective genetic method (NSGAII), as well as an artificial neural network (ANN), are run simultaneously to extract feature subsets. We modified the NSGAII approach maintaining the diversity control in this evolutionary algorithm. Next, a Random Forest approach, as an ensemble method, is used to evaluate the efficiency of the feature subsets. Results of the experiments show that using the proposed framework leads to better outcomes, which could be considered to be promising results compared to the solutions found in the literature.


2020 ◽  
Vol 9 (9) ◽  
pp. 504
Author(s):  
Quy Truong ◽  
Guillaume Touya ◽  
Cyril Runz

Though Volunteered Geographic Information (VGI) has the advantage of providing free open spatial data, it is prone to vandalism, which may heavily decrease the quality of these data. Therefore, detecting vandalism in VGI may constitute a first way of assessing the data in order to improve their quality. This article explores the ability of supervised machine learning approaches to detect vandalism in OpenStreetMap (OSM) in an automated way. For this purpose, our work includes the construction of a corpus of vandalism data, given that no OSM vandalism corpus is available so far. Then, we investigate the ability of random forest methods to detect vandalism on the created corpus. Experimental results show that random forest classifiers perform well in detecting vandalism in the same geographical regions that were used for training the model and has more issues with vandalism detection in “unfamiliar regions”.


n society the population is increasing at a high rate, people are not aware of the advancement of technologies. Machine learning can be used to increase the crop yield and quality of crops in the agriculture sector. In this project we propose a machine learning based solution for the analysis of the important soil properties and based on that we are dealing with the Grading of the Soil and Prediction of Crops suitable to the land. The various soil nutrient EC (Electrical Conductivity), pH (Power of Hydrogen), OC (Organic Carbon), etc. are the feature variables, whereas the grade of the particular soil based on its nutrient content is the target variable. Dataset is preprocessed and regression algorithm is applied and RMSE (Root Mean Square Error) is calculated for predicting rank of soil and we applied various Classification Algorithm for crop recommendation and found that Random Forest has the highest accuracy score.


Sign in / Sign up

Export Citation Format

Share Document