Detection of adaptive divergence in populations of the stream mayfly Ephemera strigata with machine learning

Mapping Intimacies ◽

10.1101/424085 ◽

2018 ◽

Author(s):

Bin Li ◽

Sakiko Yaegashi ◽

Thaddeus M Carvajal ◽

Maribet Gamboa ◽

Kozo Watanabe

Keyword(s):

Machine Learning ◽

Random Forest ◽

River Basin ◽

Evolutionary Biology ◽

Natural Populations ◽

Dispersal Ability ◽

Adaptive Divergence ◽

Genome Scanning ◽

Northeastern Japan ◽

A Genome

AbstractAdaptive divergence is a key mechanism shaping the genetic variation of natural populations. A central question linking ecology with evolutionary biology concerns the role of environmental heterogeneity in determining adaptive divergence among local populations within a species. In this study, we examined adaptive the divergence among populations of the stream mayfly Ephemera strigata in the Natori River Basin in northeastern Japan. We used a genome scanning approach to detect candidate loci under selection and then applied a machine learning method (i.e. Random Forest) and traditional distance-based redundancy analysis (dbRDA) to examine relationships between environmental factors and adaptive divergence at non-neutral loci. We also assessed spatial autocorrelation at neutral loci to quantify the dispersal ability of E. strigata. Our main findings were as follows: 1) random forest shows a higher resolution than traditional statistical analysis for detecting adaptive divergence; 2) separating markers into neutral and non-neutral loci provides insights into genetic diversity, local adaptation and dispersal ability and 3) E. strigata shows altitudinal adaptive divergence among the populations in the Natori River Basin.

Assessing the soil quality of Bansloi river basin, eastern India using soil-quality indices (SQIs) and Random Forest machine learning technique

Ecological Indicators ◽

10.1016/j.ecolind.2020.106804 ◽

2020 ◽

Vol 118 ◽

pp. 106804

Author(s):

Gopal Chandra Paul ◽

Sunil Saha ◽

Krishna Gopal Ghosh

Keyword(s):

Machine Learning ◽

Random Forest ◽

Soil Quality ◽

River Basin ◽

Eastern India ◽

Quality Indices ◽

Machine Learning Technique ◽

Learning Technique

Modelling hydrological responses under climate change using machine learning algorithms – semi-arid river basin of peninsular India

H2Open Journal ◽

10.2166/h2oj.2020.034 ◽

2020 ◽

Vol 3 (1) ◽

pp. 481-498

Author(s):

G. Sireesha Naidu ◽

M. Pratik ◽

S. Rehana

Keyword(s):

Climate Change ◽

Machine Learning ◽

Random Forest ◽

River Basin ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Support Vector ◽

Peninsular India ◽

Hydrological Responses ◽

Semi Arid

Abstract Catchment scale conceptual hydrological models apply calibration parameters entirely based on observed historical data in the climate change impact assessment. The study used the most advanced machine learning algorithms based on Ensemble Regression and Random Forest models to develop dynamically calibrated factors which can form as a basis for the analysis of hydrological responses under climate change. The Random Forest algorithm was identified as a robust method to model the calibration factors with limited data for training and testing with precipitation, evapotranspiration and uncalibrated runoff based on various performance measures. The developed model was further used to study the runoff response under climate change variability of precipitation and temperatures. A statistical downscaling model based on K-means clustering, Classification and Regression Trees and Support Vector Regression was used to develop the precipitation and temperature projections based on MIROC GCM outputs with the RCP 4.5 scenario. The proposed modelling framework has been demonstrated on a semi-arid river basin of peninsular India, Krishna River Basin (KRB). The basin outlet runoff was predicted to decrease (13.26%) for future scenarios under climate change due to an increase in temperature (0.6 °C), compared to a precipitation increase (13.12%), resulting in an overall reduction in water availability over KRB.

Development and Validation of an Insulin Resistance Predicting Model Using a Machine-Learning Approach in a Population-Based Cohort in Korea

Diagnostics ◽

10.3390/diagnostics12010212 ◽

2022 ◽

Vol 12 (1) ◽

pp. 212

Author(s):

Sunmin Park ◽

Chaeyeon Kim ◽

Xuangao Wu

Keyword(s):

Machine Learning ◽

Insulin Resistance ◽

Metabolic Syndrome ◽

Logistic Regression ◽

Random Forest ◽

Roc Curve ◽

Genome Wide Association Study ◽

Prediction Models ◽

Risk Scores ◽

A Genome

Background: Insulin resistance is a common etiology of metabolic syndrome, but receiver operating characteristic (ROC) curve analysis shows a weak association in Koreans. Using a machine learning (ML) approach, we aimed to generate the best model for predicting insulin resistance in Korean adults aged > 40 of the Ansan/Ansung cohort using a machine learning (ML) approach. Methods: The demographic, anthropometric, biochemical, genetic, nutrient, and lifestyle variables of 8842 participants were included. The polygenetic risk scores (PRS) generated by a genome-wide association study were added to represent the genetic impact of insulin resistance. They were divided randomly into the training (n = 7037) and test (n = 1769) sets. Potentially important features were selected in the highest area under the curve (AUC) of the ROC curve from 99 features using seven different ML algorithms. The AUC target was ≥0.85 for the best prediction of insulin resistance with the lowest number of features. Results: The cutoff of insulin resistance defined with HOMA-IR was 2.31 using logistic regression before conducting ML. XGBoost and logistic regression algorithms generated the highest AUC (0.86) of the prediction models using 99 features, while the random forest algorithm generated a model with 0.82 AUC. These models showed high accuracy and k-fold values (>0.85). The prediction model containing 15 features had the highest AUC of the ROC curve in XGBoost and random forest algorithms. PRS was one of 15 features. The final prediction models for insulin resistance were generated with the same nine features in the XGBoost (AUC = 0.86), random forest (AUC = 0.84), and artificial neural network (AUC = 0.86) algorithms. The model included the fasting serum glucose, ALT, total bilirubin, HDL concentrations, waist circumference, body fat, pulse, season to enroll in the study, and gender. Conclusion: The liver function, regular pulse checking, and seasonal variation in addition to metabolic syndrome components should be considered to predict insulin resistance in Koreans aged over 40 years.

Catchment scale prediction of soil moisture trends from Cosmic Ray Neutron Rover Surveys using machine learning

10.5194/egusphere-egu2020-3049 ◽

2020 ◽

Author(s):

Erik Nixdorf ◽

Marco Hannemann ◽

Uta Ködel ◽

Martin Schrön ◽

Thomas Kalbacher

Keyword(s):

Machine Learning ◽

Soil Moisture ◽

Random Forest ◽

River Basin ◽

Hydrological Model ◽

Cosmic Ray ◽

Moisture Distribution ◽

Catchment Scale ◽

Forest Model ◽

Soil Moisture Distribution

Soil moisture is a critical hydrological component for determining hydrological state conditions and a crucial variable in controlling land-atmosphere interaction including evapotranspiration, infiltration and groundwater recharge.At the catchment scale, spatial- temporal variations of soil moisture distribution are highly variable due to the influence of various factors such as soil heterogeneity, climate conditions, vegetation and geomorphology. Among the various existing soil moisture monitoring techniques, the application of vehicle-mounted Cosmic Ray Sensors (CRNS) allows monitoring soil moisture noninvasively by surveying larger regions within a reasonable time. However, measured data and their corresponding footprints are often allocated along the existing road network leaving inaccessible parts of a catchment unobserved and surveying larger areas in short intervals is often hindered by limited manpower.In this study, data from more than 200 000 CRNS rover readings measured over different regions of Germany within the last 4 years have been employed to characterize the trends of soil moisture distribution in the 209&#160;km2 large Mueglitz River Basin in Eastern Germany. Subsets of the data have been used to train three different supervised machine learning algorithms (multiple linear regression, random forest and artificial neural network) based on 85 independent relevant dynamic and stationary features derived from public databases. &#160;The Random Forest model outperforms the other models (R2= ~0.8), relying on day-of-year, altitude, air temperature, humidity, soil organic carbon content and soil temperature as the five most influencing predictors.After test and training the models, CRNS records for each day of the last decade are predicted on a 250 &#215; 250 m grid of Mueglitz River Basin using the same type of features. Derived CRNS record distributions are compared with both, spatial soil moisture estimates from a hydrological model and point estimates from a sensor network operated during spring 2019. After variable standardization, preliminary results show that the applied Random Forest model is able to resemble the spatio-temporal trends estimated by the hydrological model and the point measurements. These findings demonstrate that training machine learning models on domain-unspecific large datasets of CRNS records using spatial-temporally available predictors has the potential to fill measurement gaps and to improve soil moisture dynamics predictions on a catchment scale.

Random Forest Refinement of Pairwise Potentials for Protein-ligand Decoy Detection

10.26434/chemrxiv.8047820.v1 ◽

2019 ◽

Cited By ~ 1

Author(s):

Jun Pei ◽

Zheng Zheng ◽

Hyunji Kim ◽

Lin Song ◽

Sarah Walworth ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Probability Function ◽

Pair Potential ◽

Scoring Function ◽

Stable Structure ◽

Scoring Functions ◽

Atom Pair ◽

Data Set ◽

Atom Pairs

An accurate scoring function is expected to correctly select the most stable structure from a set of pose candidates. One can hypothesize that a scoring function’s ability to identify the most stable structure might be improved by emphasizing the most relevant atom pairwise interactions. However, it is hard to evaluate the relevant importance for each atom pair using traditional means. With the introduction of machine learning methods, it has become possible to determine the relative importance for each atom pair present in a scoring function. In this work, we use the Random Forest (RF) method to refine a pair potential developed by our laboratory (GARF6) by identifying relevant atom pairs that optimize the performance of the potential on our given task. Our goal is to construct a machine learning (ML) model that can accurately differentiate the native ligand binding pose from candidate poses using a potential refined by RF optimization. We successfully constructed RF models on an unbalanced data set with the ‘comparison’ concept and, the resultant RF models were tested on CASF-2013.5 In a comparison of the performance of our RF models against 29 scoring functions, we found our models outperformed the other scoring functions in predicting the native pose. In addition, we used two artificial designed potential models to address the importance of the GARF potential in the RF models: (1) a scrambled probability function set, which was obtained by mixing up atom pairs and probability functions in GARF, and (2) a uniform probability function set, which share the same peak positions with GARF but have fixed peak heights. The results of accuracy comparison from RF models based on the scrambled, uniform, and original GARF potential clearly showed that the peak positions in the GARF potential are important while the well depths are not.

A Study on Host Tropism Determinants of Influenza Virus Using Machine Learning

Current Bioinformatics ◽

10.2174/1574893614666191104160927 ◽

2020 ◽

Vol 15 (2) ◽

pp. 121-134 ◽

Cited By ~ 2

Author(s):

Eunmi Kwon ◽

Myeongji Cho ◽

Hayeon Kim ◽

Hyeon S. Son

Keyword(s):

Machine Learning ◽

Amino Acids ◽

Influenza Virus ◽

Random Forest ◽

Physicochemical Properties ◽

Protein Sequences ◽

Influenza Viruses ◽

Host Tropism ◽

Post Hoc ◽

Ha Protein

Background: The host tropism determinants of influenza virus, which cause changes in the host range and increase the likelihood of interaction with specific hosts, are critical for understanding the infection and propagation of the virus in diverse host species. Methods: Six types of protein sequences of influenza viral strains isolated from three classes of hosts (avian, human, and swine) were obtained. Random forest, naïve Bayes classification, and knearest neighbor algorithms were used for host classification. The Java language was used for sequence analysis programming and identifying host-specific position markers. Results: A machine learning technique was explored to derive the physicochemical properties of amino acids used in host classification and prediction. HA protein was found to play the most important role in determining host tropism of the influenza virus, and the random forest method yielded the highest accuracy in host prediction. Conserved amino acids that exhibited host-specific differences were also selected and verified, and they were found to be useful position markers for host classification. Finally, ANOVA analysis and post-hoc testing revealed that the physicochemical properties of amino acids, comprising protein sequences combined with position markers, differed significantly among hosts. Conclusion: The host tropism determinants and position markers described in this study can be used in related research to classify, identify, and predict the hosts of influenza viruses that are currently susceptible or likely to be infected in the future.

Development of Prediction Models Using Machine Learning Algorithms for Girls with Suspected Central Precocious Puberty: Retrospective Study (Preprint)

10.2196/preprints.11728 ◽

2018 ◽

Author(s):

Liyan Pan ◽

Guangjian Liu ◽

Xiaojian Mao ◽

Huixian Li ◽

Jiexin Zhang ◽

...

Keyword(s):

Machine Learning ◽

Retrospective Study ◽

Random Forest ◽

Precocious Puberty ◽

Prediction Models ◽

Central Precocious Puberty ◽

Machine Learning Algorithms ◽

Stimulation Test ◽

Gnrh Analogue ◽

Prediction Probability

BACKGROUND Central precocious puberty (CPP) in girls seriously affects their physical and mental development in childhood. The method of diagnosis—gonadotropin-releasing hormone (GnRH)–stimulation test or GnRH analogue (GnRHa)–stimulation test—is expensive and makes patients uncomfortable due to the need for repeated blood sampling. OBJECTIVE We aimed to combine multiple CPP–related features and construct machine learning models to predict response to the GnRHa-stimulation test. METHODS In this retrospective study, we analyzed clinical and laboratory data of 1757 girls who underwent a GnRHa test in order to develop XGBoost and random forest classifiers for prediction of response to the GnRHa test. The local interpretable model-agnostic explanations (LIME) algorithm was used with the black-box classifiers to increase their interpretability. We measured sensitivity, specificity, and area under receiver operating characteristic (AUC) of the models. RESULTS Both the XGBoost and random forest models achieved good performance in distinguishing between positive and negative responses, with the AUC ranging from 0.88 to 0.90, sensitivity ranging from 77.91% to 77.94%, and specificity ranging from 84.32% to 87.66%. Basal serum luteinizing hormone, follicle-stimulating hormone, and insulin-like growth factor-I levels were found to be the three most important factors. In the interpretable models of LIME, the abovementioned variables made high contributions to the prediction probability. CONCLUSIONS The prediction models we developed can help diagnose CPP and may be used as a prescreening tool before the GnRHa-stimulation test.

Document Preprocessing with TF-IDF to Improve the Polarity Classification Performance of Unstructured Sentiment Analysis

Kinetik Game Technology Information System Computer Network Computing Electronics and Control ◽

10.22219/kinetik.v5i3.1066 ◽

2020 ◽

pp. 235-242

Author(s):

Farrikh Alzami ◽

Erika Devi Udayanti ◽

Dwi Puji Prabowo ◽

Rama Aria Megantara

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Random Forest ◽

Sentiment Analysis ◽

Classification Performance ◽

Document Preparation ◽

Learning Models ◽

Polarity Classification ◽

Negative Sentiment ◽

Machine Learning Models

Sentiment analysis in terms of polarity classification is very important in everyday life, with the existence of polarity, many people can find out whether the respected document has positive or negative sentiment so that it can help in choosing and making decisions. Sentiment analysis usually done manually. Therefore, an automatic sentiment analysis classification process is needed. However, it is rare to find studies that discuss extraction features and which learning models are suitable for unstructured sentiment analysis types with the Amazon food review case. This research explores some extraction features such as Word Bags, TF-IDF, Word2Vector, as well as a combination of TF-IDF and Word2Vector with several machine learning models such as Random Forest, SVM, KNN and Naïve Bayes to find out a combination of feature extraction and learning models that can help add variety to the analysis of polarity sentiments. By assisting with document preparation such as html tags and punctuation and special characters, using snowball stemming, TF-IDF results obtained with SVM are suitable for obtaining a polarity classification in unstructured sentiment analysis for the case of Amazon food review with a performance result of 87,3 percent.

The Genetic Architecture of Ovariole Number in Drosophila melanogaster: Genes with Major, Quantitative, and Pleiotropic Effects

G3 Genes|Genome|Genetics ◽

10.1534/g3.117.042390 ◽

2017 ◽

Vol 7 (7) ◽

pp. 2391-2403 ◽

Cited By ~ 11

Author(s):

Amanda S Lobell ◽

Rachel R Kaspari ◽

Yazmin L Serrano Negron ◽

Susan T Harbison

Keyword(s):

Candidate Genes ◽

Genome Wide Association Study ◽

Natural Populations ◽

Direct Role ◽

Genome Wide ◽

A Genome ◽

Fitness Trait ◽

Sleep Parameters ◽

Activity Behavior ◽

Ovariole Number

Abstract Ovariole number has a direct role in the number of eggs produced by an insect, suggesting that it is a key morphological fitness trait. Many studies have documented the variability of ovariole number and its relationship to other fitness and life-history traits in natural populations of Drosophila. However, the genes contributing to this variability are largely unknown. Here, we conducted a genome-wide association study of ovariole number in a natural population of flies. Using mutations and RNAi-mediated knockdown, we confirmed the effects of 24 candidate genes on ovariole number, including a novel gene, anneboleyn (formerly CG32000), that impacts both ovariole morphology and numbers of offspring produced. We also identified pleiotropic genes between ovariole number traits and sleep and activity behavior. While few polymorphisms overlapped between sleep parameters and ovariole number, 39 candidate genes were nevertheless in common. We verified the effects of seven genes on both ovariole number and sleep: bin3, blot, CG42389, kirre, slim, VAChT, and zfh1. Linkage disequilibrium among the polymorphisms in these common genes was low, suggesting that these polymorphisms may evolve independently.

Transformer Oil Quality Assessment Using Random Forest with Feature Engineering

Energies ◽

10.3390/en14071809 ◽

2021 ◽

Vol 14 (7) ◽

pp. 1809

Author(s):

Mohammed El Amine Senoussaoui ◽

Mostefa Brahami ◽

Issouf Fofana

Keyword(s):

Machine Learning ◽

Random Forest ◽

Oil Quality ◽

Principal Component ◽

Condition Assessment ◽

Classification Performance ◽

Transformer Oil ◽

Classification Model ◽

Insulation Degradation ◽

Transformer Oils

Machine learning is widely used as a panacea in many engineering applications including the condition assessment of power transformers. Most statistics attribute the main cause of transformer failure to insulation degradation. Thus, a new, simple, and effective machine-learning approach was proposed to monitor the condition of transformer oils based on some aging indicators. The proposed approach was used to compare the performance of two machine-learning classifiers: J48 decision tree and random forest. The service-aged transformer oils were classified into four groups: the oils that can be maintained in service, the oils that should be reconditioned or filtered, the oils that should be reclaimed, and the oils that must be discarded. From the two algorithms, random forest exhibited a better performance and high accuracy with only a small amount of data. Good performance was achieved through not only the application of the proposed algorithm but also the approach of data preprocessing. Before feeding the classification model, the available data were transformed using the simple k-means method. Subsequently, the obtained data were filtered through correlation-based feature selection (CFsSubset). The resulting features were again retransformed by conducting the principal component analysis and were passed through the CFsSubset filter. The transformation and filtration of the data improved the classification performance of the adopted algorithms, especially random forest. Another advantage of the proposed method is the decrease in the number of the datasets required for the condition assessment of transformer oils, which is valuable for transformer condition monitoring.