River Water Salinity Prediction Using Hybrid Machine Learning Models

Electrical conductivity (EC), one of the most widely used indices for water quality assessment, has been applied to predict the salinity of the Babol-Rood River, the greatest source of irrigation water in northern Iran. This study uses two individual—M5 Prime (M5P) and random forest (RF)—and eight novel hybrid algorithms—bagging-M5P, bagging-RF, random subspace (RS)-M5P, RS-RF, random committee (RC)-M5P, RC-RF, additive regression (AR)-M5P, and AR-RF—to predict EC. Thirty-six years of observations collected by the Mazandaran Regional Water Authority were randomly divided into two sets: 70% from the period 1980 to 2008 was used as model-training data and 30% from 2009 to 2016 was used as testing data to validate the models. Several water quality variables—pH, HCO3−, Cl−, SO42−, Na+, Mg2+, Ca2+, river discharge (Q), and total dissolved solids (TDS)—were modeling inputs. Using EC and the correlation coefficients (CC) of the water quality variables, a set of nine input combinations were established. TDS, the most effective input variable, had the highest EC-CC (r = 0.91), and it was also determined to be the most important input variable among the input combinations. All models were trained and each model’s prediction power was evaluated with the testing data. Several quantitative criteria and visual comparisons were used to evaluate modeling capabilities. Results indicate that, in most cases, hybrid algorithms enhance individual algorithms’ predictive powers. The AR algorithm enhanced both M5P and RF predictions better than bagging, RS, and RC. M5P performed better than RF. Further, AR-M5P outperformed all other algorithms (R2 = 0.995, RMSE = 8.90 μs/cm, MAE = 6.20 μs/cm, NSE = 0.994 and PBIAS = −0.042). The hybridization of machine learning methods has significantly improved model performance to capture maximum salinity values, which is essential in water resource management.

Download Full-text

NLOS Multipath Classification of GNSS Signal Correlation Output Using Machine Learning

Sensors ◽

10.3390/s21072503 ◽

2021 ◽

Vol 21 (7) ◽

pp. 2503

Author(s):

Taro Suzuki ◽

Yoshiharu Amano

Keyword(s):

Machine Learning ◽

Satellite System ◽

Training Data ◽

Support Vector ◽

Positioning Errors ◽

Automated Method ◽

Global Navigation Satellite ◽

Better Than ◽

Signal Correlation

This paper proposes a method for detecting non-line-of-sight (NLOS) multipath, which causes large positioning errors in a global navigation satellite system (GNSS). We use GNSS signal correlation output, which is the most primitive GNSS signal processing output, to detect NLOS multipath based on machine learning. The shape of the multi-correlator outputs is distorted due to the NLOS multipath. The features of the shape of the multi-correlator are used to discriminate the NLOS multipath. We implement two supervised learning methods, a support vector machine (SVM) and a neural network (NN), and compare their performance. In addition, we also propose an automated method of collecting training data for LOS and NLOS signals of machine learning. The evaluation of the proposed NLOS detection method in an urban environment confirmed that NN was better than SVM, and 97.7% of NLOS signals were correctly discriminated.

Download Full-text

Predicting the Operability of Damaged Compressors Using Machine Learning

Journal of Turbomachinery ◽

10.1115/1.4046658 ◽

2020 ◽

Vol 142 (5) ◽

Author(s):

J. V. Taylor ◽

B. Conduit ◽

A. Dickens ◽

C. Hall ◽

M. Hillel ◽

...

Keyword(s):

Machine Learning ◽

Current Approach ◽

Training Data ◽

Physical Parameters ◽

Test Cases ◽

Rapid Testing ◽

Exciting Opportunity ◽

Intractable Problems ◽

High Uncertainty ◽

Better Than

Abstract The application of machine learning to aerospace problems faces a particular challenge. For successful learning, a large amount of good quality training data is required, typically tens of thousands of cases. However, due to the time and cost of experimental aerospace testing, these data are scarce. This paper shows that successful learning is possible with two novel techniques: The first technique is rapid testing. Over the last 5 years, the Whittle Laboratory has developed a capability where rebuild and test times of a compressor stage now take 15 min instead of weeks. The second technique is to base machine learning on physical parameters, derived from engineering wisdom developed in industry over many decades. The method is applied to the important industry problem of predicting the effect of blade damage on compressor operability. The current approach has high uncertainty, and it is based on human judgement and correlation of a handful of experimental test cases. It is shown using 100 training cases and 25 test cases that the new method is able to predict the operability of damaged compressor stages with an accuracy of 2% in a 95% confidence interval; far better than is possible by even the most experienced compressor designers. Use of the method is also shown to generate new physical understanding, previously unknown by any of the experts involved in this work. Using this method in the future offers an exciting opportunity to generate understanding of previously intractable problems in aerospace.

Download Full-text

A machine learning framework to determine geolocations from metagenomic profiling

Biology Direct ◽

10.1186/s13062-020-00278-z ◽

2020 ◽

Vol 15 (1) ◽

Cited By ~ 1

Author(s):

Lihong Huang ◽

Canqiang Xu ◽

Wenxian Yang ◽

Rongshan Yu

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Geographic Origin ◽

Training Data ◽

Metagenomic Data ◽

Training Dataset ◽

Kriging Interpolation ◽

Learning Framework ◽

Testing Data ◽

Microbial Samples

Abstract Background Studies on metagenomic data of environmental microbial samples found that microbial communities seem to be geolocation-specific, and the microbiome abundance profile can be a differentiating feature to identify samples’ geolocations. In this paper, we present a machine learning framework to determine the geolocations from metagenomics profiling of microbial samples. Results Our method was applied to the multi-source microbiome data from MetaSUB (The Metagenomics and Metadesign of Subways and Urban Biomes) International Consortium for the CAMDA 2019 Metagenomic Forensics Challenge (the Challenge). The goal of the Challenge is to predict the geographical origins of mystery samples by constructing microbiome fingerprints.First, we extracted features from metagenomic abundance profiles. We then randomly split the training data into training and validation sets and trained the prediction models on the training set. Prediction performance was evaluated on the validation set. By using logistic regression with L2 normalization, the prediction accuracy of the model reaches 86%, averaged over 100 random splits of training and validation datasets.The testing data consists of samples from cities that do not occur in the training data. To predict the “mystery” cities that are not sampled before for the testing data, we first defined biological coordinates for sampled cities based on the similarity of microbial samples from them. Then we performed affine transform on the map such that the distance between cities measures their biological difference rather than geographical distance. After that, we derived the probabilities of a given testing sample from unsampled cities based on its predicted probabilities on sampled cities using Kriging interpolation. Results show that this method can successfully assign high probabilities to the true cities-of-origin of testing samples. Conclusion Our framework shows good performance in predicting the geographic origin of metagenomic samples for cities where training data are available. Furthermore, we demonstrate the potential of the proposed method to predict metagenomic samples’ geolocations for samples from locations that are not in the training dataset.

Download Full-text

Assessing Continuous Operator Workload With a Hybrid Scaffolded Neuroergonomic Modeling Approach

Human Factors The Journal of the Human Factors and Ergonomics Society ◽

10.1177/0018720816672308 ◽

2017 ◽

Vol 59 (1) ◽

pp. 134-146 ◽

Cited By ~ 8

Author(s):

Brett J. Borghetti ◽

Joseph J. Giametta ◽

Christina F. Rusnock

Keyword(s):

Machine Learning ◽

Adaptive Systems ◽

Model Performance ◽

Machine Learning Algorithms ◽

Training Data ◽

State Assessments ◽

Learning Models ◽

Dynamic Task ◽

Operator Workload ◽

Machine Learning Models

Objective: We aimed to predict operator workload from neurological data using statistical learning methods to fit neurological-to-state-assessment models. Background: Adaptive systems require real-time mental workload assessment to perform dynamic task allocations or operator augmentation as workload issues arise. Neuroergonomic measures have great potential for informing adaptive systems, and we combine these measures with models of task demand as well as information about critical events and performance to clarify the inherent ambiguity of interpretation. Method: We use machine learning algorithms on electroencephalogram (EEG) input to infer operator workload based upon Improved Performance Research Integration Tool workload model estimates. Results: Cross-participant models predict workload of other participants, statistically distinguishing between 62% of the workload changes. Machine learning models trained from Monte Carlo resampled workload profiles can be used in place of deterministic workload profiles for cross-participant modeling without incurring a significant decrease in machine learning model performance, suggesting that stochastic models can be used when limited training data are available. Conclusion: We employed a novel temporary scaffold of simulation-generated workload profile truth data during the model-fitting process. A continuous workload profile serves as the target to train our statistical machine learning models. Once trained, the workload profile scaffolding is removed and the trained model is used directly on neurophysiological data in future operator state assessments. Application: These modeling techniques demonstrate how to use neuroergonomic methods to develop operator state assessments, which can be employed in adaptive systems.

Download Full-text

Using satellite imagery to understand and promote sustainable development

Science ◽

10.1126/science.abe8628 ◽

2021 ◽

Vol 371 (6535) ◽

pp. eabe8628

Author(s):

Marshall Burke ◽

Anne Driscoll ◽

David B. Lobell ◽

Stefano Ermon

Keyword(s):

Machine Learning ◽

Sustainable Development ◽

Satellite Imagery ◽

Model Building ◽

Model Performance ◽

Training Data ◽

Learning Approaches ◽

Research Directions ◽

Development Outcomes ◽

Research And Policy

Accurate and comprehensive measurements of a range of sustainable development outcomes are fundamental inputs into both research and policy. We synthesize the growing literature that uses satellite imagery to understand these outcomes, with a focus on approaches that combine imagery with machine learning. We quantify the paucity of ground data on key human-related outcomes and the growing abundance and improving resolution (spatial, temporal, and spectral) of satellite imagery. We then review recent machine learning approaches to model-building in the context of scarce and noisy training data, highlighting how this noise often leads to incorrect assessment of model performance. We quantify recent model performance across multiple sustainable development domains, discuss research and policy applications, explore constraints to future progress, and highlight research directions for the field.

Download Full-text

Forecasting admissions in psychiatric hospitals before and during Covid-19

10.1101/2021.07.16.21260200 ◽

2021 ◽

Author(s):

Jan Wolff ◽

Ansgar Klimke ◽

Michael Marschollek ◽

Tim Kacprowski

Keyword(s):

Machine Learning ◽

Time Series ◽

Hospital Admissions ◽

Model Performance ◽

Psychiatric Hospitals ◽

Time Series Models ◽

Learning Models ◽

One Step ◽

Machine Learning Models ◽

Better Than

Introduction The COVID-19 pandemic has strong effects on most health care systems and individual services providers. Forecasting of admissions can help for the efficient organisation of hospital care. We aimed to forecast the number of admissions to psychiatric hospitals before and during the COVID-19 pandemic and we compared the performance of machine learning models and time series models. This would eventually allow to support timely resource allocation for optimal treatment of patients. Methods We used admission data from 9 psychiatric hospitals in Germany between 2017 and 2020. We compared machine learning models with time series models in weekly, monthly and yearly forecasting before and during the COVID-19 pandemic. Our models were trained and validated with data from the first two years and tested in prospectively sliding time-windows in the last two years. Results A total of 90,686 admissions were analysed. The models explained up to 90% of variance in hospital admissions in 2019 and 75% in 2020 with the effects of the COVID-19 pandemic. The best models substantially outperformed a one-step seasonal naive forecast (seasonal mean absolute scaled error (sMASE) 2019: 0.59, 2020: 0.76). The best model in 2019 was a machine learning model (elastic net, mean absolute error (MAE): 7.25). The best model in 2020 was a time series model (exponential smoothing state space model with Box-Cox transformation, ARMA errors and trend and seasonal components, MAE: 10.44), which adjusted more quickly to the shock effects of the COVID-19 pandemic. Models forecasting admissions one week in advance did not perform better than monthly and yearly models in 2019 but they did in 2020. The most important features for the machine learning models were calendrical variables. Conclusion Model performance did not vary much between different modelling approaches before the COVID-19 pandemic and established forecasts were substantially better than one-step seasonal naive forecasts. However, weekly time series models adjusted quicker to the COVID-19 related shock effects. In practice, different forecast horizons could be used simultaneously to allow both early planning and quick adjustments to external effects.

Download Full-text

Multi Class Data Classification to Improve Accuracy in Sentiment Analysis using Machine Learning

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.35291 ◽

2021 ◽

Vol 9 (VI) ◽

pp. 1457-1461

Author(s):

Daram Vishnu

Keyword(s):

Machine Learning ◽

Sentiment Analysis ◽

Confusion Matrix ◽

Training Data ◽

Natural Languages ◽

Parts Of Speech ◽

Testing Data ◽

Improve Accuracy ◽

Textual Form ◽

Speech Tagging

Sentiment analysis means classifying a text into different emotional classes. These days most of the sentiment analysis techniques divide the text into either binary or ternary classification in this paper we are classifying the movie reviews into 5 classes. Multi class sentiment analysis is a technique which can be used to know the exact sentiment of a review not just polarity of a given textual statement from positive to negative. So that one can know the precise sentiment of a review . Multi class sentiment analysis has always been a challenging task as natural languages are difficult to represent mathematically. The number of features are also generally large which requires huge computational power so to reduce the number of features we will use parts-of-speech tagging using textblob to extract the important features. Sentiment analysis is done using machine learning, where it requires training data and testing data to train a model. Various kinds of models are trained and tested at last one model is selected based on its accuracy and confusion matrix. It is important to analyze the reviews in textual form because large amount of reviews is present all over the web. Analyzing textual reviews can help the firms that are trying to find out the response of their products in the market. In this paper sentiment analysis is demonstrated by analyzing the movie reviews, reviews are taken from IMDB website.

Download Full-text

Latent Semantic Analysis using a Dennis Coefficient for English Sentiment Classification in a Parallel System

International Journal of Computers Communications & Control ◽

10.15837/ijccc.2018.3.3044 ◽

2018 ◽

Vol 13 (3) ◽

pp. 408-428 ◽

Cited By ~ 4

Author(s):

Phu Vo Ngoc

Keyword(s):

Latent Semantic Analysis ◽

Semantic Analysis ◽

Sentiment Classification ◽

Training Data ◽

The Novel ◽

Data Set ◽

Proposed Model ◽

Testing Data ◽

Novel Model ◽

Better Than

We have already survey many significant approaches for many years because there are many crucial contributions of the sentiment classification which can be applied in everyday life, such as in political activities, commodity production, and commercial activities. We have proposed a novel model using a Latent Semantic Analysis (LSA) and a Dennis Coefficient (DNC) for big data sentiment classification in English. Many LSA vectors (LSAV) have successfully been reformed by using the DNC. We use the DNC and the LSAVs to classify 11,000,000 documents of our testing data set to 5,000,000 documents of our training data set in English. This novel model uses many sentiment lexicons of our basis English sentiment dictionary (bESD). We have tested the proposed model in both a sequential environment and a distributed network system. The results of the sequential system are not as good as that of the parallel environment. We have achieved 88.76% accuracy of the testing data set, and this is better than the accuracies of many previous models of the semantic analysis. Besides, we have also compared the novel model with the previous models, and the experiments and the results of our proposed model are better than that of the previous model. Many different fields can widely use the results of the novel model in many commercial applications and surveys of the sentiment classification.

Download Full-text

Graph Neural Networks Bootstrapped for Synthetic Selection and Validation of Small Molecule Immunomodulators

10.33774/chemrxiv-2021-r4xnx-v2 ◽

2021 ◽

Author(s):

Prageeth R. Wijewardhane ◽

Krupal P. Jethava ◽

Jonathan A Fine ◽

Gaurav Chopra

Keyword(s):

Neural Network ◽

Machine Learning ◽

Small Molecule ◽

Model Performance ◽

Cost Effective ◽

Bioactive Compound ◽

Binding Pocket ◽

Chemical Diversity ◽

Training Data ◽

Kappa Score

The Programmed Cell Death Protein 1/Programmed Death-Ligand 1 (PD-1/PD-L1) interaction is an immune checkpoint utilized by cancer cells to enhance immune suppression. There is a huge need to develop small molecule drugs that are fast acting, cost effective, and readily bioavailable compared to antibodies. Unfortunately, synthesizing and validating large libraries of small- molecules to inhibit PD-1/PD-L1 interaction in a blind manner is both time-consuming and expensive. To improve this drug discovery pipeline, we have developed a machine learning methodology trained on patent data to identify, synthesize, and validate PD-1/PD-L1 small molecule inhibitors. Our model incorporates two features: docking scores to represent the energy of binding (E) as a global feature and sub-graph features through a graph neural network (GNN) of molecular topology to represent local features. This interaction energy-based Graph Neural Network (EGNN) model outperforms traditional machine learning methods and a simple GNN with a F1 score of 0.9524 and Cohen’s kappa score of 0.8861 for the hold out test set, suggesting that the topology of the small molecule, the structural interaction in the binding pocket, and chemical diversity of the training data are all important considerations for enhancing model performance. A Bootstrapped EGNN model was used to select compounds for synthesis and experimental validation with predicted high and low potency to inhibit PD-1/PD-L1 interaction. The potent inhibitor, (4-((3-(2,3-dihydrobenzo[b][1,4]dioxin-6-yl)-2- methylbenzyl)oxy)-2,6-dimethoxybenzyl)-D-serine, is a hybrid of two known bioactive scaffolds, with an IC50 of 339.9 nM that is comparatively better than the known bioactive compound. We conclude that our bootstrapped EGNN model will be useful to identify target-specific high potency molecules designed by scaffold hopping, a well-known medicinal chemistry technique.

Download Full-text

PREDICTING TELECOMMUNICATION TOWER COSTS USING FUZZY SUBTRACTIVE CLUSTERING

Journal of Civil Engineering and Management ◽

10.3846/13923730.2013.802736 ◽

2014 ◽

Vol 21 (1) ◽

pp. 67-74 ◽

Cited By ~ 3

Author(s):

Mohamed Marzouk ◽

Mohamed Alaraby

Keyword(s):

Model Performance ◽

Training Data ◽

Percentage Error ◽

Subtractive Clustering ◽

Data Sets ◽

Test Model ◽

First Order ◽

Testing Data ◽

Telecommunication Towers ◽

Tower Height

This paper presents a fuzzy subtractive modelling technique to predict the weight of telecommunication towers which is used to estimate their respective costs. This is implemented through the utilization of data from previously installed telecommunication towers considering four input parameters: a) tower height; b) allowed tilt or deflection; c) antenna subjected area loading; and d) wind load. Telecommunication towers are classified according to designated code (TIA-222-F and TIA-222-G standards) and structures type (Self-Supporting Tower (SST) and Roof Top (RT)). As such, four fuzzy subtractive models are developed to represent the four classes. To build the fuzzy models, 90% of data are utilized and fed to Matlab software as training data. The remaining 10% of the data are utilized to test model performance. Sugeno-Type first order is used to optimize model performance in predicting tower weights. Errors are estimated using Mean Absolute Percentage Error (MAPE) and Root Mean Square Error (RMSE) for both training and testing data sets. Sensitivity analysis is carried to validate the model and observe the effect of clusters’ radius on models performance.

Download Full-text