Comparative analysis of surface water quality prediction performance and identification of key water parameters using different machine learning models based on big data

2020 ◽  
Vol 171 ◽  
pp. 115454 ◽  
Author(s):  
Kangyang Chen ◽  
Hexia Chen ◽  
Chuanlong Zhou ◽  
Yichao Huang ◽  
Xiangyang Qi ◽  
...  
Author(s):  
Sankhadeep Chatterjee ◽  
Sarbartha Sarkar ◽  
Nilanjan Dey ◽  
Amira S. Ashour ◽  
Soumya Sen

Water pollution due to industrial and domestic reasons is highly affecting the water quality. In undeveloped and developed countries, it has become a major reason behind a number of water borne diseases. Poor public health is putting an extra economic liability in order to deploy precautionary measures against these diseases. Recent research works have been directed toward more sustainable solutions to this problem. It has been revealed that good quality of water supply can not only improve the public health, it also accelerates economic growth of a geographical location as well. Water quality prediction using machine learning methods is still at its primitive stage. Besides, most of the studies did not follow any national or international standard for water quality prediction. In the current work, both the problems have been addressed. First, advanced machine learning methods, namely Artificial Neural Networks (ANNs) supported by a well-known multi-objective optimization algorithm called the Non-dominated Sorting Genetic Algorithm-II (NSGA-II) has been used to classify the water samples into two different classes. Secondly, Indian national standard for water quality (IS 10500:2012) has been utilized for this classification task. The hybrid NN-NSGA-II model is compared with another two well-known meta-heuristic supported ANN classifiers, namely ANN trained by Genetic Algorithm (NN-GA) and by Particle Swarm Optimization (NN-PSO). Apart from that, the support vector machine (SVM) has also been included in the comparative study. Besides analysing the performance based on several performance measuring methods, the statistical significance of the results obtained by NN-NSGA-II has been judged by performing Wilcoxon rank sum test with 5% confidence level. Results have indicated the ingenuity of the proposed NN-NSGA-II model over the other classifiers under current study.


A sentiment analysis using SNS data can confirm various people’s thoughts. Thus an analysis using SNS can predict social problems and more accurately identify the complex causes of the problem. In addition, big data technology can identify SNS information that is generated in real time, allowing a wide range of people’s opinions to be understood without losing time. It can supplement traditional opinion surveys. The incumbent government mainly uses SNS to promote its policies. However, measures are needed to actively reflect SNS in the process of carrying out the policy. Therefore this paper developed a sentiment classifier that can identify public feelings on SNS about climate change. To that end, based on a dictionary formulated on the theme of climate change, we collected climate change SNS data for learning and tagged seven sentiments. Using training data, the sentiment classifier models were developed using machine learning models. The analysis showed that the Bi-LSTM model had the best performance than shallow models. It showed the highest accuracy (85.10%) in the seven sentiments classified, outperforming traditional machine learning (Naive Bayes and SVM) by approximately 34.53%p, and 7.14%p respectively. These findings substantiate the applicability of the proposed Bi-LSTM-based sentiment classifier to the analysis of sentiments relevant to diverse climate change issues.


2021 ◽  
Vol 23 (1) ◽  
Author(s):  
Seulkee Lee ◽  
Seonyoung Kang ◽  
Yeonghee Eun ◽  
Hong-Hee Won ◽  
Hyungjin Kim ◽  
...  

Abstract Background Few studies on rheumatoid arthritis (RA) have generated machine learning models to predict biologic disease-modifying antirheumatic drugs (bDMARDs) responses; however, these studies included insufficient analysis on important features. Moreover, machine learning is yet to be used to predict bDMARD responses in ankylosing spondylitis (AS). Thus, in this study, machine learning was used to predict such responses in RA and AS patients. Methods Data were retrieved from the Korean College of Rheumatology Biologics therapy (KOBIO) registry. The number of RA and AS patients in the training dataset were 625 and 611, respectively. We prepared independent test datasets that did not participate in any process of generating machine learning models. Baseline clinical characteristics were used as input features. Responders were defined as those who met the ACR 20% improvement response criteria (ACR20) and ASAS 20% improvement response criteria (ASAS20) in RA and AS, respectively, at the first follow-up. Multiple machine learning methods, including random forest (RF-method), were used to generate models to predict bDMARD responses, and we compared them with the logistic regression model. Results The RF-method model had superior prediction performance to logistic regression model (accuracy: 0.726 [95% confidence interval (CI): 0.725–0.730] vs. 0.689 [0.606–0.717], area under curve (AUC) of the receiver operating characteristic curve (ROC) 0.638 [0.576–0.658] vs. 0.565 [0.493–0.605], F1 score 0.841 [0.837–0.843] vs. 0.803 [0.732–0.828], AUC of the precision-recall curve 0.808 [0.763–0.829] vs. 0.754 [0.714–0.789]) with independent test datasets in patients with RA. However, machine learning and logistic regression exhibited similar prediction performance in AS patients. Furthermore, the patient self-reporting scales, which are patient global assessment of disease activity (PtGA) in RA and Bath Ankylosing Spondylitis Functional Index (BASFI) in AS, were revealed as the most important features in both diseases. Conclusions RF-method exhibited superior prediction performance for responses of bDMARDs to a conventional statistical method, i.e., logistic regression, in RA patients. In contrast, despite the comparable size of the dataset, machine learning did not outperform in AS patients. The most important features of both diseases, according to feature importance analysis were patient self-reporting scales.


2021 ◽  
Vol 26 (6) ◽  
Author(s):  
Christoph Laaber ◽  
Mikael Basmaci ◽  
Pasquale Salza

AbstractSoftware benchmarks are only as good as the performance measurements they yield. Unstable benchmarks show high variability among repeated measurements, which causes uncertainty about the actual performance and complicates reliable change assessment. However, if a benchmark is stable or unstable only becomes evident after it has been executed and its results are available. In this paper, we introduce a machine-learning-based approach to predict a benchmark’s stability without having to execute it. Our approach relies on 58 statically-computed source code features, extracted for benchmark code and code called by a benchmark, related to (1) meta information, e.g., lines of code (LOC), (2) programming language elements, e.g., conditionals or loops, and (3) potentially performance-impacting standard library calls, e.g., file and network input/output (I/O). To assess our approach’s effectiveness, we perform a large-scale experiment on 4,461 Go benchmarks coming from 230 open-source software (OSS) projects. First, we assess the prediction performance of our machine learning models using 11 binary classification algorithms. We find that Random Forest performs best with good prediction performance from 0.79 to 0.90, and 0.43 to 0.68, in terms of AUC and MCC, respectively. Second, we perform feature importance analyses for individual features and feature categories. We find that 7 features related to meta-information, slice usage, nested loops, and synchronization application programming interfaces (APIs) are individually important for good predictions; and that the combination of all features of the called source code is paramount for our model, while the combination of features of the benchmark itself is less important. Our results show that although benchmark stability is affected by more than just the source code, we can effectively utilize machine learning models to predict whether a benchmark will be stable or not ahead of execution. This enables spending precious testing time on reliable benchmarks, supporting developers to identify unstable benchmarks during development, allowing unstable benchmarks to be repeated more often, estimating stability in scenarios where repeated benchmark execution is infeasible or impossible, and warning developers if new benchmarks or existing benchmarks executed in new environments will be unstable.


Sign in / Sign up

Export Citation Format

Share Document