Comparing Thresholding with Machine Learning Classifiers for Mapping Complex Water

Small reservoirs play an important role in mining, industries, and agriculture, but storage levels or stage changes are very dynamic. Accurate and up-to-date maps of surface water storage and distribution are invaluable for informing decisions relating to water security, flood monitoring, and water resources management. Satellite remote sensing is an effective way of monitoring the dynamics of surface waterbodies over large areas. The European Space Agency (ESA) has recently launched constellations of Sentinel-1 (S1) and Sentinel-2 (S2) satellites carrying C-band synthetic aperture radar (SAR) and a multispectral imaging radiometer, respectively. The constellations improve global coverage of remotely sensed imagery and enable the development of near real-time operational products. This unprecedented data availability leads to an urgent need for the application of fully automatic, feasible, and accurate retrieval methods for mapping and monitoring waterbodies. The mapping of waterbodies can take advantage of the synthesis of SAR and multispectral remote sensing data in order to increase classification accuracy. This study compares automatic thresholding to machine learning, when applied to delineate waterbodies with diverse spectral and spatial characteristics. Automatic thresholding was applied to near-concurrent normalized difference water index (NDWI) (generated from S2 optical imagery) and VH backscatter features (generated from S1 SAR data). Machine learning was applied to a comprehensive set of features derived from S1 and S2 data. During our field surveys, we observed that the waterbodies visited had different sizes and varying levels of turbidity, sedimentation, and eutrophication. Five machine learning algorithms (MLAs), namely decision tree (DT), k-nearest neighbour (k-NN), random forest (RF), and two implementations of the support vector machine (SVM) were considered. Several experiments were carried out to better understand the complexities involved in mapping spectrally and spatially complex waterbodies. It was found that the combination of multispectral indices with SAR data is highly beneficial for classifying complex waterbodies and that the proposed thresholding approach classified waterbodies with an overall classification accuracy of 89.3%. However, the varying concentrations of suspended sediments (turbidity), dissolved particles, and aquatic plants negatively affected the classification accuracies of the proposed method, whereas the MLAs (SVM in particular) were less sensitive to such variations. The main disadvantage of using MLAs for operational waterbody mapping is the requirement for suitable training samples, representing both water and non-water land covers. The dynamic nature of reservoirs (many reservoirs are depleted at least once a year) makes the re-use of training data unfeasible. The study found that aggregating (combining) the thresholding results of two SAR and multispectral features, namely the S1 VH polarisation and the S2 NDWI, respectively, provided better overall accuracies than when thresholding was applied to any of the individual features considered. The accuracies of this dual thresholding technique were comparable to those of machine learning and may thus offer a viable solution for automatic mapping of waterbodies.

Download Full-text

Remote sensing inversion of water quality in coastal sea area based on machine learning: a case study of Shenzhen bay, China

10.5194/egusphere-egu21-1972 ◽

2021 ◽

Author(s):

Xiaotong Zhu ◽

Jinhui Jeanne Huang

Keyword(s):

Machine Learning ◽

Remote Sensing ◽

Water Quality ◽

Predictive Accuracy ◽

Water Environment ◽

Quality Parameters ◽

Machine Learning Algorithms ◽

Dynamic Monitoring ◽

Support Vector ◽

Seawater Quality

Remote sensing monitoring has the characteristics of wide monitoring range, celerity, low cost for long-term dynamic monitoring of water environment. With the flourish of artificial intelligence, machine learning has enabled remote sensing inversion of seawater quality to achieve higher prediction accuracy. However, due to the physicochemical property of the water quality parameters, the performance of algorithms differs a lot. In order to improve the predictive accuracy of seawater quality parameters, we proposed a technical framework to identify the optimal machine learning algorithms using Sentinel-2 satellite and in-situ seawater sample data. In the study, we select three algorithms, i.e. support vector regression (SVR), XGBoost and deep learning (DL), and four seawater quality parameters, i.e. dissolved oxygen (DO), total dissolved solids (TDS), turbidity(TUR) and chlorophyll-a (Chla). The results show that SVR is a more precise algorithm to inverse DO (R2 = 0.81). XGBoost has the best accuracy for Chla and Tur inversion (R2 = 0.75 and 0.78 respectively) while DL performs better in TDS (R2 =0.789). Overall, this research provides a theoretical support for high precision remote sensing inversion of offshore seawater quality parameters based on machine learning.

Download Full-text

Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms

Auditing A Journal of Practice & Theory ◽

10.2308/ajpt-50009 ◽

2011 ◽

Vol 30 (2) ◽

pp. 19-50 ◽

Cited By ~ 71

Author(s):

Johan Perols

Keyword(s):

Machine Learning ◽

Analyst Forecasts ◽

Machine Learning Algorithms ◽

Data Availability ◽

Financial Statement ◽

Support Vector ◽

Classification Algorithms ◽

Employee Productivity ◽

Financial Statement Fraud ◽

Fraud Risk

SUMMARY This study compares the performance of six popular statistical and machine learning models in detecting financial statement fraud under different assumptions of misclassification costs and ratios of fraud firms to nonfraud firms. The results show, somewhat surprisingly, that logistic regression and support vector machines perform well relative to an artificial neural network, bagging, C4.5, and stacking. The results also reveal some diversity in predictors used across the classification algorithms. Out of 42 predictors examined, only six are consistently selected and used by different classification algorithms: auditor turnover, total discretionary accruals, Big 4 auditor, accounts receivable, meeting or beating analyst forecasts, and unexpected employee productivity. These findings extend financial statement fraud research and can be used by practitioners and regulators to improve fraud risk models. Data Availability: A list of fraud companies used in this study is available from the author upon request. All other data sources are described in the text.

Download Full-text

A Hybrid Feature Selection Method for Improve the Accuracy of Medical Classification Process

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.a9624.1111121 ◽

2021 ◽

Vol 11 (1) ◽

pp. 50-55

Author(s):

Maria Mohammad Yousef ◽

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Dimensionality Reduction ◽

Classification Accuracy ◽

Fitness Function ◽

Machine Learning Algorithms ◽

Feature Subset Selection ◽

High Dimensionality ◽

Support Vector ◽

Feature Subset

Generally, medical dataset classification has become one of the biggest problems in data mining research. Every database has a given number of features but it is observed that some of these features can be redundant and can be harmful as well as disrupt the process of classification and this problem is known as a high dimensionality problem. Dimensionality reduction in data preprocessing is critical for increasing the performance of machine learning algorithms. Besides the contribution of feature subset selection in dimensionality reduction gives a significant improvement in classification accuracy. In this paper, we proposed a new hybrid feature selection approach based on (GA assisted by KNN) to deal with issues of high dimensionality in biomedical data classification. The proposed method first applies the combination between GA and KNN for feature selection to find the optimal subset of features where the classification accuracy of the k-Nearest Neighbor (kNN) method is used as the fitness function for GA. After selecting the best-suggested subset of features, Support Vector Machine (SVM) are used as the classifiers. The proposed method experiments on five medical datasets of the UCI Machine Learning Repository. It is noted that the suggested technique performs admirably on these databases, achieving higher classification accuracy while using fewer features.

Download Full-text

Comparative Analysis of Machine Learning Algorithms in Automatic Identification and Extraction of Water Boundaries

Applied Sciences ◽

10.3390/app112110062 ◽

2021 ◽

Vol 11 (21) ◽

pp. 10062

Author(s):

Aimin Li ◽

Meng Fan ◽

Guangduo Qin ◽

Youcheng Xu ◽

Hailong Wang

Keyword(s):

Machine Learning ◽

Remote Sensing ◽

Decision Tree ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Water Bodies ◽

Support Vector ◽

Landsat 8 ◽

Transfer Performance ◽

Remote Sensing Images

Monitoring open water bodies accurately is important for assessing the role of ecosystem services in the context of human survival and climate change. There are many methods available for water body extraction based on remote sensing images, such as the normalized difference water index (NDWI), modified NDWI (MNDWI), and machine learning algorithms. Based on Landsat-8 remote sensing images, this study focuses on the effects of six machine learning algorithms and three threshold methods used to extract water bodies, evaluates the transfer performance of models applied to remote sensing images in different periods, and compares the differences among these models. The results are as follows. (1) Various algorithms require different numbers of samples to reach their optimal consequence. The logistic regression algorithm requires a minimum of 110 samples. As the number of samples increases, the order of the optimal model is support vector machine, neural network, random forest, decision tree, and XGBoost. (2) The accuracy evaluation performance of each machine learning on the test set cannot represent the local area performance. (3) When these models are directly applied to remote sensing images in different periods, the AUC indicators of each machine learning algorithm for three regions all show a significant decline, with a decrease range of 0.33–66.52%, and the differences among the different algorithm performances in the three areas are obvious. Generally, the decision tree algorithm has good transfer performance among the machine learning algorithms with area under curve (AUC) indexes of 0.790, 0.518, and 0.697 in the three areas, respectively, and the average value is 0.668. The Otsu threshold algorithm is the optimal among threshold methods, with AUC indexes of 0.970, 0.617, and 0.908 in the three regions respectively and an average AUC of 0.832.

Download Full-text

crops planting area identification and analysis based on multi-source high resolution remote sensing data

10.5194/egusphere-egu2020-13227 ◽

2020 ◽

Author(s):

Lei Wang ◽

Haoran Sun ◽

Wenjun Li ◽

Liang Zhou

Keyword(s):

Remote Sensing ◽

Random Forest ◽

Classification Accuracy ◽

Kappa Coefficient ◽

Optical Data ◽

Accurate Estimation ◽

Support Vector ◽

Random Forest Method ◽

Multi Temporal ◽

Sar Data

Crop planting structure is of great significance to the quantitative management of agricultural water and the accurate estimation of crop yield. With the increasing spatial and temporal resolution of remote sensing optical and SAR(Synthetic Aperture Radar) images, &#160;efficient crop mapping in large area becomes possible and the accuracy is improved. In this study, Qingyijiang Irrigation District in southwest of China is selected for crop identification methods comparison, which has heterogeneous terrain and complex crop structure . Multi-temporal optical (Sentinel-2) and SAR (Sentinel-1) data were used to calculate NDVI and backscattering coefficient as the main classification indexes. The multi-spectral and SAR data showed significant change in different stages of the whole crop growth period and varied with different crop types. Spatial distribution and texture analysis was also made. Classification using different combinations of indexes were performed using neural network, support vector machine and random forest method. The results showed that, the use of multi-temporal optical data and SAR data in the key growing periods of main crops can both provide satisfactory classification accuracy. The overall classification accuracy was greater than 82% and Kappa coefficient was greater than 0.8. SAR data has high accuracy and much potential in rice identification. However optical data had more accuracy in upland crops classification. In addition, the classification accuracy can be effectively improved by combination of classification indexes from optical and SAR data, the overall accuracy was up to 91.47%. The random forest method was superior to the other two methods in terms of the overall accuracy and the kappa coefficient.

Download Full-text

Machine Learning-Based Slum Mapping in Support of Slum Upgrading Programs: The Case of Bandung City, Indonesia

Remote Sensing ◽

10.3390/rs10101522 ◽

2018 ◽

Vol 10 (10) ◽

pp. 1522 ◽

Cited By ~ 14

Author(s):

Gina Leonita ◽

Monika Kuffer ◽

Richard Sliuzas ◽

Claudio Persello

Keyword(s):

Machine Learning ◽

Remote Sensing ◽

Large Scale ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Support Vector ◽

Slum Upgrading ◽

Sequential Feature Selection ◽

Local Acceptance ◽

Consistent Information

The survey-based slum mapping (SBSM) program conducted by the Indonesian government to reach the national target of “cities without slums” by 2019 shows mapping inconsistencies due to several reasons, e.g., the dependency on the surveyor’s experiences and the complexity of the slum indicators set. By relying on such inconsistent maps, it will be difficult to monitor the national slum upgrading program’s progress. Remote sensing imagery combined with machine learning algorithms could support the reduction of these inconsistencies. This study evaluates the performance of two machine learning algorithms, i.e., support vector machine (SVM) and random forest (RF), for slum mapping in support of the slum mapping campaign in Bandung, Indonesia. Recognizing the complexity in differentiating slum and formal areas in Indonesia, the study used a combination of spectral, contextual, and morphological features. In addition, sequential feature selection (SFS) combined with the Hilbert–Schmidt independence criterion (HSIC) was used to select significant features for classifying slums. Overall, the highest accuracy (88.5%) was achieved by the SVM with SFS using contextual, morphological, and spectral features, which is higher than the estimated accuracy of the SBSM. To evaluate the potential of machine learning-based slum mapping (MLBSM) in support of slum upgrading programs, interviews were conducted with several local and national stakeholders. Results show that local acceptance for a remote sensing-based slum mapping approach varies among stakeholder groups. Therefore, a locally adapted framework is required to combine ground surveys with robust and consistent machine learning methods, for being able to deal with big data, and to allow the rapid extraction of consistent information on the dynamics of slums at a large scale.

Download Full-text

A Comparison of Human against Machine-Classification of Spatial Audio Scenes in Binaural Recordings of Music

Applied Sciences ◽

10.3390/app10175956 ◽

2020 ◽

Vol 10 (17) ◽

pp. 5956

Author(s):

Sławomir K. Zieliński ◽

Hyunkook Lee ◽

Paweł Antoniuk ◽

Oskar Dadan

Keyword(s):

Machine Learning ◽

Classification Accuracy ◽

Screening Test ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Support Vector ◽

Spatial Audio ◽

Extreme Gradient Boosting ◽

Music Ensemble

The purpose of this paper is to compare the performance of human listeners against the selected machine learning algorithms in the task of the classification of spatial audio scenes in binaural recordings of music under practical conditions. The three scenes were subject to classification: (1) music ensemble (a group of musical sources) located in the front, (2) music ensemble located at the back, and (3) music ensemble distributed around a listener. In the listening test, undertaken remotely over the Internet, human listeners reached the classification accuracy of 42.5%. For the listeners who passed the post-screening test, the accuracy was greater, approaching 60%. The above classification task was also undertaken automatically using four machine learning algorithms: convolutional neural network, support vector machines, extreme gradient boosting framework, and logistic regression. The machine learning algorithms substantially outperformed human listeners, with the classification accuracy reaching 84%, when tested under the binaural-room-impulse-response (BRIR) matched conditions. However, when the algorithms were tested under the BRIR mismatched scenario, the accuracy obtained by the algorithms was comparable to that exhibited by the listeners who passed the post-screening test, implying that the machine learning algorithms capability to perform in unknown electro-acoustic conditions needs to be further improved.

Download Full-text

Fusion of Multispectral Aerial Imagery and Vegetation Indices for Machine Learning-Based Ground Classification

Remote Sensing ◽

10.3390/rs13081411 ◽

2021 ◽

Vol 13 (8) ◽

pp. 1411

Author(s):

Yanchao Zhang ◽

Wen Yang ◽

Ying Sun ◽

Christine Chang ◽

Jiya Yu ◽

...

Keyword(s):

Machine Learning ◽

Classification Accuracy ◽

Vegetation Indices ◽

Machine Learning Algorithms ◽

Support Vector ◽

Multispectral Images ◽

Learning Methods ◽

Machine Learning Methods ◽

Spectral Bands ◽

Vegetation Indexes

Unmanned Aerial Vehicles (UAVs) are emerging and promising platforms for carrying different types of cameras for remote sensing. The application of multispectral vegetation indices for ground cover classification has been widely adopted and has proved its reliability. However, the fusion of spectral bands and vegetation indices for machine learning-based land surface investigation has hardly been studied. In this paper, we studied the fusion of spectral bands information from UAV multispectral images and derived vegetation indices for almond plantation classification using several machine learning methods. We acquired multispectral images over an almond plantation using a UAV. First, a multispectral orthoimage was generated from the acquired multispectral images using SfM (Structure from Motion) photogrammetry methods. Eleven types of vegetation indexes were proposed based on the multispectral orthoimage. Then, 593 data points that contained multispectral bands and vegetation indexes were randomly collected and prepared for this study. After comparing six machine learning algorithms (Support Vector Machine, K-Nearest Neighbor, Linear Discrimination Analysis, Decision Tree, Random Forest, and Gradient Boosting), we selected three (SVM, KNN, and LDA) to study the fusion of multi-spectral bands information and derived vegetation index for classification. With the vegetation indexes increased, the model classification accuracy of all three selected machine learning methods gradually increased, then dropped. Our results revealed that that: (1) spectral information from multispectral images can be used for machine learning-based ground classification, and among all methods, SVM had the best performance; (2) combination of multispectral bands and vegetation indexes can improve the classification accuracy comparing to only spectral bands among all three selected methods; (3) among all VIs, NDEGE, NDVIG, and NDVGE had consistent performance in improving classification accuracies, and others may reduce the accuracy. Machine learning methods (SVM, KNN, and LDA) can be used for classifying almond plantation using multispectral orthoimages, and fusion of multispectral bands with vegetation indexes can improve machine learning-based classification accuracy if the vegetation indexes are properly selected.

Download Full-text

Sentiment Analysis Using Hybrid Feature Selection Techniques

UHD Journal of Science and Technology ◽

10.21928/uhdjst.v4n1y2020.pp29-40 ◽

2020 ◽

Vol 4 (1) ◽

pp. 29

Author(s):

Sasan Sarbast Abdulkhaliq ◽

Aso Mohammad Darwesh

Keyword(s):

Machine Learning ◽

Social Media ◽

Feature Selection ◽

Classification Accuracy ◽

Feature Selection Method ◽

Selection Method ◽

Machine Learning Algorithms ◽

Feature Subset Selection ◽

Support Vector ◽

Feature Subset

Nowadays, people from every part of the world use social media and social networks to express their feelings toward different topics and aspects. One of the trendiest social media is Twitter, which is a microblogging website that provides a platform for its users to share their views and feelings about products, services, events, etc., in public. Which makes Twitter one of the most valuable sources for collecting and analyzing data by researchers and developers to reveal people sentiment about different topics and services, such as products of commercial companies, services, well-known people such as politicians and athletes, through classifying those sentiments into positive and negative. Classification of people sentiment could be automated through using machine learning algorithms and could be enhanced through using appropriate feature selection methods. We collected most recent tweets about (Amazon, Trump, Chelsea FC, CR7) using Twitter-Application Programming Interface and assigned sentiment score using lexicon rule-based approach, then proposed a machine learning model to improve classification accuracy through using hybrid feature selection method, namely, filter-based feature selection method Chi-square (Chi-2) plus wrapper-based binary coordinate ascent (Chi-2 + BCA) to select optimal subset of features from term frequency-inverse document frequency (TF-IDF) generated features for classification through support vector machine (SVM), and Bag of words generated features for logistic regression (LR) classifiers using different n-gram ranges. After comparing the hybrid (Chi-2+BCA) method with (Chi-2) selected features, and also with the classifiers without feature subset selection, results show that the hybrid feature selection method increases classification accuracy in all cases. The maximum attained accuracy with LR is 86.55% using (1 + 2 + 3-g) range, with SVM is 85.575% using the unigram range, both in the CR7 dataset.

Download Full-text

Performance Evaluation Of Machine Learning Techniques On Rpas Remote Sensing Images

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.f1197.0886s19 ◽

2019 ◽

Vol 8 (6S) ◽

pp. 1035-1039

Keyword(s):

Machine Learning ◽

Remote Sensing ◽

Random Forest ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Support Vector ◽

Learning Approaches ◽

Kappa Index ◽

Remote Sensing Images ◽

Accuracy Measure

Recent advancements in remote sensing platforms from satellites to close-range Remotely Piloted Aircraft System (RPAS), is principal to a growing demand for innovative image processing and classification tools. Where, Machine learning approaches are very prevailing group of data driven implication tools that provide a broader scope when applied to remote sensed data. In this paper, applying different machine learning approaches on the remote sensing images with open source packages in R, to find out which algorithm is more efficient for obtaining better accuracy. We carried out a rigorous comparison of four machine learning algorithms-Support vector machine, Random forest, regression tree, Classification and Naive Bayes. These algorithms are evaluated by Classification accurateness, Kappa index and curve area as accuracy metrics. Ten runs are done to obtain the variance in the results on the training set. Using k-fold cross validation the validation is carried out. This theme identifies Random forest approach as the best method based on the accuracy measure under different conditions. Random forest is used to train efficient and highly stable with respect to variations in classification representation parameter values and significantly more accurate than other machine learning approaches trailed

Download Full-text