A Scalable Machine Learning Pipeline for Paddy Rice Classification Using Multi-Temporal Sentinel Data

The demand for rice production in Asia is expected to increase by 70% in the next 30 years, which makes evident the need for a balanced productivity and effective food security management at a national and continental level. Consequently, the timely and accurate mapping of paddy rice extent and its productivity assessment is of utmost significance. In turn, this requires continuous area monitoring and large scale mapping, at the parcel level, through the processing of big satellite data of high spatial resolution. This work designs and implements a paddy rice mapping pipeline in South Korea that is based on a time-series of Sentinel-1 and Sentinel-2 data for the year of 2018. There are two challenges that we address; the first one is the ability of our model to manage big satellite data and scale for a nationwide application. The second one is the algorithm’s capacity to cope with scarce labeled data to train supervised machine learning algorithms. Specifically, we implement an approach that combines unsupervised and supervised learning. First, we generate pseudo-labels for rice classification from a single site (Seosan-Dangjin) by using a dynamic k-means clustering approach. The pseudo-labels are then used to train a Random Forest (RF) classifier that is fine-tuned to generalize in two other sites (Haenam and Cheorwon). The optimized model was then tested against 40 labeled plots, evenly distributed across the country. The paddy rice mapping pipeline is scalable as it has been deployed in a High Performance Data Analytics (HPDA) environment using distributed implementations for both k-means and RF classifiers. When tested across the country, our model provided an overall accuracy of 96.69% and a kappa coefficient 0.87. Even more, the accurate paddy rice area mapping was returned early in the year (late July), which is key for timely decision-making. Finally, the performance of the generalized paddy rice classification model, when applied in the sites of Haenam and Cheorwon, was compared to the performance of two equivalent models that were trained with locally sampled labels. The results were comparable and highlighted the success of the model’s generalization and its applicability to other regions.

Download Full-text

Distinguishing Focal Cortical Dysplasia From Glioneuronal Tumors in Patients With Epilepsy by Machine Learning

Frontiers in Neurology ◽

10.3389/fneur.2020.548305 ◽

2020 ◽

Vol 11 ◽

Author(s):

Yi Guo ◽

Yushan Liu ◽

Wenjie Ming ◽

Zhongjin Wang ◽

Junming Zhu ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Focal Cortical Dysplasia ◽

Cortical Dysplasia ◽

Machine Learning Algorithms ◽

Classification Model ◽

Supervised Machine Learning ◽

Seizure Onset ◽

Glioneuronal Tumors ◽

Patients With Epilepsy

Purpose: We are aiming to build a supervised machine learning-based classifier, in order to preoperatively distinguish focal cortical dysplasia (FCD) from glioneuronal tumors (GNTs) in patients with epilepsy.Methods: This retrospective study was comprised of 96 patients who underwent epilepsy surgery, with the final neuropathologic diagnosis of either an FCD or GNTs. Seven classical machine learning algorithms (i.e., Random Forest, SVM, Decision Tree, Logistic Regression, XGBoost, LightGBM, and CatBoost) were employed and trained by our dataset to get the classification model. Ten features [i.e., Gender, Past history, Age at seizure onset, Course of disease, Seizure type, Seizure frequency, Scalp EEG biomarkers, MRI features, Lesion location, Number of antiepileptic drug (AEDs)] were analyzed in our study.Results: We enrolled 56 patients with FCD and 40 patients with GNTs, which included 29 with gangliogliomas (GGs) and 11 with dysembryoplasic neuroepithelial tumors (DNTs). Our study demonstrated that the Random Forest-based machine learning model offered the best predictive performance on distinguishing the diagnosis of FCD from GNTs, with an F1-score of 0.9180 and AUC value of 0.9340. Furthermore, the most discriminative factor between FCD and GNTs was the feature “age at seizure onset” with the Chi-square value of 1,213.0, suggesting that patients who had a younger age at seizure onset were more likely to be diagnosed as FCD.Conclusion: The Random Forest-based machine learning classifier can accurately differentiate FCD from GNTs in patients with epilepsy before surgery. This might lead to improved clinician confidence in appropriate surgical planning and treatment outcomes.

Download Full-text

A Self-Supervised Machine Learning Approach for Objective Live Cell Segmentation and Analysis

10.21203/rs.3.rs-147010/v1 ◽

2021 ◽

Author(s):

Marc Raphael ◽

Michael Robitaille ◽

Jeff Byers ◽

Joseph Christodoulides

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

Learning Algorithms ◽

Label Cell ◽

Live Cell ◽

Machine Learning Algorithms ◽

Classification Model ◽

Supervised Machine Learning ◽

Cell Segmentation ◽

Data Set

Abstract Machine learning algorithms hold the promise of greatly improving live cell image analysis by way of (1) analyzing far more imagery than can be achieved by more traditional manual approaches and (2) by eliminating the subjective nature of researchers and diagnosticians selecting the cells or cell features to be included in the analyzed data set. Currently, however, even the most sophisticated model based or machine learning algorithms require user supervision, meaning the subjectivity problem is not removed but rather incorporated into the algorithm’s initial training steps and then repeatedly applied to the imagery. To address this roadblock, we have developed a self-supervised machine learning algorithm that recursively trains itself directly from the live cell imagery data, thus providing objective segmentation and quantification. The approach incorporates an optical flow algorithm component to self-label cell and background pixels for training, followed by the extraction of additional feature vectors for the automated generation of a cell/background classification model. Because it is self-trained, the software has no user-adjustable parameters and does not require curated training imagery. The algorithm was applied to automatically segment cells from their background for a variety of cell types and five commonly used imaging modalities - fluorescence, phase contrast, differential interference contrast (DIC), transmitted light and interference reflection microscopy (IRM). The approach is broadly applicable in that it enables completely automated cell segmentation for long-term live cell phenotyping applications, regardless of the input imagery’s optical modality, magnification or cell type.

Download Full-text

A Self-Supervised Machine Learning Approach for Objective Live Cell Segmentation and Analysis

10.1101/2021.01.07.425773 ◽

2021 ◽

Author(s):

Michael C. Robitaille ◽

Jeff M. Byers ◽

Joseph A. Christodoulides ◽

Marc P. Raphael

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

Learning Algorithms ◽

Label Cell ◽

Live Cell ◽

Machine Learning Algorithms ◽

Classification Model ◽

Supervised Machine Learning ◽

Cell Segmentation ◽

Data Set

Machine learning algorithms hold the promise of greatly improving live cell image analysis by way of (1) analyzing far more imagery than can be achieved by more traditional manual approaches and (2) by eliminating the subjective nature of researchers and diagnosticians selecting the cells or cell features to be included in the analyzed data set. Currently, however, even the most sophisticated model based or machine learning algorithms require user supervision, meaning the subjectivity problem is not removed but rather incorporated into the algorithm's initial training steps and then repeatedly applied to the imagery. To address this roadblock, we have developed a self-supervised machine learning algorithm that recursively trains itself directly from the live cell imagery data, thus providing objective segmentation and quantification. The approach incorporates an optical flow algorithm component to self-label cell and background pixels for training, followed by the extraction of additional feature vectors for the automated generation of a cell/background classification model. Because it is self-trained, the software has no user-adjustable parameters and does not require curated training imagery. The algorithm was applied to automatically segment cells from their background for a variety of cell types and five commonly used imaging modalities - fluorescence, phase contrast, differential interference contrast (DIC), transmitted light and interference reflection microscopy (IRM). The approach is broadly applicable in that it enables completely automated cell segmentation for long-term live cell phenotyping applications, regardless of the input imagery's optical modality, magnification or cell type.

Download Full-text

Mapping paddy rice fields by applying machine learning algorithms to multi-temporal Sentinel-1A and Landsat data

International Journal of Remote Sensing ◽

10.1080/01431161.2017.1395969 ◽

2017 ◽

Vol 39 (4) ◽

pp. 1042-1067 ◽

Cited By ~ 40

Author(s):

Alex O. Onojeghuo ◽

George A. Blackburn ◽

Qunming Wang ◽

Peter M. Atkinson ◽

Daniel Kindred ◽

...

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Paddy Rice ◽

Rice Fields ◽

Machine Learning Algorithms ◽

Landsat Data ◽

Multi Temporal

Download Full-text

Streamlining Quality Review of Mass Spectrometry Data in the Clinical Laboratory by Use of Machine Learning

Archives of Pathology & Laboratory Medicine ◽

10.5858/arpa.2018-0238-oa ◽

2019 ◽

Vol 143 (8) ◽

pp. 990-998 ◽

Cited By ~ 2

Author(s):

Min Yu ◽

Lindsay A. L. Bazydlo ◽

David E. Bruns ◽

James H. Harrison

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Turnaround Time ◽

Machine Learning Algorithms ◽

Classification Model ◽

Supervised Machine Learning ◽

Training Dataset ◽

Support Vector ◽

Test Dataset ◽

Manual Review

Context.— Turnaround time and productivity of clinical mass spectrometric (MS) testing are hampered by time-consuming manual review of the analytical quality of MS data before release of patient results. Objective.— To determine whether a classification model created by using standard machine learning algorithms can verify analytically acceptable MS results and thereby reduce manual review requirements. Design.— We obtained retrospective data from gas chromatography–MS analyses of 11-nor-9-carboxy-delta-9-tetrahydrocannabinol (THC-COOH) in 1267 urine samples. The data for each sample had been labeled previously as either analytically unacceptable or acceptable by manual review. The dataset was randomly split into training and test sets (848 and 419 samples, respectively), maintaining equal proportions of acceptable (90%) and unacceptable (10%) results in each set. We used stratified 10-fold cross-validation in assessing the abilities of 6 supervised machine learning algorithms to distinguish unacceptable from acceptable assay results in the training dataset. The classifier with the highest recall was used to build a final model, and its performance was evaluated against the test dataset. Results.— In comparison testing of the 6 classifiers, a model based on the Support Vector Machines algorithm yielded the highest recall and acceptable precision. After optimization, this model correctly identified all unacceptable results in the test dataset (100% recall) with a precision of 81%. Conclusions.— Automated data review identified all analytically unacceptable assays in the test dataset, while reducing the manual review requirement by about 87%. This automation strategy can focus manual review only on assays likely to be problematic, allowing improved throughput and turnaround time without reducing quality.

Download Full-text

A Support Vector Machine and Decision Tree Based Breast Cancer Prediction

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.a1752.029320 ◽

2020 ◽

Vol 9 (3) ◽

pp. 2972-2976

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Support Vector Machine ◽

Decision Tree ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Classification Model ◽

Supervised Machine Learning ◽

Misclassification Rate ◽

Support Vector

The first step in diagnosis of a breast cancer is the identification of the disease. Early detection of the breast cancer is significant to reduce the mortality rate due to breast cancer. Machine learning algorithms can be used in identification of the breast cancer. The supervised machine learning algorithms such as Support Vector Machine (SVM) and the Decision Tree are widely used in classification problems, such as the identification of breast cancer. In this study, a machine learning model is proposed by employing learning algorithms namely, the support vector machine and decision tree. The kaggle data repository consisting of 569 observations of malignant and benign observations is used to develop the proposed model. Finally, the model is evaluated using accuracy, confusion matrix precision and recall as metrics for evaluation of performance on the test set. The analysis result showed that, the support vector machine (SVM) has better accuracy and less number of misclassification rate and better precision than the decision tree algorithm. The average accuracy of the support vector machine (SVM) is 91.92 % and that of the decision tree classification model is 87.12 %.

Download Full-text

Text Polarity Detection using Multiple Supervised Machine Learning Algorithms

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.c8449.019320 ◽

2020 ◽

Vol 9 (3) ◽

pp. 1612-1618

Keyword(s):

Machine Learning ◽

Social Media ◽

Sentiment Analysis ◽

Large Scale ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

The Public ◽

Social Media Platforms ◽

Day By Day

Sentiment analysis is the classifying of a review, opinion or a statement into categories, which brings clarity about specific sentiments of customers or the concerned group to businesses and developers. These categorized data are very critical to the development of businesses and understanding the public opinion. The need for accurate opinion and large-scale sentiment analysis on social media platforms is growing day by day. In this paper, a number of machine learning algorithms are trained and applied on twitter datasets and their respective accuracies are determined separately on different polarities of data, thereby giving a glimpse to which algorithm works best and which works worst..

Download Full-text

High Performance Classification Model to Identify Ransomware Payments for Heterogeneous Bitcoin Networks

Electronics ◽

10.3390/electronics10172113 ◽

2021 ◽

Vol 10 (17) ◽

pp. 2113

Author(s):

Qasem Abu Al-Haija ◽

Abdulaziz A. Alsulami

Keyword(s):

Machine Learning ◽

Decision Trees ◽

High Performance ◽

High Efficiency ◽

Performance Indicator ◽

Confusion Matrix ◽

Classification Model ◽

Supervised Machine Learning ◽

Experimental Result ◽

Prediction Errors

The Bitcoin cryptocurrency is a worldwide prevalent virtualized digital currency conceptualized in 2008 as a distributed transactions system. Bitcoin transactions make use of peer-to-peer network nodes without a third-party intermediary, and the transactions can be verified by the node. Although Bitcoin networks have exhibited high efficiency in the financial transaction systems, their payment transactions are vulnerable to several ransomware attacks. For that reason, investigators have been working on developing ransomware payment identification techniques for bitcoin transactions’ networks to prevent such harmful cyberattacks. In this paper, we propose a high performance Bitcoin transaction predictive system that investigates the Bitcoin payment transactions to learn data patterns that can recognize and classify ransomware payments for heterogeneous bitcoin networks. Specifically, our system makes use of two supervised machine learning methods to learn the distinguishing patterns in Bitcoin payment transactions, namely, shallow neural networks (SNN) and optimizable decision trees (ODT). To validate the effectiveness of our solution approach, we evaluate our machine learning based predictive models on a recent Bitcoin transactions dataset in terms of classification accuracy as a key performance indicator and other key evaluation metrics such as the confusion matrix, positive predictive value, true positive rate, and the corresponding prediction errors. As a result, our superlative experimental result was registered to the model-based decision trees scoring 99.9% and 99.4% classification detection (two-class classifier) and accuracy (multiclass classifier), respectively. Hence, the obtained model accuracy results are superior as they surpassed many state-of-the-art models developed to identify ransomware payments in bitcoin transactions.

Download Full-text

Discovering News Frames

International Journal of Multimedia Data Engineering and Management ◽

10.4018/ijmdem.2016100103 ◽

2016 ◽

Vol 7 (4) ◽

pp. 45-62 ◽

Cited By ~ 2

Author(s):

Loretta H. Cheeks ◽

Tracy L. Stepien ◽

Dara M. Wald ◽

Ashraf Gaffar

Keyword(s):

Machine Learning ◽

Data Mining ◽

Large Scale ◽

Online News ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Small Scale ◽

Water Insecurity ◽

News Content ◽

Mathematical Techniques

The Internet is a major source of online news content. Current efforts to evaluate online news content including text, storyline, and sources is limited by the use of small-scale manual techniques that are time consuming and dependent on human judgments. This article explores the use of machine learning algorithms and mathematical techniques for Internet-scale data mining and semantic discovery of news content that will enable researchers to mine, analyze, and visualize large-scale datasets. This research has the potential to inform the integration and application of data mining to address real-world socio-environmental issues, including water insecurity in the Southwestern United States. This paper establishes a formal definition of framing and proposes an approach for the discovery of distinct patterns that characterize prominent frames. The authors' experimental evaluation shows the proposed process is an effective approach for advancing semi-supervised machine learning and may assist in advancing tools for making sense of unstructured text.

Download Full-text

Machine Learning-Based Student Modeling Methodology for Intelligent Tutoring Systems

Journal of Educational Computing Research ◽

10.1177/0735633120986256 ◽

2021 ◽

pp. 073563312098625

Author(s):

Chunsheng Yang ◽

Feng-Kuang Chiang ◽

Qiangqiang Cheng ◽

Jun Ji

Keyword(s):

Machine Learning ◽

High Performance ◽

Large Scale ◽

Machine Learning Algorithms ◽

Educational Systems ◽

Student Modeling ◽

Education Systems ◽

Tutoring Systems ◽

Modeling Methodology ◽

Student Models

Machine learning-based modeling technology has recently become a powerful technique and tool for developing models for explaining, predicting, and describing system/human behaviors. In developing intelligent education systems or technologies, some research has focused on applying unique machine learning algorithms to build the ad-hoc student models for specific educational systems. However, systematically developing the data-driven student models from the educational data collected over prior educational experiences remains a challenge. We proposed a systematic and comprehensive machine learning-based modeling methodology to develop high-performance predictive student models from the historical educational data to address this issue. This methodology addresses the fundamental modeling issues, from data processing, to modeling, to model deployment. The said methodology can help developing student models for intelligent educational systems. After a detailed description of the proposed machine learning-based methodology, we introduce its application to an intelligent navigation tutoring system. Using the historical data collected in intelligent navigation tutoring systems, we conduct large-scale experiments to build the student models for training systems. The preliminary results proved that the proposed methodology is useful and feasible in developing the high-performance models for various intelligent education systems.

Download Full-text