Supervised binary classification methods for strawberry ripeness discrimination from bioimpedance data

AbstractStrawberry is one of the most popular fruits in the market. To meet the demanding consumer and market quality standards, there is a strong need for an on-site, accurate and reliable grading system during the whole harvesting process. In this work, a total of 923 strawberry fruit were measured directly on-plant at different ripening stages by means of bioimpedance data, collected at frequencies between 20 Hz and 300 kHz. The fruit batch was then splitted in 2 classes (i.e. ripe and unripe) based on surface color data. Starting from these data, six of the most commonly used supervised machine learning classification techniques, i.e. Logistic Regression (LR), Binary Decision Trees (DT), Naive Bayes Classifiers (NBC), K-Nearest Neighbors (KNN), Support Vector Machine (SVM) and Multi-Layer Perceptron Networks (MLP), were employed, optimized, tested and compared in view of their performance in predicting the strawberry fruit ripening stage. Such models were trained to develop a complete feature selection and optimization pipeline, not yet available for bioimpedance data analysis of fruit. The classification results highlighted that, among all the tested methods, MLP networks had the best performances on the test set, with 0.72, 0.82 and 0.73 for the F$$_1$$ 1 , F$$_{0.5}$$ 0.5 and F$$_2$$ 2 -score, respectively, and improved the training results, showing good generalization capability, adapting well to new, previously unseen data. Consequently, the MLP models, trained with bioimpedance data, are a promising alternative for real-time estimation of strawberry ripeness directly on-field, which could be a potential application technique for evaluating the harvesting time management for farmers and producers.

Download Full-text

Supervised Binary Classification Methods for Strawberry Ripeness Discrimination from Bioimpedance Data

10.21203/rs.3.rs-129281/v1 ◽

2020 ◽

Author(s):

Pietro Ibba ◽

Christian Tronstad ◽

Roberto Moscetti ◽

Tanja Mimmo ◽

Giuseppe Cantarella ◽

...

Keyword(s):

Time Management ◽

Binary Classification ◽

Time Estimation ◽

Supervised Machine Learning ◽

Support Vector ◽

Strawberry Fruit ◽

Promising Alternative ◽

Machine Learning Classification ◽

Unseen Data ◽

Harvesting Process

Abstract Strawberry is one of the most popular fruits in the market. To meet the demanding consumer and market quality standards, there is a strong need for an on-site, accurate and reliable grading system during the whole harvesting process. In this work, a total of 923 strawberry fruit were measured directly on-plant at different ripening stages by means of bioimpedance data, collected at frequencies between 20Hz and 300kHz. The fruit batch was then splitted in 2 classes (i.e. ripe and unripe) based on surface color data. Starting from these data, six of the most commonly used supervised machine learning classification techniques, i.e. Logistic Regression (LR), Binary Decision Trees (DT), Naive Bayes Classifiers (NBC), K-NearestNeighbors (KNN), Support Vector Machine (SVM) and Multi-Layer Perceptron Networks (MLP), were employed, optimized, tested and compared in view of their performance in predicting the strawberry fruit ripening stage. Such models were trained to develop a complete feature selection and optimization pipeline, not yet available for bioimpedance data analysis of fruit. The classification results highlighted that, among all the tested methods, MLP networks had the best performances on the test set, with 0.72, 0.82 and 0.73 for the F1, F0.5 and F2-score, respectively, and improved the training results, showing good generalization capability, adapting well to new, previously unseen data. Consequently, the MLP models, trained with bioimpedance data, are a promising alternative for real–time estimation of strawberry ripeness directly on–field, which could be a potential application technique for evaluating the harvesting time management for farmers and producers.

Download Full-text

THE USE OF MACHINE LEARNING METHODS FOR BINARY CLASSIFICATION OF THE WORKING CONDITION OF BEARINGS USING THE SIGNALS OF VIBRATION ACCELERATION

Bulletin of National Technical University KhPI Series System Analysis Control and Information Technologies ◽

10.20998/2079-0023.2021.02.03 ◽

2021 ◽

pp. 15-22

Author(s):

Ruslan Babudzhan ◽

Konstantyn Isaienkov ◽

Danilo Krasiy ◽

Oleksii Vodka ◽

Ivan Zadorozhny ◽

...

Keyword(s):

Machine Learning ◽

Binary Classification ◽

Fractal Dimensions ◽

Feature Space ◽

Training Data ◽

Supervised Machine Learning ◽

Support Vector ◽

Data Sets ◽

Vibration Acceleration ◽

K Nearest Neighbors

The paper investigates the relationship between vibration acceleration of bearings with their operational state. To determine these dependencies, a testbench was built and 112 experiments were carried out with different bearings: 100 bearings that developed an internal defect during operation and 12bearings without a defect. From the obtained records, a dataset was formed, which was used to build classifiers. Dataset is freely available. A methodfor classifying new and used bearings was proposed, which consists in searching for dependencies and regularities of the signal using descriptive functions: statistical, entropy, fractal dimensions and others. In addition to processing the signal itself, the frequency domain of the bearing operationsignal was also used to complement the feature space. The paper considered the possibility of generalizing the classification for its application on thosesignals that were not obtained in the course of laboratory experiments. An extraneous dataset was found in the public domain. This dataset was used todetermine how accurate a classifier was when it was trained and tested on significantly different signals. Training and validation were carried out usingthe bootstrapping method to eradicate the effect of randomness, given the small amount of training data available. To estimate the quality of theclassifiers, the F1-measure was used as the main metric due to the imbalance of the data sets. The following supervised machine learning methodswere chosen as classifier models: logistic regression, support vector machine, random forest, and K nearest neighbors. The results are presented in theform of plots of density distribution and diagrams.

Download Full-text

A Two-Stage Machine Learning Classification Approach to Identify Extremism in Arabic Opinions

International Journal of Advanced Trends in Computer Science and Engineering ◽

10.30534/ijatcse/2021/391022021 ◽

2021 ◽

Vol 10 (2) ◽

pp. 736-745

Keyword(s):

Machine Learning ◽

Binary Classification ◽

Feature Selection Method ◽

Support Vector ◽

Two Stage ◽

Machine Learning Classification ◽

Second Stage ◽

Testing Data ◽

Stage Classification ◽

Positive Dataset

The increased usage of the Internet and social networks allowed and enabled people to express their views, which have generated an increasing attention lately. Sentiment Analysis (SA) techniques are used to determine the polarity of information, either positive or negative, toward a given topic, including opinions. In this research, we have introduced a machine learning approach based on Support Vector Machine (SVM), Naïve Bayes (NB) and Random Forest (RF) classifiers, to find and classify extreme opinions in Arabic reviews. To achieve this, a dataset of 1500 Arabic reviews was collected from Google Play Store. In addition, a two-stage Classification process was applied to classify the reviews. In the first stage, we built a binary classifier to sort out positive from negative reviews. In the second stage, however we applied a binary classification mechanism based on a set of proposed rules that distinguishes extreme positive from positive reviews, and extreme negative from negative reviews. Four major experiments were conducted with a total of 10 different sub experiments to fulfill the two-stage process using different X-validation schemas and Term Frequency-Inverse Document Frequency feature selection method. Obtained results have indicated that SVM was the best during the first stage classification with 30% testing data, and NB was the best with 20% testing data. The results of the second stage classification indicated that SVM has scored better results in identifying extreme positive reviews when dealing with the positive dataset with an overall accuracy of 68.7% and NB showed better accuracy results in identifying extreme negative reviews when dealing with the negative dataset, with an overall accuracy of 72.8%.

Download Full-text

Classifications of Breast Cancer Diagnosis using Machine Learning

International Journal of Computers ◽

10.46300/9108.2020.14.13 ◽

2020 ◽

Vol 14 ◽

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Random Forest ◽

Breast Cancer Diagnosis ◽

Performance Comparison ◽

Support Vector ◽

Breast Cancer Dataset ◽

K Nearest Neighbors ◽

Cancer Dataset ◽

Machine Learning Classification

Breast Cancer (BC) is amongst the most common and leading causes of deaths in women throughout the world. Recently, classification and data analysis tools are being widely used in the medical field for diagnosis, prognosis and decision making to help lower down the risks of people dying or suffering from diseases. Advanced machine learning methods have proven to give hope for patients as this has helped the doctors in early detection of diseases like Breast Cancer that can be fatal, in support with providing accurate outcomes. However, the results highly depend on the techniques used for feature selection and classification which will produce a strong machine learning model. In this paper, a performance comparison is conducted using four classifiers which are Multilayer Perceptron (MLP), Support Vector Machine (SVM), K-Nearest Neighbors (KNN) and Random Forest on the Wisconsin Breast Cancer dataset to spot the most effective predictors. The main goal is to apply best machine learning classification methods to predict the Breast Cancer as benign or malignant using terms such as accuracy, f-measure, precision and recall. Experimental results show that Random forest is proven to achieve the highest accuracy of 99.26% on this dataset and features, while SVM and KNN show 97.78% and 97.04% accuracy respectively. MLP shows the least accuracy of 94.07%. All the experiments are conducted using RStudio as the data mining tool platform.

Download Full-text

Machine Learning Classification Algorithms to Predict aGvHD following Allo-HSCT: A Systematic Review

Methods of Information in Medicine ◽

10.1055/s-0040-1709150 ◽

2019 ◽

Vol 58 (06) ◽

pp. 205-212

Author(s):

Cirruse Salehnasab ◽

Abbas Hajifathali ◽

Farkhondeh Asadi ◽

Elham Roshandel ◽

Alireza Kazemi ◽

...

Keyword(s):

Machine Learning ◽

Systemic Review ◽

Predictor Variables ◽

Support Vector ◽

Classification Algorithms ◽

K Nearest Neighbors ◽

Hematopoietic Stem ◽

Machine Learning Classification ◽

Graft Versus Host ◽

Meta Analyses

Abstract Background The acute graft-versus-host disease (aGvHD) is the most important cause of mortality in patients receiving allogeneic hematopoietic stem cell transplantation. Given that it occurs at the stage of severe tissue damage, its diagnosis is late. With the advancement of machine learning (ML), promising real-time models to predict aGvHD have emerged. Objective This article aims to synthesize the literature on ML classification algorithms for predicting aGvHD, highlighting algorithms and important predictor variables used. Methods A systemic review of ML classification algorithms used to predict aGvHD was performed using a search of the PubMed, Embase, Web of Science, Scopus, Springer, and IEEE Xplore databases undertaken up to April 2019 based on Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) statements. The studies with a focus on using the ML classification algorithms in the process of predicting of aGvHD were considered. Results After applying the inclusion and exclusion criteria, 14 studies were selected for evaluation. The results of the current analysis showed that the algorithms used were Artificial Neural Network (79%), Support Vector Machine (50%), Naive Bayes (43%), k-Nearest Neighbors (29%), Regression (29%), and Decision Trees (14%), respectively. Also, many predictor variables have been used in these studies so that we have divided them into more abstract categories, including biomarkers, demographics, infections, clinical, genes, transplants, drugs, and other variables. Conclusion Each of these ML algorithms has a particular characteristic and different proposed predictors. Therefore, it seems these ML algorithms have a high potential for predicting aGvHD if the process of modeling is performed correctly.

Download Full-text

A Data-Driven Approach for Winter Precipitation Classification Using Weather Radar and NWP Data

Atmosphere ◽

10.3390/atmos11070701 ◽

2020 ◽

Vol 11 (7) ◽

pp. 701

Author(s):

Bong-Chul Seo

Keyword(s):

Random Forest ◽

Binary Classification ◽

Weather Prediction ◽

Model Development ◽

Winter Precipitation ◽

Ensemble Classification ◽

Supervised Machine Learning ◽

Data Driven ◽

Support Vector ◽

Data Driven Approach

This study describes a framework that provides qualitative weather information on winter precipitation types using a data-driven approach. The framework incorporates the data retrieved from weather radars and the numerical weather prediction (NWP) model to account for relevant precipitation microphysics. To enable multimodel-based ensemble classification, we selected six supervised machine learning models: k-nearest neighbors, logistic regression, support vector machine, decision tree, random forest, and multi-layer perceptron. Our model training and cross-validation results based on Monte Carlo Simulation (MCS) showed that all the models performed better than our baseline method, which applies two thresholds (surface temperature and atmospheric layer thickness) for binary classification (i.e., rain/snow). Among all six models, random forest presented the best classification results for the basic classes (rain, freezing rain, and snow) and the further refinement of the snow classes (light, moderate, and heavy). Our model evaluation, which uses an independent dataset not associated with model development and learning, led to classification performance consistent with that from the MCS analysis. Based on the visual inspection of the classification maps generated for an individual radar domain, we confirmed the improved classification capability of the developed models (e.g., random forest) compared to the baseline one in representing both spatial variability and continuity.

Download Full-text

Automatic recognition of self-acknowledged limitations in clinical research literature

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocy038 ◽

2018 ◽

Vol 25 (7) ◽

pp. 855-861 ◽

Cited By ~ 4

Author(s):

Halil Kilicoglu ◽

Graciela Rosemblat ◽

Mario Malički ◽

Gerben ter Riet

Keyword(s):

Machine Learning ◽

Clinical Research ◽

Binary Classification ◽

Classification Performance ◽

Research Literature ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Support Vector ◽

Rule Based ◽

Research Transparency

Abstract Objective To automatically recognize self-acknowledged limitations in clinical research publications to support efforts in improving research transparency. Methods To develop our recognition methods, we used a set of 8431 sentences from 1197 PubMed Central articles. A subset of these sentences was manually annotated for training/testing, and inter-annotator agreement was calculated. We cast the recognition problem as a binary classification task, in which we determine whether a given sentence from a publication discusses self-acknowledged limitations or not. We experimented with three methods: a rule-based approach based on document structure, supervised machine learning, and a semi-supervised method that uses self-training to expand the training set in order to improve classification performance. The machine learning algorithms used were logistic regression (LR) and support vector machines (SVM). Results Annotators had good agreement in labeling limitation sentences (Krippendorff’s α = 0.781). Of the three methods used, the rule-based method yielded the best performance with 91.5% accuracy (95% CI [90.1-92.9]), while self-training with SVM led to a small improvement over fully supervised learning (89.9%, 95% CI [88.4-91.4] vs 89.6%, 95% CI [88.1-91.1]). Conclusions The approach presented can be incorporated into the workflows of stakeholders focusing on research transparency to improve reporting of limitations in clinical studies.

Download Full-text

Supervised machine learning based liver disease prediction approach with LASSO feature selection

Bulletin of Electrical Engineering and Informatics ◽

10.11591/eei.v10i6.3242 ◽

2021 ◽

Vol 10 (6) ◽

pp. 3369-3376

Author(s):

Saima Afrin ◽

F. M. Javed Mehedi Shamrat ◽

Tafsirul Islam Nibir ◽

Mst. Fahmida Muntasim ◽

Md. Shakil Moharram ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Liver Disease ◽

Decision Tree ◽

Medical Science ◽

Supervised Machine Learning ◽

Gradient Boosting ◽

Support Vector ◽

Machine Learning Classification ◽

Prediction Approach

In this contemporary era, the uses of machine learning techniques are increasing rapidly in the field of medical science for detecting various diseases such as liver disease (LD). Around the globe, a large number of people die because of this deadly disease. By diagnosing the disease in a primary stage, early treatment can be helpful to cure the patient. In this research paper, a method is proposed to diagnose the LD using supervised machine learning classification algorithms, namely logistic regression, decision tree, random forest, AdaBoost, KNN, linear discriminant analysis, gradient boosting and support vector machine (SVM). We also deployed a least absolute shrinkage and selection operator (LASSO) feature selection technique on our taken dataset to suggest the most highly correlated attributes of LD. The predictions with 10 fold cross-validation (CV) made by the algorithms are tested in terms of accuracy, sensitivity, precision and f1-score values to forecast the disease. It is observed that the decision tree algorithm has the best performance score where accuracy, precision, sensitivity and f1-score values are 94.295%, 92%, 99% and 96% respectively with the inclusion of LASSO. Furthermore, a comparison with recent studies is shown to prove the significance of the proposed system.

Download Full-text

Predicting diabetes-related hospitalizations based on electronic health records

Statistical Methods in Medical Research ◽

10.1177/0962280218810911 ◽

2018 ◽

Vol 28 (12) ◽

pp. 3667-3682 ◽

Cited By ~ 8

Author(s):

Theodora S Brisimi ◽

Tingting Xu ◽

Taiyao Wang ◽

Wuyang Dai ◽

Ioannis Ch Paschalidis

Keyword(s):

New England ◽

Medical Center ◽

Computational Cost ◽

Safety Net ◽

New Method ◽

Supervised Machine Learning ◽

Support Vector ◽

Large Set ◽

Machine Learning Classification ◽

Safety Net Hospital

Objective: To derive a predictive model to identify patients likely to be hospitalized during the following year due to complications attributed to Type II diabetes. Methods: A variety of supervised machine learning classification methods were tested and a new method that discovers hidden patient clusters in the positive class (hospitalized) was developed while, at the same time, sparse linear support vector machine classifiers were derived to separate positive samples from the negative ones (non-hospitalized). The convergence of the new method was established and theoretical guarantees were proved on how the classifiers it produces generalize to a test set not seen during training. Results: The methods were tested on a large set of patients from the Boston Medical Center – the largest safety net hospital in New England. It is found that our new joint clustering/classification method achieves an accuracy of 89% (measured in terms of area under the ROC Curve) and yields informative clusters which can help interpret the classification results, thus increasing the trust of physicians to the algorithmic output and providing some guidance towards preventive measures. While it is possible to increase accuracy to 92% with other methods, this comes with increased computational cost and lack of interpretability. The analysis shows that even a modest probability of preventive actions being effective (more than 19%) suffices to generate significant hospital care savings. Conclusions: Predictive models are proposed that can help avert hospitalizations, improve health outcomes and drastically reduce hospital expenditures. The scope for savings is significant as it has been estimated that in the USA alone, about $5.8 billion are spent each year on diabetes-related hospitalizations that could be prevented.

Download Full-text

A Comparison of Traditional Machine Learning Approaches for Supervised Feedback Classification in Bahasa Indonesia

International Journal of New Media Technology ◽

10.31937/ijnmt.v1i1.1485 ◽

2020 ◽

Vol 7 (1) ◽

pp. 28-32

Author(s):

Andre Rusli ◽

Alethea Suryadibrata ◽

Samiaji Bintang Nusantara ◽

Julio Christian Young

Keyword(s):

Machine Learning ◽

Language Processing ◽

Text Classification ◽

Weighted Average ◽

Supervised Machine Learning ◽

Learning Approaches ◽

K Nearest Neighbors ◽

Machine Learning Classification ◽

Logistics Regression ◽

Learning Machine

The advancement of machine learning and natural language processing techniques hold essential opportunities to improve the existing software engineering activities, including the requirements engineering activity. Instead of manually reading all submitted user feedback to understand the evolving requirements of their product, developers could use the help of an automatic text classification program to reduce the required effort. Many supervised machine learning approaches have already been used in many fields of text classification and show promising results in terms of performance. This paper aims to implement NLP techniques for the basic text preprocessing, which then are followed by traditional (non-deep learning) machine learning classification algorithms, which are the Logistics Regression, Decision Tree, Multinomial Naïve Bayes, K-Nearest Neighbors, Linear SVC, and Random Forest classifier. Finally, the performance of each algorithm to classify the feedback in our dataset into several categories is evaluated using three F1 Score metrics, the macro-, micro-, and weighted-average F1 Score. Results show that generally, Logistics Regression is the most suitable classifier in most cases, followed by Linear SVC. However, the performance gap is not large, and with different configurations and requirements, other classifiers could perform equally or even better.

Download Full-text