Development of a random forest cloud regime classification model based on surface radiation and cloud products

Journal of Applied Meteorology and Climatology ◽

10.1175/jamc-d-20-0153.1 ◽

2021 ◽

Author(s):

Joseph Sedlar ◽

Laura D. Riihimaki ◽

Kathleen Lantz ◽

David D. Turner

Keyword(s):

Random Forest ◽

Great Plains ◽

Digital Processing ◽

Classification Performance ◽

Surface Radiation ◽

Classification Model ◽

Cloud Type ◽

Model Classification ◽

Advantages And Disadvantages ◽

Cloud Regime

AbstractVarious methods have been developed to characterize cloud type, otherwise referred to as cloud regime. These include manual sky observations, combining radiative and cloud vertical properties observed from satellite, surface-based remote sensing, and digital processing of sky imagers. While each methodology has inherent advantages and disadvantages, none of these cloud typing methods actually include measurements of surface shortwave or longwave radiative fluxes. Here, a methodology that relies upon detailed, surface-based radiation and cloud measurements and derived data products to train a random forest machine learning cloud classification model is introduced. Measurements from five years of data from the ARM Southern Great Plains site were compiled to train and independently evaluate the model classification performance. A cloud type accuracy of approximately 80% using the random forest classifier reveals the model is well suited to predict climatological cloud properties. Furthermore, an analysis of the cloud type misclassifications is performed. While physical cloud types may be misreported, the shortwave radiative signatures are similar between misclassified cloud types. From this, we assert the cloud regime model has the capacity to successfully differentiate clouds with comparable cloud-radiative interactions. Therefore, we conclude the model can provide useful cloud property information for fundamental cloud studies, inform renewable energy studies, a tool for numerical model evaluation and parameterization improvement, among many other applications.

Download Full-text

Transformer Oil Quality Assessment Using Random Forest with Feature Engineering

Energies ◽

10.3390/en14071809 ◽

2021 ◽

Vol 14 (7) ◽

pp. 1809

Author(s):

Mohammed El Amine Senoussaoui ◽

Mostefa Brahami ◽

Issouf Fofana

Keyword(s):

Machine Learning ◽

Random Forest ◽

Oil Quality ◽

Principal Component ◽

Condition Assessment ◽

Classification Performance ◽

Transformer Oil ◽

Classification Model ◽

Insulation Degradation ◽

Transformer Oils

Machine learning is widely used as a panacea in many engineering applications including the condition assessment of power transformers. Most statistics attribute the main cause of transformer failure to insulation degradation. Thus, a new, simple, and effective machine-learning approach was proposed to monitor the condition of transformer oils based on some aging indicators. The proposed approach was used to compare the performance of two machine-learning classifiers: J48 decision tree and random forest. The service-aged transformer oils were classified into four groups: the oils that can be maintained in service, the oils that should be reconditioned or filtered, the oils that should be reclaimed, and the oils that must be discarded. From the two algorithms, random forest exhibited a better performance and high accuracy with only a small amount of data. Good performance was achieved through not only the application of the proposed algorithm but also the approach of data preprocessing. Before feeding the classification model, the available data were transformed using the simple k-means method. Subsequently, the obtained data were filtered through correlation-based feature selection (CFsSubset). The resulting features were again retransformed by conducting the principal component analysis and were passed through the CFsSubset filter. The transformation and filtration of the data improved the classification performance of the adopted algorithms, especially random forest. Another advantage of the proposed method is the decrease in the number of the datasets required for the condition assessment of transformer oils, which is valuable for transformer condition monitoring.

Download Full-text

Real-Time Heart Arrhythmia Detection Using Apache Spark Structured Streaming

Journal of Healthcare Engineering ◽

10.1155/2021/6624829 ◽

2021 ◽

Vol 2021 ◽

pp. 1-13

Author(s):

Sadegh Ilbeigipour ◽

Amir Albadvi ◽

Elham Akhondzadeh Noughabi

Keyword(s):

Random Forest ◽

Real Time ◽

Cardiac Arrhythmias ◽

Performance Metrics ◽

Classification Performance ◽

Apache Spark ◽

Classification Model ◽

Arrhythmia Detection ◽

Class Labels ◽

The Impact

One of the major causes of death in the world is cardiac arrhythmias. In the field of healthcare, physicians use the patient’s electrocardiogram (ECG) records to detect arrhythmias, which indicate the electrical activity of the patient’s heart. The problem is that the symptoms do not always appear and the physician may be mistaken in the diagnosis. Therefore, patients need continuous monitoring through real-time ECG analysis to detect arrhythmias in a timely manner and prevent an eventual incident that threatens the patient’s life. In this research, we used the Structured Streaming module built top on the open-source Apache Spark platform for the first time to implement a machine learning pipeline for real-time cardiac arrhythmias detection and evaluate the impact of using this new module on classification performance metrics and the rate of delay in arrhythmia detection. The ECG data collected from the MIT/BIH database for the detection of three class labels: normal beats, RBBB, and atrial fibrillation arrhythmias. We also developed three decision trees, random forest, and logistic regression multiclass classifiers for data classification where the random forest classifier showed better performance in classification than the other two classifiers. The results show previous results in performance metrics of the classification model and a significant decrease in pipeline runtime by using more class labels compared to previous studies.

Download Full-text

Machine Learning-Based Hourly Frost-Prediction System Optimized for Orchards Using Automatic Weather Station and Digital Camera Image Data

Atmosphere ◽

10.3390/atmos12070846 ◽

2021 ◽

Vol 12 (7) ◽

pp. 846

Author(s):

Ilseok Noh ◽

Hae-Won Doh ◽

Soo-Ock Kim ◽

Su-Hyun Kim ◽

Seoleun Shin ◽

...

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Digital Camera ◽

Image Data ◽

Classification Performance ◽

Classification Model ◽

Support Vector ◽

Observation Data ◽

Freezing Resistance

Spring frosts damage crops that have weakened freezing resistance after germination. We developed a machine learning (ML)-based frost-classification model and optimized it for orchard farming environments. First, logistic regression, decision tree, random forest, and support vector machine models were trained using balanced Korea Meteorological Administration (KMA) Automated Synoptic Observing System (ASOS) frost observation data for March from the last 10 years (2008–2017). Random forest and support vector machine models showed good classification performance and were selected as the main techniques, which were optimized for orchard fields based on initial frost occurrence times. The training period was then extended to March–April for 20 years (2000–2019). Finally, the model was applied to the KMA ASOS frost observation data from March to April 2020, which were not used in the previous steps, and RGB data were extracted by digital cameras installed in an orchard in Gyeonggi-do. The developed model successfully classified 117 of 139 frost observation cases from the domestic ASOS data and 35 of 37 orchard camera observations. The assumption of the initial frost occurrence time for training helped the most in improving the frost-classification model. These results clearly indicate that the frost-classification model using ML has applicable accuracy in orchard farming.

Download Full-text

Feature-Level Fusion of Polarized SAR and Optical Images Based on Random Forest and Conditional Random Fields

Remote Sensing ◽

10.3390/rs13071323 ◽

2021 ◽

Vol 13 (7) ◽

pp. 1323

Author(s):

Yingying Kong ◽

Biyuan Yan ◽

Yanjuan Liu ◽

Henry Leung ◽

Xiangyang Peng

Keyword(s):

Random Forest ◽

Random Fields ◽

Conditional Random Fields ◽

Classification Performance ◽

Image Features ◽

Classification Model ◽

Feature Identification ◽

Optical Images ◽

Feature Level Fusion ◽

Level Fusion

In terms of land cover classification, optical images have been proven to have good classification performance. Synthetic Aperture Radar (SAR) has the characteristics of working all-time and all-weather. It has more significant advantages over optical images for the recognition of some scenes, such as water bodies. One of the current challenges is how to fuse the benefits of both to obtain more powerful classification capabilities. This study proposes a classification model based on random forest with the conditional random fields (CRF) for feature-level fusion classification using features extracted from polarized SAR and optical images. In this paper, feature importance is introduced as a weight in the pairwise potential function of the CRF to improve the correction rate of misclassified points. The results show that the dataset combining the two provides significant improvements in feature identification when compared to the dataset using optical or polarized SAR image features alone. Among the four classification models used, the random forest-importance_ conditional random fields (RF-Im_CRF) model developed in this paper obtained the best overall accuracy (OA) and Kappa coefficient, validating the effectiveness of the method.

Download Full-text

Document Preprocessing with TF-IDF to Improve the Polarity Classification Performance of Unstructured Sentiment Analysis

Kinetik Game Technology Information System Computer Network Computing Electronics and Control ◽

10.22219/kinetik.v5i3.1066 ◽

2020 ◽

pp. 235-242

Author(s):

Farrikh Alzami ◽

Erika Devi Udayanti ◽

Dwi Puji Prabowo ◽

Rama Aria Megantara

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Random Forest ◽

Sentiment Analysis ◽

Classification Performance ◽

Document Preparation ◽

Learning Models ◽

Polarity Classification ◽

Negative Sentiment ◽

Machine Learning Models

Sentiment analysis in terms of polarity classification is very important in everyday life, with the existence of polarity, many people can find out whether the respected document has positive or negative sentiment so that it can help in choosing and making decisions. Sentiment analysis usually done manually. Therefore, an automatic sentiment analysis classification process is needed. However, it is rare to find studies that discuss extraction features and which learning models are suitable for unstructured sentiment analysis types with the Amazon food review case. This research explores some extraction features such as Word Bags, TF-IDF, Word2Vector, as well as a combination of TF-IDF and Word2Vector with several machine learning models such as Random Forest, SVM, KNN and Naïve Bayes to find out a combination of feature extraction and learning models that can help add variety to the analysis of polarity sentiments. By assisting with document preparation such as html tags and punctuation and special characters, using snowball stemming, TF-IDF results obtained with SVM are suitable for obtaining a polarity classification in unstructured sentiment analysis for the case of Amazon food review with a performance result of 87,3 percent.

Download Full-text

Extraction of Arecanut Planting Distribution Based on the Feature Space Optimization of PlanetScope Imagery

Agriculture ◽

10.3390/agriculture11040371 ◽

2021 ◽

Vol 11 (4) ◽

pp. 371

Author(s):

Yu Jin ◽

Jiawei Guo ◽

Huichun Ye ◽

Jinling Zhao ◽

Wenjiang Huang ◽

...

Keyword(s):

Random Forest ◽

Satellite Imagery ◽

Feature Space ◽

Kappa Coefficient ◽

Classification Model ◽

Support Vector ◽

Textural Feature ◽

Monitoring Accuracy ◽

Areca Catechu ◽

High Level

The remote sensing extraction of large areas of arecanut (Areca catechu L.) planting plays an important role in investigating the distribution of arecanut planting area and the subsequent adjustment and optimization of regional planting structures. Satellite imagery has previously been used to investigate and monitor the agricultural and forestry vegetation in Hainan. However, the monitoring accuracy is affected by the cloudy and rainy climate of this region, as well as the high level of land fragmentation. In this paper, we used PlanetScope imagery at a 3 m spatial resolution over the Hainan arecanut planting area to investigate the high-precision extraction of the arecanut planting distribution based on feature space optimization. First, spectral and textural feature variables were selected to form the initial feature space, followed by the implementation of the random forest algorithm to optimize the feature space. Arecanut planting area extraction models based on the support vector machine (SVM), BP neural network (BPNN), and random forest (RF) classification algorithms were then constructed. The overall classification accuracies of the SVM, BPNN, and RF models optimized by the RF features were determined as 74.82%, 83.67%, and 88.30%, with Kappa coefficients of 0.680, 0.795, and 0.853, respectively. The RF model with optimized features exhibited the highest overall classification accuracy and kappa coefficient. The overall accuracy of the SVM, BPNN, and RF models following feature optimization was improved by 3.90%, 7.77%, and 7.45%, respectively, compared with the corresponding unoptimized classification model. The kappa coefficient also improved. The results demonstrate the ability of PlanetScope satellite imagery to extract the planting distribution of arecanut. Furthermore, the RF is proven to effectively optimize the initial feature space, composed of spectral and textural feature variables, further improving the extraction accuracy of the arecanut planting distribution. This work can act as a theoretical and technical reference for the agricultural and forestry industries.

Download Full-text

Integrating Genetic Algorithm with Random Forest for Improving the Classification Performance of Web Log Data

2020 Sixth International Conference on Parallel, Distributed and Grid Computing (PDGC) ◽

10.1109/pdgc50313.2020.9315807 ◽

2020 ◽

Author(s):

Ruchi Mittal ◽

Varun Malik ◽

Vikram Singh ◽

Jaiteg Singh ◽

Amandeep Kaur

Keyword(s):

Genetic Algorithm ◽

Random Forest ◽

Classification Performance ◽

Log Data ◽

Web Log

Download Full-text

Improving Medication Regimen Recommendation for Parkinson’s Disease Using Sensor Technology

Sensors ◽

10.3390/s21103553 ◽

2021 ◽

Vol 21 (10) ◽

pp. 3553

Author(s):

Jeremy Watts ◽

Anahita Khojandi ◽

Rama Vasudevan ◽

Fatta B. Nahab ◽

Ritesh A. Ramdhani

Keyword(s):

Parkinson’S Disease ◽

Time Series ◽

Parkinson's Disease ◽

Random Forest ◽

Treatment Planning ◽

Time Series Data ◽

Classification Model ◽

Series Data ◽

Demographic Information ◽

Subjective Data

Parkinson’s disease medication treatment planning is generally based on subjective data obtained through clinical, physician-patient interactions. The Personal KinetiGraph™ (PKG) and similar wearable sensors have shown promise in enabling objective, continuous remote health monitoring for Parkinson’s patients. In this proof-of-concept study, we propose to use objective sensor data from the PKG and apply machine learning to cluster patients based on levodopa regimens and response. The resulting clusters are then used to enhance treatment planning by providing improved initial treatment estimates to supplement a physician’s initial assessment. We apply k-means clustering to a dataset of within-subject Parkinson’s medication changes—clinically assessed by the MDS-Unified Parkinson’s Disease Rating Scale-III (MDS-UPDRS-III) and the PKG sensor for movement staging. A random forest classification model was then used to predict patients’ cluster allocation based on their respective demographic information, MDS-UPDRS-III scores, and PKG time-series data. Clinically relevant clusters were partitioned by levodopa dose, medication administration frequency, and total levodopa equivalent daily dose—with the PKG providing similar symptomatic assessments to physician MDS-UPDRS-III scores. A random forest classifier trained on demographic information, MDS-UPDRS-III scores, and PKG time-series data was able to accurately classify subjects of the two most demographically similar clusters with an accuracy of 86.9%, an F1 score of 90.7%, and an AUC of 0.871. A model that relied solely on demographic information and PKG time-series data provided the next best performance with an accuracy of 83.8%, an F1 score of 88.5%, and an AUC of 0.831, hence further enabling fully remote assessments. These computational methods demonstrate the feasibility of using sensor-based data to cluster patients based on their medication responses with further potential to assist with medication recommendations.

Download Full-text

Classifying Very High-Dimensional Data with Random Forests Built from Small Subspaces

International Journal of Data Warehousing and Mining ◽

10.4018/jdwm.2012040103 ◽

2012 ◽

Vol 8 (2) ◽

pp. 44-63 ◽

Cited By ~ 30

Author(s):

Baoxun Xu ◽

Joshua Zhexue Huang ◽

Graham Williams ◽

Qiang Wang ◽

Yunming Ye

Keyword(s):

Random Forest ◽

High Dimensional Data ◽

Real Life ◽

Classification Performance ◽

Feature Weighting ◽

Random Forest Model ◽

High Dimensional ◽

Forest Model ◽

Forest Models ◽

Random Forest Models

The selection of feature subspaces for growing decision trees is a key step in building random forest models. However, the common approach using randomly sampling a few features in the subspace is not suitable for high dimensional data consisting of thousands of features, because such data often contains many features which are uninformative to classification, and the random sampling often doesn’t include informative features in the selected subspaces. Consequently, classification performance of the random forest model is significantly affected. In this paper, the authors propose an improved random forest method which uses a novel feature weighting method for subspace selection and therefore enhances classification performance over high-dimensional data. A series of experiments on 9 real life high dimensional datasets demonstrated that using a subspace size of features where M is the total number of features in the dataset, our random forest model significantly outperforms existing random forest models.

Download Full-text

Machine learning, transcriptome, and genotyping chip analyses provide insights into SNP markers identifying flower color in Platycodon grandiflorus

Scientific Reports ◽

10.1038/s41598-021-87281-0 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Go-Eun Yu ◽

Younhee Shin ◽

Sathiyamoorthy Subramaniyam ◽

Sang-Ho Kang ◽

Si-Myung Lee ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Feature Selection Method ◽

Flower Color ◽

Classification Performance ◽

Snp Markers ◽

Rna Seq ◽

Color Classification ◽

Dna Pcr ◽

Selection Of

AbstractBellflower is an edible ornamental gardening plant in Asia. For predicting the flower color in bellflower plants, a transcriptome-wide approach based on machine learning, transcriptome, and genotyping chip analyses was used to identify SNP markers. Six machine learning methods were deployed to explore the classification potential of the selected SNPs as features in two datasets, namely training (60 RNA-Seq samples) and validation (480 Fluidigm chip samples). SNP selection was performed in sequential order. Firstly, 96 SNPs were selected from the transcriptome-wide SNPs using the principal compound analysis (PCA). Then, 9 among 96 SNPs were later identified using the Random forest based feature selection method from the Fluidigm chip dataset. Among six machines, the random forest (RF) model produced higher classification performance than the other models. The 9 SNP marker candidates selected for classifying the flower color classification were verified using the genomic DNA PCR with Sanger sequencing. Our results suggest that this methodology could be used for future selection of breeding traits even though the plant accessions are highly heterogeneous.

Download Full-text