Oropharyngeal Cancer Patient Stratification Using Random Forest Based-learning Over High-dimensional Radiomic Features

Abstract OBJECTIVE: To improve risk prediction for oropharyngeal cancer (OPC) patients using cluster analysis on the radiomic features extracted from pre-treatment Computed Tomography (CT) scans.MATERIALS AND METHODS: OPC Patients were classified into 2 or 3 risk groups by applying hierarchical clustering over the co-occurrence matrix obtained from a random survival forest (RSF) trained over 301 radiomic features. The cluster label was included together with other clinical data to train an ensemble model using five predictive models (Cox, random forest, RSF, logistic regression, and logistic-elastic net). Ensemble performance was evaluated over an independent test set for both recurrence free survival (RFS) and overall survival (OS). RESULTS: The Kaplan-Meier curves for OS stratified by cluster label show significant differences for both training (p-val<0.0001) and testing (p-val=0.005). Inclusion of the cluster label outperforms clinical data only improving AUC from .60 to .76 and from .63 to .75 for OS and RFS, respectively. CONCLUSION: The extraction of a single feature, namely a cluster label, to represent the high-dimensional radiomic feature space reduces the dimensionality and sparsity of the data. Moreover, inclusion of the cluster label improves model performance compared to clinical data only and compares to the raw radiomic features performance.

Download Full-text

Oropharyngeal cancer patient stratification using random forest based-learning over high-dimensional radiomic features

Scientific Reports ◽

10.1038/s41598-021-92072-8 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Harsh Patel ◽

David M. Vock ◽

G. Elisabeta Marai ◽

Clifton D. Fuller ◽

Abdallah S. R. Mohamed ◽

...

Keyword(s):

Random Forest ◽

Clinical Data ◽

Test Performance ◽

Oropharyngeal Cancer ◽

Model Performance ◽

Feature Space ◽

Risk Groups ◽

High Dimensional ◽

Cluster Label ◽

Occurrence Matrix

AbstractTo improve risk prediction for oropharyngeal cancer (OPC) patients using cluster analysis on the radiomic features extracted from pre-treatment Computed Tomography (CT) scans. 553 OPC Patients randomly split into training (80%) and validation (20%), were classified into 2 or 3 risk groups by applying hierarchical clustering over the co-occurrence matrix obtained from a random survival forest (RSF) trained over 301 radiomic features. The cluster label was included together with other clinical data to train an ensemble model using five predictive models (Cox, random forest, RSF, logistic regression, and logistic-elastic net). Ensemble performance was evaluated over the independent test set for both recurrence free survival (RFS) and overall survival (OS). The Kaplan–Meier curves for OS stratified by cluster label show significant differences for both training and testing (p val < 0.0001). When compared to the models trained using clinical data only, the inclusion of the cluster label improves AUC test performance from .62 to .79 and from .66 to .80 for OS and RFS, respectively. The extraction of a single feature, namely a cluster label, to represent the high-dimensional radiomic feature space reduces the dimensionality and sparsity of the data. Moreover, inclusion of the cluster label improves model performance compared to clinical data only and offers comparable performance to the models including raw radiomic features.

Download Full-text

Forecasting of Steam Coal Price Based on Robust Regularized Kernel Regression and Empirical Mode Decomposition

Frontiers in Energy Research ◽

10.3389/fenrg.2021.752593 ◽

2021 ◽

Vol 9 ◽

Author(s):

Xiangwan Fu ◽

Mingzhu Tang ◽

Dongqun Xu ◽

Jun Yang ◽

Donglin Chen ◽

...

Keyword(s):

Empirical Mode Decomposition ◽

Kernel Function ◽

Dimensional Space ◽

Kernel Regression ◽

Model Performance ◽

Feature Space ◽

Evaluation Index ◽

High Dimensional ◽

Polynomial Kernel ◽

Mode Decomposition

Aiming at the problem of difficulties in modeling the nonlinear relation in the steam coal dataset, this article proposes a forecasting method for the price of steam coal based on robust regularized kernel regression and empirical mode decomposition. By selecting the polynomial kernel function, the robust loss function and L2 regular term to construct a robust regularized kernel regression model are used. The polynomial kernel function does not depend on the kernel parameters and can mine the global rules in the dataset so that improves the forecasting stability of the kernel model. This method maps the features to the high-dimensional space by using the polynomial kernel function to transform the nonlinear law in the original feature space into linear law in the high-dimensional space and helps learn the linear law in the high-dimensional feature space by using the linear model. The Huber loss function is selected to reduce the influence of abnormal noise in the dataset on the model performance, and the L2 regular term is used to reduce the risk of model overfitting. We use the combined model based on empirical mode decomposition (EMD) and auto regressive integrated moving average (ARIMA) model to compensate for the error of robust regularized kernel regression model, thus making up for the limitations of the single forecasting model. Finally, we use the steam coal dataset to verify the proposed model and such model has an optimal evaluation index value compared to other contrast models after the model performance is evaluated as per the evaluation index such as RMSE, MAE, and mean absolute percentage error.

Download Full-text

Survival Prediction in Gallbladder Cancer Using CT Based Machine Learning

Frontiers in Oncology ◽

10.3389/fonc.2020.604288 ◽

2020 ◽

Vol 10 ◽

Author(s):

Zefan Liu ◽

Guannan Zhu ◽

Xian Jiang ◽

Yunuo Zhao ◽

Hao Zeng ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Gallbladder Cancer ◽

Clinical Information ◽

Risk Groups ◽

Ct Images ◽

Survival Outcomes ◽

Survival Prediction ◽

Learning Technology ◽

Pre Treatment

ObjectiveTo establish a classifier for accurately predicting the overall survival of gallbladder cancer (GBC) patients by analyzing pre-treatment CT images using machine learning technology.MethodsThis retrospective study included 141 patients with pathologically confirmed GBC. After obtaining the pre-treatment CT images, manual segmentation of the tumor lesion was performed and LIFEx package was used to extract the tumor signature. Next, LASSO and Random Forest methods were used to optimize and model. Finally, the clinical information was combined to accurately predict the survival outcomes of GBC patients.ResultsFifteen CT features were selected through LASSO and random forest. On the basis of relative importance GLZLM-HGZE, GLCM-homogeneity and NGLDM-coarseness were included in the final model. The hazard ratio of the CT-based model was 1.462(95% CI: 1.014–2.107). According to the median of risk score, all patients were divided into high and low risk groups, and survival analysis showed that high-risk groups had a poor survival outcome (P = 0.012). After inclusion of clinical factors, we used multivariate COX to classify patients with GBC. The AUC values in the test set and validation set for 3 years reached 0.79 and 0.73, respectively.ConclusionGBC survival outcomes could be predicted by radiomics based on LASSO and Random Forest.

Download Full-text

Gray-Level Co-occurrence Matrix and Random Forest Based Off-line Odia Handwritten Character Recognition

Recent Patents on Engineering ◽

10.2174/1872212112666180601085544 ◽

2019 ◽

Vol 13 (2) ◽

pp. 136-141 ◽

Cited By ~ 2

Author(s):

Abhisek Sethy ◽

Prashanta Kumar Patra ◽

Deepak Ranjan Nayak

Keyword(s):

Feature Extraction ◽

Random Forest ◽

Character Recognition ◽

Recognition Rate ◽

Discrete Wavelet ◽

Gray Level ◽

Handwritten Character Recognition ◽

Handwritten Character ◽

Wide Range ◽

Occurrence Matrix

Background: In the past decades, handwritten character recognition has received considerable attention from researchers across the globe because of its wide range of applications in daily life. From the literature, it has been observed that there is limited study on various handwritten Indian scripts and Odia is one of them. We revised some of the patents relating to handwritten character recognition. Methods: This paper deals with the development of an automatic recognition system for offline handwritten Odia character recognition. In this case, prior to feature extraction from images, preprocessing has been done on the character images. For feature extraction, first the gray level co-occurrence matrix (GLCM) is computed from all the sub-bands of two-dimensional discrete wavelet transform (2D DWT) and thereafter, feature descriptors such as energy, entropy, correlation, homogeneity, and contrast are calculated from GLCMs which are termed as the primary feature vector. In order to further reduce the feature space and generate more relevant features, principal component analysis (PCA) has been employed. Because of the several salient features of random forest (RF) and K- nearest neighbor (K-NN), they have become a significant choice in pattern classification tasks and therefore, both RF and K-NN are separately applied in this study for segregation of character images. Results: All the experiments were performed on a system having specification as windows 8, 64-bit operating system, and Intel (R) i7 – 4770 CPU @ 3.40 GHz. Simulations were conducted through Matlab2014a on a standard database named as NIT Rourkela Odia Database. Conclusion: The proposed system has been validated on a standard database. The simulation results based on 10-fold cross-validation scenario demonstrate that the proposed system earns better accuracy than the existing methods while requiring least number of features. The recognition rate using RF and K-NN classifier is found to be 94.6% and 96.4% respectively.

Download Full-text

Extending Geodemographics Using Data Primitives: A Review and a Methodological Proposal

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi10060386 ◽

2021 ◽

Vol 10 (6) ◽

pp. 386

Author(s):

Jennie Gray ◽

Lisa Buckner ◽

Alexis Comber

Keyword(s):

Social Dynamics ◽

Feature Space ◽

Social Processes ◽

Multiple Time ◽

Current State ◽

Cluster Label ◽

Static Data ◽

The Social ◽

Using Data

This paper reviews geodemographic classifications and developments in contemporary classifications. It develops a critique of current approaches and identifiea a number of key limitations. These include the problems associated with the geodemographic cluster label (few cluster members are typical or have the same properties as the cluster centre) and the failure of the static label to describe anything about the underlying neighbourhood processes and dynamics. To address these limitations, this paper proposed a data primitives approach. Data primitives are the fundamental dimensions or measurements that capture the processes of interest. They can be used to describe the current state of an area in a multivariate feature space, and states can be compared over multiple time periods for which data are available, through for example a change vector approach. In this way, emergent social processes, which may be too weak to result in a change in a cluster label, but are nonetheless important signals, can be captured. As states are updated (for example, as new data become available), inferences about different social processes can be made, as well as classification updates if required. State changes can also be used to determine neighbourhood trajectories and to predict or infer future states. A list of data primitives was suggested from a review of the mechanisms driving a number of neighbourhood-level social processes, with the aim of improving the wider understanding of the interaction of complex neighbourhood processes and their effects. A small case study was provided to illustrate the approach. In this way, the methods outlined in this paper suggest a more nuanced approach to geodemographic research, away from a focus on classifications and static data, towards approaches that capture the social dynamics experienced by neighbourhoods.

Download Full-text

Extraction of Arecanut Planting Distribution Based on the Feature Space Optimization of PlanetScope Imagery

Agriculture ◽

10.3390/agriculture11040371 ◽

2021 ◽

Vol 11 (4) ◽

pp. 371

Author(s):

Yu Jin ◽

Jiawei Guo ◽

Huichun Ye ◽

Jinling Zhao ◽

Wenjiang Huang ◽

...

Keyword(s):

Random Forest ◽

Satellite Imagery ◽

Feature Space ◽

Kappa Coefficient ◽

Classification Model ◽

Support Vector ◽

Textural Feature ◽

Monitoring Accuracy ◽

Areca Catechu ◽

High Level

The remote sensing extraction of large areas of arecanut (Areca catechu L.) planting plays an important role in investigating the distribution of arecanut planting area and the subsequent adjustment and optimization of regional planting structures. Satellite imagery has previously been used to investigate and monitor the agricultural and forestry vegetation in Hainan. However, the monitoring accuracy is affected by the cloudy and rainy climate of this region, as well as the high level of land fragmentation. In this paper, we used PlanetScope imagery at a 3 m spatial resolution over the Hainan arecanut planting area to investigate the high-precision extraction of the arecanut planting distribution based on feature space optimization. First, spectral and textural feature variables were selected to form the initial feature space, followed by the implementation of the random forest algorithm to optimize the feature space. Arecanut planting area extraction models based on the support vector machine (SVM), BP neural network (BPNN), and random forest (RF) classification algorithms were then constructed. The overall classification accuracies of the SVM, BPNN, and RF models optimized by the RF features were determined as 74.82%, 83.67%, and 88.30%, with Kappa coefficients of 0.680, 0.795, and 0.853, respectively. The RF model with optimized features exhibited the highest overall classification accuracy and kappa coefficient. The overall accuracy of the SVM, BPNN, and RF models following feature optimization was improved by 3.90%, 7.77%, and 7.45%, respectively, compared with the corresponding unoptimized classification model. The kappa coefficient also improved. The results demonstrate the ability of PlanetScope satellite imagery to extract the planting distribution of arecanut. Furthermore, the RF is proven to effectively optimize the initial feature space, composed of spectral and textural feature variables, further improving the extraction accuracy of the arecanut planting distribution. This work can act as a theoretical and technical reference for the agricultural and forestry industries.

Download Full-text

Comparison of Selected Immune and Hematological Parameters and Their Impact on Survival in Patients with HPV-Related and HPV-Unrelated Oropharyngeal Cancer

Cancers ◽

10.3390/cancers13133256 ◽

2021 ◽

Vol 13 (13) ◽

pp. 3256

Author(s):

Adam Brewczyński ◽

Beata Jabłońska ◽

Agnieszka Maria Mazurek ◽

Jolanta Mrochem-Kwarciak ◽

Sławomir Mrowiec ◽

...

Keyword(s):

Prognostic Factors ◽

Oropharyngeal Cancer ◽

Hematological Parameters ◽

White Blood Cells ◽

Post Treatment ◽

Neutrophil Lymphocyte Ratio ◽

Lymphocyte Ratio ◽

Hpv Status ◽

Pre Treatment ◽

The Impact

Several immune and hematological parameters are associated with survival in patients with oropharyngeal cancer (OPC). The aim of the study was to analyze selected immune and hematological parameters of patients with HPV-related (HPV+) and HPV-unrelated (HPV-) OPC, before and after radiotherapy/chemoradiotherapy (RT/CRT) and to assess the impact of these parameters on survival. One hundred twenty seven patients with HPV+ and HPV− OPC, treated with RT alone or concurrent chemoradiotherapy (CRT), were included. Patients were divided according to HPV status. Confirmation of HPV etiology was obtained from FFPE (Formalin-Fixed, Paraffin-Embedded) tissue samples and/or extracellular circulating HPV DNA was determined. The pre-treatment and post-treatment laboratory blood parameters were compared in both groups. The neutrophil/lymphocyte ratio (NLR), platelet/lymphocyte ratio (PLR), monocyte/lymphocyte ratio (MLR), and systemic immune inflammation (SII) index were calculated. The impact of these parameters on overall (OS) and disease-free (DFS) survival was analyzed. In HPV+ patients, a high pre-treatment white blood cells (WBC) count (>8.33 /mm3), NLR (>2.13), SII (>448.60) significantly correlated with reduced OS, whereas high NLR (>2.29), SII (>462.58) significantly correlated with reduced DFS. A higher pre-treatment NLR and SII were significant poor prognostic factors for both OS and DFS in the HPV+ group. These associations were not apparent in HPV− patients. There are different pre-treatment and post-treatment immune and hematological prognostic factors for OS and DFS in HPV+ and HPV− patients. The immune ratios could be considered valuable biomarkers for risk stratification and differentiation for HPV− and HPV+ OPC patients.

Download Full-text

Influence of Random Forest Hyperparameterization on Short-Term Runoff Forecasting in an Andean Mountain Catchment

Atmosphere ◽

10.3390/atmos12020238 ◽

2021 ◽

Vol 12 (2) ◽

pp. 238

Author(s):

Pablo Contreras ◽

Johanna Orellana-Alvear ◽

Paul Muñoz ◽

Jörg Bendix ◽

Rolando Célleri

Keyword(s):

Random Forest ◽

Model Performance ◽

Early Warning Systems ◽

Point Of View ◽

Lead Times ◽

Physical Parameters ◽

Runoff Forecasting ◽

Spatio Temporal ◽

Improved Model ◽

Search Approach

The Random Forest (RF) algorithm, a decision-tree-based technique, has become a promising approach for applications addressing runoff forecasting in remote areas. This machine learning approach can overcome the limitations of scarce spatio-temporal data and physical parameters needed for process-based hydrological models. However, the influence of RF hyperparameters is still uncertain and needs to be explored. Therefore, the aim of this study is to analyze the sensitivity of RF runoff forecasting models of varying lead time to the hyperparameters of the algorithm. For this, models were trained by using (a) default and (b) extensive hyperparameter combinations through a grid-search approach that allow reaching the optimal set. Model performances were assessed based on the R2, %Bias, and RMSE metrics. We found that: (i) The most influencing hyperparameter is the number of trees in the forest, however the combination of the depth of the tree and the number of features hyperparameters produced the highest variability-instability on the models. (ii) Hyperparameter optimization significantly improved model performance for higher lead times (12- and 24-h). For instance, the performance of the 12-h forecasting model under default RF hyperparameters improved to R2 = 0.41 after optimization (gain of 0.17). However, for short lead times (4-h) there was no significant model improvement (0.69 < R2 < 0.70). (iii) There is a range of values for each hyperparameter in which the performance of the model is not significantly affected but remains close to the optimal. Thus, a compromise between hyperparameter interactions (i.e., their values) can produce similar high model performances. Model improvements after optimization can be explained from a hydrological point of view, the generalization ability for lead times larger than the concentration time of the catchment tend to rely more on hyperparameterization than in what they can learn from the input data. This insight can help in the development of operational early warning systems.

Download Full-text

SAR Oil Spill Detection System through Random Forest Classifiers

Remote Sensing ◽

10.3390/rs13112044 ◽

2021 ◽

Vol 13 (11) ◽

pp. 2044

Author(s):

Marcos R. A. Conceição ◽

Luis F. F. Mendonça ◽

Carlos A. D. Lentini ◽

André T. C. Lima ◽

José M. Lopes ◽

...

Keyword(s):

Random Forest ◽

Oil Spill ◽

Oil Spills ◽

Detection System ◽

Feature Space ◽

Dark Spot ◽

Sar Image ◽

Extensive Search ◽

Gradient Based ◽

Biological Films

A set of open-source routines capable of identifying possible oil-like spills based on two random forest classifiers were developed and tested with a Sentinel-1 SAR image dataset. The first random forest model is an ocean SAR image classifier where the labeling inputs were oil spills, biological films, rain cells, low wind regions, clean sea surface, ships, and terrain. The second one was a SAR image oil detector named “Radar Image Oil Spill Seeker (RIOSS)”, which classified oil-like targets. An optimized feature space to serve as input to such classification models, both in terms of variance and computational efficiency, was developed. It involved an extensive search from 42 image attribute definitions based on their correlations and classifier-based importance estimative. This number included statistics, shape, fractal geometry, texture, and gradient-based attributes. Mixed adaptive thresholding was performed to calculate some of the features studied, returning consistent dark spot segmentation results. The selected attributes were also related to the imaged phenomena’s physical aspects. This process helped us apply the attributes to a random forest, increasing our algorithm’s accuracy up to 90% and its ability to generate even more reliable results.

Download Full-text

Optimal multi-kernel local fisher discriminant analysis for feature dimensionality reduction and fault diagnosis

Proceedings of the Institution of Mechanical Engineers Part O Journal of Risk and Reliability ◽

10.1177/1748006x211009335 ◽

2021 ◽

pp. 1748006X2110093

Author(s):

Qing Zhang ◽

Heng Li ◽

Xiaolong Zhang ◽

Haifeng Wang

Keyword(s):

Fault Diagnosis ◽

Discriminant Analysis ◽

Feature Space ◽

High Dimensional ◽

Fisher Discriminant Analysis ◽

Vibration Signals ◽

Fisher Discriminant ◽

Local Fisher Discriminant Analysis ◽

Diagnosis Accuracy ◽

Diagnosis Model

To achieve a more desirable fault diagnosis accuracy by applying multi-domain features of vibration signals, it is significative and challenging to refine the most representative and intrinsic feature components from the original high dimensional feature space. A novel dimensionality reduction method for fault diagnosis is proposed based on local Fisher discriminant analysis (LFDA) which takes both label information and local geometric structure of the high dimensional features into consideration. Multi-kernel trick is introduced into the LFDA to improve its performance in dealing with the nonlinearity of mapping high dimensional feature space into a lower one. To obtain an optimal diagnosis accuracy by the reduced features of low dimensionality, binary particle swarm optimization (BPSO) algorithm is utilized to search for the most appropriate parameters of kernels and K-nearest neighbor (kNN) recognition model. Samples with labels are used to train the optimal multi-kernel LFDA and kNN (OMKLFDA-kNN) fault diagnosis model to obtain the optimal transformation matrix. Consequently, the trained fault diagnosis model implements the recognition of machinery health condition with the most representative feature space of vibration signals. A bearing fault diagnosis experiment is conducted to verify the effectiveness of proposed diagnostic approach. Performance comparison with some other methods are investigated, and the improvement for fault diagnosis of the proposed method are confirmed in different aspects.

Download Full-text