Systematic Framework to Predict Early-Stage Liver Carcinoma Using Hybrid of Feature Selection Techniques and Regression Techniques

The liver is the human body’s mandatory organ, but detecting liver disease at an early stage is very difficult due to the hiddenness of symptoms. Liver diseases may cause loss of energy or weakness when some irregularities in the working of the liver get visible. Cancer is one of the most common diseases of the liver and also the most fatal of all. Uncontrolled growth of harmful cells is developed inside the liver. If diagnosed late, it may cause death. Treatment of liver diseases at an early stage is, therefore, an important issue as is designing a model to diagnose early disease. Firstly, an appropriate feature should be identified which plays a more significant part in the detection of liver cancer at an early stage. Therefore, it is essential to extract some essential features from thousands of unwanted features. So, these features will be mined using data mining and soft computing techniques. These techniques give optimized results that will be helpful in disease diagnosis at an early stage. In these techniques, we use feature selection methods to reduce the dataset’s feature, which include Filter, Wrapper, and Embedded methods. Different Regression algorithms are then applied to these methods individually to evaluate the result. Regression algorithms include Linear Regression, Ridge Regression, LASSO Regression, Support Vector Regression, Decision Tree Regression, Multilayer Perceptron Regression, and Random Forest Regression. Based on the accuracy and error rates generated by these Regression algorithms, we have evaluated our results. The result shows that Random Forest Regression with the Wrapper Method from all the deployed Regression techniques is the best and gives the highest R2-Score of 0.8923 and lowest MSE of 0.0618.

Download Full-text

Compression Strength Prediction Using Machine Learning Techniques

International Journal of Advanced Trends in Computer Science and Engineering ◽

10.30534/ijatcse/2021/431012021 ◽

2021 ◽

Vol 10 (1) ◽

pp. 301-307

Keyword(s):

Random Forest ◽

Compression Strength ◽

Machine Learning Techniques ◽

Support Vector ◽

Mathematical Relationship ◽

Random Forest Regression ◽

Learning Techniques ◽

Regression Techniques ◽

Compressive Strength Of Concrete ◽

Advanced Computing

The advanced computing techniques and its applications on other engineering disciplines accelerated the different aspects and phases in engineering process. Nowadays there are so many computer aided methods widely used in civil engineering domain. The mathematical relationship between ratios of different concrete components and other influencing factors with its compression strength need to be analyzed for different engineering needs. This paper aims to develop a mathematical relationship after analyzing the above factors and to foresee the compressive strength of concrete by applying various regression techniques such as linear regression, support vector regression, decision tree regression and random forest regression on assumeddata set., It was found that the accuracy of the random forest regression was considerable as per the result after applying the various regression techniques.

Download Full-text

Strategic Analysis in Prediction of Liver Disease Using Different Classification Algorithms

Handbook of Research on Disease Prediction Through Data Analytics and Machine Learning - Advances in Medical Diagnosis, Treatment, and Care ◽

10.4018/978-1-7998-2742-9.ch022 ◽

2021 ◽

pp. 437-449

Author(s):

Binish Khan ◽

Piyush Kumar Shukla ◽

Manish Kumar Ahirwar ◽

Manish Mishra

Keyword(s):

Liver Disease ◽

Random Forest ◽

Liver Diseases ◽

Early Stage ◽

Normal Activity ◽

Liver Disorder ◽

Support Vector ◽

Classification Algorithms ◽

Main Motive ◽

Learning Software

Liver diseases avert the normal activity of the liver. Discovering the presence of liver disorder at an early stage is a complex task for the doctors. Predictive analysis of liver disease using classification algorithms is an efficacious task that can help the doctors to diagnose the disease within a short duration of time. The main motive of this study is to analyze the parameters of various classification algorithms and compare their predictive accuracies so as to find the best classifier for determining the liver disease. This chapter focuses on the related works of various authors on liver disease such that algorithms were implemented using Weka tool that is a machine learning software written in Java. Also, orange tool is utilized to compare several classification algorithms in terms of accuracy. In this chapter, random forest, logistic regression, and support vector machine were estimated with an aim to identify the best classifier. Based on this study, random forest with the highest accuracy outperformed the other algorithms.

Download Full-text

Estimation of the erodibility of treated unsaturated lateritic soil using support vector machine-polynomial and -radial basis function and random forest regression techniques

Cleaner Materials ◽

10.1016/j.clema.2021.100039 ◽

2022 ◽

Vol 3 ◽

pp. 100039

Author(s):

Kennedy C. Onyelowe ◽

Tammineni Gnananandarao ◽

Ahmed M. Ebid

Keyword(s):

Support Vector Machine ◽

Random Forest ◽

Radial Basis Function ◽

Basis Function ◽

Lateritic Soil ◽

Support Vector ◽

Random Forest Regression ◽

Radial Basis ◽

Regression Techniques

Download Full-text

Empirical methods for the estimation of Southern Ocean CO<sub>2</sub>: support vector and random forest regression

Biogeosciences ◽

10.5194/bg-14-5551-2017 ◽

2017 ◽

Vol 14 (23) ◽

pp. 5551-5569 ◽

Cited By ~ 10

Author(s):

Luke Gregor ◽

Schalk Kok ◽

Pedro M. S. Monteiro

Keyword(s):

Random Forest ◽

Southern Ocean ◽

Synthetic Data ◽

Good Effect ◽

Support Vector ◽

Co2 Uptake ◽

Empirical Methods ◽

Random Forest Regression ◽

Proxy Variables ◽

The Impact

Abstract. The Southern Ocean accounts for 40 % of oceanic CO2 uptake, but the estimates are bound by large uncertainties due to a paucity in observations. Gap-filling empirical methods have been used to good effect to approximate pCO2 from satellite observable variables in other parts of the ocean, but many of these methods are not in agreement in the Southern Ocean. In this study we propose two additional methods that perform well in the Southern Ocean: support vector regression (SVR) and random forest regression (RFR). The methods are used to estimate ΔpCO2 in the Southern Ocean based on SOCAT v3, achieving similar trends to the SOM-FFN method by Landschützer et al. (2014). Results show that the SOM-FFN and RFR approaches have RMSEs of similar magnitude (14.84 and 16.45 µatm, where 1 atm  =  101 325 Pa) where the SVR method has a larger RMSE (24.40 µatm). However, the larger errors for SVR and RFR are, in part, due to an increase in coastal observations from SOCAT v2 to v3, where the SOM-FFN method used v2 data. The success of both SOM-FFN and RFR depends on the ability to adapt to different modes of variability. The SOM-FFN achieves this by having independent regression models for each cluster, while this flexibility is intrinsic to the RFR method. Analyses of the estimates shows that the SVR and RFR's respective sensitivity and robustness to outliers define the outcome significantly. Further analyses on the methods were performed by using a synthetic dataset to assess the following: which method (RFR or SVR) has the best performance? What is the effect of using time, latitude and longitude as proxy variables on ΔpCO2? What is the impact of the sampling bias in the SOCAT v3 dataset on the estimates? We find that while RFR is indeed better than SVR, the ensemble of the two methods outperforms either one, due to complementary strengths and weaknesses of the methods. Results also show that for the RFR and SVR implementations, it is better to include coordinates as proxy variables as RMSE scores are lowered and the phasing of the seasonal cycle is more accurate. Lastly, we show that there is only a weak bias due to undersampling. The synthetic data provide a useful framework to test methods in regions of sparse data coverage and show potential as a useful tool to evaluate methods in future studies.

Download Full-text

Statistical and Electrical Features Evaluation for Electrical Appliances Energy Disaggregation

Sustainability ◽

10.3390/su11113222 ◽

2019 ◽

Vol 11 (11) ◽

pp. 3222 ◽

Cited By ~ 15

Author(s):

Pascal Schirmer ◽

Iosif Mporas

Keyword(s):

Random Forest ◽

Machine Learning Algorithms ◽

Support Vector ◽

Random Forest Regression ◽

Nearest Neighbours ◽

Energy Disaggregation ◽

Vector Machines ◽

Non Linear ◽

Load Monitoring ◽

Sinusoidal Current

In this paper we evaluate several well-known and widely used machine learning algorithms for regression in the energy disaggregation task. Specifically, the Non-Intrusive Load Monitoring approach was considered and the K-Nearest-Neighbours, Support Vector Machines, Deep Neural Networks and Random Forest algorithms were evaluated across five datasets using seven different sets of statistical and electrical features. The experimental results demonstrated the importance of selecting both appropriate features and regression algorithms. Analysis on device level showed that linear devices can be disaggregated using statistical features, while for non-linear devices the use of electrical features significantly improves the disaggregation accuracy, as non-linear appliances have non-sinusoidal current draw and thus cannot be well parametrized only by their active power consumption. The best performance in terms of energy disaggregation accuracy was achieved by the Random Forest regression algorithm.

Download Full-text

Predicting discontinuation of docetaxel treatment for metastatic castration-resistant prostate cancer (mCRPC) with random forest

F1000Research ◽

10.12688/f1000research.8353.1 ◽

2016 ◽

Vol 5 ◽

pp. 2673 ◽

Cited By ~ 1

Author(s):

Daniel Kristiyanto ◽

Kevin E. Anderson ◽

Ling-Hong Hung ◽

Ka Yee Yeung

Keyword(s):

Prostate Cancer ◽

Feature Selection ◽

Random Forest ◽

Developed Countries ◽

Hill Climbing ◽

Support Vector ◽

Castration Resistant Prostate Cancer ◽

K Nearest Neighbor ◽

Missing Data Imputation ◽

Docetaxel Treatment

Prostate cancer is the most common cancer among men in developed countries. Androgen deprivation therapy (ADT) is the standard treatment for prostate cancer. However, approximately one third of all patients with metastatic disease treated with ADT develop resistance to ADT. This condition is called metastatic castrate-resistant prostate cancer (mCRPC). Patients who do not respond to hormone therapy are often treated with a chemotherapy drug called docetaxel. Sub-challenge 2 of the Prostate Cancer DREAM Challenge aims to improve the prediction of whether a patient with mCRPC would discontinue docetaxel treatment due to adverse effects. Specifically, a dataset containing three distinct clinical studies of patients with mCRPC treated with docetaxel was provided. We applied the k-nearest neighbor method for missing data imputation, the hill climbing algorithm and random forest importance for feature selection, and the random forest algorithm for classification. We also empirically studied the performance of many classification algorithms, including support vector machines and neural networks. Additionally, we found using random forest importance for feature selection provided slightly better results than the more computationally expensive method of hill climbing.

Download Full-text

Techniques for Detecting Malware Traffic: A Comprehensive Approach to Feature Selection and Classification

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.39088 ◽

2021 ◽

Vol 9 (12) ◽

pp. 1-10

Author(s):

Harsha A K

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Random Forest ◽

Learning Algorithms ◽

Malware Detection ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Steady Increase ◽

Extreme Gradient Boosting

Abstract: Since the advent of encryption, there has been a steady increase in malware being transmitted over encrypted networks. Traditional approaches to detect malware like packet content analysis are inefficient in dealing with encrypted data. In the absence of actual packet contents, we can make use of other features like packet size, arrival time, source and destination addresses and other such metadata to detect malware. Such information can be used to train machine learning classifiers in order to classify malicious and benign packets. In this paper, we offer an efficient malware detection approach using classification algorithms in machine learning such as support vector machine, random forest and extreme gradient boosting. We employ an extensive feature selection process to reduce the dimensionality of the chosen dataset. The dataset is then split into training and testing sets. Machine learning algorithms are trained using the training set. These models are then evaluated against the testing set in order to assess their respective performances. We further attempt to tune the hyper parameters of the algorithms, in order to achieve better results. Random forest and extreme gradient boosting algorithms performed exceptionally well in our experiments, resulting in area under the curve values of 0.9928 and 0.9998 respectively. Our work demonstrates that malware traffic can be effectively classified using conventional machine learning algorithms and also shows the importance of dimensionality reduction in such classification problems. Keywords: Malware Detection, Extreme Gradient Boosting, Random Forest, Feature Selection.

Download Full-text

Feature Selection Using Random Forest Algorithm to Diagnose Tuberculosis From Lung CT Images

AI Innovation in Medical Imaging Diagnostics - Advances in Medical Technologies and Clinical Practice ◽

10.4018/978-1-7998-3092-4.ch005 ◽

2021 ◽

pp. 92-100

Author(s):

Beaulah Jeyavathana Rajendran ◽

Kanimozhi K. V.

Keyword(s):

Feature Selection ◽

Random Forest ◽

The Body ◽

Support Vector ◽

Feature Descriptor ◽

Feature Sets ◽

Modified Particle Swarm Optimization ◽

Tuberculosis Disease ◽

Optimal Feature ◽

Lung Ct

Tuberculosis is one of the hazardous infectious diseases that can be categorized by the evolution of tubercles in the tissues. This disease mainly affects the lungs and also the other parts of the body. The disease can be easily diagnosed by the radiologists. The main objective of this chapter is to get best solution selected by means of modified particle swarm optimization is regarded as optimal feature descriptor. Five stages are being used to detect tuberculosis disease. They are pre-processing an image, segmenting the lungs and extracting the feature, feature selection and classification. These stages that are used in medical image processing to identify the tuberculosis. In the feature extraction, the GLCM approach is used to extract the features and from the extracted feature sets the optimal features are selected by random forest. Finally, support vector machine classifier method is used for image classification. The experimentation is done, and intermediate results are obtained. The proposed system accuracy results are better than the existing method in classification.

Download Full-text

Review of Gregor et al: Empirical methods for the estimation of Southern Ocean CO2: Support Vector and Random Forest Regression

10.5194/bg-2017-215-rc1 ◽

2017 ◽

Author(s):

Anonymous

Keyword(s):

Random Forest ◽

Southern Ocean ◽

Support Vector ◽

Empirical Methods ◽

Random Forest Regression

Download Full-text

Detecting the Early Stage of Phaeosphaeria Leaf Spot Infestations in Maize Crop Using In Situ Hyperspectral Data and Guided Regularized Random Forest Algorithm

Journal of Spectroscopy ◽

10.1155/2017/6961387 ◽

2017 ◽

Vol 2017 ◽

pp. 1-8 ◽

Cited By ~ 18

Author(s):

Elhadi Adam ◽

Houtao Deng ◽

John Odindi ◽

Elfatih M. Abdel-Rahman ◽

Onisimo Mutanga

Keyword(s):

Feature Selection ◽

Random Forest ◽

Leaf Spot ◽

Early Stage ◽

Hyperspectral Data ◽

Tropical Maize ◽

Maize Crop ◽

Maize Production ◽

Independent Test Dataset ◽

Kappa Value

Phaeosphaeria leaf spot (PLS) is considered one of the major diseases that threaten the stability of maize production in tropical and subtropical African regions. The objective of the present study was to investigate the use of hyperspectral data in detecting the early stage of PLS in tropical maize. Field data were collected from healthy and the early stage of PLS over two years (2013 and 2014) using a handheld spectroradiometer. An integration of a newly developed guided regularized random forest (GRRF) and a traditional random forest (RF) was used for feature selection and classification, respectively. The 2013 dataset was used to train the model, while the 2014 dataset was used as independent test dataset. Results showed that there were statistically significant differences in biochemical concentration between the healthy leaves and leaves that were at an early stage of PLS infestation. The newly developed GRRF was able to reduce the high dimensionality of hyperspectral data by selecting key wavelengths with less autocorrelation. These wavelengths are located at 420 nm, 795 nm, 779 nm, 1543 nm, 1747 nm, and 1010 nm. Using these variables (n=6), a random forest classifier was able to discriminate between the healthy maize and maize at an early stage of PLS infestation with an overall accuracy of 88% and a kappa value of 0.75. Overall, our study showed potential application of hyperspectral data, GRRF feature selection, and RF classifiers in detecting the early stage of PLS infestation in tropical maize.

Download Full-text