SMO-RF:A machine learning approach by random forest for predicting class imbalancing followed by SMOTE

Author(s):  
Ankur Goyal ◽  
Likhita Rathore ◽  
Avinash Sharma
2021 ◽  
Author(s):  
Merlin James Rukshan Dennis

Distributed Denial of Service (DDoS) attack is a serious threat on today’s Internet. As the traffic across the Internet increases day by day, it is a challenge to distinguish between legitimate and malicious traffic. This thesis proposes two different approaches to build an efficient DDoS attack detection system in the Software Defined Networking environment. SDN is the latest networking approach which implements centralized controller, which is programmable. The central control and the programming capability of the controller are used in this thesis to implement the detection and mitigation mechanisms. In this thesis, two designed approaches, statistical approach and machine-learning approach, are proposed for the DDoS detection. The statistical approach implements entropy computation and flow statistics analysis. It uses the mean and standard deviation of destination entropy, new flow arrival rate, packets per flow and flow duration to compute various thresholds. These thresholds are then used to distinguish normal and attack traffic. The machine learning approach uses Random Forest classifier to detect the DDoS attack. We fine-tune the Random Forest algorithm to make it more accurate in DDoS detection. In particular, we introduce the weighted voting instead of the standard majority voting to improve the accuracy. Our result shows that the proposed machine-learning approach outperforms the statistical approach. Furthermore, it also outperforms other machine-learning approach found in the literature.


DYNA ◽  
2020 ◽  
Vol 87 (212) ◽  
pp. 63-72
Author(s):  
Jorge Iván Pérez Rave ◽  
Favián González Echavarría ◽  
Juan Carlos Correa Morales

The objective of this work is to develop a machine learning model for online pricing of apartments in a Colombian context. This article addresses three aspects: i) it compares the predictive capacity of linear regression, regression trees, random forest and bagging; ii) it studies the effect of a group of text attributes on the predictive capability of the models; and iii) it identifies the more stable-important attributes and interprets them from an inferential perspective to better understand the object of study. The sample consists of 15,177 observations of real estate. The methods of assembly (random forest and bagging) show predictive superiority with respect to others. The attributes derived from the text had a significant relationship with the property price (on a log scale). However, their contribution to the predictive capacity was almost nil, since four different attributes achieved highly accurate predictions and remained stable when the sample change.


Molecules ◽  
2019 ◽  
Vol 24 (21) ◽  
pp. 3837 ◽  
Author(s):  
Seong-Eun Park ◽  
Seung-Ho Seo ◽  
Eun-Ju Kim ◽  
Dae-Hun Park ◽  
Kyung-Mok Park ◽  
...  

The purpose of this study was to analyze metabolic differences of ginseng berries according to cultivation age and ripening stage using gas chromatography-mass spectrometry (GC-MS)-based metabolomics method. Ginseng berries were harvested every week during five different ripening stages of three-year-old and four-year-old ginseng. Using identified metabolites, a random forest machine learning approach was applied to obtain predictive models for the classification of cultivation age or ripening stage. Principal component analysis (PCA) score plot showed a clear separation by ripening stage, indicating that continuous metabolic changes occurred until the fifth ripening stage. Three-year-old ginseng berries had higher levels of valine, glutamic acid, and tryptophan, but lower levels of lactic acid and galactose than four-year-old ginseng berries at fully ripened stage. Metabolic pathways affected by different cultivation age were involved in amino acid metabolism pathways. A random forest machine learning approach extracted some important metabolites for predicting cultivation age or ripening stage with low error rate. This study demonstrates that different cultivation ages or ripening stages of ginseng berry can be successfully discriminated using a GC-MS-based metabolomic approach together with random forest analysis.


Author(s):  
Amy Marie Campbell ◽  
Marie-Fanny Racault ◽  
Stephen Goult ◽  
Angus Laurenson

Oceanic and coastal ecosystems have undergone complex environmental changes in recent years, amid a context of climate change. These changes are also reflected in the dynamics of water-borne diseases as some of the causative agents of these illnesses are ubiquitous in the aquatic environment and their survival rates are impacted by changes in climatic conditions. Previous studies have established strong relationships between essential climate variables and the coastal distribution and seasonal dynamics of the bacteria Vibrio cholerae, pathogenic types of which are responsible for human cholera disease. In this study we provide a novel exploration of the potential of a machine learning approach to forecast environmental cholera risk in coastal India, home to more than 200 million inhabitants, utilising atmospheric, terrestrial and oceanic satellite-derived essential climate variables. A Random Forest classifier model is developed, trained and tested on a cholera outbreak dataset over the period 2010–2018 for districts along coastal India. The random forest classifier model has an Accuracy of 0.99, an F1 Score of 0.942 and a Sensitivity score of 0.895, meaning that 89.5% of outbreaks are correctly identified. Spatio-temporal patterns emerged in terms of the model’s performance based on seasons and coastal locations. Further analysis of the specific contribution of each Essential Climate Variable to the model outputs shows that chlorophyll-a concentration, sea surface salinity and land surface temperature are the strongest predictors of the cholera outbreaks in the dataset used. The study reveals promising potential of the use of random forest classifiers and remotely-sensed essential climate variables for the development of environmental cholera-risk applications. Further exploration of the present random forest model and associated essential climate variables is encouraged on cholera surveillance datasets in other coastal areas affected by the disease to determine the model’s transferability potential and applicative value for cholera forecasting systems.


2021 ◽  
Author(s):  
Diti Roy ◽  
Md. Ashiq Mahmood ◽  
Tamal Joyti Roy

<p>Heart Disease is the most dominating disease which is taking a large number of deaths every year. A report from WHO in 2016 portrayed that every year at least 17 million people die of heart disease. This number is gradually increasing day by day and WHO estimated that this death toll will reach the summit of 75 million by 2030. Despite having modern technology and health care system predicting heart disease is still beyond limitations. As the Machine Learning algorithm is a vital source predicting data from available data sets we have used a machine learning approach to predict heart disease. We have collected data from the UCI repository. In our study, we have used Random Forest, Zero R, Voted Perceptron, K star classifier. We have got the best result through the Random Forest classifier with an accuracy of 97.69.<i><b></b></i></p> <p><b> </b></p>


2022 ◽  
Vol 21 (1) ◽  
Author(s):  
Luca Boniardi ◽  
Federica Nobile ◽  
Massimo Stafoggia ◽  
Paola Michelozzi ◽  
Carla Ancona

Abstract Background Air pollution is one of the main concerns for the health of European citizens, and cities are currently striving to accomplish EU air pollution regulation. The 2020 COVID-19 lockdown measures can be seen as an unintended but effective experiment to assess the impact of traffic restriction policies on air pollution. Our objective was to estimate the impact of the lockdown measures on NO2 concentrations and health in the two largest Italian cities. Methods NO2 concentration datasets were built using data deriving from a 1-month citizen science monitoring campaign that took place in Milan and Rome just before the Italian lockdown period. Annual mean NO2 concentrations were estimated for a lockdown scenario (Scenario 1) and a scenario without lockdown (Scenario 2), by applying city-specific annual adjustment factors to the 1-month data. The latter were estimated deriving data from Air Quality Network stations and by applying a machine learning approach. NO2 spatial distribution was estimated at a neighbourhood scale by applying Land Use Random Forest models for the two scenarios. Finally, the impact of lockdown on health was estimated by subtracting attributable deaths for Scenario 1 and those for Scenario 2, both estimated by applying literature-based dose–response function on the counterfactual concentrations of 10 μg/m3. Results The Land Use Random Forest models were able to capture 41–42% of the total NO2 variability. Passing from Scenario 2 (annual NO2 without lockdown) to Scenario 1 (annual NO2 with lockdown), the population-weighted exposure to NO2 for Milan and Rome decreased by 15.1% and 15.3% on an annual basis. Considering the 10 μg/m3 counterfactual, prevented deaths were respectively 213 and 604. Conclusions Our results show that the lockdown had a beneficial impact on air quality and human health. However, compliance with the current EU legal limit is not enough to avoid a high number of NO2 attributable deaths. This contribution reaffirms the potentiality of the citizen science approach and calls for more ambitious traffic calming policies and a re-evaluation of the legal annual limit value for NO2 for the protection of human health.


F1000Research ◽  
2019 ◽  
Vol 7 ◽  
pp. 702
Author(s):  
Pedro Morell Miranda ◽  
Francesca Bertolini ◽  
Haja N. Kadarmideen

Background: Inflammatory bowel disease (IBD) is a group of chronic diseases related to inflammatory processes in the digestive tract generally associated with an immune response to an altered gut microbiome in genetically predisposed subjects. For years, both researchers and clinicians have been reporting increased rates of anxiety and depression disorders in IBD, and these disorders have also been linked to an altered microbiome. However, the underlying pathophysiological mechanisms of comorbidity are poorly understood at the gut microbiome level. Methods: Metagenomic and metatranscriptomic data were retrieved from the Inflammatory Bowel Disease Multi-Omics Database. Samples from 70 individuals that had answered to a self-reported depression and anxiety questionnaire were selected and classified by their IBD diagnosis and their questionnaire results, creating six different groups. The cross-validation random forest algorithm was used in 90% of the individuals (training set) to retain the most important species involved in discriminating the samples without losing predictive power. The validation set that represented the remaining 10% of the samples equally distributed across the six groups was used to train a random forest using only the species selected in order to evaluate their predictive power. Results: A total of 24 species were identified as the most informative in discriminating the 6 groups. Several of these species were frequently described in dysbiosis cases, such as species from the genus Bacteroides and Faecalibacterium prausnitzii. Despite the different compositions among the groups, no common patterns were found between samples classified as depressed. However, distinct taxonomic profiles within patients of IBD depending on their depression status were detected. Conclusions: The machine learning approach is a promising approach for investigating the role of microbiome in IBD and depression. Abundance and functional changes in these species suggest that depression should be considered as a factor in future research on IBD.


2019 ◽  
Vol 2019 ◽  
pp. 1-7 ◽  
Author(s):  
Su Young Jeong ◽  
Wook Kim ◽  
Byung Hyun Byun ◽  
Chang-Bae Kong ◽  
Won Seok Song ◽  
...  

Purpose. Patients with high-grade osteosarcoma undergo several chemotherapy cycles before surgical intervention. Response to chemotherapy, however, is affected by intratumor heterogeneity. In this study, we assessed the ability of a machine learning approach using baseline 18F-fluorodeoxyglucose (18F-FDG) positron emitted tomography (PET) textural features to predict response to chemotherapy in osteosarcoma patients. Materials and Methods. This study included 70 osteosarcoma patients who received neoadjuvant chemotherapy. Quantitative characteristics of the tumors were evaluated by standard uptake value (SUV), total lesion glycolysis (TLG), and metabolic tumor volume (MTV). Tumor heterogeneity was evaluated using textural analysis of 18F-FDG PET scan images. Assessments were performed at baseline and after chemotherapy using 18F-FDG PET; 18F-FDG textural features were evaluated using the Chang-Gung Image Texture Analysis toolbox. To predict the chemotherapy response, several features were chosen using the principal component analysis (PCA) feature selection method. Machine learning was performed using linear support vector machine (SVM), random forest, and gradient boost methods. The ability to predict chemotherapy response was evaluated using the area under the receiver operating characteristic curve (AUC). Results. AUCs of the baseline 18F-FDG features SUVmax, TLG, MTV, 1st entropy, and gray level co-occurrence matrix entropy were 0.553, 0538, 0.536, 0.538, and 0.543, respectively. However, AUCs of the machine learning features linear SVM, random forest, and gradient boost were 0.72, 0.78, and 0.82, respectively. Conclusion. We found that a machine learning approach based on 18F-FDG textural features could predict the chemotherapy response using baseline PET images. This early prediction of the chemotherapy response may aid in determining treatment plans for osteosarcoma patients.


2019 ◽  
Vol 11 (8) ◽  
pp. 920 ◽  
Author(s):  
Syed Haleem Shah ◽  
Yoseline Angel ◽  
Rasmus Houborg ◽  
Shawkat Ali ◽  
Matthew F. McCabe

Developing rapid and non-destructive methods for chlorophyll estimation over large spatial areas is a topic of much interest, as it would provide an indirect measure of plant photosynthetic response, be useful in monitoring soil nitrogen content, and offer the capacity to assess vegetation structural and functional dynamics. Traditional methods of direct tissue analysis or the use of handheld meters, are not able to capture chlorophyll variability at anything beyond point scales, so are not particularly useful for informing decisions on plant health and status at the field scale. Examining the spectral response of plants via remote sensing has shown much promise as a means to capture variations in vegetation properties, while offering a non-destructive and scalable approach to monitoring. However, determining the optimum combination of spectra or spectral indices to inform plant response remains an active area of investigation. Here, we explore the use of a machine learning approach to enhance the estimation of leaf chlorophyll (Chlt), defined as the sum of chlorophyll a and b, from spectral reflectance data. Using an ASD FieldSpec 4 Hi-Res spectroradiometer, 2700 individual leaf hyperspectral reflectance measurements were acquired from wheat plants grown across a gradient of soil salinity and nutrient levels in a greenhouse experiment. The extractable Chlt was determined from laboratory analysis of 270 collocated samples, each composed of three leaf discs. A random forest regression algorithm was trained against these data, with input predictors based upon (1) reflectance values from 2102 bands across the 400–2500 nm spectral range; and (2) 45 established vegetation indices. As a benchmark, a standard univariate regression analysis was performed to model the relationship between measured Chlt and the selected vegetation indices. Results show that the root mean square error (RMSE) was significantly reduced when using the machine learning approach compared to standard linear regression. When exploiting the entire spectral range of individual bands as input variables, the random forest estimated Chlt with an RMSE of 5.49 µg·cm−2 and an R2 of 0.89. Model accuracy was improved when using vegetation indices as input variables, producing an RMSE ranging from 3.62 to 3.91 µg·cm−2, depending on the particular combination of indices selected. In further analysis, input predictors were ranked according to their importance level, and a step-wise reduction in the number of input features (from 45 down to 7) was performed. Implementing this resulted in no significant effect on the RMSE, and showed that much the same prediction accuracy could be obtained by a smaller subset of indices. Importantly, the random forest regression approach identified many important variables that were not good predictors according to their linear regression statistics. Overall, the research illustrates the promise in using established vegetation indices as input variables in a machine learning approach for the enhanced estimation of Chlt from hyperspectral data.


Sign in / Sign up

Export Citation Format

Share Document