Random Forest Algorithm
Recently Published Documents





Guo Zi–chen ◽  
Wang Tao ◽  
Liu Shu–lin ◽  
Kang Wen–ping ◽  
Chen Xiang ◽  

Arthur Blanluet ◽  
Sven Gastauer ◽  
Franck Cattanéo ◽  
Chloé Goulon ◽  
David Grimardias ◽  

With a growing demand for hydroelectric energy, the number of reservoirs is dramatically increasing worldwide. These new water bodies also present an opportunity for the development of fishing activities. However, these reservoirs are commonly impounded on uncut forests, resulting in many immersed trees. These trees hinder fish assessments by disrupting both gill netting and acoustic sampling. Immersed trees can easily be confused with fish schools on echograms. To overcome this issue, we developed a method to discriminate fish schools from immersed trees. A random forest algorithm was used to classify echo-traces at 120 and 200 kHz, recorded by an EK80 (SIMRAD) in narrowband (Continuous Wave) and in broadband mode (Frequency Modulated). We obtained a good discrimination rate between trees and schools, especially in broadband (90 % ratio of good classification). We demonstrate that it is possible to discriminate fish schools from immersed trees and thus facilitate the use of fisheries acoustics in reservoirs.

Diagnostics ◽  
2021 ◽  
Vol 11 (10) ◽  
pp. 1880
Giuseppe Murdaca ◽  
Simone Caprioli ◽  
Alessandro Tonacci ◽  
Lucia Billeci ◽  
Monica Greco ◽  

Introduction: Systemic sclerosis (SSc) is a systemic immune-mediated disease, featuring fibrosis of the skin and organs, and has the greatest mortality among rheumatic diseases. The nervous system involvement has recently been demonstrated, although actual lung involvement is considered the leading cause of death in SSc and, therefore, should be diagnosed early. Pulmonary function tests are not sensitive enough to be used for screening purposes, thus they should be flanked by other clinical examinations; however, this would lead to a risk of overtesting, with considerable costs for the health system and an unnecessary burden for the patients. To this extent, Machine Learning (ML) algorithms could represent a useful add-on to the current clinical practice for diagnostic purposes and could help retrieve the most useful exams to be carried out for diagnostic purposes. Method: Here, we retrospectively collected high resolution computed tomography, pulmonary function tests, esophageal pH impedance tests, esophageal manometry and reflux disease questionnaires of 38 patients with SSc, applying, with R, different supervised ML algorithms, including lasso, ridge, elastic net, classification and regression trees (CART) and random forest to estimate the most important predictors for pulmonary involvement from such data. Results: In terms of performance, the random forest algorithm outperformed the other classifiers, with an estimated root-mean-square error (RMSE) of 0.810. However, this algorithm was seen to be computationally intensive, leaving room for the usefulness of other classifiers when a shorter response time is needed. Conclusions: Despite the notably small sample size, that could have prevented obtaining fully reliable data, the powerful tools available for ML can be useful for predicting early lung involvement in SSc patients. The use of predictors coming from spirometry and pH impedentiometry together might perform optimally for predicting early lung involvement in SSc.

Metabolites ◽  
2021 ◽  
Vol 11 (10) ◽  
pp. 697
Serafina Perrone ◽  
Simona Negro ◽  
Elisa Laschi ◽  
Marco Calderisi ◽  
Maurizio Giordano ◽  

Prematurity is a risk factor for the development of chronic adult diseases. Metabolomics can correlate the biochemical changes to a determined phenotype, obtaining real information about the state of health of a subject at that precise moment. Significative differences in the metabolomic profile of preterm newborns compared to those born at term have been already identified at birth. An observational case–control study was performed at the University Hospital of Siena. The aim was to evaluate and compare the metabolomic profiles of young adults born preterm to those born at term. Urinary samples were collected from 67 young adults (18–23 years old) born preterm (mean gestational age of 30 weeks, n = 49), and at term of pregnancy (mean gestational age of 38 weeks, n = 18). The urinary spectra of young adults born preterm was different from those born at term and resembled what was previously described at birth. The Random Forest algorithm gave the best classification (accuracy 82%) and indicated the following metabolites as responsible for the classification: citrate, CH2 creatinine, fumarate and hippurate. Urine spectra are promising tools for the early identification of neonates at risk of disease in adulthood and may provide insight into the pathogenesis and effects of fetal programming and infants’ outcomes.

Oyelakin A. M ◽  
Alimi O. M ◽  
Mustapha I. O ◽  
Ajiboye I. K

Phishing attacks have been used in different ways to harvest the confidential information of unsuspecting internet users. To stem the tide of phishing-based attacks, several machine learning techniques have been proposed in the past. However, fewer studies have considered investigating single and ensemble machine learning-based models for the classification of phishing attacks. This study carried out performance analysis of selected single and ensemble machine learning (ML) classifiers in phishing classification.The focus is to investigate how these algorithms behave in the classification of phishing attacks in the chosen dataset. Logistic Regression and Decision Trees were chosen as single learning classifiers while simple voting techniques and Random Forest were used as the ensemble machine learning algorithms. Accuracy, Precision, Recall and F1-score were used as performance metrics. Logistic Regression algorithm recorded 0.86 as accuracy, 0.89 as precision, 0.87 as recall and 0.81 as F1-score. Similarly, the Decision Trees classifier achieved an accuracy of 0.87, 0.83 for precision, 0.88 for recall and 0.81 for F1-score. In the voting ensemble, accuracy of 0.92 was achieved. 0.90 was obtained for precision, 0.92 for recall and 0.92 for F1-score. Random Forest algorithm recorded 0.98, 0.97, 0.98 and 0.97 as accuracy, precision, recall and F1-score respectively. From the experimental analyses, Random Forest algorithm outperformed simple averaging classifier and the two single algorithms used for phishing url detection. The study established that the ensemble techniques that were used for the experimentations are more efficient for phishing url identification compared to the single classifiers.  

Sensors ◽  
2021 ◽  
Vol 21 (20) ◽  
pp. 6715
Yuequn Zhang ◽  
Lei Luo ◽  
Xu Ji ◽  
Yiyang Dai

Fault detection and diagnosis (FDD) has received considerable attention with the advent of big data. Many data-driven FDD procedures have been proposed, but most of them may not be accurate when data missing occurs. Therefore, this paper proposes an improved random forest (RF) based on decision paths, named DPRF, utilizing correction coefficients to compensate for the influence of incomplete data. In this DPRF model, intact training samples are firstly used to grow all the decision trees in the RF. Then, for each test sample that possibly contains missing values, the decision paths and the corresponding nodes importance scores are obtained, so that for each tree in the RF, the reliability score for the sample can be inferred. Thus, the prediction results of each decision tree for the sample will be assigned to certain reliability scores. The final prediction result is obtained according to the majority voting law, combining both the predicting results and the corresponding reliability scores. To prove the feasibility and effectiveness of the proposed method, the Tennessee Eastman (TE) process is tested. Compared with other FDD methods, the proposed DPRF model shows better performance on incomplete data.

2021 ◽  
Vol 13 (20) ◽  
pp. 4033
Giang V. Nguyen ◽  
Xuan-Hien Le ◽  
Linh Nguyen Van ◽  
Sungho Jung ◽  
Minho Yeon ◽  

Precipitation is a crucial component of the water cycle and plays a key role in hydrological processes. Recently, satellite-based precipitation products (SPPs) have provided grid-based precipitation with spatiotemporal variability. However, SPPs contain a lot of uncertainty in estimated precipitation, and the spatial resolution of these products is still relatively coarse. To overcome these limitations, this study aims to generate new grid-based daily precipitation based on a combination of rainfall observation data with multiple SPPs for the period of 2003–2017 across South Korea. A Random Forest (RF) machine-learning algorithm model was applied for producing a new merged precipitation product. In addition, several statistical linear merging methods have been adopted to compare with the results achieved from the RF model. To investigate the efficiency of RF, rainfall data from 64 observed Automated Synoptic Observation System (ASOS) installations were collected to analyze the accuracy of products through several continuous as well as categorical indicators. The new precipitation values produced by the merging procedure generally not only report higher accuracy than a single satellite rainfall product but also indicate that RF is more effective than the statistical merging method. Thus, the achievements from this study point out that the RF model might be applied for merging multiple satellite precipitation products, especially in sparse region areas.

2021 ◽  
Di Fang ◽  
Hong Wang ◽  
Fanglong Meng

Energies ◽  
2021 ◽  
Vol 14 (19) ◽  
pp. 6283
Mingzhu Tang ◽  
Zixin Liang ◽  
Huawei Wu ◽  
Zimin Wang

A fault diagnosis method for wind turbine gearboxes based on undersampling, XGBoost feature selection, and improved whale optimization-random forest (IWOA-RF) was proposed for the problem of high false negative and false positive rates in wind turbine gearboxes. Normal samples of raw data were subjected to undersampling first, and various features and data labels in the raw data were provided with importance analysis by XGBoost feature selection to select features with higher label correlation. Two parameters of random forest algorithm were optimized via the whale optimization algorithm to create a fitness function with the false negative rate (FNR) and false positive rate (FPR) as evaluation indexes. Then, the minimum fitness function value within the given scope of parameters was found. The WOA was controlled by the hyper-parameter α to optimize the step size. This article uses the variant form of the sigmoid function to alter the change trend of the WOA hyper-parameter α from a linear decline to a rapid decline first and then a slow decline to allow the WOA to be optimized. In the initial stage, a larger step size and step size change rate can make the model progress to the optimization target faster, while in the later stage of optimization, a smaller step size and step size change rate allows the model to more accurately find the minimum value of the fitness function. Finally, two hyper-parameters, corresponding to the minimum fitness function value, were substituted into a random forest algorithm for model training. The results showed that the method proposed in this paper can significantly reduce the false negative and false positive rates compared with other optimization classification methods.

Information ◽  
2021 ◽  
Vol 12 (10) ◽  
pp. 405
Mike Nkongolo ◽  
Jacobus Philippus van Deventer ◽  
Sydney Mambwe Kasongo

This research attempts to introduce the production methodology of an anomaly detection dataset using ten desirable requirements. Subsequently, the article presents the produced dataset named UGRansome, created with up-to-date and modern network traffic (netflow), which represents cyclostationary patterns of normal and abnormal classes of threatening behaviours. It was discovered that the timestamp of various network attacks is inferior to one minute and this feature pattern was used to record the time taken by the threat to infiltrate a network node. The main asset of the proposed dataset is its implication in the detection of zero-day attacks and anomalies that have not been explored before and cannot be recognised by known threats signatures. For instance, the UDP Scan attack has been found to utilise the lowest netflow in the corpus, while the Razy utilises the highest one. In turn, the EDA2 and Globe malware are the most abnormal zero-day threats in the proposed dataset. These feature patterns are included in the corpus, but derived from two well-known datasets, namely, UGR’16 and ransomware that include real-life instances. The former incorporates cyclostationary patterns while the latter includes ransomware features. The UGRansome dataset was tested with cross-validation and compared to the KDD99 and NSL-KDD datasets to assess the performance of Ensemble Learning algorithms. False alarms have been minimized with a null empirical error during the experiment, which demonstrates that implementing the Random Forest algorithm applied to UGRansome can facilitate accurate results to enhance zero-day threats detection. Additionally, most zero-day threats such as Razy, Globe, EDA2, and TowerWeb are recognised as advanced persistent threats that are cyclostationary in nature and it is predicted that they will be using spamming and phishing for intrusion. Lastly, achieving the UGRansome balance was found to be NP-Hard due to real life-threatening classes that do not have a uniform distribution in terms of several instances.

Sign in / Sign up

Export Citation Format

Share Document