DECISION TREES Y RANDOM FOREST MEDIANTE HERRAMIENTAS OPEN SOURCE:

2021 ◽  
pp. 91-104
Author(s):  
Guadalupe Peláez Ramírez ◽  
Francisco Javier Lena-Acebo
PLoS ONE ◽  
2020 ◽  
Vol 15 (11) ◽  
pp. e0242458
Author(s):  
Minzheng Jiang ◽  
Tiancai Cheng ◽  
Kangxing Dong ◽  
Shufan Xu ◽  
Yulong Geng

The difficulty in directly determining the failure mode of the submersible screw pump will shorten the life of the system and the normal production of the oil well. This thesis aims to identify the fault forms of submersible screw pump accurately and efficiently, and proposes a fault diagnosis method of the submersible screw pump based on random forest. HDFS storage system and MapReduce processing system are established based on Hadoop big data processing platform; Furthermore, the Bagging algorithm is used to collect the training set data. Also, this thesis adopts the CART method to establish the sample library and the decision trees for a random forest model. Six continuous variables, four categorical variables and fault categories of submersible screw pump oil production system are used for training the decision trees. As several decision trees constitute a random forest model, the parameters to be tested are input into the random forest models, and various types of decision trees are used to determine the failure category in the submersible screw pump. It has been verified that the accuracy rate of fault diagnosis is 92.86%. This thesis can provide some meaningful guidance for timely detection of the causes of downhole unit failures, reducing oil well production losses, and accelerating the promotion and application of submersible screw pumps in oil fields.


2021 ◽  
Author(s):  
Chris J. Kennedy ◽  
Dustin G. Mark ◽  
Jie Huang ◽  
Mark J. van der Laan ◽  
Alan E. Hubbard ◽  
...  

Background: Chest pain is the second leading reason for emergency department (ED) visits and is commonly identified as a leading driver of low-value health care. Accurate identification of patients at low risk of major adverse cardiac events (MACE) is important to improve resource allocation and reduce over-treatment. Objectives: We sought to assess machine learning (ML) methods and electronic health record (EHR) covariate collection for MACE prediction. We aimed to maximize the pool of low-risk patients that are accurately predicted to have less than 0.5% MACE risk and may be eligible for reduced testing. Population Studied: 116,764 adult patients presenting with chest pain in the ED and evaluated for potential acute coronary syndrome (ACS). 60-day MACE rate was 1.9%. Methods: We evaluated ML algorithms (lasso, splines, random forest, extreme gradient boosting, Bayesian additive regression trees) and SuperLearner stacked ensembling. We tuned ML hyperparameters through nested ensembling, and imputed missing values with generalized low-rank models (GLRM). We benchmarked performance to key biomarkers, validated clinical risk scores, decision trees, and logistic regression. We explained the models through variable importance ranking and accumulated local effect visualization. Results: The best discrimination (area under the precision-recall [PR-AUC] and receiver operating characteristic [ROC-AUC] curves) was provided by SuperLearner ensembling (0.148, 0.867), followed by random forest (0.146, 0.862). Logistic regression (0.120, 0.842) and decision trees (0.094, 0.805) exhibited worse discrimination, as did risk scores [HEART (0.064, 0.765), EDACS (0.046, 0.733)] and biomarkers [serum troponin level (0.064, 0.708), electrocardiography (0.047, 0.686)]. The ensemble's risk estimates were miscalibrated by 0.2 percentage points. The ensemble accurately identified 50% of patients to be below a 0.5% 60-day MACE risk threshold. The most important predictors were age, peak troponin, HEART score, EDACS score, and electrocardiogram. GLRM imputation achieved 90% reduction in root mean-squared error compared to median-mode imputation. Conclusion: Use of ML algorithms, combined with broad predictor sets, improved MACE risk prediction compared to simpler alternatives, while providing calibrated predictions and interpretability. Standard risk scores may neglect important health information available in other characteristics and combined in nuanced ways via ML.


Circulation ◽  
2020 ◽  
Vol 142 (Suppl_3) ◽  
Author(s):  
Koichi Sughimoto ◽  
Jacob Levman ◽  
Fazleem Baig ◽  
Derek Berger ◽  
Yoshihiro Oshima ◽  
...  

Introduction: Despite improvements in management for children after cardiac surgery, a non-negligible proportion of patients suffer from cardiac arrest, having a poor prognosis. Although serum lactate levels are widely accepted markers of hemodynamic instability, measuring lactate requires discrete blood sampling. An alternative method to evaluate hemodynamic stability/instability continuously and non-invasively may assist in improving the standard of patient care. Hypothesis: We hypothesize that blood lactate in PICU patients can be predicted using machine learning applied to arterial waveforms and perioperative characteristics. Methods: Forty-eight children, who underwent heart surgery, were included. Patient characteristics and physiological measurements were acquired and analyzed using specialized software/hardware, including heart rate, lactate level, arterial waveform sharpness, and area under the curve. Predicting a patient’s blood lactate levels was accomplished using regression-based supervised learning algorithms, including regression decision trees, tuned decision trees, random forest regressor, tuned random forest, AdaBoost regressor, and hypertuned AdaBoost. All algorithms were compared with hold-out cross validation. Two approaches were considered: basing prediction on the currently acquired physiological measurements along with those acquired at admission, as well as adding the most recent lactate measurement and the time since that measurement as prediction parameters. The second approach supports updating the learning system’s predictive capacity whenever a patient has a new ground truth blood lactate reading acquired. Results: In both approaches, the best performing machine learning method was the tuned random forest, which yielded a mean absolute error of 5.60 mg/dL in the first approach, and 4.62 mg/dL when predicting blood lactate with updated ground truth. Conclusions: In conclusion, the tuned random forest is capable of predicting the level of serum lactate by analyzing perioperative variables, including the arterial pressure waveform. Machine learning can predict the patient’s hemodynamics non-invasively, continuously, and with accuracy that may demonstrate clinical utility.


2018 ◽  
pp. 1587-1599
Author(s):  
Hiroaki Koma ◽  
Taku Harada ◽  
Akira Yoshizawa ◽  
Hirotoshi Iwasaki

Detecting distracted states can be applied to various problems such as danger prevention when driving a car. A cognitive distracted state is one example of a distracted state. It is known that eye movements express cognitive distraction. Eye movements can be classified into several types. In this paper, the authors detect a cognitive distraction using classified eye movement types when applying the Random Forest machine learning algorithm, which uses decision trees. They show the effectiveness of considering eye movement types for detecting cognitive distraction when applying Random Forest. The authors use visual experiments with still images for the detection.


2020 ◽  
Vol 11 (1) ◽  
pp. 2385-2410
Author(s):  
Quoc Bao Pham ◽  
Kaustuv Mukherjee ◽  
Akbar Norouzi ◽  
Nguyen Thi Thuy Linh ◽  
Saeid Janizadeh ◽  
...  

2013 ◽  
Vol 2013 ◽  
pp. 1-6 ◽  
Author(s):  
Gábor Szűcs

The paper deals with classification in privacy-preserving data mining. An algorithm, the Random Response Forest, is introduced constructing many binary decision trees, as an extension of Random Forest for privacy-preserving problems. Random Response Forest uses the Random Response idea among the anonymization methods, which instead of generalization keeps the original data, but mixes them. An anonymity metric is defined for undistinguishability of two mixed sets of data. This metric, the binary anonymity, is investigated and taken into consideration for optimal coding of the binary variables. The accuracy of Random Response Forest is presented at the end of the paper.


2017 ◽  
Vol 29 (3) ◽  
pp. 164-170 ◽  
Author(s):  
Hao Wu

Purpose This paper aims to inspect the defects of solder joints of printed circuit board in real-time production line, simple computing and high accuracy are primary consideration factors for feature extraction and classification algorithm. Design/methodology/approach In this study, the author presents an ensemble method for the classification of solder joint defects. The new method is based on extracting the color and geometry features after solder image acquisition and using decision trees to guarantee the algorithm’s running executive efficiency. To improve algorithm accuracy, the author proposes an ensemble method of random forest which combined several trees for the classification of solder joints. Findings The proposed method has been tested using 280 samples of solder joints, including good and various defect types, for experiments. The results show that the proposed method has a high accuracy. Originality/value The author extracted the color and geometry features and used decision trees to guarantee the algorithm's running executive efficiency. To improve the algorithm accuracy, the author proposes using an ensemble method of random forest which combined several trees for the classification of solder joints. The results show that the proposed method has a high accuracy.


2021 ◽  
Vol 8 (2) ◽  
pp. 257-272
Author(s):  
Yunai Yi ◽  
Diya Sun ◽  
Peixin Li ◽  
Tae-Kyun Kim ◽  
Tianmin Xu ◽  
...  

AbstractThis paper presents an unsupervised clustering random-forest-based metric for affinity estimation in large and high-dimensional data. The criterion used for node splitting during forest construction can handle rank-deficiency when measuring cluster compactness. The binary forest-based metric is extended to continuous metrics by exploiting both the common traversal path and the smallest shared parent node.The proposed forest-based metric efficiently estimates affinity by passing down data pairs in the forest using a limited number of decision trees. A pseudo-leaf-splitting (PLS) algorithm is introduced to account for spatial relationships, which regularizes affinity measures and overcomes inconsistent leaf assign-ments. The random-forest-based metric with PLS facilitates the establishment of consistent and point-wise correspondences. The proposed method has been applied to automatic phrase recognition using color and depth videos and point-wise correspondence. Extensive experiments demonstrate the effectiveness of the proposed method in affinity estimation in a comparison with the state-of-the-art.


Author(s):  
Linlan Liu ◽  
Yi Feng ◽  
Shengrong Gao ◽  
Jian Shu

Aiming at the imbalance problem of wireless link samples, we propose the link quality estimation method which combines the K-means synthetic minority over-sampling technique (K-means SMOTE) and weighted random forest. The method adopts the mean, variance and asymmetry metrics of the physical layer parameters as the link quality parameters. The link quality is measured by link quality level which is determined by the packet receiving rate. K-means is used to cluster link quality samples. SMOTE is employed to synthesize samples for minority link quality samples, so as to make link quality samples of different link quality levels reach balance. Based on the weighted random forest, the link quality estimation model is constructed. In the link quality estimation model, the decision trees with worse classification performance are assigned smaller weight, and the decision trees with better classification performance are assigned bigger weight. The experimental results show that the proposed link quality estimation method has better performance with samples processed by K-means SMOTE. Furthermore, it has better estimation performance than the ones of Naive Bayesian, Logistic Regression and K-nearest Neighbour estimation methods.


2016 ◽  
Vol 6 (3) ◽  
pp. 356-367 ◽  
Author(s):  
Ch.Ravi Sekhar ◽  
Minal Minal ◽  
Errampalli Madhu

Sign in / Sign up

Export Citation Format

Share Document