scholarly journals Classifying many-class high-dimensional fingerprint datasets using random forest of oblique decision trees

2014 ◽  
Vol 2 (1) ◽  
pp. 3-12 ◽  
Author(s):  
Thanh-Nghi Do ◽  
Philippe Lenca ◽  
Stéphane Lallich
2021 ◽  
Vol 8 (2) ◽  
pp. 257-272
Author(s):  
Yunai Yi ◽  
Diya Sun ◽  
Peixin Li ◽  
Tae-Kyun Kim ◽  
Tianmin Xu ◽  
...  

AbstractThis paper presents an unsupervised clustering random-forest-based metric for affinity estimation in large and high-dimensional data. The criterion used for node splitting during forest construction can handle rank-deficiency when measuring cluster compactness. The binary forest-based metric is extended to continuous metrics by exploiting both the common traversal path and the smallest shared parent node.The proposed forest-based metric efficiently estimates affinity by passing down data pairs in the forest using a limited number of decision trees. A pseudo-leaf-splitting (PLS) algorithm is introduced to account for spatial relationships, which regularizes affinity measures and overcomes inconsistent leaf assign-ments. The random-forest-based metric with PLS facilitates the establishment of consistent and point-wise correspondences. The proposed method has been applied to automatic phrase recognition using color and depth videos and point-wise correspondence. Extensive experiments demonstrate the effectiveness of the proposed method in affinity estimation in a comparison with the state-of-the-art.


2012 ◽  
Vol 8 (2) ◽  
pp. 44-63 ◽  
Author(s):  
Baoxun Xu ◽  
Joshua Zhexue Huang ◽  
Graham Williams ◽  
Qiang Wang ◽  
Yunming Ye

The selection of feature subspaces for growing decision trees is a key step in building random forest models. However, the common approach using randomly sampling a few features in the subspace is not suitable for high dimensional data consisting of thousands of features, because such data often contains many features which are uninformative to classification, and the random sampling often doesn’t include informative features in the selected subspaces. Consequently, classification performance of the random forest model is significantly affected. In this paper, the authors propose an improved random forest method which uses a novel feature weighting method for subspace selection and therefore enhances classification performance over high-dimensional data. A series of experiments on 9 real life high dimensional datasets demonstrated that using a subspace size of features where M is the total number of features in the dataset, our random forest model significantly outperforms existing random forest models.


2021 ◽  
pp. 1-15
Author(s):  
Zhaozhao Xu ◽  
Derong Shen ◽  
Yue Kou ◽  
Tiezheng Nie

Due to high-dimensional feature and strong correlation of features, the classification accuracy of medical data is not as good enough as expected. feature selection is a common algorithm to solve this problem, and selects effective features by reducing the dimensionality of high-dimensional data. However, traditional feature selection algorithms have the blindness of threshold setting and the search algorithms are liable to fall into a local optimal solution. Based on it, this paper proposes a hybrid feature selection algorithm combining ReliefF and Particle swarm optimization. The algorithm is mainly divided into three parts: Firstly, the ReliefF is used to calculate the feature weight, and the features are ranked by the weight. Then ranking feature is grouped according to the density equalization, where the density of features in each group is the same. Finally, the Particle Swarm Optimization algorithm is used to search the ranking feature groups, and the feature selection is performed according to a new fitness function. Experimental results show that the random forest has the highest classification accuracy on the features selected. More importantly, it has the least number of features. In addition, experimental results on 2 medical datasets show that the average accuracy of random forest reaches 90.20%, which proves that the hybrid algorithm has a certain application value.


PLoS ONE ◽  
2020 ◽  
Vol 15 (11) ◽  
pp. e0242458
Author(s):  
Minzheng Jiang ◽  
Tiancai Cheng ◽  
Kangxing Dong ◽  
Shufan Xu ◽  
Yulong Geng

The difficulty in directly determining the failure mode of the submersible screw pump will shorten the life of the system and the normal production of the oil well. This thesis aims to identify the fault forms of submersible screw pump accurately and efficiently, and proposes a fault diagnosis method of the submersible screw pump based on random forest. HDFS storage system and MapReduce processing system are established based on Hadoop big data processing platform; Furthermore, the Bagging algorithm is used to collect the training set data. Also, this thesis adopts the CART method to establish the sample library and the decision trees for a random forest model. Six continuous variables, four categorical variables and fault categories of submersible screw pump oil production system are used for training the decision trees. As several decision trees constitute a random forest model, the parameters to be tested are input into the random forest models, and various types of decision trees are used to determine the failure category in the submersible screw pump. It has been verified that the accuracy rate of fault diagnosis is 92.86%. This thesis can provide some meaningful guidance for timely detection of the causes of downhole unit failures, reducing oil well production losses, and accelerating the promotion and application of submersible screw pumps in oil fields.


2021 ◽  
Author(s):  
Chris J. Kennedy ◽  
Dustin G. Mark ◽  
Jie Huang ◽  
Mark J. van der Laan ◽  
Alan E. Hubbard ◽  
...  

Background: Chest pain is the second leading reason for emergency department (ED) visits and is commonly identified as a leading driver of low-value health care. Accurate identification of patients at low risk of major adverse cardiac events (MACE) is important to improve resource allocation and reduce over-treatment. Objectives: We sought to assess machine learning (ML) methods and electronic health record (EHR) covariate collection for MACE prediction. We aimed to maximize the pool of low-risk patients that are accurately predicted to have less than 0.5% MACE risk and may be eligible for reduced testing. Population Studied: 116,764 adult patients presenting with chest pain in the ED and evaluated for potential acute coronary syndrome (ACS). 60-day MACE rate was 1.9%. Methods: We evaluated ML algorithms (lasso, splines, random forest, extreme gradient boosting, Bayesian additive regression trees) and SuperLearner stacked ensembling. We tuned ML hyperparameters through nested ensembling, and imputed missing values with generalized low-rank models (GLRM). We benchmarked performance to key biomarkers, validated clinical risk scores, decision trees, and logistic regression. We explained the models through variable importance ranking and accumulated local effect visualization. Results: The best discrimination (area under the precision-recall [PR-AUC] and receiver operating characteristic [ROC-AUC] curves) was provided by SuperLearner ensembling (0.148, 0.867), followed by random forest (0.146, 0.862). Logistic regression (0.120, 0.842) and decision trees (0.094, 0.805) exhibited worse discrimination, as did risk scores [HEART (0.064, 0.765), EDACS (0.046, 0.733)] and biomarkers [serum troponin level (0.064, 0.708), electrocardiography (0.047, 0.686)]. The ensemble's risk estimates were miscalibrated by 0.2 percentage points. The ensemble accurately identified 50% of patients to be below a 0.5% 60-day MACE risk threshold. The most important predictors were age, peak troponin, HEART score, EDACS score, and electrocardiogram. GLRM imputation achieved 90% reduction in root mean-squared error compared to median-mode imputation. Conclusion: Use of ML algorithms, combined with broad predictor sets, improved MACE risk prediction compared to simpler alternatives, while providing calibrated predictions and interpretability. Standard risk scores may neglect important health information available in other characteristics and combined in nuanced ways via ML.


Circulation ◽  
2020 ◽  
Vol 142 (Suppl_3) ◽  
Author(s):  
Koichi Sughimoto ◽  
Jacob Levman ◽  
Fazleem Baig ◽  
Derek Berger ◽  
Yoshihiro Oshima ◽  
...  

Introduction: Despite improvements in management for children after cardiac surgery, a non-negligible proportion of patients suffer from cardiac arrest, having a poor prognosis. Although serum lactate levels are widely accepted markers of hemodynamic instability, measuring lactate requires discrete blood sampling. An alternative method to evaluate hemodynamic stability/instability continuously and non-invasively may assist in improving the standard of patient care. Hypothesis: We hypothesize that blood lactate in PICU patients can be predicted using machine learning applied to arterial waveforms and perioperative characteristics. Methods: Forty-eight children, who underwent heart surgery, were included. Patient characteristics and physiological measurements were acquired and analyzed using specialized software/hardware, including heart rate, lactate level, arterial waveform sharpness, and area under the curve. Predicting a patient’s blood lactate levels was accomplished using regression-based supervised learning algorithms, including regression decision trees, tuned decision trees, random forest regressor, tuned random forest, AdaBoost regressor, and hypertuned AdaBoost. All algorithms were compared with hold-out cross validation. Two approaches were considered: basing prediction on the currently acquired physiological measurements along with those acquired at admission, as well as adding the most recent lactate measurement and the time since that measurement as prediction parameters. The second approach supports updating the learning system’s predictive capacity whenever a patient has a new ground truth blood lactate reading acquired. Results: In both approaches, the best performing machine learning method was the tuned random forest, which yielded a mean absolute error of 5.60 mg/dL in the first approach, and 4.62 mg/dL when predicting blood lactate with updated ground truth. Conclusions: In conclusion, the tuned random forest is capable of predicting the level of serum lactate by analyzing perioperative variables, including the arterial pressure waveform. Machine learning can predict the patient’s hemodynamics non-invasively, continuously, and with accuracy that may demonstrate clinical utility.


2018 ◽  
pp. 1587-1599
Author(s):  
Hiroaki Koma ◽  
Taku Harada ◽  
Akira Yoshizawa ◽  
Hirotoshi Iwasaki

Detecting distracted states can be applied to various problems such as danger prevention when driving a car. A cognitive distracted state is one example of a distracted state. It is known that eye movements express cognitive distraction. Eye movements can be classified into several types. In this paper, the authors detect a cognitive distraction using classified eye movement types when applying the Random Forest machine learning algorithm, which uses decision trees. They show the effectiveness of considering eye movement types for detecting cognitive distraction when applying Random Forest. The authors use visual experiments with still images for the detection.


2021 ◽  
pp. 91-104
Author(s):  
Guadalupe Peláez Ramírez ◽  
Francisco Javier Lena-Acebo

2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Jiali Sun ◽  
Qingtai Wu ◽  
Dafeng Shen ◽  
Yangjun Wen ◽  
Fengrong Liu ◽  
...  

AbstractOne of the most important tasks in genome-wide association analysis (GWAS) is the detection of single-nucleotide polymorphisms (SNPs) which are related to target traits. With the development of sequencing technology, traditional statistical methods are difficult to analyze the corresponding high-dimensional massive data or SNPs. Recently, machine learning methods have become more popular in high-dimensional genetic data analysis for their fast computation speed. However, most of machine learning methods have several drawbacks, such as poor generalization ability, over-fitting, unsatisfactory classification and low detection accuracy. This study proposed a two-stage algorithm based on least angle regression and random forest (TSLRF), which firstly considered the control of population structure and polygenic effects, then selected the SNPs that were potentially related to target traits by using least angle regression (LARS), furtherly analyzed this variable subset using random forest (RF) to detect quantitative trait nucleotides (QTNs) associated with target traits. The new method has more powerful detection in simulation experiments and real data analyses. The results of simulation experiments showed that, compared with the existing approaches, the new method effectively improved the detection ability of QTNs and model fitting degree, and required less calculation time. In addition, the new method significantly distinguished QTNs and other SNPs. Subsequently, the new method was applied to analyze five flowering-related traits in Arabidopsis. The results showed that, the distinction between QTNs and unrelated SNPs was more significant than the other methods. The new method detected 60 genes confirmed to be related to the target trait, which was significantly higher than the other methods, and simultaneously detected multiple gene clusters associated with the target trait.


Sign in / Sign up

Export Citation Format

Share Document