A data mining approach for lubricant-based fault diagnosis

PurposeThe purpose of this paper is to develop a maintenance decision support system (DSS) framework using in-service lubricant data for fault diagnosis. The DSS reveals embedded patterns in the data (knowledge discovery) and automatically quantifies the influence of lubricant parameters on the unhealthy state of the machine using alternative classifiers. The classifiers are compared for robustness from which decision-makers select an appropriate classifier given a specific lubricant data set.Design/methodology/approachThe DSS embeds a framework integrating cluster and principal component analysis, for feature extraction, and eight classifiers among them extreme gradient boosting (XGB), random forest (RF), decision trees (DT) and logistic regression (LR). A qualitative and quantitative criterion is developed in conjunction with practitioners for comparing the classifier models.FindingsThe results show the importance of embedded knowledge, explored via a knowledge discovery approach. Moreover, the efficacy of the embedded knowledge on maintenance DSS is emphasized. Importantly, the proposed framework is demonstrated as plausible for decision support due to its high accuracy and consideration of practitioners needs.Practical implicationsThe proposed framework will potentially assist maintenance managers in accurately exploiting lubricant data for maintenance DSS, while offering insights with reduced time and errors.Originality/valueAdvances in lubricant-based intelligent approach for fault diagnosis is seldom utilized in practice, however, may be incorporated in the information management systems offering high predictive accuracy. The classification models' comparison approach, will inevitably assist the industry in selecting amongst divergent models' for DSS.

Download Full-text

A Machine Learning Study to Improve Surgical Case Duration Prediction

10.21203/rs.3.rs-40927/v1 ◽

2020 ◽

Author(s):

Ching-Chieh Huang ◽

Jesyin Lai ◽

Der-Yang Cho ◽

Jiaxin Yu

Keyword(s):

Machine Learning ◽

Predictive Accuracy ◽

Healthcare Management ◽

Gradient Boosting ◽

External Evaluation ◽

Data Set ◽

Surgical Case ◽

Case Duration ◽

Extreme Gradient Boosting ◽

Duration Prediction

Abstract Since the emergence of COVID-19, many hospitals have encountered challenges in performing efficient scheduling and good resource management to ensure the quality of healthcare provided to patients is not compromised. Operating room (OR) scheduling is one of the issues that has gained our attention because it is related to workflow efficiency and critical care of hospitals. Automatic scheduling and high predictive accuracy of surgical case duration have a critical role in improving OR utilization. To estimate surgical case duration, many hospitals rely on historic averages based on a specific surgeon or a specific procedure type obtained from electronic medical record (EMR) scheduling systems. However, the low predictive accuracy with EMR data leads to negative impacts on patients and hospitals, such as rescheduling of surgeries and cancellation. In this study, we aim to improve the prediction of surgical case duration with advanced machine learning (ML) algorithms. We obtained a large data set containing 170,748 surgical cases (from Jan 2017 to Dec 2019) from a hospital. The data covered a broad variety of details on patients, surgeries, specialties and surgical teams. In addition, a more recent data set with 8,672 cases (from Mar to Apr 2020) was available to be used for external evaluation. We computed historic averages from the EMR data for surgeon- or procedure-specific cases, and they were used as baseline models for comparison. Subsequently, we developed our models using linear regression, random forest and extreme gradient boosting (XGB) algorithms. All models were evaluated with R-square (R2), mean absolute error (MAE), and percentage overage (actual duration longer than prediction), underage (shorter than prediction) and within (within prediction). The XGB model was superior to the other models, achieving a higher R2 (85 %) and percentage within (48 %) as well as a lower MAE (30.2 min). The total prediction errors computed for all models showed that the XGB model had the lowest inaccurate percentage (23.7 %). Overall, this study applied ML techniques in the field of OR scheduling to reduce the medical and financial burden for healthcare management. The results revealed the importance of surgery and surgeon factors in surgical case duration prediction. This study also demonstrated the importance of performing an external evaluation to better validate the performance of ML models.

Download Full-text

Fault Diagnosis Method for Hydraulic Directional Valves Integrating PCA and XGBoost

Processes ◽

10.3390/pr7090589 ◽

2019 ◽

Vol 7 (9) ◽

pp. 589 ◽

Cited By ~ 6

Author(s):

Yafei Lei ◽

Wanlu Jiang ◽

Anqi Jiang ◽

Yong Zhu ◽

Hongjie Niu ◽

...

Keyword(s):

Machine Learning ◽

Fault Diagnosis ◽

Principal Component ◽

Cloud Service ◽

Recall Rate ◽

Gradient Boosting ◽

Extreme Gradient Boosting ◽

Diagnosis Method ◽

Cart Model ◽

Component Feature

A novel fault diagnosis method is proposed, depending on a cloud service, for the typical faults in the hydraulic directional valve. The method, based on the Machine Learning Service (MLS) HUAWEI CLOUD, achieves accurate diagnosis of hydraulic valve faults by combining both the advantages of Principal Component Analysis (PCA) in dimensionality reduction and the eXtreme Gradient Boosting (XGBoost) algorithm. First, to obtain the principal component feature set of the pressure signal, PCA was utilized to reduce the dimension of the measured inlet and outlet pressure signals of the hydraulic directional valve. Second, a machine learning sample was constructed by replacing the original fault set with the principal component feature set. Third, the MLS was employed to create an XGBoost model to diagnose valve faults. Lastly, based on model evaluation indicators such as precision, the recall rate, and the F1 score, a test set was used to compare the XGBoost model with the Classification And Regression Trees (CART) model and the Random Forests (RFs) model, respectively. The research results indicate that the proposed method can effectively identify valve faults in the hydraulic directional valve and have higher fault diagnosis accuracy.

Download Full-text

Cervical Cancer Diagnosis Model Using Extreme Gradient Boosting and Bioinspired Firefly Optimization

Scientific Programming ◽

10.1155/2021/5540024 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Irfan Ullah Khan ◽

Nida Aslam ◽

Rawan Alshehri ◽

Seham Alzahrani ◽

Manal Alghamdi ◽

...

Keyword(s):

Risk Factors ◽

Cervical Cancer ◽

Early Diagnosis ◽

Predictive Accuracy ◽

Gradient Boosting ◽

Data Set ◽

Cancer Risk Factors ◽

Cervical Cancer Risk ◽

Extreme Gradient Boosting ◽

Reduced Risk

Cervical cancer is frequently a deadly disease, common in females. However, early diagnosis of cervical cancer can reduce the mortality rate and other associated complications. Cervical cancer risk factors can aid the early diagnosis. For better diagnosis accuracy, we proposed a study for early diagnosis of cervical cancer using reduced risk feature set and three ensemble-based classification techniques, i.e., extreme Gradient Boosting (XGBoost), AdaBoost, and Random Forest (RF) along with Firefly algorithm for optimization. Synthetic Minority Oversampling Technique (SMOTE) data sampling technique was used to alleviate the data imbalance problem. Cervical cancer Risk Factors data set, containing 32 risks factor and four targets (Hinselmann, Schiller, Cytology, and Biopsy), is used in the study. The four targets are the widely used diagnosis test for cervical cancer. The effectiveness of the proposed study is evaluated in terms of accuracy, sensitivity, specificity, positive predictive accuracy (PPA), and negative predictive accuracy (NPA). Moreover, Firefly features selection technique was used to achieve better results with the reduced number of features. Experimental results reveal the significance of the proposed model and achieved the highest outcome for Hinselmann test when compared with other three diagnostic tests. Furthermore, the reduction in the number of features has enhanced the outcomes. Additionally, the performance of the proposed models is noticeable in terms of accuracy when compared with other benchmark studies for cervical cancer diagnosis using reduced risk factors data set.

Download Full-text

Novel Intelligent Method for Fault Diagnosis of Steam Turbine Based on T-SNE and XGBoost

10.21203/rs.3.rs-36099/v1 ◽

2020 ◽

Author(s):

Xizhe Wang ◽

Lijun Zhang

Keyword(s):

Fault Diagnosis ◽

Power Plant ◽

Steam Turbine ◽

Dimensional Space ◽

Sampling Technique ◽

Gradient Boosting ◽

Data Set ◽

Extreme Gradient Boosting ◽

Low Dimensional ◽

T Distribution

Abstract For fault failures of a steam turbine occur frequently and cause huge losses, it is important to identify the fault category. A steam turbine clustering fault diagnosis method based on t-distribution stochastic neighborhood embedding (t-SNE) and extreme gradient boosting (XGBoost) is proposed. Firstly, the t-SNE algorithm is used to map high-dimensional data to low-dimensional space, and data clustering is performed in low-dimensional space. Combined with the fault records of the power plant, the fault data and health data of the clustering result are distinguished. Then, the imbalance problem in the data is processed by the synthetic minority over-sampling technique (SMOTE) algorithm to obtain the steam turbine characteristic data set with fault labels. Finally, we used the XGBoost to solve this multiclassification problem. In the experiment, the method achieved the best performance with an overall accuracy of 97% and early warning at least two hours in advance. The experimental results show that this method can effectively evaluate the state and make fault warning for power plant equipment.

Download Full-text

Structural Damage Classification in a Jacket-Type Wind-Turbine Foundation Using Principal Component Analysis and Extreme Gradient Boosting

Sensors ◽

10.3390/s21082748 ◽

2021 ◽

Vol 21 (8) ◽

pp. 2748

Author(s):

Jersson X. Leon-Medina ◽

Maribel Anaya ◽

Núria Parés ◽

Diego A. Tibaduiza ◽

Francesc Pozo

Keyword(s):

Principal Component Analysis ◽

Feature Extraction ◽

Wind Turbine ◽

Structural Damage ◽

Principal Component ◽

Gradient Boosting ◽

Damage Classification ◽

Linear Feature ◽

Linear Feature Extraction ◽

Extreme Gradient Boosting

Damage classification is an important topic in the development of structural health monitoring systems. When applied to wind-turbine foundations, it provides information about the state of the structure, helps in maintenance, and prevents catastrophic failures. A data-driven pattern-recognition methodology for structural damage classification was developed in this study. The proposed methodology involves several stages: (1) data acquisition, (2) data arrangement, (3) data normalization through the mean-centered unitary group-scaling method, (4) linear feature extraction, (5) classification using the extreme gradient boosting machine learning classifier, and (6) validation applying a 5-fold cross-validation technique. The linear feature extraction capabilities of principal component analysis are employed; the original data of 58,008 features is reduced to only 21 features. The methodology is validated with an experimental test performed in a small-scale wind-turbine foundation structure that simulates the perturbation effects caused by wind and marine waves by applying an unknown white noise signal excitation to the structure. A vibration-response methodology is selected for collecting accelerometer data from both the healthy structure and the structure subjected to four different damage scenarios. The datasets are satisfactorily classified, with performance measures over 99.9% after using the proposed damage classification methodology.

Download Full-text

Interpretable Detection and Location of Myocardial Infarction Based on Ventricular Fusion Rule Features

Journal of Healthcare Engineering ◽

10.1155/2021/4123471 ◽

2021 ◽

Vol 2021 ◽

pp. 1-15

Author(s):

Wenzhi Zhang ◽

Runchuan Li ◽

Shengya Shen ◽

Jinliang Yao ◽

Yan Peng ◽

...

Keyword(s):

Myocardial Infarction ◽

Clinical Decision Making ◽

Human Life ◽

Principal Component ◽

Fusion Rule ◽

Clinical Decision ◽

Gradient Boosting ◽

Discrete Wavelet ◽

Extreme Gradient Boosting ◽

Ventricular Activity

Myocardial infarction (MI) is one of the most common cardiovascular diseases threatening human life. In order to accurately distinguish myocardial infarction and have a good interpretability, the classification method that combines rule features and ventricular activity features is proposed in this paper. Specifically, according to the clinical diagnosis rule and the pathological changes of myocardial infarction on the electrocardiogram, the local information extracted from the Q wave, ST segment, and T wave is computed as the rule feature. All samples of the QT segment are extracted as ventricular activity features. Then, in order to reduce the computational complexity of the ventricular activity features, the effects of Discrete Wavelet Transform (DWT), Principal Component Analysis (PCA), and Locality Preserving Projections (LPP) on the extracted ventricular activity features are compared. Combining rule features and ventricular activity features, all the 12 leads features are fused as the ultimate feature vector. Finally, eXtreme Gradient Boosting (XGBoost) is used to identify myocardial infarction, and the overall accuracy rate of 99.86% is obtained on the Physikalisch-Technische Bundesanstalt (PTB) database. This method has a good medical diagnosis basis while improving the accuracy, which is very important for clinical decision-making.

Download Full-text

Establishing a Credit Risk Evaluation System for SMEs Using the Soft Voting Fusion Model

Risks ◽

10.3390/risks9110202 ◽

2021 ◽

Vol 9 (11) ◽

pp. 202

Author(s):

Ge Gao ◽

Hongxin Wang ◽

Pengbin Gao

Keyword(s):

Credit Risk ◽

Evaluation System ◽

Predictive Accuracy ◽

Assessment System ◽

Gradient Boosting ◽

Support Vector ◽

Fusion Model ◽

Light Gradient ◽

Extreme Gradient Boosting ◽

The Government

In China, SMEs are facing financing difficulties, and commercial banks and financial institutions are the main financing channels for SMEs. Thus, a reasonable and efficient credit risk assessment system is important for credit markets. Based on traditional statistical methods and AI technology, a soft voting fusion model, which incorporates logistic regression, support vector machine (SVM), random forest (RF), eXtreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM), is constructed to improve the predictive accuracy of SMEs’ credit risk. To verify the feasibility and effectiveness of the proposed model, we use data from 123 SMEs nationwide that worked with a Chinese bank from 2016 to 2020, including financial information and default records. The results show that the accuracy of the soft voting fusion model is higher than that of a single machine learning (ML) algorithm, which provides a theoretical basis for the government to control credit risk in the future and offers important references for banks to make credit decisions.

Download Full-text

Evaluating Variable Selection and Machine Learning Algorithms for Estimating Forest Heights by Combining Lidar and Hyperspectral Data

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi9090507 ◽

2020 ◽

Vol 9 (9) ◽

pp. 507

Author(s):

Sanjiwana Arjasakusuma ◽

Sandiaga Swahyu Kusuma ◽

Stuart Phinn

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Learning Algorithms ◽

Principal Component ◽

Hyperspectral Data ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Forest Height ◽

Extreme Gradient Boosting

Machine learning has been employed for various mapping and modeling tasks using input variables from different sources of remote sensing data. For feature selection involving high- spatial and spectral dimensionality data, various methods have been developed and incorporated into the machine learning framework to ensure an efficient and optimal computational process. This research aims to assess the accuracy of various feature selection and machine learning methods for estimating forest height using AISA (airborne imaging spectrometer for applications) hyperspectral bands (479 bands) and airborne light detection and ranging (lidar) height metrics (36 metrics), alone and combined. Feature selection and dimensionality reduction using Boruta (BO), principal component analysis (PCA), simulated annealing (SA), and genetic algorithm (GA) in combination with machine learning algorithms such as multivariate adaptive regression spline (MARS), extra trees (ET), support vector regression (SVR) with radial basis function, and extreme gradient boosting (XGB) with trees (XGbtree and XGBdart) and linear (XGBlin) classifiers were evaluated. The results demonstrated that the combinations of BO-XGBdart and BO-SVR delivered the best model performance for estimating tropical forest height by combining lidar and hyperspectral data, with R2 = 0.53 and RMSE = 1.7 m (18.4% of nRMSE and 0.046 m of bias) for BO-XGBdart and R2 = 0.51 and RMSE = 1.8 m (15.8% of nRMSE and −0.244 m of bias) for BO-SVR. Our study also demonstrated the effectiveness of BO for variables selection; it could reduce 95% of the data to select the 29 most important variables from the initial 516 variables from lidar metrics and hyperspectral data.

Download Full-text

Dimensionality Reduction using PCA and K-Means Clustering for Breast Cancer Prediction

Lontar Komputer Jurnal Ilmiah Teknologi Informasi ◽

10.24843/lkjiti.2018.v09.i03.p08 ◽

2018 ◽

pp. 192 ◽

Cited By ~ 2

Author(s):

Ade Jamal ◽

Annisa Handayani ◽

Ali Akbar Septiandri ◽

Endang Ripmiatin ◽

Yunus Effendi

Keyword(s):

Breast Cancer ◽

Principal Component Analysis ◽

Dimensionality Reduction ◽

Principal Component ◽

Component Analysis ◽

Gradient Boosting ◽

Support Vector ◽

Breast Cancer Dataset ◽

Cancer Prediction ◽

Extreme Gradient Boosting

Breast cancer is the most important cause of death among women. A prediction of breast cancer in early stage provides a greater possibility of its cure. It needs a breast cancer prediction tool that can classify a breast tumor whether it was a harmful malignant tumor or un-harmful benign tumor. In this paper, two algorithms of machine learning, namely Support Vector Machine and Extreme Gradient Boosting technique will be compared for classification purpose. Prior to the classification, the number of data attribute will be reduced from the raw data by extracting features using Principal Component Analysis. A clustering method, namely K-Means is also used for dimensionality reduction besides the Principal Component Analysis. This paper will present a comparison among four models based on two dimensionality reduction methods combined with two classifiers which applied on Wisconsin Breast Cancer Dataset. The comparison will be measured by using accuracy, sensitivity and specificity metrics evaluated from the confusion matrices. The experimental results have indicated that the K-Means method, which is not usually used for dimensionality reduction can perform well compared to the popular Principal Component Analysis.

Download Full-text

Exploiting Rules to Enhance Machine Learning in Extracting Information From Multi-Institutional Prostate Pathology Reports

JCO Clinical Cancer Informatics ◽

10.1200/cci.20.00028 ◽

2020 ◽

pp. 865-874

Author(s):

Enrico Santus ◽

Tal Schuster ◽

Amir M. Tahmasebi ◽

Clara Li ◽

Adam Yala ◽

...

Keyword(s):

Machine Learning ◽

Hybrid Systems ◽

High Performance ◽

Feature Model ◽

Training Data ◽

Gradient Boosting ◽

Support Vector ◽

Data Set ◽

Extreme Gradient Boosting ◽

Pathology Reports

PURPOSE Literature on clinical note mining has highlighted the superiority of machine learning (ML) over hand-crafted rules. Nevertheless, most studies assume the availability of large training sets, which is rarely the case. For this reason, in the clinical setting, rules are still common. We suggest 2 methods to leverage the knowledge encoded in pre-existing rules to inform ML decisions and obtain high performance, even with scarce annotations. METHODS We collected 501 prostate pathology reports from 6 American hospitals. Reports were split into 2,711 core segments, annotated with 20 attributes describing the histology, grade, extension, and location of tumors. The data set was split by institutions to generate a cross-institutional evaluation setting. We assessed 4 systems, namely a rule-based approach, an ML model, and 2 hybrid systems integrating the previous methods: a Rule as Feature model and a Classifier Confidence model. Several ML algorithms were tested, including logistic regression (LR), support vector machine (SVM), and eXtreme gradient boosting (XGB). RESULTS When training on data from a single institution, LR lags behind the rules by 3.5% (F1 score: 92.2% v 95.7%). Hybrid models, instead, obtain competitive results, with Classifier Confidence outperforming the rules by +0.5% (96.2%). When a larger amount of data from multiple institutions is used, LR improves by +1.5% over the rules (97.2%), whereas hybrid systems obtain +2.2% for Rule as Feature (97.7%) and +2.6% for Classifier Confidence (98.3%). Replacing LR with SVM or XGB yielded similar performance gains. CONCLUSION We developed methods to use pre-existing handcrafted rules to inform ML algorithms. These hybrid systems obtain better performance than either rules or ML models alone, even when training data are limited.

Download Full-text