Investigating the use of random forest, gradient boosting machine, support vector machine and their ensemble applied to fault detection

Author(s):  
Luis Felipe Nogoseke ◽  
Gabriel Herman Bernardim Andrade ◽  
Marco Boaretto ◽  
Leandro Coelho
2020 ◽  
Author(s):  
Zhanyou Xu ◽  
Andreomar Kurek ◽  
Steven B. Cannon ◽  
Williams D. Beavis

AbstractSelection of markers linked to alleles at quantitative trait loci (QTL) for tolerance to Iron Deficiency Chlorosis (IDC) has not been successful. Genomic selection has been advocated for continuous numeric traits such as yield and plant height. For ordinal data types such as IDC, genomic prediction models have not been systematically compared. The objectives of research reported in this manuscript were to evaluate the most commonly used genomic prediction method, ridge regression and it’s equivalent logistic ridge regression method, with algorithmic modeling methods including random forest, gradient boosting, support vector machine, K-nearest neighbors, Naïve Bayes, and artificial neural network using the usual comparator metric of prediction accuracy. In addition we compared the methods using metrics of greater importance for decisions about selecting and culling lines for use in variety development and genetic improvement projects. These metrics include specificity, sensitivity, precision, decision accuracy, and area under the receiver operating characteristic curve. We found that Support Vector Machine provided the best specificity for culling IDC susceptible lines, while Random Forest GP models provided the best combined set of decision metrics for retaining IDC tolerant and culling IDC susceptible lines.


2021 ◽  
Vol 12 (2) ◽  
pp. 28-55
Author(s):  
Fabiano Rodrigues ◽  
Francisco Aparecido Rodrigues ◽  
Thelma Valéria Rocha Rodrigues

Este estudo analisa resultados obtidos com modelos de machine learning para predição do sucesso de startups. Como proxy de sucesso considera-se a perspectiva do investidor, na qual a aquisição da startup ou realização de IPO (Initial Public Offering) são formas de recuperação do investimento. A revisão da literatura aborda startups e veículos de financiamento, estudos anteriores sobre predição do sucesso de startups via modelos de machine learning, e trade-offs entre técnicas de machine learning. Na parte empírica, foi realizada uma pesquisa quantitativa baseada em dados secundários oriundos da plataforma americana Crunchbase, com startups de 171 países. O design de pesquisa estabeleceu como filtro startups fundadas entre junho/2010 e junho/2015, e uma janela de predição entre junho/2015 e junho/2020 para prever o sucesso das startups. A amostra utilizada, após etapa de pré-processamento dos dados, foi de 18.571 startups. Foram utilizados seis modelos de classificação binária para a predição: Regressão Logística, Decision Tree, Random Forest, Extreme Gradiente Boosting, Support Vector Machine e Rede Neural. Ao final, os modelos Random Forest e Extreme Gradient Boosting apresentaram os melhores desempenhos na tarefa de classificação. Este artigo, envolvendo machine learning e startups, contribui para áreas de pesquisa híbridas ao mesclar os campos da Administração e Ciência de Dados. Além disso, contribui para investidores com uma ferramenta de mapeamento inicial de startups na busca de targets com maior probabilidade de sucesso.   


2021 ◽  
Vol 4 (2(112)) ◽  
pp. 58-72
Author(s):  
Chingiz Kenshimov ◽  
Zholdas Buribayev ◽  
Yedilkhan Amirgaliyev ◽  
Aisulyu Ataniyazova ◽  
Askhat Aitimov

In the course of our research work, the American, Russian and Turkish sign languages were analyzed. The program of recognition of the Kazakh dactylic sign language with the use of machine learning methods is implemented. A dataset of 5000 images was formed for each gesture, gesture recognition algorithms were applied, such as Random Forest, Support Vector Machine, Extreme Gradient Boosting, while two data types were combined into one database, which caused a change in the architecture of the system as a whole. The quality of the algorithms was also evaluated. The research work was carried out due to the fact that scientific work in the field of developing a system for recognizing the Kazakh language of sign dactyls is currently insufficient for a complete representation of the language. There are specific letters in the Kazakh language, because of the peculiarities of the spelling of the language, problems arise when developing recognition systems for the Kazakh sign language. The results of the work showed that the Support Vector Machine and Extreme Gradient Boosting algorithms are superior in real-time performance, but the Random Forest algorithm has high recognition accuracy. As a result, the accuracy of the classification algorithms was 98.86 % for Random Forest, 98.68 % for Support Vector Machine and 98.54 % for Extreme Gradient Boosting. Also, the evaluation of the quality of the work of classical algorithms has high indicators. The practical significance of this work lies in the fact that scientific research in the field of gesture recognition with the updated alphabet of the Kazakh language has not yet been conducted and the results of this work can be used by other researchers to conduct further research related to the recognition of the Kazakh dactyl sign language, as well as by researchers, engaged in the development of the international sign language


JNANALOKA ◽  
2020 ◽  
pp. 1-10
Author(s):  
Muhammad Kurniawan

Data mining berhubungan dengan pencarian data untuk menemukan pola atau pengetahuan da- ri data keseluruhan. Data mining dapat digunakan untuk memprediksi suatu keadaan, seperti apakah seseorang terkena penyakit ginjal kronis atau tidak. Dalam penelitian ini metode pengu- rangan fitur symmetrical uncertainty dengan algoritma klasifikasi Gradient Boosting, Random Forest, Support Vector Machine, dan Naïve Bayes digunakan untuk memprediksi penyakit ginjal kronis. Jumlah atribut yang diklasifikasi adalah 24, 12, 6, 5, dan 4 atribut. Peningkatan nilai akurasi didapatkan pada pengurangan atribut dari 24 ke 12 dengan algoritma Naïve Bayes. Se- lain itu, diperoleh Support Vector Machine memiliki akurasi terbaik pada semua jumlah atribut, diikuti Gradient Boosting, Random Forest, dan Naïve Bayes. Pada klasifikasi 5 atribut, terlihat algoritma Support Vector Machine dan Gradient Boosting masih memiliki akurasi 1. Kelima atribut tersebut antara lain: hemoglobin, packed cell volume, serum creatinine, albumin, dan specifity gravity. Pengurangan atribut dapat meningkatkan akurasi dan dapat memudahkan proses prediksi karena jumlah atribut lebih sedikit. Belum ada


Sensors ◽  
2021 ◽  
Vol 21 (24) ◽  
pp. 8163
Author(s):  
Wunna Tun ◽  
Johnny Kwok-Wai Wong ◽  
Sai-Ho Ling

The malfunctioning of the heating, ventilating, and air conditioning (HVAC) system is considered to be one of the main challenges in modern buildings. Due to the complexity of the building management system (BMS) with operational data input from a large number of sensors used in HVAC system, the faults can be very difficult to detect in the early stage. While numerous fault detection and diagnosis (FDD) methods with the use of statistical modeling and machine learning have revealed prominent results in recent years, early detection remains a challenging task since many current approaches are unfeasible for diagnosing some HVAC faults and have accuracy performance issues. In view of this, this study presents a novel hybrid FDD approach by combining random forest (RF) and support vector machine (SVM) classifiers for the application of FDD for the HVAC system. Experimental results demonstrate that our proposed hybrid random forest–support vector machine (HRF–SVM) outperforms other methods with higher prediction accuracy (98%), despite that the fault symptoms were insignificant. Furthermore, the proposed framework can reduce the significant number of sensors required and work well with the small number of faulty training data samples available in real-world applications.


2021 ◽  
Vol 9 (Suppl 3) ◽  
pp. A838-A839
Author(s):  
Steven Tran ◽  
Luke Rasmussen ◽  
Jennifer Pacheco ◽  
Carlos Galvez ◽  
Kyle Tegtmeyer ◽  
...  

BackgroundImmune checkpoint inhibitors (ICIs) are a pillar of cancer therapy with demonstrated efficacy in a variety of malignancies. However, they are associated with immune-related adverse events (irAEs) that affect many organ systems with varying severity, inhibiting patient quality of life and in some cases the ability to continue immunotherapy. Research into irAEs is nascent, and identifying patients with adverse events poses a critical challenge for future research efforts and patient care. This study's objective was to develop an electronic health record (EHR)-based model to identify and characterize patients with ICI-associated arthritis (checkpoint arthritis).MethodsForty-two patients with checkpoint arthritis were chart abstracted from a cohort of all patients who received checkpoint therapy for cancer (n=2,612) in a single-center retrospective study. All EHR clinical codes (N=32,198) were extracted including International Classification of Diseases (ICD)-9 and ICD-10, Logical Observation Identifiers Names and Codes (LOINC), RxNorm, and Current Procedural Terminology (CPT). Logistic regression, random forest, gradient boosting, support vector machine, K-nearest neighbors, and neural network machine learning models were trained to identify checkpoint arthritis patients using these clinical codes. Models were evaluated using receiver operating characteristic area under the curve (ROC-AUC), and the most important variables were determined from the logistic regression model. Models were retrained on smaller fractions of the important variables to determine the minimum variable set necessary to achieve accurate identification of checkpoint arthritis.ResultsLogistic regression and random forest were the highest performing models on the full variable set of 32,198 clinical codes (AUCs: 0.911, 0.894, respectively) (table 1). Retraining the models on smaller fractions of the most important variables demonstrated peak performance using the top 31 clinical codes, or 0.1% of the total variables (figure 1). The most important features included presence of ESR, CRP, rheumatoid factor lab, prednisone, joint pain, creatine kinase lab, thyroid labs, and immunization, all positively associated with checkpoint arthritis (figure 2).ConclusionsOur study demonstrates that a data-driven, EHR based approach can robustly identify checkpoint arthritis patients. The high performance of the models using only the 0.1% most important variables suggests that only a small number of clinical attributes are needed to identify these patients. The variables most important for identifying checkpoint arthritis included several unexpected clinical features, such as thyroid labs and immunization, indicating potential underlying irAE associations that warrant further exploration. Finally, the flexibility of this approach and its demonstrated effectiveness could be applied to identify and characterize other irAEs.Ethics ApprovalThis study was approved by the Northwestern University Institutional Review Board, ID STU00210502, with a granted waiver of consentAbstract 802 Table 1Model performance metricsAUC was calculated from the ROC curve. Sensitivity, specificity, PPV, and NPV were determined at the threshold maximizing the F1-score. AUC = area under the curve, ROC = receiver operating characteristic, PPV = positive predictive value, NPV = negative predictive valueAbstract 802 Figure 1Model AUC trained on decreasing fractions of the most important variables, determined by the random forest model. 100% = 32,198 clinical codes. LReg = logistic regression, RF = random forest, GB = gradient boosting, NN = neural network, KNN = K-nearest neighbor, SVM = support vector machine, SVMAnom = SVM anomaly detectionAbstract 802 Figure 2The 31 most important variables determined by the logistic regression (A, coefficients) and random forest (B, relative importance) models


Author(s):  
Linlin Kou ◽  
Yong Qin ◽  
Xunjun Zhao ◽  
Yong Fu

Bogies are critical components of a rail vehicle, which are important for the safe operation of rail transit. In this study, the authors analyzed the real vibration data of the bogies of a railway vehicle obtained from a Chinese subway company under four different operating conditions. The authors selected 15 feature indexes – that ranged from time-domain, energy, and entropy – as well as their correlations. The adaptive synthetic sampling approach–gradient boosting decision tree (ADASYN–GBDT) method is proposed for the bogie fault diagnosis. A comparison between ADASYN–GBDT and the three commonly used classifiers (K-nearest neighbor, support vector machine, and Gaussian naïve Bayes), combined with random forest as the feature selection, was done under different test data sizes. A confusion matrix was used to evaluate those classifiers. In K-nearest neighbor, support vector machine, and Gaussian naïve Bayes, the optimal features should be selected first, while the proposed method of this study does not need to select the optimal features. K-nearest neighbor, support vector machine, and Gaussian naïve Bayes produced inaccurate results in multi-class identification. It can be seen that the lowest false detection rates of the proposed ADASYN–GBDT model are 92.95% and 87.81% when proportion of the test dataset is 0.4 and 0.9, respectively. In addition, the ADASYN–GBDT model has the ability to correctly identify a fault, which makes it more practical and suitable for use in railway operations. The entire process (training and testing) was finished in 2.4231 s and the detection procedure took 0.0027 s on average. The results show that the proposed ADASYN–GBDT method satisfied the requirements of real-time performance and accuracy for online fault detection. It might therefore aid in the fault detection of bogies.


JNANALOKA ◽  
2020 ◽  
pp. 1-10
Author(s):  
Muhammad Kurniawan

Data mining berhubungan dengan pencarian data untuk menemukan pola atau pengetahuan da- ri data keseluruhan. Data mining dapat digunakan untuk memprediksi suatu keadaan, seperti apakah seseorang terkena penyakit ginjal kronis atau tidak. Dalam penelitian ini metode pengu- rangan fitur symmetrical uncertainty dengan algoritma klasifikasi Gradient Boosting, Random Forest, Support Vector Machine, dan Naïve Bayes digunakan untuk memprediksi penyakit ginjal kronis. Jumlah atribut yang diklasifikasi adalah 24, 12, 6, 5, dan 4 atribut. Peningkatan nilai akurasi didapatkan pada pengurangan atribut dari 24 ke 12 dengan algoritma Naïve Bayes. Se- lain itu, diperoleh Support Vector Machine memiliki akurasi terbaik pada semua jumlah atribut, diikuti Gradient Boosting, Random Forest, dan Naïve Bayes. Pada klasifikasi 5 atribut, terlihat algoritma Support Vector Machine dan Gradient Boosting masih memiliki akurasi 1. Kelima atribut tersebut antara lain: hemoglobin, packed cell volume, serum creatinine, albumin, dan specifity gravity. Pengurangan atribut dapat meningkatkan akurasi dan dapat memudahkan proses prediksi karena jumlah atribut lebih sedikit. Belum ada


Sign in / Sign up

Export Citation Format

Share Document