Recognition and labeling of faults in wind turbines with a density-based clustering algorithm

PurposeThe purpose of this paper is to recognize and label the faults in wind turbines with a new density-based clustering algorithm, named contour density scanning clustering (CDSC) algorithm.Design/methodology/approachThe algorithm includes four components: (1) computation of neighborhood density, (2) selection of core and noise data, (3) scanning core data and (4) updating clusters. The proposed algorithm considers the relationship between neighborhood data points according to a contour density scanning strategy.FindingsThe first experiment is conducted with artificial data to validate that the proposed CDSC algorithm is suitable for handling data points with arbitrary shapes. The second experiment with industrial gearbox vibration data is carried out to demonstrate that the time complexity and accuracy of the proposed CDSC algorithm in comparison with other conventional clustering algorithms, including k-means, density-based spatial clustering of applications with noise, density peaking clustering, neighborhood grid clustering, support vector clustering, random forest, core fusion-based density peak clustering, AdaBoost and extreme gradient boosting. The third experiment is conducted with an industrial bearing vibration data set to highlight that the CDSC algorithm can automatically track the emerging fault patterns of bearing in wind turbines over time.Originality/valueData points with different densities are clustered using three strategies: direct density reachability, density reachability and density connectivity. A contours density scanning strategy is proposed to determine whether the data points with the same density belong to one cluster. The proposed CDSC algorithm achieves automatically clustering, which means that the trends of the fault pattern could be tracked.

Download Full-text

Correlative Density-Based Clustering

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2016.5650 ◽

2016 ◽

Vol 13 (10) ◽

pp. 6935-6943 ◽

Cited By ~ 1

Author(s):

Jia-Lin Hua ◽

Jian Yu ◽

Miin-Shen Yang

Keyword(s):

Correlation Analysis ◽

Clustering Algorithm ◽

Clustering Methods ◽

Data Set ◽

Density Based Clustering ◽

Inherent Structure ◽

Data Points ◽

Artificial Datasets

Mountains, which heap up by densities of a data set, intuitively reflect the structure of data points. These mountain clustering methods are useful for grouping data points. However, the previous mountain-based clustering suffers from the choice of parameters which are used to compute the density. In this paper, we adopt correlation analysis to determine the density, and propose a new clustering algorithm, called Correlative Density-based Clustering (CDC). The new algorithm computes the density with a modified way and determines the parameters based on the inherent structure of data points. Experiments on artificial datasets and real datasets demonstrate the simplicity and effectiveness of the proposed approach.

Download Full-text

Massively scalable density based clustering (DBSCAN) on the HPCC systems big data platform

IAES International Journal of Artificial Intelligence (IJ-AI) ◽

10.11591/ijai.v10.i1.pp207-214 ◽

2021 ◽

Vol 10 (1) ◽

pp. 207

Author(s):

Yatish H. R. ◽

Shubham Milind Phal ◽

Tanmay Sanjay Hukkeri ◽

Lili Xu ◽

Shobha G ◽

...

Keyword(s):

Clustering Algorithm ◽

Spatial Clustering ◽

Computation Time ◽

Large Data ◽

Single Node ◽

Data Set ◽

Traffic Pattern ◽

Density Based Clustering ◽

Data Points ◽

Hpcc Systems

<span id="docs-internal-guid-919b015d-7fff-56da-f81d-8f032097bce2"><span>Dealing with large samples of unlabeled data is a key challenge in today’s world, especially in applications such as traffic pattern analysis and disaster management. DBSCAN, or density based spatial clustering of applications with noise, is a well-known density-based clustering algorithm. Its key strengths lie in its capability to detect outliers and handle arbitrarily shaped clusters. However, the algorithm, being fundamentally sequential in nature, proves expensive and time consuming when operated on extensively large data chunks. This paper thus presents a novel implementation of a parallel and distributed DBSCAN algorithm on the HPCC Systems platform. The algorithm seeks to fully parallelize the algorithm implementation by making use of HPCC Systems optimal distributed architecture and performing a tree-based union to merge local clusters. The proposed approach* was tested both on synthetic as well as standard datasets (MFCCs Data Set) and found to be completely accurate. Additionally, when compared against a single node setup, a significant decrease in computation time was observed with no impact to accuracy. The parallelized algorithm performed eight times better for higher number of data points and takes exponentially lesser time as the number of data points increases.</span></span>

Download Full-text

Identification of a standard AI based technique for credit risk analysis

Benchmarking An International Journal ◽

10.1108/bij-09-2014-0094 ◽

2016 ◽

Vol 23 (5) ◽

pp. 1381-1390 ◽

Cited By ~ 5

Author(s):

M. Punniyamoorthy ◽

P. Sridevi

Keyword(s):

Neural Network ◽

Risk Assessment ◽

Credit Risk ◽

Financial Institutions ◽

Support Vector ◽

Data Set ◽

Content Type ◽

Machine Learning Classification ◽

Credit Risk Assessment ◽

Data Points

Purpose – Credit risk assessment has gained importance in recent years due to global financial crisis and credit crunch. Financial institutions therefore seek the support of credit rating agencies to predict the ability of creditors to meet financial persuasions. The purpose of this paper is to construct neural network (NN) and fuzzy support vector machine (FSVM) classifiers to discriminate good creditors from bad ones and identify a best classifier for credit risk assessment. Design/methodology/approach – This study uses artificial neural network, the most popular AI technique used in the field of financial applications for classification and prediction and the new machine learning classification algorithm, FSVM to differentiate good creditors from bad. As membership value on data points influence the classification problem, this paper presents the new FSVM model. The instances membership is computed using fuzzy c-means by evolving a new membership. The FSVM model is also tested on different kernels and compared and the classifier with highest classification accuracy for a kernel is identified. Findings – The paper identifies a standard AI model by comparing the performances of the NN model and FSVM model for a credit risk data set. This work proves that that FSVM model performs better than back propagation-neural network. Practical implications – The proposed model can be used by financial institutions to accurately assess the credit risk pattern of customers and make better decisions. Originality/value – This paper has developed a new membership for data points and has proposed a new FCM-based FSVM model for more accurate predictions.

Download Full-text

A WEIGHTED SUPPORT VECTOR MACHINE FOR DATA CLASSIFICATION

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001407005703 ◽

2007 ◽

Vol 21 (05) ◽

pp. 961-976 ◽

Cited By ~ 87

Author(s):

XULEI YANG ◽

QING SONG ◽

YUE WANG

Keyword(s):

Support Vector Machine ◽

Clustering Algorithm ◽

Data Classification ◽

Training Data ◽

Support Vector ◽

Classification Rate ◽

Data Set ◽

Standard Support Vector Machine ◽

Data Points ◽

Sensitivity Problem

This paper presents a weighted support vector machine (WSVM) to improve the outlier sensitivity problem of standard support vector machine (SVM) for two-class data classification. The basic idea is to assign different weights to different data points such that the WSVM training algorithm learns the decision surface according to the relative importance of data points in the training data set. The weights used in WSVM are generated by a robust fuzzy clustering algorithm, kernel-based possibilistic c-means (KPCM) algorithm, whose partition generates relative high values for important data points but low values for outliers. Experimental results indicate that the proposed method reduces the effect of outliers and yields higher classification rate than standard SVM does when outliers exist in the training data set.

Download Full-text

Using Machine Learning to Identify Metabolomic Signatures of Pediatric Chronic Kidney Disease Etiology

Journal of the American Society of Nephrology ◽

10.1681/asn.2021040538 ◽

2022 ◽

pp. ASN.2021040538

Author(s):

Arthur M. Lee ◽

Jian Hu ◽

Yunwen Xu ◽

Alison G. Abraham ◽

Rui Xiao ◽

...

Keyword(s):

Machine Learning ◽

Chronic Kidney Disease ◽

Kidney Disease ◽

Gut Microbiome ◽

Support Vector ◽

Pathway Enrichment Analysis ◽

Disease Etiology ◽

Content Type ◽

Extreme Gradient Boosting ◽

Pediatric Ckd

BackgroundUntargeted plasma metabolomic profiling combined with machine learning (ML) may lead to discovery of metabolic profiles that inform our understanding of pediatric CKD causes. We sought to identify metabolomic signatures in pediatric CKD based on diagnosis: FSGS, obstructive uropathy (OU), aplasia/dysplasia/hypoplasia (A/D/H), and reflux nephropathy (RN).MethodsUntargeted metabolomic quantification (GC-MS/LC-MS, Metabolon) was performed on plasma from 702 Chronic Kidney Disease in Children study participants (n: FSGS=63, OU=122, A/D/H=109, and RN=86). Lasso regression was used for feature selection, adjusting for clinical covariates. Four methods were then applied to stratify significance: logistic regression, support vector machine, random forest, and extreme gradient boosting. ML training was performed on 80% total cohort subsets and validated on 20% holdout subsets. Important features were selected based on being significant in at least two of the four modeling approaches. We additionally performed pathway enrichment analysis to identify metabolic subpathways associated with CKD cause.ResultsML models were evaluated on holdout subsets with receiver-operator and precision-recall area-under-the-curve, F1 score, and Matthews correlation coefficient. ML models outperformed no-skill prediction. Metabolomic profiles were identified based on cause. FSGS was associated with the sphingomyelin-ceramide axis. FSGS was also associated with individual plasmalogen metabolites and the subpathway. OU was associated with gut microbiome–derived histidine metabolites.ConclusionML models identified metabolomic signatures based on CKD cause. Using ML techniques in conjunction with traditional biostatistics, we demonstrated that sphingomyelin-ceramide and plasmalogen dysmetabolism are associated with FSGS and that gut microbiome–derived histidine metabolites are associated with OU.

Download Full-text

A Comparison of Semi-Supervised Classification Approaches for Software Defect Prediction

Journal of Intelligent Systems ◽

10.1515/jisys-2013-0030 ◽

2014 ◽

Vol 23 (1) ◽

pp. 75-82 ◽

Cited By ~ 12

Author(s):

Cagatay Catal

Keyword(s):

Supervised Classification ◽

Defect Prediction ◽

Support Vector ◽

Software Defect Prediction ◽

Classification Methods ◽

Data Set ◽

Software Defect ◽

Data Points ◽

Supervised Classification Methods ◽

Prediction Approach

AbstractPredicting the defect-prone modules when the previous defect labels of modules are limited is a challenging problem encountered in the software industry. Supervised classification approaches cannot build high-performance prediction models with few defect data, leading to the need for new methods, techniques, and tools. One solution is to combine labeled data points with unlabeled data points during learning phase. Semi-supervised classification methods use not only labeled data points but also unlabeled ones to improve the generalization capability. In this study, we evaluated four semi-supervised classification methods for semi-supervised defect prediction. Low-density separation (LDS), support vector machine (SVM), expectation-maximization (EM-SEMI), and class mass normalization (CMN) methods have been investigated on NASA data sets, which are CM1, KC1, KC2, and PC1. Experimental results showed that SVM and LDS algorithms outperform CMN and EM-SEMI algorithms. In addition, LDS algorithm performs much better than SVM when the data set is large. In this study, the LDS-based prediction approach is suggested for software defect prediction when there are limited fault data.

Download Full-text

An Incremental Isomap Method for Hyperspectral Dimensionality Reduction and Classification

Photogrammetric Engineering & Remote Sensing ◽

10.14358/pers.87.7.445 ◽

2021 ◽

Vol 87 (6) ◽

pp. 445-455

Author(s):

Yi Ma ◽

Zezhong Zheng ◽

Yutang Ma ◽

Mingcang Zhu ◽

Ran Huang ◽

...

Keyword(s):

Manifold Learning ◽

Nearest Neighbor ◽

Hyperspectral Image ◽

Hyperspectral Data ◽

Training Data ◽

Support Vector ◽

Data Sets ◽

K Nearest Neighbor ◽

Data Set ◽

Data Points

Many manifold learning algorithms conduct an eigen vector analysis on a data-similarity matrix with a size of N×N, where N is the number of data points. Thus, the memory complexity of the analysis is no less than O(N2). We pres- ent in this article an incremental manifold learning approach to handle large hyperspectral data sets for land use identification. In our method, the number of dimensions for the high-dimensional hyperspectral-image data set is obtained with the training data set. A local curvature varia- tion algorithm is utilized to sample a subset of data points as landmarks. Then a manifold skeleton is identified based on the landmarks. Our method is validated on three AVIRIS hyperspectral data sets, outperforming the comparison algorithms with a k–nearest-neighbor classifier and achieving the second best performance with support vector machine.

Download Full-text

Exploiting Rules to Enhance Machine Learning in Extracting Information From Multi-Institutional Prostate Pathology Reports

JCO Clinical Cancer Informatics ◽

10.1200/cci.20.00028 ◽

2020 ◽

pp. 865-874

Author(s):

Enrico Santus ◽

Tal Schuster ◽

Amir M. Tahmasebi ◽

Clara Li ◽

Adam Yala ◽

...

Keyword(s):

Machine Learning ◽

Hybrid Systems ◽

High Performance ◽

Feature Model ◽

Training Data ◽

Gradient Boosting ◽

Support Vector ◽

Data Set ◽

Extreme Gradient Boosting ◽

Pathology Reports

PURPOSE Literature on clinical note mining has highlighted the superiority of machine learning (ML) over hand-crafted rules. Nevertheless, most studies assume the availability of large training sets, which is rarely the case. For this reason, in the clinical setting, rules are still common. We suggest 2 methods to leverage the knowledge encoded in pre-existing rules to inform ML decisions and obtain high performance, even with scarce annotations. METHODS We collected 501 prostate pathology reports from 6 American hospitals. Reports were split into 2,711 core segments, annotated with 20 attributes describing the histology, grade, extension, and location of tumors. The data set was split by institutions to generate a cross-institutional evaluation setting. We assessed 4 systems, namely a rule-based approach, an ML model, and 2 hybrid systems integrating the previous methods: a Rule as Feature model and a Classifier Confidence model. Several ML algorithms were tested, including logistic regression (LR), support vector machine (SVM), and eXtreme gradient boosting (XGB). RESULTS When training on data from a single institution, LR lags behind the rules by 3.5% (F1 score: 92.2% v 95.7%). Hybrid models, instead, obtain competitive results, with Classifier Confidence outperforming the rules by +0.5% (96.2%). When a larger amount of data from multiple institutions is used, LR improves by +1.5% over the rules (97.2%), whereas hybrid systems obtain +2.2% for Rule as Feature (97.7%) and +2.6% for Classifier Confidence (98.3%). Replacing LR with SVM or XGB yielded similar performance gains. CONCLUSION We developed methods to use pre-existing handcrafted rules to inform ML algorithms. These hybrid systems obtain better performance than either rules or ML models alone, even when training data are limited.

Download Full-text

Human motion analysis from UAV video

International Journal of Intelligent Unmanned Systems ◽

10.1108/ijius-10-2017-0012 ◽

2018 ◽

Vol 6 (2) ◽

pp. 69-92 ◽

Cited By ~ 2

Author(s):

Asanka G. Perera ◽

Yee Wei Law ◽

Ali Al-Naji ◽

Javaan Chahl

Keyword(s):

Temporal Relationship ◽

Projective Transformation ◽

Human Motion ◽

Video Frame ◽

Support Vector ◽

Svm Classifier ◽

Human Motion Analysis ◽

Data Set ◽

Content Type ◽

Classifier Selection

Purpose The purpose of this paper is to present a preliminary solution to address the problem of estimating human pose and trajectory by an aerial robot with a monocular camera in near real time. Design/methodology/approach The distinguishing feature of the solution is a dynamic classifier selection architecture. Each video frame is corrected for perspective using projective transformation. Then, a silhouette is extracted as a Histogram of Oriented Gradients (HOG). The HOG is then classified using a dynamic classifier. A class is defined as a pose-viewpoint pair, and a total of 64 classes are defined to represent a forward walking and turning gait sequence. The dynamic classifier consists of a Support Vector Machine (SVM) classifier C64 that recognizes all 64 classes, and 64 SVM classifiers that recognize four classes each – these four classes are chosen based on the temporal relationship between them, dictated by the gait sequence. Findings The solution provides three main advantages: first, classification is efficient due to dynamic selection (4-class vs 64-class classification). Second, classification errors are confined to neighbors of the true viewpoints. This means a wrongly estimated viewpoint is at most an adjacent viewpoint of the true viewpoint, enabling fast recovery from incorrect estimations. Third, the robust temporal relationship between poses is used to resolve the left-right ambiguities of human silhouettes. Originality/value Experiments conducted on both fronto-parallel videos and aerial videos confirm that the solution can achieve accurate pose and trajectory estimation for these different kinds of videos. For example, the “walking on an 8-shaped path” data set (1,652 frames) can achieve the following estimation accuracies: 85 percent for viewpoints and 98.14 percent for poses.

Download Full-text

Artificial bee colony algorithm for feature selection and improved support vector machine for text classification

Information Discovery and Delivery ◽

10.1108/idd-09-2018-0045 ◽

2019 ◽

Vol 47 (3) ◽

pp. 154-170

Author(s):

Janani Balakumar ◽

S. Vijayarani Mohan

Keyword(s):

Support Vector Machine ◽

Feature Selection ◽

Text Classification ◽

Support Vector ◽

Data Sets ◽

Selection Algorithm ◽

Data Set ◽

Content Type ◽

Benchmark Data ◽

Bee Colony

Purpose Owing to the huge volume of documents available on the internet, text classification becomes a necessary task to handle these documents. To achieve optimal text classification results, feature selection, an important stage, is used to curtail the dimensionality of text documents by choosing suitable features. The main purpose of this research work is to classify the personal computer documents based on their content. Design/methodology/approach This paper proposes a new algorithm for feature selection based on artificial bee colony (ABCFS) to enhance the text classification accuracy. The proposed algorithm (ABCFS) is scrutinized with the real and benchmark data sets, which is contrary to the other existing feature selection approaches such as information gain and χ2 statistic. To justify the efficiency of the proposed algorithm, the support vector machine (SVM) and improved SVM classifier are used in this paper. Findings The experiment was conducted on real and benchmark data sets. The real data set was collected in the form of documents that were stored in the personal computer, and the benchmark data set was collected from Reuters and 20 Newsgroups corpus. The results prove the performance of the proposed feature selection algorithm by enhancing the text document classification accuracy. Originality/value This paper proposes a new ABCFS algorithm for feature selection, evaluates the efficiency of the ABCFS algorithm and improves the support vector machine. In this paper, the ABCFS algorithm is used to select the features from text (unstructured) documents. Although, there is no text feature selection algorithm in the existing work, the ABCFS algorithm is used to select the data (structured) features. The proposed algorithm will classify the documents automatically based on their content.

Download Full-text