Probing for Sparse and Fast Variable Selection with Model-Based Boosting

We present a new variable selection method based on model-based gradient boosting and randomly permuted variables. Model-based boosting is a tool to fit a statistical model while performing variable selection at the same time. A drawback of the fitting lies in the need of multiple model fits on slightly altered data (e.g., cross-validation or bootstrap) to find the optimal number of boosting iterations and prevent overfitting. In our proposed approach, we augment the data set with randomly permuted versions of the true variables, so-called shadow variables, and stop the stepwise fitting as soon as such a variable would be added to the model. This allows variable selection in a single fit of the model without requiring further parameter tuning. We show that our probing approach can compete with state-of-the-art selection methods like stability selection in a high-dimensional classification benchmark and apply it on three gene expression data sets.

Download Full-text

Predictive and Descriptive CoMFA Models: The Effect of Variable Selection

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207321666180212162028 ◽

2018 ◽

Vol 21 (2) ◽

pp. 117-124 ◽

Cited By ~ 4

Author(s):

Bakhtyar Sepehri ◽

Nematollah Omidikia ◽

Mohsen Kompany-Zareh ◽

Raouf Ghavami

Keyword(s):

Variable Selection ◽

Predictive Power ◽

Selection Method ◽

Data Sets ◽

Data Set ◽

Comfa Model ◽

Variable Selection Method

Aims & Scope: In this research, 8 variable selection approaches were used to investigate the effect of variable selection on the predictive power and stability of CoMFA models. Materials & Methods: Three data sets including 36 EPAC antagonists, 79 CD38 inhibitors and 57 ATAD2 bromodomain inhibitors were modelled by CoMFA. First of all, for all three data sets, CoMFA models with all CoMFA descriptors were created then by applying each variable selection method a new CoMFA model was developed so for each data set, 9 CoMFA models were built. Obtained results show noisy and uninformative variables affect CoMFA results. Based on created models, applying 5 variable selection approaches including FFD, SRD-FFD, IVE-PLS, SRD-UVEPLS and SPA-jackknife increases the predictive power and stability of CoMFA models significantly. Result & Conclusion: Among them, SPA-jackknife removes most of the variables while FFD retains most of them. FFD and IVE-PLS are time consuming process while SRD-FFD and SRD-UVE-PLS run need to few seconds. Also applying FFD, SRD-FFD, IVE-PLS, SRD-UVE-PLS protect CoMFA countor maps information for both fields.

Download Full-text

Models for Internal Clustering Validation Indexes Based on Hadoop-MapReduce

International Journal of Distributed Systems and Technologies ◽

10.4018/ijdst.2020070103 ◽

2020 ◽

Vol 11 (3) ◽

pp. 42-67

Author(s):

Soumeya Zerabi ◽

Souham Meshoul ◽

Samia Chikhi Boucherkha

Keyword(s):

Clustering Algorithms ◽

Large Data ◽

Optimal Number ◽

Data Sets ◽

Data Set ◽

Number Of Clusters ◽

Distributed Models ◽

Hadoop Mapreduce ◽

Distributed Solutions ◽

Clustering Validation

Cluster validation aims to both evaluate the results of clustering algorithms and predict the number of clusters. It is usually achieved using several indexes. Traditional internal clustering validation indexes (CVIs) are mainly based in computing pairwise distances which results in a quadratic complexity of the related algorithms. The existing CVIs cannot handle large data sets properly and need to be revisited to take account of the ever-increasing data set volume. Therefore, design of parallel and distributed solutions to implement these indexes is required. To cope with this issue, the authors propose two parallel and distributed models for internal CVIs namely for Silhouette and Dunn indexes using MapReduce framework under Hadoop. The proposed models termed as MR_Silhouette and MR_Dunn have been tested to solve both the issue of evaluating the clustering results and identifying the optimal number of clusters. The results of experimental study are very promising and show that the proposed parallel and distributed models achieve the expected tasks successfully.

Download Full-text

A classification approach based on variable precision rough sets and cluster validity index function

Engineering Computations ◽

10.1108/ec-11-2012-0297 ◽

2014 ◽

Vol 31 (8) ◽

pp. 1778-1789

Author(s):

Hongkang Lin

Keyword(s):

Optimal Number ◽

Data Sets ◽

Cluster Validity ◽

Cluster Validity Index ◽

Index Method ◽

Data Set ◽

Content Type ◽

The Individual ◽

Variable Precision Rough Sets ◽

Optimal Number Of Clusters

Purpose – The clustering/classification method proposed in this study, designated as the PFV-index method, provides the means to solve the following problems for a data set characterized by imprecision and uncertainty: first, discretizing the continuous values of all the individual attributes within a data set; second, evaluating the optimality of the discretization results; third, determining the optimal number of clusters per attribute; and fourth, improving the classification accuracy (CA) of data sets characterized by uncertainty. The paper aims to discuss these issues. Design/methodology/approach – The proposed method for the solution of the clustering/classifying problem, designated as PFV-index method, combines a particle swarm optimization algorithm, fuzzy C-means method, variable precision rough sets theory, and a new cluster validity index function. Findings – This method could cluster the values of the individual attributes within the data set and achieves both the optimal number of clusters and the optimal CA. Originality/value – The validity of the proposed approach is investigated by comparing the classification results obtained for UCI data sets with those obtained by supervised classification BPNN, decision-tree methods.

Download Full-text

Core looseness fault identification model based on Mel spectrogram-CNN

Journal of Physics Conference Series ◽

10.1088/1742-6596/2137/1/012060 ◽

2021 ◽

Vol 2137 (1) ◽

pp. 012060

Author(s):

Ping He ◽

Yong Li ◽

Shoulong Chen ◽

Hoghua Xu ◽

Lei Zhu ◽

...

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Convolution Neural Network ◽

Data Sets ◽

Data Set ◽

Recognition Model ◽

Model Based ◽

Fault Recognition ◽

Transformer Core ◽

Identification Model

Abstract In order to realize transformer voiceprint recognition, a transformer voiceprint recognition model based on Mel spectrum convolution neural network is proposed. Firstly, the transformer core looseness fault is simulated by setting different preloads, and the sound signals under different preloads are collected; Secondly, the sound signal is converted into a spectrogram that can be trained by convolutional neural network, and then the dimension is reduced by Mel filter bank to draw Mel spectrogram, which can generate spectrogram data sets under different preloads in batch; Finally, the data set is introduced into convolutional neural network for training, and the transformer voiceprint fault recognition model is obtained. The results show that the training accuracy of the proposed Mel spectrum convolution neural network transformer identification model is 99.91%, which can well identify the core loosening faults.

Download Full-text

Evaluation of machine learning algorithms for classification of primary biological aerosol using a new UV-LIF spectrometer

Atmospheric Measurement Techniques ◽

10.5194/amt-10-695-2017 ◽

2017 ◽

Vol 10 (2) ◽

pp. 695-708 ◽

Cited By ~ 25

Author(s):

Simon Ruske ◽

David O. Topping ◽

Virginia E. Foot ◽

Paul H. Kaye ◽

Warren R. Stanley ◽

...

Keyword(s):

Neural Networks ◽

Decision Trees ◽

Supervised Learning ◽

Ensemble Methods ◽

Gradient Boosting ◽

Support Vector ◽

Data Sets ◽

Data Set ◽

Shape Information ◽

Accuracy Of Measurements

Abstract. Characterisation of bioaerosols has important implications within environment and public health sectors. Recent developments in ultraviolet light-induced fluorescence (UV-LIF) detectors such as the Wideband Integrated Bioaerosol Spectrometer (WIBS) and the newly introduced Multiparameter Bioaerosol Spectrometer (MBS) have allowed for the real-time collection of fluorescence, size and morphology measurements for the purpose of discriminating between bacteria, fungal spores and pollen.This new generation of instruments has enabled ever larger data sets to be compiled with the aim of studying more complex environments. In real world data sets, particularly those from an urban environment, the population may be dominated by non-biological fluorescent interferents, bringing into question the accuracy of measurements of quantities such as concentrations. It is therefore imperative that we validate the performance of different algorithms which can be used for the task of classification.For unsupervised learning we tested hierarchical agglomerative clustering with various different linkages. For supervised learning, 11 methods were tested, including decision trees, ensemble methods (random forests, gradient boosting and AdaBoost), two implementations for support vector machines (libsvm and liblinear) and Gaussian methods (Gaussian naïve Bayesian, quadratic and linear discriminant analysis, the k-nearest neighbours algorithm and artificial neural networks).The methods were applied to two different data sets produced using the new MBS, which provides multichannel UV-LIF fluorescence signatures for single airborne biological particles. The first data set contained mixed PSLs and the second contained a variety of laboratory-generated aerosol.Clustering in general performs slightly worse than the supervised learning methods, correctly classifying, at best, only 67. 6 and 91. 1 % for the two data sets respectively. For supervised learning the gradient boosting algorithm was found to be the most effective, on average correctly classifying 82. 8 and 98. 27 % of the testing data, respectively, across the two data sets.A possible alternative to gradient boosting is neural networks. We do however note that this method requires much more user input than the other methods, and we suggest that further research should be conducted using this method, especially using parallelised hardware such as the GPU, which would allow for larger networks to be trained, which could possibly yield better results.We also saw that some methods, such as clustering, failed to utilise the additional shape information provided by the instrument, whilst for others, such as the decision trees, ensemble methods and neural networks, improved performance could be attained with the inclusion of such information.

Download Full-text

The benefits of segmentation: Evidence from a South African bank and other studies

South African Journal of Science ◽

10.17159/sajs.2017/20160345 ◽

2017 ◽

Vol 113 (9/10) ◽

Cited By ~ 2

Author(s):

Douw G. Breed ◽

Tanja Verster

Keyword(s):

South African ◽

Direct Marketing ◽

Model Performance ◽

Predictive Modelling ◽

Modelling Technique ◽

Gradient Boosting ◽

Data Sets ◽

Data Set ◽

Linear Modelling ◽

Modelling Techniques

We applied different modelling techniques to six data sets from different disciplines in the industry, on which predictive models can be developed, to demonstrate the benefit of segmentation in linear predictive modelling. We compared the model performance achieved on the data sets to the performance of popular non-linear modelling techniques, by first segmenting the data (using unsupervised, semi-supervised, as well as supervised methods) and then fitting a linear modelling technique. A total of eight modelling techniques was compared. We show that there is no one single modelling technique that always outperforms on the data sets. Specifically considering the direct marketing data set from a local South African bank, it is observed that gradient boosting performed the best. Depending on the characteristics of the data set, one technique may outperform another. We also show that segmenting the data benefits the performance of the linear modelling technique in the predictive modelling context on all data sets considered. Specifically, of the three segmentation methods considered, the semi-supervised segmentation appears the most promising.

Download Full-text

Corrigendum to “Probing for Sparse and Fast Variable Selection with Model-Based Boosting”

Computational and Mathematical Methods in Medicine ◽

10.1155/2018/2430438 ◽

2018 ◽

Vol 2018 ◽

pp. 1-1

Author(s):

Janek Thomas ◽

Tobias Hepp ◽

Andreas Mayr ◽

Bernd Bischl

Keyword(s):

Variable Selection ◽

Fast Variable ◽

Model Based

Download Full-text

A new cluster validity index using maximum cluster spread based compactness measure

International Journal of Intelligent Computing and Cybernetics ◽

10.1108/ijicc-02-2016-0006 ◽

2016 ◽

Vol 9 (2) ◽

pp. 179-204 ◽

Cited By ~ 10

Author(s):

M. Arif Wani ◽

Romana Riyaz

Keyword(s):

Optimal Number ◽

Data Sets ◽

Cluster Validity ◽

Cluster Validity Index ◽

Validity Index ◽

Data Set ◽

Number Of Clusters ◽

Content Type ◽

Validity Indices ◽

Optimal Number Of Clusters

Purpose – The most commonly used approaches for cluster validation are based on indices but the majority of the existing cluster validity indices do not work well on data sets of different complexities. The purpose of this paper is to propose a new cluster validity index (ARSD index) that works well on all types of data sets. Design/methodology/approach – The authors introduce a new compactness measure that depicts the typical behaviour of a cluster where more points are located around the centre and lesser points towards the outer edge of the cluster. A novel penalty function is proposed for determining the distinctness measure of clusters. Random linear search-algorithm is employed to evaluate and compare the performance of the five commonly known validity indices and the proposed validity index. The values of the six indices are computed for all nc ranging from (nc min, nc max) to obtain the optimal number of clusters present in a data set. The data sets used in the experiments include shaped, Gaussian-like and real data sets. Findings – Through extensive experimental study, it is observed that the proposed validity index is found to be more consistent and reliable in indicating the correct number of clusters compared to other validity indices. This is experimentally demonstrated on 11 data sets where the proposed index has achieved better results. Originality/value – The originality of the research paper includes proposing a novel cluster validity index which is used to determine the optimal number of clusters present in data sets of different complexities.

Download Full-text

Joining the Dots: Linking Disconnected Networks of Evidence Using Dose-Response Model-Based Network Meta-Analysis

Medical Decision Making ◽

10.1177/0272989x20983315 ◽

2021 ◽

Vol 41 (2) ◽

pp. 194-208

Author(s):

Hugo Pedder ◽

Sofia Dias ◽

Meg Bennetts ◽

Martin Boucher ◽

Nicky J. Welton

Keyword(s):

Dose Response ◽

Meta Analysis ◽

Relative Effectiveness ◽

Data Sets ◽

Response Relationship ◽

Phase Ii Trials ◽

Dose Response Relationship ◽

Data Set ◽

Model Based ◽

Multiple Doses

Background Network meta-analysis (NMA) synthesizes direct and indirect evidence on multiple treatments to estimate their relative effectiveness. However, comparisons between disconnected treatments are not possible without making strong assumptions. When studies including multiple doses of the same drug are available, model-based NMA (MBNMA) presents a novel solution to this problem by modeling a parametric dose-response relationship within an NMA framework. In this article, we illustrate several scenarios in which dose-response MBNMA can connect and strengthen evidence networks. Methods We created illustrative data sets by removing studies or treatments from an NMA of triptans for migraine relief. We fitted MBNMA models with different dose-response relationships. For connected networks, we compared MBNMA estimates with NMA estimates. For disconnected networks, we compared MBNMA estimates with NMA estimates from an “augmented” network connected by adding studies or treatments back into the data set. Results In connected networks, relative effect estimates from MBNMA were more precise than those from NMA models (ratio of posterior SDs NMA v. MBNMA: median = 1.13; range = 1.04–1.68). In disconnected networks, MBNMA provided estimates for all treatments where NMA could not and were consistent with NMA estimates from augmented networks for 15 of 18 data sets. In the remaining 3 of 18 data sets, a more complex dose-response relationship was required than could be fitted with the available evidence. Conclusions Where information on multiple doses is available, MBNMA can connect disconnected networks and increase precision while making less strong assumptions than alternative approaches. MBNMA relies on correct specification of the dose-response relationship, which requires sufficient data at different doses to allow reliable estimation. We recommend that systematic reviews for NMA search for and include evidence (including phase II trials) on multiple doses of agents where available.

Download Full-text