Automatic catalog of RR Lyrae from ∼14 million VVV light curves: How far can we go with traditional machine-learning?

Context. The creation of a 3D map of the bulge using RR Lyrae (RRL) is one of the main goals of the VISTA Variables in the Via Lactea Survey (VVV) and VVV(X) surveys. The overwhelming number of sources undergoing analysis undoubtedly requires the use of automatic procedures. In this context, previous studies have introduced the use of machine learning (ML) methods for the task of variable star classification. Aims. Our goal is to develop and test an entirely automatic ML-based procedure for the identification of RRLs in the VVV Survey. This automatic procedure is meant to be used to generate reliable catalogs integrated over several tiles in the survey. Methods. Following the reconstruction of light curves, we extracted a set of period- and intensity-based features, which were already defined in previous works. Also, for the first time, we put a new subset of useful color features to use. We discuss in considerable detail all the appropriate steps needed to define our fully automatic pipeline, namely: the selection of quality measurements; sampling procedures; classifier setup, and model selection. Results. As a result, we were able to construct an ensemble classifier with an average recall of 0.48 and average precision of 0.86 over 15 tiles. We also made all our processed datasets available and we published a catalog of candidate RRLs. Conclusions. Perhaps most interestingly, from a classification perspective based on photometric broad-band data, our results indicate that color is an informative feature type of the RRL objective class that should always be considered in automatic classification methods via ML. We also argue that recall and precision in both tables and curves are high-quality metrics with regard to this highly imbalanced problem. Furthermore, we show for our VVV data-set that to have good estimates, it is important to use the original distribution more abundantly than reduced samples with an artificial balance. Finally, we show that the use of ensemble classifiers helps resolve the crucial model selection step and that most errors in the identification of RRLs are related to low-quality observations of some sources or to the increased difficulty in resolving the RRL-C type given the data.

Download Full-text

A practical approach for applying Machine Learning in the detection and classification of network devices used in building management

10.22541/au.160689781.19054555/v1 ◽

2020 ◽

Author(s):

Maroun Touma ◽

Shalisha Witherspoon ◽

Shonda Witherspoon ◽

Isabelle Crawford-Eng

Keyword(s):

Machine Learning ◽

Critical Infrastructure ◽

Ensemble Classifier ◽

Essential Elements ◽

Small Sample ◽

Training Data ◽

Feature Engineering ◽

Ensemble Classifiers ◽

Commercial Building ◽

Automation And Control

With the increasing deployment of smart buildings and infrastructure, Supervisory Control and Data Acquisition (SCADA) devices and the underlying IT network have become essential elements for the proper operations of these highly complex systems. Of course, with the increase in automation and the proliferation of SCADA devices, a corresponding increase in surface area of attack on critical infrastructure has increased. Understanding device behaviors in terms of known and understood or potentially qualified activities versus unknown and potentially nefarious activities in near-real time is a key component of any security solution. In this paper, we investigate the challenges with building robust machine learning models to identify unknowns purely from network traffic both inside and outside firewalls, starting with missing or inconsistent labels across sites, feature engineering and learning, temporal dependencies and analysis, and training data quality (including small sample sizes) for both shallow and deep learning methods. To demonstrate these challenges and the capabilities we have developed, we focus on Building Automation and Control networks (BACnet) from a private commercial building system. Our results show that ”Model Zoo” built from binary classifiers based on each device or behavior combined with an ensemble classifier integrating information from all classifiers provides a reliable methodology to identify unknown devices as well as determining specific known devices when the device type is in the training set. The capability of the Model Zoo framework is shown to be directly linked to feature engineering and learning, and the dependency of the feature selection varies depending on both the binary and ensemble classifiers as well.

Download Full-text

Finding flares in Kepler data using machine-learning tools

Astronomy and Astrophysics ◽

10.1051/0004-6361/201833194 ◽

2018 ◽

Vol 616 ◽

pp. A163 ◽

Cited By ~ 10

Author(s):

Krisztián Vida ◽

Rachael M. Roettenbacher

Keyword(s):

Machine Learning ◽

Rotation Period ◽

Light Curves ◽

Consensus Algorithm ◽

Data Sets ◽

Learning Tools ◽

Stellar Rotation ◽

Data Set ◽

Single Target ◽

Long Time

Context. Archives of long photometric surveys, such as the Kepler database, are a great basis for studying flares. However, identifying the flares is a complex task; it is easily done in the case of single-target observations by visual inspection, but is nearly impossible for several year-long time series for several thousand targets. Although automated methods for this task exist, several problems are difficult (or impossible) to overcome with traditional fitting and analysis approaches. Aims. We introduce a code for identifying and analyzing flares based on machine-learning methods, which are intrinsically adept at handling such data sets. Methods. We used the RANSAC (RANdom SAmple Consensus) algorithm to model light curves, as it yields robust fits even in the case of several outliers, such as flares. The light curves were divided into search windows, approximately on the order of the stellar rotation period. This search window was shifted over the data set, and a voting system was used to keep false positives to a minimum: only those flare candidate points were kept that were identified as a flare in several windows. Results. The code was tested on short-cadence K2 observations of TRAPPIST-1 and on long-cadence Kepler data of KIC 1722506. The detected flare events and flare energies are consistent with earlier results from manual inspections.

Download Full-text

Anomaly Detection using Optimized Features using Genetic Algorithm and MultiEnsemble Classifier

IJOSTHE ◽

10.24113/ojssports.v5i6.79 ◽

2018 ◽

Vol 5 (6) ◽

pp. 7

Author(s):

Apoorva Deshpande ◽

Ramnaresh Sharma

Keyword(s):

Machine Learning ◽

Genetic Algorithm ◽

Intrusion Detection ◽

Anomaly Detection ◽

Detection System ◽

Research Work ◽

Ensemble Classifier ◽

Machine Learning Algorithms ◽

Data Set ◽

Machine Learning Classification

Anomaly detection system plays an important role in network security. Anomaly detection or intrusion detection model is a predictive model used to predict the network data traffic as normal or intrusion. Machine Learning algorithms are used to build accurate models for clustering, classification and prediction. In this paper classification and predictive models for intrusion detection are built by using machine learning classification algorithms namely Random Forest. These algorithms are tested with KDD-99 data set. In this research work the model for anomaly detection is based on normalized reduced feature and multilevel ensemble classifier. The work is performed in divided into two stages. In the first stage data is normalized using mean normalization. In second stage genetic algorithm is used to reduce number of features and further multilevel ensemble classifier is used for classification of data into different attack groups. From result analysis it is analysed that with reduced feature intrusion can be classified more efficiently.

Download Full-text

Comparison of Bagging and Boosting Ensemble Machine Learning Methods for Automated EMG Signal Classification

BioMed Research International ◽

10.1155/2019/9152506 ◽

2019 ◽

Vol 2019 ◽

pp. 1-13 ◽

Cited By ~ 5

Author(s):

Emine Yaman ◽

Abdulhamit Subasi

Keyword(s):

Machine Learning ◽

Neuromuscular Disorders ◽

Real Life ◽

Kappa Statistic ◽

Ensemble Classifier ◽

Machine Learning Algorithms ◽

Ensemble Classifiers ◽

Learning Methods ◽

Emg Signal ◽

Ensemble Machine Learning

The neuromuscular disorders are diagnosed using electromyographic (EMG) signals. Machine learning algorithms are employed as a decision support system to diagnose neuromuscular disorders. This paper compares bagging and boosting ensemble learning methods to classify EMG signals automatically. Even though ensemble classifiers’ efficacy in relation to real-life issues has been presented in numerous studies, there are almost no studies which focus on the feasibility of bagging and boosting ensemble classifiers to diagnose the neuromuscular disorders. Therefore, the purpose of this paper is to assess the feasibility of bagging and boosting ensemble classifiers to diagnose neuromuscular disorders through the use of EMG signals. It should be understood that there are three steps to this method, where the step number one is to calculate the wavelet packed coefficients (WPC) for every type of EMG signal. After this, it is necessary to calculate statistical values of WPC so that the distribution of wavelet coefficients could be demonstrated. In the last step, an ensemble classifier used the extracted features as an input of the classifier to diagnose the neuromuscular disorders. Experimental results showed the ensemble classifiers achieved better performance for diagnosis of neuromuscular disorders. Results are promising and showed that the AdaBoost with random forest ensemble method achieved an accuracy of 99.08%, F-measure 0.99, AUC 1, and kappa statistic 0.99.

Download Full-text

Detection of phishing websites using a novel twofold ensemble model

Journal of Systems and Information Technology ◽

10.1108/jsit-09-2017-0074 ◽

2018 ◽

Vol 20 (3) ◽

pp. 321-357 ◽

Cited By ~ 2

Author(s):

Kalyan Nagaraj ◽

Biplab Bhattacharjee ◽

Amulyashree Sridhar ◽

Sharvani GS

Keyword(s):

Mean Squared Error ◽

Statistical Tests ◽

Ensemble Classifier ◽

Superior Performance ◽

Ensemble Model ◽

Ensemble Classifiers ◽

Time Data ◽

Data Set ◽

Content Type ◽

Detection Techniques

Purpose Phishing is one of the major threats affecting businesses worldwide in current times. Organizations and customers face the hazards arising out of phishing attacks because of anonymous access to vulnerable details. Such attacks often result in substantial financial losses. Thus, there is a need for effective intrusion detection techniques to identify and possibly nullify the effects of phishing. Classifying phishing and non-phishing web content is a critical task in information security protocols, and full-proof mechanisms have yet to be implemented in practice. The purpose of the current study is to present an ensemble machine learning model for classifying phishing websites. Design/methodology/approach A publicly available data set comprising 10,068 instances of phishing and legitimate websites was used to build the classifier model. Feature extraction was performed by deploying a group of methods, and relevant features extracted were used for building the model. A twofold ensemble learner was developed by integrating results from random forest (RF) classifier, fed into a feedforward neural network (NN). Performance of the ensemble classifier was validated using k-fold cross-validation. The twofold ensemble learner was implemented as a user-friendly, interactive decision support system for classifying websites as phishing or legitimate ones. Findings Experimental simulations were performed to access and compare the performance of the ensemble classifiers. The statistical tests estimated that RF_NN model gave superior performance with an accuracy of 93.41 per cent and minimal mean squared error of 0.000026. Research limitations/implications The research data set used in this study is publically available and easy to analyze. Comparative analysis with other real-time data sets of recent origin must be performed to ensure generalization of the model against various security breaches. Different variants of phishing threats must be detected rather than focusing particularly toward phishing website detection. Originality/value The twofold ensemble model is not applied for classification of phishing websites in any previous studies as per the knowledge of authors.

Download Full-text

Hyperparameter Tuning and Pipeline Optimization via Grid Search Method and Tree-Based AutoML in Breast Cancer Prediction

Journal of Personalized Medicine ◽

10.3390/jpm11100978 ◽

2021 ◽

Vol 11 (10) ◽

pp. 978

Author(s):

Siti Fairuz Mat Radzi ◽

Muhammad Khalis Abdul Karim ◽

M Iqbal Saripan ◽

Mohd Amiruddin Abdul Rahman ◽

Iza Nurzawani Che Isa ◽

...

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Model Selection ◽

Principal Component ◽

Receiver Operating Curve ◽

Support Vector ◽

Grid Search ◽

Breast Cancer Data ◽

Data Set ◽

Cancer Data

Automated machine learning (AutoML) has been recognized as a powerful tool to build a system that automates the design and optimizes the model selection machine learning (ML) pipelines. In this study, we present a tree-based pipeline optimization tool (TPOT) as a method for determining ML models with significant performance and less complex breast cancer diagnostic pipelines. Some features of pre-processors and ML models are defined as expression trees and optimal gene programming (GP) pipelines, a stochastic search system. Features of radiomics have been presented as a guide for the ML pipeline selection from the breast cancer data set based on TPOT. Breast cancer data were used in a comparative analysis of the TPOT-generated ML pipelines with the selected ML classifiers, optimized by a grid search approach. The principal component analysis (PCA) random forest (RF) classification was proven to be the most reliable pipeline with the lowest complexity. The TPOT model selection technique exceeded the performance of grid search (GS) optimization. The RF classifier showed an outstanding outcome amongst the models in combination with only two pre-processors, with a precision of 0.83. The grid search optimized for support vector machine (SVM) classifiers generated a difference of 12% in comparison, while the other two classifiers, naïve Bayes (NB) and artificial neural network—multilayer perceptron (ANN-MLP), generated a difference of almost 39%. The method’s performance was based on sensitivity, specificity, accuracy, precision, and receiver operating curve (ROC) analysis.

Download Full-text

Aggregate Linear Discriminate Analyzed Feature Extraction and Ensemble of Bootstrap with Knn Classifier for Malicious Tumour Detection

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.c4802.098319 ◽

2019 ◽

Vol 8 (3) ◽

pp. 3686-3694

Keyword(s):

Feature Extraction ◽

Classification Accuracy ◽

Extraction Methods ◽

Ensemble Classifier ◽

Detection Accuracy ◽

Ensemble Classifiers ◽

Data Set ◽

Tumour Detection ◽

Tumour Classification ◽

Classification Feature

Tumour detection medical applications utilize classification techniques to categorize malicious and nonmalicious tumour features to provide an efficient medical diagnosis of the human individual under investigation. One way to enable efficient classification, Feature extraction methods are used to eliminate the redundant features and obtain the most relevant features. However, the challenges concerning the dimension and quantum of tumour dataset persist. Toward this goal, this paper aims to maximize the malicious tumour classification accuracy using two reliable ensemble classifiers namely Bootstrap Aggregation and k-nearest neighbour. Tumour features extracted by Aggregate Linear Discriminate Analysis (LDA) and the feature distance is calculated with iterative scattering matrix algorithm. The extracted features are further refined by aggregation to select most effective feature values. After this, an ensemble classifier technique is employed to construct malicious and non-malicious tumour classes. The tumour classification based on an ensemble of bagging and knearest neighbour. Simulation is carried out on Tumour Repository data set to show that proposed ensemble classifiers have considerably better tumour detection accuracy than existing conventional techniques. Numerical performance evaluations show that 8% improvement by proposed method in tumour classification accuracy for malicious tumour detection in human individuals.

Download Full-text

Exchange Spin Coupling from Gaussian Process Regression

10.26434/chemrxiv.12589541.v3 ◽

2020 ◽

Author(s):

Marc Philipp Bahlke ◽

Natnael Mogos ◽

Jonny Proppe ◽

Carmen Herrmann

Keyword(s):

Machine Learning ◽

Gaussian Process ◽

Gaussian Process Regression ◽

Molecular Magnets ◽

Molecular Structures ◽

Spin Coupling ◽

Structure Property ◽

Data Set ◽

Uncertainty Estimates

Heisenberg exchange spin coupling between metal centers is essential for describing and understanding the electronic structure of many molecular catalysts, metalloenzymes, and molecular magnets for potential application in information technology. We explore the machine-learnability of exchange spin coupling, which has not been studied yet. We employ Gaussian process regression since it can potentially deal with small training sets (as likely associated with the rather complex molecular structures required for exploring spin coupling) and since it provides uncertainty estimates (“error bars”) along with predicted values. We compare a range of descriptors and kernels for 257 small dicopper complexes and find that a simple descriptor based on chemical intuition, consisting only of copper-bridge angles and copper-copper distances, clearly outperforms several more sophisticated descriptors when it comes to extrapolating towards larger experimentally relevant complexes. Exchange spin coupling is similarly easy to learn as the polarizability, while learning dipole moments is much harder. The strength of the sophisticated descriptors lies in their ability to linearize structure-property relationships, to the point that a simple linear ridge regression performs just as well as the kernel-based machine-learning model for our small dicopper data set. The superior extrapolation performance of the simple descriptor is unique to exchange spin coupling, reinforcing the crucial role of choosing a suitable descriptor, and highlighting the interesting question of the role of chemical intuition vs. systematic or automated selection of features for machine learning in chemistry and material science.

Download Full-text

Random Forest Refinement of Pairwise Potentials for Protein-ligand Decoy Detection

10.26434/chemrxiv.8047820.v1 ◽

2019 ◽

Cited By ~ 1

Author(s):

Jun Pei ◽

Zheng Zheng ◽

Hyunji Kim ◽

Lin Song ◽

Sarah Walworth ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Probability Function ◽

Pair Potential ◽

Scoring Function ◽

Stable Structure ◽

Scoring Functions ◽

Atom Pair ◽

Data Set ◽

Atom Pairs

An accurate scoring function is expected to correctly select the most stable structure from a set of pose candidates. One can hypothesize that a scoring function’s ability to identify the most stable structure might be improved by emphasizing the most relevant atom pairwise interactions. However, it is hard to evaluate the relevant importance for each atom pair using traditional means. With the introduction of machine learning methods, it has become possible to determine the relative importance for each atom pair present in a scoring function. In this work, we use the Random Forest (RF) method to refine a pair potential developed by our laboratory (GARF6) by identifying relevant atom pairs that optimize the performance of the potential on our given task. Our goal is to construct a machine learning (ML) model that can accurately differentiate the native ligand binding pose from candidate poses using a potential refined by RF optimization. We successfully constructed RF models on an unbalanced data set with the ‘comparison’ concept and, the resultant RF models were tested on CASF-2013.5 In a comparison of the performance of our RF models against 29 scoring functions, we found our models outperformed the other scoring functions in predicting the native pose. In addition, we used two artificial designed potential models to address the importance of the GARF potential in the RF models: (1) a scrambled probability function set, which was obtained by mixing up atom pairs and probability functions in GARF, and (2) a uniform probability function set, which share the same peak positions with GARF but have fixed peak heights. The results of accuracy comparison from RF models based on the scrambled, uniform, and original GARF potential clearly showed that the peak positions in the GARF potential are important while the well depths are not. <br>

Download Full-text

In silico Prediction of Inhibitory Constant of Thrombin Inhibitors Using Machine Learning

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207322666181220130232 ◽

2019 ◽

Vol 21 (9) ◽

pp. 662-669 ◽

Cited By ~ 1

Author(s):

Junnan Zhao ◽

Lu Zhu ◽

Weineng Zhou ◽

Lingfeng Yin ◽

Yuchen Wang ◽

...

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Regression Tree ◽

Large Data ◽

Thrombin Inhibitors ◽

Coagulation Cascade ◽

Gradient Boosting ◽

Support Vector ◽

Data Set ◽

Descriptor Selection

Background: Thrombin is the central protease of the vertebrate blood coagulation cascade, which is closely related to cardiovascular diseases. The inhibitory constant Ki is the most significant property of thrombin inhibitors. Method: This study was carried out to predict Ki values of thrombin inhibitors based on a large data set by using machine learning methods. Taking advantage of finding non-intuitive regularities on high-dimensional datasets, machine learning can be used to build effective predictive models. A total of 6554 descriptors for each compound were collected and an efficient descriptor selection method was chosen to find the appropriate descriptors. Four different methods including multiple linear regression (MLR), K Nearest Neighbors (KNN), Gradient Boosting Regression Tree (GBRT) and Support Vector Machine (SVM) were implemented to build prediction models with these selected descriptors. Results: The SVM model was the best one among these methods with R2=0.84, MSE=0.55 for the training set and R2=0.83, MSE=0.56 for the test set. Several validation methods such as yrandomization test and applicability domain evaluation, were adopted to assess the robustness and generalization ability of the model. The final model shows excellent stability and predictive ability and can be employed for rapid estimation of the inhibitory constant, which is full of help for designing novel thrombin inhibitors.

Download Full-text