A practical approach for applying Machine Learning in the detection and classification of network devices used in building management

With the increasing deployment of smart buildings and infrastructure, Supervisory Control and Data Acquisition (SCADA) devices and the underlying IT network have become essential elements for the proper operations of these highly complex systems. Of course, with the increase in automation and the proliferation of SCADA devices, a corresponding increase in surface area of attack on critical infrastructure has increased. Understanding device behaviors in terms of known and understood or potentially qualified activities versus unknown and potentially nefarious activities in near-real time is a key component of any security solution. In this paper, we investigate the challenges with building robust machine learning models to identify unknowns purely from network traffic both inside and outside firewalls, starting with missing or inconsistent labels across sites, feature engineering and learning, temporal dependencies and analysis, and training data quality (including small sample sizes) for both shallow and deep learning methods. To demonstrate these challenges and the capabilities we have developed, we focus on Building Automation and Control networks (BACnet) from a private commercial building system. Our results show that ”Model Zoo” built from binary classifiers based on each device or behavior combined with an ensemble classifier integrating information from all classifiers provides a reliable methodology to identify unknown devices as well as determining specific known devices when the device type is in the training set. The capability of the Model Zoo framework is shown to be directly linked to feature engineering and learning, and the dependency of the feature selection varies depending on both the binary and ensemble classifiers as well.

Download Full-text

Automatic catalog of RR Lyrae from ∼14 million VVV light curves: How far can we go with traditional machine-learning?

Astronomy and Astrophysics ◽

10.1051/0004-6361/202038314 ◽

2020 ◽

Vol 642 ◽

pp. A58

Author(s):

J. B. Cabral ◽

F. Ramos ◽

S. Gurovich ◽

P. M. Granitto

Keyword(s):

Machine Learning ◽

Model Selection ◽

Broad Band ◽

Ensemble Classifier ◽

Light Curves ◽

Ensemble Classifiers ◽

Data Set ◽

Rr Lyrae ◽

Selection Step ◽

Sampling Procedures

Context. The creation of a 3D map of the bulge using RR Lyrae (RRL) is one of the main goals of the VISTA Variables in the Via Lactea Survey (VVV) and VVV(X) surveys. The overwhelming number of sources undergoing analysis undoubtedly requires the use of automatic procedures. In this context, previous studies have introduced the use of machine learning (ML) methods for the task of variable star classification. Aims. Our goal is to develop and test an entirely automatic ML-based procedure for the identification of RRLs in the VVV Survey. This automatic procedure is meant to be used to generate reliable catalogs integrated over several tiles in the survey. Methods. Following the reconstruction of light curves, we extracted a set of period- and intensity-based features, which were already defined in previous works. Also, for the first time, we put a new subset of useful color features to use. We discuss in considerable detail all the appropriate steps needed to define our fully automatic pipeline, namely: the selection of quality measurements; sampling procedures; classifier setup, and model selection. Results. As a result, we were able to construct an ensemble classifier with an average recall of 0.48 and average precision of 0.86 over 15 tiles. We also made all our processed datasets available and we published a catalog of candidate RRLs. Conclusions. Perhaps most interestingly, from a classification perspective based on photometric broad-band data, our results indicate that color is an informative feature type of the RRL objective class that should always be considered in automatic classification methods via ML. We also argue that recall and precision in both tables and curves are high-quality metrics with regard to this highly imbalanced problem. Furthermore, we show for our VVV data-set that to have good estimates, it is important to use the original distribution more abundantly than reduced samples with an artificial balance. Finally, we show that the use of ensemble classifiers helps resolve the crucial model selection step and that most errors in the identification of RRLs are related to low-quality observations of some sources or to the increased difficulty in resolving the RRL-C type given the data.

Download Full-text

Object Detection using Feature Mining in a Distributed Machine Learning Framework

10.51202/9783186855107 ◽

2017 ◽

Author(s):

Arne Ehlers

Keyword(s):

Machine Learning ◽

Object Detection ◽

Training Data ◽

Visual Object ◽

Ensemble Classifiers ◽

Adaptive Boosting ◽

Learning Framework ◽

Theory Of Evidence ◽

Feature Mining ◽

Distributed Machine Learning

This dissertation addresses the problem of visual object detection based on machine-learned classifiers. A distributed machine learning framework is developed to learn detectors for several object classes creating cascaded ensemble classifiers by the Adaptive Boosting algorithm. Methods are proposed that enhance several components of an object detection framework: At first, the thesis deals with augmenting the training data in order to improve the performance of object detectors learned from sparse training sets. Secondly, feature mining strategies are introduced to create feature sets that are customized to the object class to be detected. Furthermore, a novel class of fractal features is proposed that allows to represent a wide variety of shapes. Thirdly, a method is introduced that models and combines internal confidences and uncertainties of the cascaded detector using Dempster’s theory of evidence in order to increase the quality of the post-processing. ...

Download Full-text

Democratizing AI: non-expert design of prediction tasks

PeerJ Computer Science ◽

10.7717/peerj-cs.296 ◽

2020 ◽

Vol 6 ◽

pp. e296

Author(s):

James P. Bagrow

Keyword(s):

Machine Learning ◽

Health Behavior ◽

Randomized Trial ◽

Recent Work ◽

Predictive Models ◽

Automatic Machine ◽

Training Data ◽

Feature Engineering ◽

Crowdsourced Data ◽

Current Events

Non-experts have long made important contributions to machine learning (ML) by contributing training data, and recent work has shown that non-experts can also help with feature engineering by suggesting novel predictive features. However, non-experts have only contributed features to prediction tasks already posed by experienced ML practitioners. Here we study how non-experts can design prediction tasks themselves, what types of tasks non-experts will design, and whether predictive models can be automatically trained on data sourced for their tasks. We use a crowdsourcing platform where non-experts design predictive tasks that are then categorized and ranked by the crowd. Crowdsourced data are collected for top-ranked tasks and predictive models are then trained and evaluated automatically using those data. We show that individuals without ML experience can collectively construct useful datasets and that predictive models can be learned on these datasets, but challenges remain. The prediction tasks designed by non-experts covered a broad range of domains, from politics and current events to health behavior, demographics, and more. Proper instructions are crucial for non-experts, so we also conducted a randomized trial to understand how different instructions may influence the types of prediction tasks being proposed. In general, understanding better how non-experts can contribute to ML can further leverage advances in Automatic machine learning and has important implications as ML continues to drive workplace automation.

Download Full-text

Identifying At-Risk Online Learners by Psychological Variables Using Machine Learning Techniques

Online Learning ◽

10.24059/olj.v24i4.2320 ◽

2020 ◽

Vol 24 (4) ◽

Author(s):

Hsiang-yu Chien ◽

Oi-Man Kwok ◽

Yu-Chen Yeh ◽

Noelle Wall Sweany ◽

Eunkyeng Baek ◽

...

Keyword(s):

Machine Learning ◽

At Risk ◽

Online Courses ◽

Small Sample ◽

Training Data ◽

Machine Learning Techniques ◽

Stepwise Logistic Regression ◽

Online Learners ◽

Psychological Variables ◽

Learning Techniques

The purpose of this study was to investigate a predictive model of online learners’ learning outcomes through machine learning. To create a model, we observed students’ motivation, learning tendencies, online learning-motivated attention, and supportive learning behaviors along with final test scores. A total of 225 college students who were taking online courses participated. Longitudinal data were collected over three semesters (T1, T2, and T3). T3 was used as training data given that it contained the largest sample size across all three data waves. To analyze the data, two approaches were applied: (a) stepwise logistic regression and (b) random forest (RF). Results showed that RF used fewer items and predicted final grades more accurately in a small sample. Furthermore, it selected four items that might potentially be used to identify at-risk learners even before they enroll in an online course.

Download Full-text

Comparison of Bagging and Boosting Ensemble Machine Learning Methods for Automated EMG Signal Classification

BioMed Research International ◽

10.1155/2019/9152506 ◽

2019 ◽

Vol 2019 ◽

pp. 1-13 ◽

Cited By ~ 5

Author(s):

Emine Yaman ◽

Abdulhamit Subasi

Keyword(s):

Machine Learning ◽

Neuromuscular Disorders ◽

Real Life ◽

Kappa Statistic ◽

Ensemble Classifier ◽

Machine Learning Algorithms ◽

Ensemble Classifiers ◽

Learning Methods ◽

Emg Signal ◽

Ensemble Machine Learning

The neuromuscular disorders are diagnosed using electromyographic (EMG) signals. Machine learning algorithms are employed as a decision support system to diagnose neuromuscular disorders. This paper compares bagging and boosting ensemble learning methods to classify EMG signals automatically. Even though ensemble classifiers’ efficacy in relation to real-life issues has been presented in numerous studies, there are almost no studies which focus on the feasibility of bagging and boosting ensemble classifiers to diagnose the neuromuscular disorders. Therefore, the purpose of this paper is to assess the feasibility of bagging and boosting ensemble classifiers to diagnose neuromuscular disorders through the use of EMG signals. It should be understood that there are three steps to this method, where the step number one is to calculate the wavelet packed coefficients (WPC) for every type of EMG signal. After this, it is necessary to calculate statistical values of WPC so that the distribution of wavelet coefficients could be demonstrated. In the last step, an ensemble classifier used the extracted features as an input of the classifier to diagnose the neuromuscular disorders. Experimental results showed the ensemble classifiers achieved better performance for diagnosis of neuromuscular disorders. Results are promising and showed that the AdaBoost with random forest ensemble method achieved an accuracy of 99.08%, F-measure 0.99, AUC 1, and kappa statistic 0.99.

Download Full-text

Flow Cytometry-Based Classification in Cancer Research: A View on Feature Selection

Cancer Informatics ◽

10.4137/cin.s30795 ◽

2015 ◽

Vol 14s5 ◽

pp. CIN.S30795 ◽

Cited By ~ 5

Author(s):

S. Sakira Hassan ◽

Pekka Ruusuvuori ◽

Leena Latonen ◽

Heikki Huttunen

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Cross Validation ◽

Small Sample ◽

Training Data ◽

Error Estimator ◽

Classification Models ◽

Learning Tasks ◽

Small Sample Sizes ◽

Model Selection Problem

In this paper, we study the problem of feature selection in cancer-related machine learning tasks. In particular, we study the accuracy and stability of different feature selection approaches within simplistic machine learning pipelines. Earlier studies have shown that for certain cases, the accuracy of detection can easily reach 100% given enough training data. Here, however, we concentrate on simplifying the classification models with and seek for feature selection approaches that are reliable even with extremely small sample sizes. We show that as much as 50% of features can be discarded without compromising the prediction accuracy. Moreover, we study the model selection problem among the ℓ1 regularization path of logistic regression classifiers. To this aim, we compare a more traditional cross-validation approach with a recently proposed Bayesian error estimator.

Download Full-text

Scalable Approach to High Coverages on Oxides via Iterative Training of a Machine-Learning Algorithm

10.26434/chemrxiv.10288514.v1 ◽

2019 ◽

Author(s):

Andrew Medford ◽

Shengchun Yang ◽

Fuzhu Liu

Keyword(s):

Machine Learning ◽

Chemical Potential ◽

Learning Algorithm ◽

Absolute Error ◽

Low Energy ◽

Training Data ◽

High Coverage ◽

Metal Compounds ◽

Adsorption Energies ◽

The Stability

Understanding the interaction of multiple types of adsorbate molecules on solid surfaces is crucial to establishing the stability of catalysts under various chemical environments. Computational studies on the high coverage and mixed coverages of reaction intermediates are still challenging, especially for transition-metal compounds. In this work, we present a framework to predict differential adsorption energies and identify low-energy structures under high- and mixed-adsorbate coverages on oxide materials. The approach uses Gaussian process machine-learning models with quantified uncertainty in conjunction with an iterative training algorithm to actively identify the training set. The framework is demonstrated for the mixed adsorption of CHx, NHx and OHx species on the oxygen vacancy and pristine rutile TiO2(110) surface sites. The results indicate that the proposed algorithm is highly efficient at identifying the most valuable training data, and is able to predict differential adsorption energies with a mean absolute error of ~0.3 eV based on <25% of the total DFT data. The algorithm is also used to identify 76% of the low-energy structures based on <30% of the total DFT data, enabling construction of surface phase diagrams that account for high and mixed coverage as a function of the chemical potential of C, H, O, and N. Furthermore, the computational scaling indicates the algorithm scales nearly linearly (N1.12) as the number of adsorbates increases. This framework can be directly extended to metals, metal oxides, and other materials, providing a practical route toward the investigation of the behavior of catalysts under high-coverage conditions.

Download Full-text

Optimization of Diabetes Training DATA using Machine Learning Algorithms

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v6i2.283286 ◽

2018 ◽

Vol 6 (2) ◽

pp. 283-286

Author(s):

M. Samba Siva Rao ◽

◽

M.Yaswanth . ◽

K. Raghavendra Swamy ◽

◽

...

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Training Data

Download Full-text

Proceedings of the ACL Workshop on Feature Engineering for Machine Learning in Natural Language Processing - FeatureEng '05

10.3115/1610230 ◽

2005 ◽

Keyword(s):

Machine Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Feature Engineering

Download Full-text

Comparative Analysis of Machine Learning Techniques Using Predictive Modeling

Recent Advances in Computer Science and Communications ◽

10.2174/2666255813999200904164539 ◽

2020 ◽

Vol 13 ◽

Author(s):

Ritu Khandelwal ◽

Hemlata Goyal ◽

Rajveer Singh Shekhawat

Keyword(s):

Machine Learning ◽

Comparative Analysis ◽

Data Science ◽

Training Data ◽

Machine Learning Techniques ◽

Future Trends ◽

Data Set ◽

Learning Stage ◽

Learning Techniques ◽

Different Types

Introduction: Machine learning is an intelligent technology that works as a bridge between businesses and data science. With the involvement of data science, the business goal focuses on findings to get valuable insights on available data. The large part of Indian Cinema is Bollywood which is a multi-million dollar industry. This paper attempts to predict whether the upcoming Bollywood Movie would be Blockbuster, Superhit, Hit, Average or Flop. For this Machine Learning techniques (classification and prediction) will be applied. To make classifier or prediction model first step is the learning stage in which we need to give the training data set to train the model by applying some technique or algorithm and after that different rules are generated which helps to make a model and predict future trends in different types of organizations. Methods: All the techniques related to classification and Prediction such as Support Vector Machine(SVM), Random Forest, Decision Tree, Naïve Bayes, Logistic Regression, Adaboost, and KNN will be applied and try to find out efficient and effective results. All these functionalities can be applied with GUI Based workflows available with various categories such as data, Visualize, Model, and Evaluate. Result: To make classifier or prediction model first step is learning stage in which we need to give the training data set to train the model by applying some technique or algorithm and after that different rules are generated which helps to make a model and predict future trends in different types of organizations Conclusion: This paper focuses on Comparative Analysis that would be performed based on different parameters such as Accuracy, Confusion Matrix to identify the best possible model for predicting the movie Success. By using Advertisement Propaganda, they can plan for the best time to release the movie according to the predicted success rate to gain higher benefits. Discussion: Data Mining is the process of discovering different patterns from large data sets and from that various relationships are also discovered to solve various problems that come in business and helps to predict the forthcoming trends. This Prediction can help Production Houses for Advertisement Propaganda and also they can plan their costs and by assuring these factors they can make the movie more profitable.

Download Full-text