Imbal-OL: Online Machine Learning from Imbalanced Data Streams in Real-world IoT

Current generation real-world data sets processed through machine learning are imbalanced by nature. This imbalanced data enables the researchers with a challenging scenario in the context of perdition for both the machine learning and data mining algorithms. It is observed from the past research studies most of the imbalanced data sets consists of the major classes and minor classes and the major class leads the minor class. Several standards and hybrid prediction algorithms are proposed in various application domains but in most of the real-time data sets analyzed in the studies are imbalanced by nature thereby affecting the accuracy of the prediction. This paper presents a systematic survey of the past research studies to analyze intrinsic data characteristics and techniques utilized for handling class-imbalanced data. In addition, this study reveals the research gaps, trends and patterns in existing studies and discusses briefly on future research directions

Download Full-text

ADES: A New Ensemble Diversity-Based Approach for Handling Concept Drift

Mobile Information Systems ◽

10.1155/2021/5549300 ◽

2021 ◽

Vol 2021 ◽

pp. 1-17

Author(s):

Tinofirei Museba ◽

Fulufhelo Nelwamondo ◽

Khmaies Ouahada

Keyword(s):

Machine Learning ◽

Real World ◽

Data Streams ◽

Predictive Models ◽

Concept Drift ◽

Dynamic Environments ◽

Real World Data ◽

World Data ◽

Different Types ◽

Concept Drifts

Beyond applying machine learning predictive models to static tasks, a significant corpus of research exists that applies machine learning predictive models to streaming environments that incur concept drift. With the prevalence of streaming real-world applications that are associated with changes in the underlying data distribution, the need for applications that are capable of adapting to evolving and time-varying dynamic environments can be hardly overstated. Dynamic environments are nonstationary and change with time and the target variables to be predicted by the learning algorithm and often evolve with time, a phenomenon known as concept drift. Most work in handling concept drift focuses on updating the prediction model so that it can recover from concept drift while little effort has been dedicated to the formulation of a learning system that is capable of learning different types of drifting concepts at any time with minimum overheads. This work proposes a novel and evolving data stream classifier called Adaptive Diversified Ensemble Selection Classifier (ADES) that significantly optimizes adaptation to different types of concept drifts at any time and improves convergence to new concepts by exploiting different amounts of ensemble diversity. The ADES algorithm generates diverse base classifiers, thereby optimizing the margin distribution to exploit ensemble diversity to formulate an ensemble classifier that generalizes well to unseen instances and provides fast recovery from different types of concept drift. Empirical experiments conducted on both artificial and real-world data streams demonstrate that ADES can adapt to different types of drifts at any given time. The prediction performance of ADES is compared to three other ensemble classifiers designed to handle concept drift using both artificial and real-world data streams. The comparative evaluation performed demonstrated the ability of ADES to handle different types of concept drifts. The experimental results, including statistical test results, indicate comparable performances with other algorithms designed to handle concept drift and prove their significance and effectiveness.

Download Full-text

A Multi-Tier Streaming Analytics Model of 0-Day Ransomware Detection Using Machine Learning

Applied Sciences ◽

10.3390/app10093210 ◽

2020 ◽

Vol 10 (9) ◽

pp. 3210 ◽

Cited By ~ 1

Author(s):

Hiba Zuhair ◽

Ali Selamat ◽

Ondrej Krejcar

Keyword(s):

Machine Learning ◽

Information Systems ◽

Data Streams ◽

Classification Accuracy ◽

Imbalanced Data ◽

Learner Model ◽

Related Information ◽

Proposed Model ◽

Hybrid Machine ◽

Dynamic Traits

Desktop and portable platform-based information systems become the most tempting target of crypto and locker ransomware attacks during the last decades. Hence, researchers have developed anti-ransomware tools to assist the Windows platform at thwarting ransomware attacks, protecting the information, preserving the users’ privacy, and securing the inter-related information systems through the Internet. Furthermore, they utilized machine learning to devote useful anti-ransomware tools that detect sophisticated versions. However, such anti-ransomware tools remain sub-optimal in efficacy, partial to analyzing ransomware traits, inactive to learn significant and imbalanced data streams, limited to attributing the versions’ ancestor families, and indecisive about fusing the multi-descent versions. In this paper, we propose a hybrid machine learner model, which is a multi-tiered streaming analytics model that classifies various ransomware versions of 14 families by learning 24 static and dynamic traits. The proposed model classifies ransomware versions to their ancestor families numerally and fuses those of multi-descent families statistically. Thus, it classifies ransomware versions among 40K corpora of ransomware, malware, and good-ware versions through both semi-realistic and realistic environments. The supremacy of this ransomware streaming analytics model among competitive anti-ransomware technologies is proven experimentally and justified critically with the average of 97% classification accuracy, 2.4% mistake rate, and 0.34% miss rate under comparative and realistic test.

Download Full-text

Machine learning with asymmetric abstention for biomedical decision-making

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-021-01655-y ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Mariem Gandouz ◽

Hajo Holzmann ◽

Dominik Heider

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Decision Making ◽

Real World ◽

Computational Models ◽

Imbalanced Data ◽

Classification Performance ◽

Biomedical Data ◽

Decision Boundary ◽

Human Decision

AbstractMachine learning and artificial intelligence have entered biomedical decision-making for diagnostics, prognostics, or therapy recommendations. However, these methods need to be interpreted with care because of the severe consequences for patients. In contrast to human decision-making, computational models typically make a decision also with low confidence. Machine learning with abstention better reflects human decision-making by introducing a reject option for samples with low confidence. The abstention intervals are typically symmetric intervals around the decision boundary. In the current study, we use asymmetric abstention intervals, which we demonstrate to be better suited for biomedical data that is typically highly imbalanced. We evaluate symmetric and asymmetric abstention on three real-world biomedical datasets and show that both approaches can significantly improve classification performance. However, asymmetric abstention rejects as many or fewer samples compared to symmetric abstention and thus, should be used in imbalanced data.

Download Full-text

Predicting Future Occurrence of Acute Hypotensive Episodes Using Noninvasive and Invasive Features

Military Medicine ◽

10.1093/milmed/usaa418 ◽

2021 ◽

Vol 186 (Supplement_1) ◽

pp. 445-451

Author(s):

Yifei Sun ◽

Navid Rashedi ◽

Vikrant Vaze ◽

Parikshit Shah ◽

Ryan Halter ◽

...

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Real World ◽

Short Term Memory ◽

Model Performance ◽

Learning Technologies ◽

Machine Learning Algorithms ◽

Support Vector ◽

K Nearest Neighbor ◽

Continuous Map

ABSTRACT Introduction Early prediction of the acute hypotensive episode (AHE) in critically ill patients has the potential to improve outcomes. In this study, we apply different machine learning algorithms to the MIMIC III Physionet dataset, containing more than 60,000 real-world intensive care unit records, to test commonly used machine learning technologies and compare their performances. Materials and Methods Five classification methods including K-nearest neighbor, logistic regression, support vector machine, random forest, and a deep learning method called long short-term memory are applied to predict an AHE 30 minutes in advance. An analysis comparing model performance when including versus excluding invasive features was conducted. To further study the pattern of the underlying mean arterial pressure (MAP), we apply a regression method to predict the continuous MAP values using linear regression over the next 60 minutes. Results Support vector machine yields the best performance in terms of recall (84%). Including the invasive features in the classification improves the performance significantly with both recall and precision increasing by more than 20 percentage points. We were able to predict the MAP with a root mean square error (a frequently used measure of the differences between the predicted values and the observed values) of 10 mmHg 60 minutes in the future. After converting continuous MAP predictions into AHE binary predictions, we achieve a 91% recall and 68% precision. In addition to predicting AHE, the MAP predictions provide clinically useful information regarding the timing and severity of the AHE occurrence. Conclusion We were able to predict AHE with precision and recall above 80% 30 minutes in advance with the large real-world dataset. The prediction of regression model can provide a more fine-grained, interpretable signal to practitioners. Model performance is improved by the inclusion of invasive features in predicting AHE, when compared to predicting the AHE based on only the available, restricted set of noninvasive technologies. This demonstrates the importance of exploring more noninvasive technologies for AHE prediction.

Download Full-text

The graph neural networking challenge

ACM SIGCOMM Computer Communication Review ◽

10.1145/3477482.3477485 ◽

2021 ◽

Vol 51 (3) ◽

pp. 9-16

Author(s):

José Suárez-Varela ◽

Miquel Ferriol-Galmés ◽

Albert López ◽

Paul Almasan ◽

Guillermo Bernárdez ◽

...

Keyword(s):

Machine Learning ◽

Computer Networks ◽

Real World ◽

Large Scale ◽

Lessons Learned ◽

Educational Resources ◽

Global Competition ◽

International Telecommunication Union ◽

International Telecommunication ◽

Broad Audience

During the last decade, Machine Learning (ML) has increasingly become a hot topic in the field of Computer Networks and is expected to be gradually adopted for a plethora of control, monitoring and management tasks in real-world deployments. This poses the need to count on new generations of students, researchers and practitioners with a solid background in ML applied to networks. During 2020, the International Telecommunication Union (ITU) has organized the "ITU AI/ML in 5G challenge", an open global competition that has introduced to a broad audience some of the current main challenges in ML for networks. This large-scale initiative has gathered 23 different challenges proposed by network operators, equipment manufacturers and academia, and has attracted a total of 1300+ participants from 60+ countries. This paper narrates our experience organizing one of the proposed challenges: the "Graph Neural Networking Challenge 2020". We describe the problem presented to participants, the tools and resources provided, some organization aspects and participation statistics, an outline of the top-3 awarded solutions, and a summary with some lessons learned during all this journey. As a result, this challenge leaves a curated set of educational resources openly available to anyone interested in the topic.

Download Full-text

A Review of Machine Learning Classification Using Quantum Annealing for Real-World Applications

SN Computer Science ◽

10.1007/s42979-021-00751-0 ◽

2021 ◽

Vol 2 (5) ◽

Author(s):

Rajdeep Kumar Nath ◽

Himanshu Thapliyal ◽

Travis S. Humble

Keyword(s):

Machine Learning ◽

Real World ◽

Quantum Annealing ◽

Machine Learning Classification ◽

Real World Applications

Download Full-text

A novel multi-stage ensemble model with multiple K-means-based selective undersampling: An application in credit scoring

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-201954 ◽

2021 ◽

Vol 40 (5) ◽

pp. 9471-9484

Author(s):

Yilun Jin ◽

Yanan Liu ◽

Wenyu Zhang ◽

Shuai Zhang ◽

Yu Lou

Keyword(s):

Machine Learning ◽

Predictive Accuracy ◽

Credit Scoring ◽

Imbalanced Data ◽

Ensemble Model ◽

Selective Sampling ◽

Machine Learning Methods ◽

Multi Stage ◽

Proposed Model ◽

New Feature

With the advancement of machine learning, credit scoring can be performed better. As one of the widely recognized machine learning methods, ensemble learning has demonstrated significant improvements in the predictive accuracy over individual machine learning models for credit scoring. This study proposes a novel multi-stage ensemble model with multiple K-means-based selective undersampling for credit scoring. First, a new multiple K-means-based undersampling method is proposed to deal with the imbalanced data. Then, a new selective sampling mechanism is proposed to select the better-performing base classifiers adaptively. Finally, a new feature-enhanced stacking method is proposed to construct an effective ensemble model by composing the shortlisted base classifiers. In the experiments, four datasets with four evaluation indicators are used to evaluate the performance of the proposed model, and the experimental results prove the superiority of the proposed model over other benchmark models.

Download Full-text

Deep Learning Classification of Canine Behavior Using a Single Collar-Mounted Accelerometer: Real-World Validation

Animals ◽

10.3390/ani11061549 ◽

2021 ◽

Vol 11 (6) ◽

pp. 1549

Author(s):

Robert D. Chambers ◽

Nathanael C. Yoder ◽

Aletha B. Carson ◽

Christian Junge ◽

David E. Allen ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Real World ◽

Learning Algorithm ◽

Drinking Behavior ◽

True Positive Rate ◽

Training Dataset ◽

Activity Levels ◽

Accelerometer Data ◽

Activity Monitors

Collar-mounted canine activity monitors can use accelerometer data to estimate dog activity levels, step counts, and distance traveled. With recent advances in machine learning and embedded computing, much more nuanced and accurate behavior classification has become possible, giving these affordable consumer devices the potential to improve the efficiency and effectiveness of pet healthcare. Here, we describe a novel deep learning algorithm that classifies dog behavior at sub-second resolution using commercial pet activity monitors. We built machine learning training databases from more than 5000 videos of more than 2500 dogs and ran the algorithms in production on more than 11 million days of device data. We then surveyed project participants representing 10,550 dogs, which provided 163,110 event responses to validate real-world detection of eating and drinking behavior. The resultant algorithm displayed a sensitivity and specificity for detecting drinking behavior (0.949 and 0.999, respectively) and eating behavior (0.988, 0.983). We also demonstrated detection of licking (0.772, 0.990), petting (0.305, 0.991), rubbing (0.729, 0.996), scratching (0.870, 0.997), and sniffing (0.610, 0.968). We show that the devices’ position on the collar had no measurable impact on performance. In production, users reported a true positive rate of 95.3% for eating (among 1514 users), and of 94.9% for drinking (among 1491 users). The study demonstrates the accurate detection of important health-related canine behaviors using a collar-mounted accelerometer. We trained and validated our algorithms on a large and realistic training dataset, and we assessed and confirmed accuracy in production via user validation.

Download Full-text