Classification Problem in Imbalanced Datasets

Deep learning has gained much popularity in the past years due to GPU advancements, cloud computing improvements, and its supremacy, considering the accuracy results when trained on massive datasets. As with machine learning, deep learning models may experience low performance when handled with imbalanced datasets. In this paper, we focus on the trajectory classification problem, and we examine deep learning techniques for coping with imbalanced class data. We extend a deep learning model, called DeepeST (Deep Learning for Sub-Trajectory classification), to predict the class or label for sub-trajectories from imbalanced datasets. DeepeST is the first deep learning model for trajectory classification that provides approaches for coping with imbalanced dataset problems from the authors' knowledge. In this paper, we perform the experiments with three real datasets from LBSN (Location-Based Social Network) trajectories to identify who is the user of a sub-trajectory (similar to the Trajectory-User Linking problem). We show that DeepeST outperforms other deep learning approaches from state-of-the-art concerning the accuracy, precision, recall, and F1-score.

Download Full-text

Prognostication Stereotype of Patients Morbidity and Mortality by Extraction of E-Health Records

International Journal of Emerging Research in Management and Technology ◽

10.23956/ijermt.v6i6.271 ◽

2018 ◽

Vol 6 (6) ◽

pp. 215

Author(s):

Sunitha .T ◽

Shyamala .J ◽

Annie Jesus Suganthi Rani.A

Keyword(s):

At Risk ◽

Risk Prediction ◽

Health Risks ◽

Preventive Intervention ◽

Classification Problem ◽

Health Examination ◽

Disease Area ◽

Health Records ◽

Mortality And Morbidity ◽

Main Motive

Data mining suggest an innovative way of prognostication stereotype of Patients health risks. Large amount of Electronic Health Records (EHRs) collected over the years have provided a rich base for risk analysis and prediction. An EHR contains digitally stored healthcare information about an individual, such as observations, laboratory tests, diagnostic reports, medications, procedures, patient identifying information and allergies. A special type of EHR is the Health Examination Records (HER) from annual general health check-ups. Identifying participants at risk based on their current and past HERs is important for early warning and preventive intervention. By “risk”, we mean unwanted outcomes such as mortality and morbidity. This approach is limited due to the classification problem and consequently it is not informative about the specific disease area in which a personal is at risk. Limited amount of data extracted from the health record is not feasible for providing the accurate risk prediction. The main motive of this project is for risk prediction to classify progressively developing situation with the majority of the data unlabeled.

Download Full-text

Solving the Message Classification Problem in Voice Interaction Systems

Vestnik MEI ◽

10.24160/1993-6982-2020-5-132-139 ◽

2020 ◽

Vol 5 (5) ◽

pp. 132-139

Author(s):

Ivan E. Kurilenko ◽

◽

Igor E. Nikonov ◽

Keyword(s):

Artificial Intelligence ◽

Classification Problem ◽

Subject Area ◽

Text Messages ◽

Case Based Reasoning ◽

Short Text ◽

Proposed Modification ◽

Voice Interaction ◽

Text Content ◽

Case Based

A method for solving the problem of classifying short-text messages in the form of sentences of customers uttered in talking via the telephone line of organizations is considered. To solve this problem, a classifier was developed, which is based on using a combination of two methods: a description of the subject area in the form of a hierarchy of entities and plausible reasoning based on the case-based reasoning approach, which is actively used in artificial intelligence systems. In solving various problems of artificial intelligence-based analysis of data, these methods have shown a high degree of efficiency, scalability, and independence from data structure. As part of using the case-based reasoning approach in the classifier, it is proposed to modify the TF-IDF (Term Frequency - Inverse Document Frequency) measure of assessing the text content taking into account known information about the distribution of documents by topics. The proposed modification makes it possible to improve the classification quality in comparison with classical measures, since it takes into account the information about the distribution of words not only in a separate document or topic, but in the entire database of cases. Experimental results are presented that confirm the effectiveness of the proposed metric and the developed classifier as applied to classification of customer sentences and providing them with the necessary information depending on the classification result. The developed text classification service prototype is used as part of the voice interaction module with the user in the objective of robotizing the telephone call routing system and making a shift from interaction between the user and system by means of buttons to their interaction through voice.

Download Full-text

TOOLS AND PECULIARITIES OF SOLVING THE CLASSIFICATION PROBLEM IN CREDIT SCORING SYSTEMS

CONTINUUM. MATHS. INFORMATICS. EDUCATION ◽

10.24888/2500-1957-2020-17-1-51-59 ◽

2020 ◽

Vol 17 (1) ◽

pp. 51-59

Author(s):

A.A. Grishin ◽

◽

S.P. Stroyev ◽

Keyword(s):

Credit Scoring ◽

Scoring Systems ◽

Classification Problem

Download Full-text

Faculty Opinions recommendation of The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.725375858.793530299 ◽

2017 ◽

Author(s):

Michael Barnes ◽

David Watson

Keyword(s):

Imbalanced Datasets ◽

Binary Classifiers

Download Full-text

Machine Learning for the Classification of Alzheimer’s Disease and Its Prodromal Stage Using Brain Diffusion Tensor Imaging Data: A Systematic Review

Processes ◽

10.3390/pr8091071 ◽

2020 ◽

Vol 8 (9) ◽

pp. 1071

Author(s):

Lucia Billeci ◽

Asia Badolato ◽

Lorenzo Bachi ◽

Alessandro Tonacci

Keyword(s):

Machine Learning ◽

Systematic Review ◽

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Diffusion Tensor Imaging ◽

Magnetic Resonance ◽

Diffusion Tensor ◽

Classification Problem ◽

Computer Algorithms ◽

Imaging Data

Alzheimer’s disease is notoriously the most common cause of dementia in the elderly, affecting an increasing number of people. Although widespread, its causes and progression modalities are complex and still not fully understood. Through neuroimaging techniques, such as diffusion Magnetic Resonance (MR), more sophisticated and specific studies of the disease can be performed, offering a valuable tool for both its diagnosis and early detection. However, processing large quantities of medical images is not an easy task, and researchers have turned their attention towards machine learning, a set of computer algorithms that automatically adapt their output towards the intended goal. In this paper, a systematic review of recent machine learning applications on diffusion tensor imaging studies of Alzheimer’s disease is presented, highlighting the fundamental aspects of each work and reporting their performance score. A few examined studies also include mild cognitive impairment in the classification problem, while others combine diffusion data with other sources, like structural magnetic resonance imaging (MRI) (multimodal analysis). The findings of the retrieved works suggest a promising role for machine learning in evaluating effective classification features, like fractional anisotropy, and in possibly performing on different image modalities with higher accuracy.

Download Full-text

Imbalanced datasets in the generation of fuzzy classification systems - an investigation using a multiobjective evolutionary algorithm based on decomposition

2016 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) ◽

10.1109/fuzz-ieee.2016.7737859 ◽

2016 ◽

Cited By ~ 4

Author(s):

Edward Hinojosa Cardenas ◽

Heloisa A. Camargo ◽

Yvan J. Tupac

Keyword(s):

Evolutionary Algorithm ◽

Classification Systems ◽

Fuzzy Classification ◽

Imbalanced Datasets ◽

Multiobjective Evolutionary Algorithm

Download Full-text

Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media

Journal Of Big Data ◽

10.1186/s40537-021-00488-w ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Yahya Albalawi ◽

Jim Buckley ◽

Nikola S. Nikolov

Keyword(s):

Social Media ◽

Deep Learning ◽

Comprehensive Evaluation ◽

Classification Problem ◽

Data Sets ◽

Word Embeddings ◽

Data Set ◽

Lower Accuracy ◽

Health Related ◽

The Impact

AbstractThis paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.

Download Full-text

Advancing Stress Detection Methodology with Deep Learning Techniques Targeting UX Evaluation in AAL Scenarios: Applying Embeddings for Categorical Variables

Electronics ◽

10.3390/electronics10131550 ◽

2021 ◽

Vol 10 (13) ◽

pp. 1550

Author(s):

Alexandros Liapis ◽

Evanthia Faliagka ◽

Christos P. Antonopoulos ◽

Georgios Keramidas ◽

Nikolaos Voros

Keyword(s):

Machine Learning ◽

Deep Learning ◽

User Experience ◽

Electrodermal Activity ◽

Binary Classification ◽

Research Question ◽

Classification Problem ◽

Categorical Variables ◽

Stress Detection ◽

Software Failures

Physiological measurements have been widely used by researchers and practitioners in order to address the stress detection challenge. So far, various datasets for stress detection have been recorded and are available to the research community for testing and benchmarking. The majority of the stress-related available datasets have been recorded while users were exposed to intense stressors, such as songs, movie clips, major hardware/software failures, image datasets, and gaming scenarios. However, it remains an open research question if such datasets can be used for creating models that will effectively detect stress in different contexts. This paper investigates the performance of the publicly available physiological dataset named WESAD (wearable stress and affect detection) in the context of user experience (UX) evaluation. More specifically, electrodermal activity (EDA) and skin temperature (ST) signals from WESAD were used in order to train three traditional machine learning classifiers and a simple feed forward deep learning artificial neural network combining continues variables and entity embeddings. Regarding the binary classification problem (stress vs. no stress), high accuracy (up to 97.4%), for both training approaches (deep-learning, machine learning), was achieved. Regarding the stress detection effectiveness of the created models in another context, such as user experience (UX) evaluation, the results were quite impressive. More specifically, the deep-learning model achieved a rather high agreement when a user-annotated dataset was used for validation.

Download Full-text

Feature Selection and Ensemble Learning Techniques in One-Class Classifiers: An Empirical Study of Two-Class Imbalanced Datasets

IEEE Access ◽

10.1109/access.2021.3051969 ◽

2021 ◽

Vol 9 ◽

pp. 13717-13726

Author(s):

Chih-Fong Tsai ◽

Wei-Chao Lin

Keyword(s):

Feature Selection ◽

Empirical Study ◽

Ensemble Learning ◽

Imbalanced Datasets ◽

Learning Techniques

Download Full-text