A Systematic Methodology on Class Imbalanced Problems involved in the Classification of Real-World Datasets

doi:10.35940/ijrte.c5756.098319

A Systematic Methodology on Class Imbalanced Problems involved in the Classification of Real-World Datasets

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.c5756.098319 ◽

2019 ◽

Vol 8 (3) ◽

pp. 7071-7081

Keyword(s):

Machine Learning ◽

Real World ◽

Imbalanced Data ◽

Past Research ◽

Future Research ◽

Data Sets ◽

Time Data ◽

Real World Data ◽

The Past ◽

Research Studies

Current generation real-world data sets processed through machine learning are imbalanced by nature. This imbalanced data enables the researchers with a challenging scenario in the context of perdition for both the machine learning and data mining algorithms. It is observed from the past research studies most of the imbalanced data sets consists of the major classes and minor classes and the major class leads the minor class. Several standards and hybrid prediction algorithms are proposed in various application domains but in most of the real-time data sets analyzed in the studies are imbalanced by nature thereby affecting the accuracy of the prediction. This paper presents a systematic survey of the past research studies to analyze intrinsic data characteristics and techniques utilized for handling class-imbalanced data. In addition, this study reveals the research gaps, trends and patterns in existing studies and discusses briefly on future research directions

Download Full-text

Learning From Imbalanced Data

Encyclopedia of Information Science and Technology, Fourth Edition ◽

10.4018/978-1-5225-2255-3.ch159 ◽

2018 ◽

pp. 1825-1834 ◽

Cited By ~ 3

Author(s):

Lincy Mathews ◽

Seetha Hari

Keyword(s):

Real World ◽

Imbalanced Data ◽

Misclassification Rate ◽

Future Research ◽

Data Sets ◽

Real World Data ◽

Imbalanced Data Sets ◽

Challenging Issue ◽

Research Challenges ◽

Balanced Distribution

A very challenging issue in real world data is that in many domains like medicine, finance, marketing, web, telecommunication, management etc., the distribution of data among classes is inherently imbalanced. A widely accepted researched issue is that the traditional classifier algorithms assume a balanced distribution among the classes. Data imbalance is evident when the number of instances representing the class of concern is much lesser than other classes. Hence, the classifiers tend to bias towards the well-represented class. This leads to a higher misclassification rate among the lesser represented class. Hence, there is a need of efficient learners to classify imbalanced data. This chapter aims to address the need, challenges, existing methods and evaluation metrics identified when learning from imbalanced data sets. Future research challenges and directions are highlighted.

Download Full-text

Learning From Imbalanced Data

Advances in Computer and Electrical Engineering - Advanced Methodologies and Technologies in Network Architecture, Mobile Computing, and Data Analytics ◽

10.4018/978-1-5225-7598-6.ch030 ◽

2019 ◽

pp. 403-414 ◽

Cited By ~ 1

Author(s):

Lincy Mathews ◽

Seetha Hari

Keyword(s):

Real World ◽

Imbalanced Data ◽

Misclassification Rate ◽

Future Research ◽

Data Sets ◽

Real World Data ◽

Imbalanced Data Sets ◽

Challenging Issue ◽

Research Challenges ◽

Balanced Distribution

A very challenging issue in real-world data is that in many domains like medicine, finance, marketing, web, telecommunication, management, etc. the distribution of data among classes is inherently imbalanced. A widely accepted researched issue is that the traditional classifier algorithms assume a balanced distribution among the classes. Data imbalance is evident when the number of instances representing the class of concern is much lesser than other classes. Hence, the classifiers tend to bias towards the well-represented class. This leads to a higher misclassification rate among the lesser represented class. Hence, there is a need of efficient learners to classify imbalanced data. This chapter aims to address the need, challenges, existing methods, and evaluation metrics identified when learning from imbalanced data sets. Future research challenges and directions are highlighted.

Download Full-text

Oversampling for Imbalanced Data via Optimal Transport

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33015605 ◽

2019 ◽

Vol 33 ◽

pp. 5605-5612 ◽

Cited By ~ 1

Author(s):

Yuguang Yan ◽

Mingkui Tan ◽

Yanwu Xu ◽

Jiezhang Cao ◽

Michael Ng ◽

...

Keyword(s):

Real World ◽

Optimal Transport ◽

Imbalanced Data ◽

Data Sets ◽

Similar Distribution ◽

Real World Data ◽

Geometric Information ◽

Minority Class ◽

Real World Applications ◽

Multiple Metrics

The issue of data imbalance occurs in many real-world applications especially in medical diagnosis, where normal cases are usually much more than the abnormal cases. To alleviate this issue, one of the most important approaches is the oversampling method, which seeks to synthesize minority class samples to balance the numbers of different classes. However, existing methods barely consider global geometric information involved in the distribution of minority class samples, and thus may incur distribution mismatching between real and synthetic samples. In this paper, relying on optimal transport (Villani 2008), we propose an oversampling method by exploiting global geometric information of data to make synthetic samples follow a similar distribution to that of minority class samples. Moreover, we introduce a novel regularization based on synthetic samples and shift the distribution of minority class samples according to loss information. Experiments on toy and real-world data sets demonstrate the efficacy of our proposed method in terms of multiple metrics.

Download Full-text

A survey on feature weighting based K-Means algorithms

10.37686/ser.v1i2.79 ◽

2020 ◽

Author(s):

Renato Cordeiro de Amorim

Keyword(s):

Machine Learning ◽

Empirical Evidence ◽

Real World ◽

Clustering Algorithm ◽

Machine Learning Algorithms ◽

Feature Weighting ◽

Future Research ◽

Real World Data ◽

Data Set ◽

Partitional Clustering

In a real-world data set there is always the possibility, rather high in our opinion, that different features may have different degrees of relevance. Most machine learning algorithms deal with this fact by either selecting or deselecting features in the data preprocessing phase. However, we maintain that even among relevant features there may be different degrees of relevance, and this should be taken into account during the clustering process. With over 50 years of history, K-Means is arguably the most popular partitional clustering algorithm there is. The first K-Means based clustering algorithm to compute feature weights was designed just over 30 years ago. Various such algorithms have been designed since but there has not been, to our knowledge, a survey integrating empirical evidence of cluster recovery ability, common flaws, and possible directions for future research. This paper elaborates on the concept of feature weighting and addresses these issues by critically analysing some of the most popular, or innovative, feature weighting mechanisms based in K-Means

Download Full-text

Co-GCN for Multi-View Semi-Supervised Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5901 ◽

2020 ◽

Vol 34 (04) ◽

pp. 4691-4698

Author(s):

Shu Li ◽

Wen-Tao Li ◽

Wei Wang

Keyword(s):

Supervised Learning ◽

Real World ◽

State Of The Art ◽

Data Sets ◽

Real World Data ◽

Convolutional Network ◽

The Past ◽

Real World Applications ◽

Supervised Methods ◽

Disjoint Sets

In many real-world applications, the data have several disjoint sets of features and each set is called as a view. Researchers have developed many multi-view learning methods in the past decade. In this paper, we bring Graph Convolutional Network (GCN) into multi-view learning and propose a novel multi-view semi-supervised learning method Co-GCN by adaptively exploiting the graph information from the multiple views with combined Laplacians. Experimental results on real-world data sets verify that Co-GCN can achieve better performance compared with state-of-the-art multi-view semi-supervised methods.

Download Full-text

Collective Classification in Network Data

AI Magazine ◽

10.1609/aimag.v29i3.2157 ◽

2008 ◽

Vol 29 (3) ◽

pp. 93 ◽

Cited By ~ 561

Author(s):

Prithviraj Sen ◽

Galileo Namata ◽

Mustafa Bilgic ◽

Lise Getoor ◽

Brian Galligher ◽

...

Keyword(s):

Machine Learning ◽

Communication Networks ◽

Real World ◽

Biological Networks ◽

Real World Data ◽

Machine Learning Classification ◽

The Past ◽

Inference Algorithms ◽

Real World Applications ◽

Communication Links

Many real-world applications produce networked data such as the world-wide web (hypertext documents connected via hyperlinks), social networks (for example, people connected by friendship links), communication networks (computers connected via communication links) and biological networks (for example, protein interaction networks). A recent focus in machine learning research has been to extend traditional machine learning classification techniques to classify nodes in such networks. In this article, we provide a brief introduction to this area of research and how it has progressed during the past decade. We introduce four of the most widely used inference algorithms for classifying networked data and empirically compare them on both synthetic and real-world data.

Download Full-text

An Empirical Overview of the No Free Lunch Theorem and Its Effect on Real-World Machine Learning Classification

Neural Computation ◽

10.1162/neco_a_00793 ◽

2016 ◽

Vol 28 (1) ◽

pp. 216-228 ◽

Cited By ~ 21

Author(s):

David Gómez ◽

Alfonso Rojas

Keyword(s):

Machine Learning ◽

Real World ◽

Optimization Problem ◽

Data Sets ◽

Free Lunch ◽

Real World Data ◽

Classification Techniques ◽

Machine Learning Classification ◽

No Free Lunch Theorem ◽

No Free Lunch

A sizable amount of research has been done to improve the mechanisms for knowledge extraction such as machine learning classification or regression. Quite unintuitively, the no free lunch (NFL) theorem states that all optimization problem strategies perform equally well when averaged over all possible problems. This fact seems to clash with the effort put forth toward better algorithms. This letter explores empirically the effect of the NFL theorem on some popular machine learning classification techniques over real-world data sets.

Download Full-text

Hfinger: Malware HTTP Request Fingerprinting

Entropy ◽

10.3390/e23050507 ◽

2021 ◽

Vol 23 (5) ◽

pp. 507

Author(s):

Piotr Białczak ◽

Wojciech Mazurczyk

Keyword(s):

Real World ◽

Network Traffic ◽

Experimental Evaluation ◽

Data Sets ◽

Real World Data ◽

Malicious Software ◽

Default Mode ◽

World Data ◽

Effectiveness Analysis ◽

Http Protocol

Malicious software utilizes HTTP protocol for communication purposes, creating network traffic that is hard to identify as it blends into the traffic generated by benign applications. To this aim, fingerprinting tools have been developed to help track and identify such traffic by providing a short representation of malicious HTTP requests. However, currently existing tools do not analyze all information included in the HTTP message or analyze it insufficiently. To address these issues, we propose Hfinger, a novel malware HTTP request fingerprinting tool. It extracts information from the parts of the request such as URI, protocol information, headers, and payload, providing a concise request representation that preserves the extracted information in a form interpretable by a human analyst. For the developed solution, we have performed an extensive experimental evaluation using real-world data sets and we also compared Hfinger with the most related and popular existing tools such as FATT, Mercury, and p0f. The conducted effectiveness analysis reveals that on average only 1.85% of requests fingerprinted by Hfinger collide between malware families, what is 8–34 times lower than existing tools. Moreover, unlike these tools, in default mode, Hfinger does not introduce collisions between malware and benign applications and achieves it by increasing the number of fingerprints by at most 3 times. As a result, Hfinger can effectively track and hunt malware by providing more unique fingerprints than other standard tools.

Download Full-text

Different algorithms, different models

Quality & Quantity ◽

10.1007/s11135-021-01193-9 ◽

2021 ◽

Author(s):

Martyna Daria Swiatczak

Keyword(s):

Comparative Analysis ◽

Real World ◽

Qualitative Comparative Analysis ◽

Comparative Methods ◽

Data Sets ◽

Simulation Studies ◽

Threshold Values ◽

Real World Data ◽

Software Packages ◽

Methodological Approaches

AbstractThis study assesses the extent to which the two main Configurational Comparative Methods (CCMs), i.e. Qualitative Comparative Analysis (QCA) and Coincidence Analysis (CNA), produce different models. It further explains how this non-identity is due to the different algorithms upon which both methods are based, namely QCA’s Quine–McCluskey algorithm and the CNA algorithm. I offer an overview of the fundamental differences between QCA and CNA and demonstrate both underlying algorithms on three data sets of ascending proximity to real-world data. Subsequent simulation studies in scenarios of varying sample sizes and degrees of noise in the data show high overall ratios of non-identity between the QCA parsimonious solution and the CNA atomic solution for varying analytical choices, i.e. different consistency and coverage threshold values and ways to derive QCA’s parsimonious solution. Clarity on the contrasts between the two methods is supposed to enable scholars to make more informed decisions on their methodological approaches, enhance their understanding of what is happening behind the results generated by the software packages, and better navigate the interpretation of results. Clarity on the non-identity between the underlying algorithms and their consequences for the results is supposed to provide a basis for a methodological discussion about which method and which variants thereof are more successful in deriving which search target.

Download Full-text

Combining Outcome-Based and Preference-Based Matching: A Constrained Priority Mechanism

Political Analysis ◽

10.1017/pan.2020.48 ◽

2021 ◽

pp. 1-24

Author(s):

Avidit Acharya ◽

Kirk Bansak ◽

Jens Hainmueller

Keyword(s):

Machine Learning ◽

Host Country ◽

Real World ◽

Test Scores ◽

Standardized Test ◽

Market Design ◽

Standardized Test Scores ◽

Real World Data ◽

Refugee Families ◽

Strategy Proof

Abstract We introduce a constrained priority mechanism that combines outcome-based matching from machine learning with preference-based allocation schemes common in market design. Using real-world data, we illustrate how our mechanism could be applied to the assignment of refugee families to host country locations, and kindergarteners to schools. Our mechanism allows a planner to first specify a threshold $\bar g$ for the minimum acceptable average outcome score that should be achieved by the assignment. In the refugee matching context, this score corresponds to the probability of employment, whereas in the student assignment context, it corresponds to standardized test scores. The mechanism is a priority mechanism that considers both outcomes and preferences by assigning agents (refugee families and students) based on their preferences, but subject to meeting the planner’s specified threshold. The mechanism is both strategy-proof and constrained efficient in that it always generates a matching that is not Pareto dominated by any other matching that respects the planner’s threshold.

Download Full-text