Data Preprocessing, Active Learning, and Cost Perceptive Approaches for Resolving Data Imbalance - Advances in Data Mining and Database Management

Effective Multi-Label Classification Using Data Preprocessing

Data Preprocessing, Active Learning, and Cost Perceptive Approaches for Resolving Data Imbalance - Advances in Data Mining and Database Management ◽

10.4018/978-1-7998-7371-6.ch005 ◽

2021 ◽

pp. 90-109

Author(s):

Vaishali S. Tidake ◽

Shirish S. Sane

Keyword(s):

Hamming Distance ◽

State Of The Art ◽

Nearest Neighbors ◽

Data Preprocessing ◽

The State ◽

Distance Metrics ◽

Feature Similarity ◽

Improved Performance ◽

Using Data

Usage of feature similarity is expected when the nearest neighbors are to be explored. Examples in multi-label datasets are associated with multiple labels. Hence, the use of label dissimilarity accompanied by feature similarity may reveal better neighbors. Information extracted from such neighbors is explored by devised MLFLD and MLFLD-MAXP algorithms. Among three distance metrics used for computation of label dissimilarity, Hamming distance has shown the most improved performance and hence used for further evaluation. The performance of implemented algorithms is compared with the state-of-the-art MLkNN algorithm. They showed an improvement for some datasets only. This chapter introduces parameters MLE and skew. MLE, skew, along with outlier parameter help to analyze multi-label and imbalanced nature of datasets. Investigation of datasets for various parameters and experimentation explored the need for data preprocessing for removing outliers. It revealed an improvement in the performance of implemented algorithms for all measures, and effectiveness is empirically validated.

Indian Judgment Categorization for Practicing Similar Judgment Identification

Data Preprocessing, Active Learning, and Cost Perceptive Approaches for Resolving Data Imbalance - Advances in Data Mining and Database Management ◽

10.4018/978-1-7998-7371-6.ch013 ◽

2021 ◽

pp. 232-241

Author(s):

Jenish Dhanani ◽

Rupa G. Mehta ◽

Dipti P. Rana ◽

Rahul Lad ◽

Amogh Agrawal ◽

...

Keyword(s):

Search Space ◽

Automated System ◽

Court Case ◽

Criminal Cases ◽

Legal Information ◽

Law System ◽

Legal Professionals ◽

Related Information ◽

Legal Document ◽

Legal Domain

Recently, legal information retrieval has emerged as an essential practice for the legal fraternity. In the legal domain, judgment is a specific kind of legal document, which discusses case-related information and the verdict of a court case. In the common law system, the legal professionals exploit relevant judgments to prepare arguments. Hence, an automated system is a vital demand to identify similar judgments effectively. The judgments can be broadly categorized into civil and criminal cases, where judgments with similar case matters can have strong relevance compared to judgments with different case matters. In similar judgment identification, categorized judgments can significantly prune search space by restrictive search within a specific case category. So, this chapter provides a novel methodology that classifies Indian judgments in either of the case matter. Crucial challenges like imbalance and intrinsic characteristics of legal data are also highlighted specific to similarity analysis of Indian judgments, which can be a motivating aspect to the research community.

A Novel Feature Correlation Approach for Brand Spam Detection

Data Preprocessing, Active Learning, and Cost Perceptive Approaches for Resolving Data Imbalance - Advances in Data Mining and Database Management ◽

10.4018/978-1-7998-7371-6.ch008 ◽

2021 ◽

pp. 149-161

Author(s):

Bharat Tidke ◽

Swati Tidke

Keyword(s):

State Of The Art ◽

The Internet ◽

Spam Detection ◽

Feature Engineering ◽

Novel Approach ◽

Feature Correlation ◽

The Given ◽

Improve State

In this age of the internet, no person wants to make his decision on his own. Be it for purchasing a product, watching a movie, reading a book, a person looks out for reviews. People are unaware of the fact that these reviews may not always be true. It is the age of paid reviews, where the reviews are not just written to promote one's product but also to demote a competitor's product. But the ones which are turning out to be the most critical are given on brand of a certain product. This chapter proposed a novel approach for brand spam detection using feature correlation to improve state-of-the-art approaches. Correlation-based feature engineering is considered as one of the finest methods for determining the relations among the features. Several features attached with reviews are important, keeping in focus customer and company needs in making strong decisions, user for purchasing, and company for improving sales and services. Due to severe spamming these days, it has become nearly impossible to judge whether the given review is a trusted or a fake review.

Detection of Bot Accounts on Social Media Considering Its Imbalanced Nature

Data Preprocessing, Active Learning, and Cost Perceptive Approaches for Resolving Data Imbalance - Advances in Data Mining and Database Management ◽

10.4018/978-1-7998-7371-6.ch009 ◽

2021 ◽

pp. 162-176

Author(s):

Isha Y. Agarwal ◽

Dipti P. Rana ◽

Devanshi Bhatia ◽

Jay Rathod ◽

Kaneesha J. Gandhi ◽

...

Keyword(s):

Machine Learning ◽

Social Media ◽

Machine Learning Algorithms ◽

Security Threats ◽

Private Data ◽

Normal Behavior ◽

Social Media Platforms ◽

Harmful Content ◽

Negative Impacts ◽

Ease Of Access

Social media has completely transformed the way people communicate. However, every revolution brings with it some negative impacts. Due to its popularity amongst tons of global users, these platforms have a huge volume of data. The ease of access with minimal verification of new users on social media has led to the creation of the bot accounts used to collect private data, spread false and harmful content, and also poses many security threats. A lot of concerns have been raised with the increment in the quantity of bot accounts on different social media platforms. Also there is a high imbalance between bot and non-bot accounts where the imbalance is a result of 'normal behavior' of bot users. The research aims at identifying the artificial bots accounts on Twitter using various machine learning algorithms and content-based classification based on features provided on the platform and recent tweets of users respectively.

Mitigating Data Imbalance Issues in Medical Image Analysis

Data Preprocessing, Active Learning, and Cost Perceptive Approaches for Resolving Data Imbalance - Advances in Data Mining and Database Management ◽

10.4018/978-1-7998-7371-6.ch004 ◽

2021 ◽

pp. 66-89

Author(s):

Debapriya Banik ◽

Debotosh Bhattacharjee

Keyword(s):

Learning Community ◽

Medical Diagnosis ◽

Medical Image ◽

Learning Algorithm ◽

Medical Image Analysis ◽

Imbalanced Data ◽

Disease Classification ◽

Data Imbalance ◽

Imbalance Problem ◽

Optimal Procedures

Medical images mostly suffer from data imbalance problems, which make the disease classification task very difficult. The imbalanced distribution of the data in medical datasets happens when a proportion of a specific type of disease in a dataset appears in a small section of the entire dataset. So analyzing medical datasets with imbalanced data is a significant challenge for the machine learning and deep learning community. A standard classification learning algorithm might be biased towards the majority class and ignore the importance of the minority class (class of interest), which generally leads to the wrong diagnosis of the patients. So, the data imbalance problem in the medical image dataset is of utmost importance for the early prediction of disease, specifically cancer. This chapter attempts to explore different problems concerning data imbalance in medical diagnosis. The authors have discussed different rebalancing strategies that offer guidelines for choosing appropriate optimal procedures to train the samples by a classifier for an efficient medical diagnosis.

Leaf Disease Detection Using AI

Data Preprocessing, Active Learning, and Cost Perceptive Approaches for Resolving Data Imbalance - Advances in Data Mining and Database Management ◽

10.4018/978-1-7998-7371-6.ch006 ◽

2021 ◽

pp. 110-136

Author(s):

Praveen Kumar Maduri ◽

Tushar Biswas ◽

Preeti Dhiman ◽

Apurva Soni ◽

Kushagra Singh

Keyword(s):

Machine Learning ◽

Clustering Algorithms ◽

Essential Elements ◽

Plant Diseases ◽

Machine Learning Techniques ◽

Learning Techniques ◽

Initial Stage ◽

Large Farm ◽

Time And Energy ◽

Financial Losses

Plants play a significant role in everyone's life. They provide us essential elements like food, oxygen, and shelter, so plants must be supervised and nurtured properly. During cultivation, crops are prone to different kinds of diseases which can severely damage the whole yield leading to financial losses for farmers. In last 10 years, researchers have used different machine learning techniques to detect the disease on plants, but either the methods were not efficient enough to be implemented or were not able to cover the wide area in which plant diseases can be detected. So, the author has introduced a method which is efficient enough to easily detect plant disease and can be implemented in large fields. The author has used a combination of CNN and k-means clustering algorithms. By using this method, crops disease is detected by analyzing the leaves, which notifies users for action in the initial stage. Thus, the proposed method prevents whole crops from getting damaged and saves time and energy of farmers as disease will be identified way before a human eye can detect it on a large farm.

Impact of Balancing Techniques for Imbalanced Class Distribution on Twitter Data for Emotion Analysis

Data Preprocessing, Active Learning, and Cost Perceptive Approaches for Resolving Data Imbalance - Advances in Data Mining and Database Management ◽

10.4018/978-1-7998-7371-6.ch012 ◽

2021 ◽

pp. 211-231

Author(s):

Shivani Vasantbhai Vora ◽

Rupa G. Mehta ◽

Shreyas Kishorkumar Patel

Keyword(s):

Data Augmentation ◽

Imbalanced Data ◽

Training Data ◽

Human Needs ◽

Class Distribution ◽

Redundant Data ◽

Imbalanced Class ◽

Understanding Emotions ◽

Imbalanced Class Distribution

Continuously growing technology enhances creativity and simplifies humans' lives and offers the possibility to anticipate and satisfy their unmet needs. Understanding emotions is a crucial part of human behavior. Machines must deeply understand emotions to be able to predict human needs. Most tweets have sentiments of the user. It inherits the imbalanced class distribution. Most machine learning (ML) algorithms are likely to get biased towards the majority classes. The imbalanced distribution of classes gained extensive attention as it has produced many research challenges. It demands efficient approaches to handle the imbalanced data set. Strategies used for balancing the distribution of classes in the case study are handling redundant data, resampling training data, and data augmentation. Six methods related to these techniques have been examined in a case study. Upon conducting experiments on the Twitter dataset, it is seen that merging minority classes and shuffle sentence methods outperform other techniques.

A Survey of Class Imbalance Problem on Evolving Data Stream

Data Preprocessing, Active Learning, and Cost Perceptive Approaches for Resolving Data Imbalance - Advances in Data Mining and Database Management ◽

10.4018/978-1-7998-7371-6.ch002 ◽

2021 ◽

pp. 23-41

Author(s):

D. Himaja ◽

T. Maruthi Padmaja ◽

P. Radha Krishna

Keyword(s):

Change Detection ◽

Data Streams ◽

Data Stream ◽

Concept Drift ◽

Class Imbalance ◽

Detection Methods ◽

Class Imbalance Problem ◽

Imbalance Problem ◽

Learning From Data ◽

Main Emphasis

Learning from data streams with both online class imbalance and concept drift (OCI-CD) is receiving much attention in today's world. Due to this problem, the performance is affected for the current models that learn from both stationary as well as non-stationary environments. In the case of non-stationary environments, due to the imbalance, it is hard to spot the concept drift using conventional drift detection methods that aim at tracking the change detection based on the learner's performance. There is limited work on the combined problem from imbalanced evolving streams both from stationary and non-stationary environments. Here the data may be evolved with complete labels or with only limited labels. This chapter's main emphasis is to provide different methods for the purpose of resolving the issue of class imbalance in emerging streams, which involves changing and unchanging environments with supervised and availability of limited labels.

An Experimental Analysis to Learn Data Imbalance in Scholarly Data

Data Preprocessing, Active Learning, and Cost Perceptive Approaches for Resolving Data Imbalance - Advances in Data Mining and Database Management ◽

10.4018/978-1-7998-7371-6.ch014 ◽

2021 ◽

pp. 242-254

Author(s):

Mitali Desai ◽

Rupa G. Mehta ◽

Dipti P. Rana

Keyword(s):

Real World ◽

Experimental Analysis ◽

Meteorological Data ◽

Medical Data ◽

Classification Problems ◽

Data Imbalance ◽

Scholarly Data ◽

Class Labels ◽

Media Data ◽

Existing Data

Data imbalance is a key challenge in the majority of real-world classification problems. It refers to the disparity of data instances corresponding to either of the class labels. Data imbalance is studied in detail with respect to many data domains such as transaction data, medical data, e-commerce data, meteorological data, social media data, and web data. But the scholarly data domain is yet to be analyzed pertaining to data imbalance. In this chapter, the scholarly data domain is explored with a focus to study various forms of data imbalance. A well-known and popular scholarly platform, ResearchGate (RG), is targeted to extract real scholarly data. An extensive experimental analysis is performed on the extracted data in order to identify the existence of both data-level and network-level imbalance. The outcome contributes to the learning of various types of data imbalance that exist in scholarly data. Resolving the existing data imbalance will substantially help in achieving efficient and accurate outcomes in many real-world scholarly literature applications.

Gender-Based Tweet Analysis (GTA)

Data Preprocessing, Active Learning, and Cost Perceptive Approaches for Resolving Data Imbalance - Advances in Data Mining and Database Management ◽

10.4018/978-1-7998-7371-6.ch015 ◽

2021 ◽

pp. 255-267

Author(s):

Dipti P. Rana ◽

Navodita Saini

Keyword(s):

Mental Health ◽

Social Media ◽

Real World ◽

Gender Biases ◽

Implicit Biases ◽

Health Advice ◽

Gender Based ◽

Behavior Characteristics ◽

And Behavior ◽

Global World

Each gender is having special personality and behavior characteristics that can be naturally reflected in the language used on social media to review, spread information, make relationships, etc. This information is used by different agencies for their profits. The magnified study of this information can reflect the implicit biases of their creators' gender. The ratio of gender is imbalanced across the global world, social media, discussion, etc. Twitter is used to discuss the issues caused by COVID-19 disease like its symptoms, mental health, advice, etc. This fascinating information motivated this research to propose the methodology gender-based tweet analysis (GTA) to study and magnify gender's impact on emotions of tweet data. The analysis of the experiment discovered the biases of gender on emotions of tweet data and highlighted the future real-world applications which may become more productive if gender biases are considered for the safety and benefit of society.

Data Preprocessing, Active Learning, and Cost Perceptive Approaches for Resolving Data Imbalance - Advances in Data Mining and Database Management
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

Effective Multi-Label Classification Using Data Preprocessing

Indian Judgment Categorization for Practicing Similar Judgment Identification

A Novel Feature Correlation Approach for Brand Spam Detection

Detection of Bot Accounts on Social Media Considering Its Imbalanced Nature

Mitigating Data Imbalance Issues in Medical Image Analysis

Leaf Disease Detection Using AI

Impact of Balancing Techniques for Imbalanced Class Distribution on Twitter Data for Emotion Analysis

A Survey of Class Imbalance Problem on Evolving Data Stream

An Experimental Analysis to Learn Data Imbalance in Scholarly Data

Gender-Based Tweet Analysis (GTA)

Export Citation Format

Data Preprocessing, Active Learning, and Cost Perceptive Approaches for Resolving Data Imbalance - Advances in Data Mining and Database ManagementLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By IGI Global

Effective Multi-Label Classification Using Data Preprocessing

Indian Judgment Categorization for Practicing Similar Judgment Identification

A Novel Feature Correlation Approach for Brand Spam Detection

Detection of Bot Accounts on Social Media Considering Its Imbalanced Nature

Mitigating Data Imbalance Issues in Medical Image Analysis

Leaf Disease Detection Using AI

Impact of Balancing Techniques for Imbalanced Class Distribution on Twitter Data for Emotion Analysis

A Survey of Class Imbalance Problem on Evolving Data Stream

An Experimental Analysis to Learn Data Imbalance in Scholarly Data

Gender-Based Tweet Analysis (GTA)

Data Preprocessing, Active Learning, and Cost Perceptive Approaches for Resolving Data Imbalance - Advances in Data Mining and Database Management
Latest Publications