Data Preprocessing, Active Learning, and Cost Perceptive Approaches for Resolving Data Imbalance - Advances in Data Mining and Database Management
Latest Publications


TOTAL DOCUMENTS

15
(FIVE YEARS 15)

H-INDEX

0
(FIVE YEARS 0)

Published By IGI Global

9781799873716, 9781799873730

Author(s):  
Vaishali S. Tidake ◽  
Shirish S. Sane

Usage of feature similarity is expected when the nearest neighbors are to be explored. Examples in multi-label datasets are associated with multiple labels. Hence, the use of label dissimilarity accompanied by feature similarity may reveal better neighbors. Information extracted from such neighbors is explored by devised MLFLD and MLFLD-MAXP algorithms. Among three distance metrics used for computation of label dissimilarity, Hamming distance has shown the most improved performance and hence used for further evaluation. The performance of implemented algorithms is compared with the state-of-the-art MLkNN algorithm. They showed an improvement for some datasets only. This chapter introduces parameters MLE and skew. MLE, skew, along with outlier parameter help to analyze multi-label and imbalanced nature of datasets. Investigation of datasets for various parameters and experimentation explored the need for data preprocessing for removing outliers. It revealed an improvement in the performance of implemented algorithms for all measures, and effectiveness is empirically validated.


Author(s):  
Jenish Dhanani ◽  
Rupa G. Mehta ◽  
Dipti P. Rana ◽  
Rahul Lad ◽  
Amogh Agrawal ◽  
...  

Recently, legal information retrieval has emerged as an essential practice for the legal fraternity. In the legal domain, judgment is a specific kind of legal document, which discusses case-related information and the verdict of a court case. In the common law system, the legal professionals exploit relevant judgments to prepare arguments. Hence, an automated system is a vital demand to identify similar judgments effectively. The judgments can be broadly categorized into civil and criminal cases, where judgments with similar case matters can have strong relevance compared to judgments with different case matters. In similar judgment identification, categorized judgments can significantly prune search space by restrictive search within a specific case category. So, this chapter provides a novel methodology that classifies Indian judgments in either of the case matter. Crucial challenges like imbalance and intrinsic characteristics of legal data are also highlighted specific to similarity analysis of Indian judgments, which can be a motivating aspect to the research community.


Author(s):  
Bharat Tidke ◽  
Swati Tidke

In this age of the internet, no person wants to make his decision on his own. Be it for purchasing a product, watching a movie, reading a book, a person looks out for reviews. People are unaware of the fact that these reviews may not always be true. It is the age of paid reviews, where the reviews are not just written to promote one's product but also to demote a competitor's product. But the ones which are turning out to be the most critical are given on brand of a certain product. This chapter proposed a novel approach for brand spam detection using feature correlation to improve state-of-the-art approaches. Correlation-based feature engineering is considered as one of the finest methods for determining the relations among the features. Several features attached with reviews are important, keeping in focus customer and company needs in making strong decisions, user for purchasing, and company for improving sales and services. Due to severe spamming these days, it has become nearly impossible to judge whether the given review is a trusted or a fake review.


Author(s):  
Isha Y. Agarwal ◽  
Dipti P. Rana ◽  
Devanshi Bhatia ◽  
Jay Rathod ◽  
Kaneesha J. Gandhi ◽  
...  

Social media has completely transformed the way people communicate. However, every revolution brings with it some negative impacts. Due to its popularity amongst tons of global users, these platforms have a huge volume of data. The ease of access with minimal verification of new users on social media has led to the creation of the bot accounts used to collect private data, spread false and harmful content, and also poses many security threats. A lot of concerns have been raised with the increment in the quantity of bot accounts on different social media platforms. Also there is a high imbalance between bot and non-bot accounts where the imbalance is a result of 'normal behavior' of bot users. The research aims at identifying the artificial bots accounts on Twitter using various machine learning algorithms and content-based classification based on features provided on the platform and recent tweets of users respectively.


Author(s):  
Debapriya Banik ◽  
Debotosh Bhattacharjee

Medical images mostly suffer from data imbalance problems, which make the disease classification task very difficult. The imbalanced distribution of the data in medical datasets happens when a proportion of a specific type of disease in a dataset appears in a small section of the entire dataset. So analyzing medical datasets with imbalanced data is a significant challenge for the machine learning and deep learning community. A standard classification learning algorithm might be biased towards the majority class and ignore the importance of the minority class (class of interest), which generally leads to the wrong diagnosis of the patients. So, the data imbalance problem in the medical image dataset is of utmost importance for the early prediction of disease, specifically cancer. This chapter attempts to explore different problems concerning data imbalance in medical diagnosis. The authors have discussed different rebalancing strategies that offer guidelines for choosing appropriate optimal procedures to train the samples by a classifier for an efficient medical diagnosis.


Author(s):  
Praveen Kumar Maduri ◽  
Tushar Biswas ◽  
Preeti Dhiman ◽  
Apurva Soni ◽  
Kushagra Singh

Plants play a significant role in everyone's life. They provide us essential elements like food, oxygen, and shelter, so plants must be supervised and nurtured properly. During cultivation, crops are prone to different kinds of diseases which can severely damage the whole yield leading to financial losses for farmers. In last 10 years, researchers have used different machine learning techniques to detect the disease on plants, but either the methods were not efficient enough to be implemented or were not able to cover the wide area in which plant diseases can be detected. So, the author has introduced a method which is efficient enough to easily detect plant disease and can be implemented in large fields. The author has used a combination of CNN and k-means clustering algorithms. By using this method, crops disease is detected by analyzing the leaves, which notifies users for action in the initial stage. Thus, the proposed method prevents whole crops from getting damaged and saves time and energy of farmers as disease will be identified way before a human eye can detect it on a large farm.


Author(s):  
Shivani Vasantbhai Vora ◽  
Rupa G. Mehta ◽  
Shreyas Kishorkumar Patel

Continuously growing technology enhances creativity and simplifies humans' lives and offers the possibility to anticipate and satisfy their unmet needs. Understanding emotions is a crucial part of human behavior. Machines must deeply understand emotions to be able to predict human needs. Most tweets have sentiments of the user. It inherits the imbalanced class distribution. Most machine learning (ML) algorithms are likely to get biased towards the majority classes. The imbalanced distribution of classes gained extensive attention as it has produced many research challenges. It demands efficient approaches to handle the imbalanced data set. Strategies used for balancing the distribution of classes in the case study are handling redundant data, resampling training data, and data augmentation. Six methods related to these techniques have been examined in a case study. Upon conducting experiments on the Twitter dataset, it is seen that merging minority classes and shuffle sentence methods outperform other techniques.


Author(s):  
D. Himaja ◽  
T. Maruthi Padmaja ◽  
P. Radha Krishna

Learning from data streams with both online class imbalance and concept drift (OCI-CD) is receiving much attention in today's world. Due to this problem, the performance is affected for the current models that learn from both stationary as well as non-stationary environments. In the case of non-stationary environments, due to the imbalance, it is hard to spot the concept drift using conventional drift detection methods that aim at tracking the change detection based on the learner's performance. There is limited work on the combined problem from imbalanced evolving streams both from stationary and non-stationary environments. Here the data may be evolved with complete labels or with only limited labels. This chapter's main emphasis is to provide different methods for the purpose of resolving the issue of class imbalance in emerging streams, which involves changing and unchanging environments with supervised and availability of limited labels.


Author(s):  
Mitali Desai ◽  
Rupa G. Mehta ◽  
Dipti P. Rana

Data imbalance is a key challenge in the majority of real-world classification problems. It refers to the disparity of data instances corresponding to either of the class labels. Data imbalance is studied in detail with respect to many data domains such as transaction data, medical data, e-commerce data, meteorological data, social media data, and web data. But the scholarly data domain is yet to be analyzed pertaining to data imbalance. In this chapter, the scholarly data domain is explored with a focus to study various forms of data imbalance. A well-known and popular scholarly platform, ResearchGate (RG), is targeted to extract real scholarly data. An extensive experimental analysis is performed on the extracted data in order to identify the existence of both data-level and network-level imbalance. The outcome contributes to the learning of various types of data imbalance that exist in scholarly data. Resolving the existing data imbalance will substantially help in achieving efficient and accurate outcomes in many real-world scholarly literature applications.


Author(s):  
Dipti P. Rana ◽  
Navodita Saini

Each gender is having special personality and behavior characteristics that can be naturally reflected in the language used on social media to review, spread information, make relationships, etc. This information is used by different agencies for their profits. The magnified study of this information can reflect the implicit biases of their creators' gender. The ratio of gender is imbalanced across the global world, social media, discussion, etc. Twitter is used to discuss the issues caused by COVID-19 disease like its symptoms, mental health, advice, etc. This fascinating information motivated this research to propose the methodology gender-based tweet analysis (GTA) to study and magnify gender's impact on emotions of tweet data. The analysis of the experiment discovered the biases of gender on emotions of tweet data and highlighted the future real-world applications which may become more productive if gender biases are considered for the safety and benefit of society.


Sign in / Sign up

Export Citation Format

Share Document