imbalanced class distribution
Recently Published Documents


TOTAL DOCUMENTS

31
(FIVE YEARS 17)

H-INDEX

6
(FIVE YEARS 2)

Author(s):  
Dr.Yelepi Usha Rani ◽  
◽  
Lakshmi Sowmya Kotturi ◽  
Dr. G. Sudhakar ◽  
◽  
...  

In recent years researchers are intensely using machine learning and employing AI techniques in the medical field particularly in the domain of cancer. Breast cancer is one such example and many studies have proposed CAD systems and algorithms to efficiently detect cancer cells and tumors. Breast cancer is one of the dreadful cancers accounting for a large portion of deaths caused due to cancer worldwide mostly affecting women, needs early detection for proper diagnosis, and subsequent decrease in death rate. Thus, for efficient classification, we implemented different ML techniques on Wisconsin dataset [1] namely SVM, KNN, Decision Tree, Random Forest, Naive Bayes using accuracy as a performance metric, and as per observance, SVM has shown better results when compared to other algorithms. Also, we worked on Breast Histopathology Images [2] scanned at 40x which had images of IDC which is one of the most common types of breast cancers. And to work with the image dataset along with EDA we used high-end techniques like a mobile net where smote a resampling was used to handle imbalanced class distribution, CNN, SVC, InceptionResNetV2 where frameworks like Tensor Flow, Keras were loaded for supporting the environment and smoothly implement the algorithms.


2021 ◽  
Author(s):  
Peiyuan Zhou ◽  
Andrew K.C. Wong ◽  
Yang Yang ◽  
Scott T. Leatherdale ◽  
Kate Battista ◽  
...  

Abstract Background: COMPASS is a longitudinal, prospective cohort study collecting data annually from students attending high school in jurisdictions across Canada. We aimed to discover significant frequent/rare associations of behavioral factors among Canadian adolescents related to cannabis use.Methods: We use a subset of the COMPASS dataset which contains 18,761 records of students in grades 9 to 12 with 31 selected features (attributes) involving various characteristics, from living habits to academic performance. We then used the Pattern Discovery and Disentanglement (PDD) algorithm to detect strong and rare (yet statistically significant) associations from the dataset.Results: Cohort characteristics and factors associated with cannabis use and other associations detected by PDD show consistent results with common sense and literature surveys. In addition, PDD outperformed methods using other criteria (i.e. support and confidence) popular as reported in the literature. Association results showed that PDD could discover: i) a smaller set of succinct significant associations in clusters; ii) frequent and rare, yet significant, patterns supported by population health relevant study; iii) patterns from a dataset with extremely imbalanced groups (majority class (None-user): minority class (Regular) = 88.3%: 11.7%). Conclusions: Results on the COMPASS dataset have validated PDD’s efficacy in discovering succinct interpretable frequent associations with comprehensive coverage and rare yet significant associations from datasets with extremely imbalanced class distribution without relying on any balancing process. The frequent associations show consistent results with common sense and literature surveys, while the rare patterns show very special cases. The success of PDD on this project indicates that PDD has great potential for population health data analysis.


Author(s):  
M. Voelsen ◽  
D. Lobo Torres ◽  
R. Q. Feitosa ◽  
F. Rottensteiner ◽  
C. Heipke

Abstract. Fully convolutional neural networks (FCN) are successfully used for pixel-wise land cover classification - the task of identifying the physical material of the Earth’s surface for every pixel in an image. The acquisition of large training datasets is challenging, especially in remote sensing, but necessary for a FCN to perform well. One way to circumvent manual labelling is the usage of existing databases, which usually contain a certain amount of label noise when combined with another data source. As a first part of this work, we investigate the impact of training data on a FCN. We experiment with different amounts of training data, varying w.r.t. the covered area, the available acquisition dates and the amount of label noise. We conclude that the more data is used for training, the better is the generalization performance of the model, and the FCN is able to mitigate the effect of label noise to a high degree. Another challenge is the imbalanced class distribution in most real-world datasets, which can cause the classifier to focus on the majority classes, leading to poor classification performance for minority classes. To tackle this problem, in this paper, we use the cosine similarity loss to force feature vectors of the same class to be close to each other in feature space. Our experiments show that the cosine loss helps to obtain more similar feature vectors, but the similarity of the cluster centers also increases.


2021 ◽  
Vol 11 (8) ◽  
pp. 3543
Author(s):  
Xiang Yang Lim ◽  
Kok Beng Gan ◽  
Noor Azah Abd Aziz

Human activity recognition (HAR) is the study of the identification of specific human movement and action based on images, accelerometer data and inertia measurement unit (IMU) sensors. In the sensor based HAR application, most of the researchers used many IMU sensors to get an accurate HAR classification. The use of many IMU sensors not only limits the deployment phase but also increase the difficulty and discomfort for users. As reported in the literature, the original model used 19 sensor data consisting of accelerometers and IMU sensors. The imbalanced class distribution is another challenge to the recognition of human activity in real-life. This is a real-life scenario, and the classifier may predict some of the imbalanced classes with very high accuracy. When a model is trained using an imbalanced dataset, it can degrade model’s performance. In this paper, two approaches, namely resampling and multiclass focal loss, were used to address the imbalanced dataset. The resampling method was used to reconstruct the imbalanced class distribution of the IMU sensor dataset prior to model development and learning using the cross-entropy loss function. A deep ConvLSTM network with a minimal number of IMU sensor data was used to develop the upper-body HAR model. On the other hand, the multiclass focal loss function was used in the HAR model and classified minority classes without the need to resample the imbalanced dataset. Based on the experiments results, the developed HAR model using a cross-entropy loss function and reconstructed dataset achieved a good performance of 0.91 in the model accuracy and F1-score. The HAR model with a multiclass focal loss function and imbalanced dataset has a slightly lower model accuracy and F1-score in both 1% difference from the resampling method. In conclusion, the upper body HAR model using a minimal number of IMU sensors and proper handling of imbalanced class distribution by the resampling method is useful for the assessment of home-based rehabilitation involving activities of daily living.


Author(s):  
Seunghoon Kim ◽  
Youngbin Lym ◽  
Ki-Jung Kim

Along with the rapid demographic change, there has been increased attention to the risk of vehicle crashes relative to older drivers. Due to senior involvement and their physical vulnerability, it is crucial to develop models that accurately predict the severity of senior-involved crashes. However, the challenge is how to cope with an imbalanced severity class distribution and the ordered nature of crash severities, as these can complicate the classification of the severity of crashes. In that regard, this study investigates the influence of implementing ordinal nature and handling imbalanced class distribution on the prediction performance. Using vehicle crash data in Ohio, U.S., as an example, the eight machine learning classifiers (logistic and ordered logistic regressions and random forest and ordered random forest with or without handling imbalanced classes) are suggested and then compared with their respective performances. The analysis outcomes show that balancing strategy enhances performance in predicting severe crashes. In contrast, the effects of implementing ordinal nature vary across models. Specifically, the ordered random forest classifier without balancing appears to be superior in terms of overall prediction accuracy, and the ordered random forest with balancing outperforms others in predicting severer crashes.


Author(s):  
Shivani Vasantbhai Vora ◽  
Rupa G. Mehta ◽  
Shreyas Kishorkumar Patel

Continuously growing technology enhances creativity and simplifies humans' lives and offers the possibility to anticipate and satisfy their unmet needs. Understanding emotions is a crucial part of human behavior. Machines must deeply understand emotions to be able to predict human needs. Most tweets have sentiments of the user. It inherits the imbalanced class distribution. Most machine learning (ML) algorithms are likely to get biased towards the majority classes. The imbalanced distribution of classes gained extensive attention as it has produced many research challenges. It demands efficient approaches to handle the imbalanced data set. Strategies used for balancing the distribution of classes in the case study are handling redundant data, resampling training data, and data augmentation. Six methods related to these techniques have been examined in a case study. Upon conducting experiments on the Twitter dataset, it is seen that merging minority classes and shuffle sentence methods outperform other techniques.


2020 ◽  
Vol 2020 ◽  
pp. 1-10
Author(s):  
Zina Z. R. Al-Shamaa ◽  
Sefer Kurnaz ◽  
Adil Deniz Duru ◽  
Nadia Peppa ◽  
Alex H. Mirnezami ◽  
...  

Imbalanced class distribution in the medical dataset is a challenging task that hinders classifying disease correctly. It emerges when the number of healthy class instances being much larger than the disease class instances. To solve this problem, we proposed undersampling the healthy class instances to improve disease class classification. This model is named Hellinger Distance Undersampling (HDUS). It employs the Hellinger Distance to measure the resemblance between majority class instance and its neighbouring minority class instances to separate classes effectively and boost the discrimination power for each class. An extensive experiment has been conducted on four imbalanced medical datasets using three classifiers to compare HDUS with a baseline model and three state-of-the-art undersampling models. The outcomes display that HDUS can perform better than other models in terms of sensitivity, F1 measure, and balanced accuracy.


Sign in / Sign up

Export Citation Format

Share Document