scholarly journals Optimization Based Undersampling for Imbalanced Classes

Author(s):  
Fatih SAĞLAM
Keyword(s):  
2016 ◽  
Vol 63 (3) ◽  
pp. 353-372 ◽  
Author(s):  
David Holleran ◽  
Bruce D. Stout

In this study, we examine how important juvenile race and other factors are in juvenile commitment classification in the New Jersey Family Court. Data from the Family Court in New Jersey for the year 2010 comprise the population. Given the class imbalance in the dependent variable, we employ balanced random forest (RF). Variable importance plots and an information gain summary are used to assess the role of the juvenile’s race and other variables for classes of the dependent variable. The results from balanced RF indicate that the juvenile’s delinquency history and the offense seriousness make the most important contributions to commitment to juvenile state incarceration. The juvenile’s race makes a very weak contribution to commitment; in fact, when the balanced RF was rerun with the juvenile’s race omitted, the estimated misclassification error slightly dropped for commitments. Balanced RF is an attractive procedure for handling dependent variables with highly imbalanced classes. The juvenile’s adjudication history and offense seriousness emerged as the most important variables to state incarceration. The race of the juvenile was not an important variable with respect to commitment.


Symmetry ◽  
2020 ◽  
Vol 12 (11) ◽  
pp. 1792
Author(s):  
Shu-Fen Huang ◽  
Ching-Hsue Cheng

Medical data usually have missing values; hence, imputation methods have become an important issue. In previous studies, many imputation methods based on variable data had a multivariate normal distribution, such as expectation-maximization and regression-based imputation. These assumptions may lead to deviations in the results, which sometimes create a bottleneck. In addition, directly deleting instances with missing values may have several problems, such as losing important data, producing invalid research samples, and leading to research deviations. Therefore, this study proposed a safe-region imputation method for handling medical data with missing values; we also built a medical prediction model and compared the removed missing values with imputation methods in terms of the generated rules, accuracy, and AUC. First, this study used the kNN imputation, multiple imputation, and the proposed imputation to impute the missing data and then applied four attribute selection methods to select the important attributes. Then, we used the decision tree (C4.5), random forest, REP tree, and LMT classifier to generate the rules, accuracy, and AUC for comparison. Because there were four datasets with imbalanced classes (asymmetric classes), the AUC was an important criterion. In the experiment, we collected four open medical datasets from UCI and one international stroke trial dataset. The results show that the proposed safe-region imputation is better than the listing imputation methods and after imputing offers better results than directly deleting instances with missing values in the number of rules, accuracy, and AUC. These results will provide a reference for medical stakeholders.


2019 ◽  
Vol 16 (12) ◽  
pp. 1254-1261 ◽  
Author(s):  
Wei Ouyang ◽  
Casper F. Winsnes ◽  
Martin Hjelmare ◽  
Anthony J. Cesnik ◽  
Lovisa Åkesson ◽  
...  

AbstractPinpointing subcellular protein localizations from microscopy images is easy to the trained eye, but challenging to automate. Based on the Human Protein Atlas image collection, we held a competition to identify deep learning solutions to solve this task. Challenges included training on highly imbalanced classes and predicting multiple labels per image. Over 3 months, 2,172 teams participated. Despite convergence on popular networks and training techniques, there was considerable variety among the solutions. Participants applied strategies for modifying neural networks and loss functions, augmenting data and using pretrained networks. The winning models far outperformed our previous effort at multi-label classification of protein localization patterns by ~20%. These models can be used as classifiers to annotate new images, feature extractors to measure pattern similarity or pretrained networks for a wide range of biological applications.


Sign in / Sign up

Export Citation Format

Share Document