scholarly journals The Influence of Unbalanced Economic Data on Feature Selection and Quality of Classifiers

2020 ◽  
Vol 20 (1) ◽  
pp. 232-247
Author(s):  
Mariusz Kubus

AbstractResearch background: The successful learning of classifiers depends on the quality of data. Modeling is especially difficult when the data are unbalanced or contain many irrelevant variables. This is the case in many applications. The classification of rare events is the overarching goal, e.g. in bankruptcy prediction, churn analysis or fraud detection. The problem of irrelevant variables accompanies situations where the specification of the model is not known a priori, thus in typical conditions for data mining analysts.Purpose: The purpose of this paper is to compare the combinations of the most popular strategies of handling unbalanced data with feature selection methods that represent filters, wrappers and embedded methods.Research methodology: In the empirical study, we use real datasets with additionally introduced irrelevant variables. In this way, we are able to recognize which method correctly eliminates irrelevant variables.Results: Having carried out the experiment we conclude that over-sampling does not work in connection with feature selection. Some recommendations of the most promising methods also are given.Novelty: There are many solutions proposed in the literature concerning unbalanced data as well as feature selection. The innovative field of our interests is to examine their interactions.

Information ◽  
2021 ◽  
Vol 12 (11) ◽  
pp. 451
Author(s):  
Okechinyere J. Achilonu ◽  
Victor Olago ◽  
Elvira Singh ◽  
René M. J. C. Eijkemans ◽  
Gideon Nimako ◽  
...  

A cancer pathology report is a valuable medical document that provides information for clinical management of the patient and evaluation of health care. However, there are variations in the quality of reporting in free-text style formats, ranging from comprehensive to incomplete reporting. Moreover, the increasing incidence of cancer has generated a high throughput of pathology reports. Hence, manual extraction and classification of information from these reports can be intrinsically complex and resource-intensive. This study aimed to (i) evaluate the quality of over 80,000 breast, colorectal, and prostate cancer free-text pathology reports and (ii) assess the effectiveness of random forest (RF) and variants of support vector machine (SVM) in the classification of reports into benign and malignant classes. The study approach comprises data preprocessing, visualisation, feature selections, text classification, and evaluation of performance metrics. The performance of the classifiers was evaluated across various feature sizes, which were jointly selected by four filter feature selection methods. The feature selection methods identified established clinical terms, which are synonymous with each of the three cancers. Uni-gram tokenisation using the classifiers showed that the predictive power of RF model was consistent across various feature sizes, with overall F-scores of 95.2%, 94.0%, and 95.3% for breast, colorectal, and prostate cancer classification, respectively. The radial SVM achieved better classification performance compared with its linear variant for most of the feature sizes. The classifiers also achieved high precision, recall, and accuracy. This study supports a nationally agreed standard in pathology reporting and the use of text mining for encoding, classifying, and production of high-quality information abstractions for cancer prognosis and research.


2015 ◽  
Vol 1 (311) ◽  
Author(s):  
Katarzyna Stąpor

Discriminant Analysis can best be defined as a technique which allows the classification of an individual into several dictinctive populations on the basis of a set of measurements. Stepwise discriminant analysis (SDA) is concerned with selecting the most important variables whilst retaining the highest discrimination power possible. The process of selecting a smaller number of variables is often necessary for a variety number of reasons. In the existing statistical software packages SDA is based on the classic feature selection methods. Many problems with such stepwise procedures have been identified. In this work the new method based on the metaheuristic strategy tabu search will be presented together with the experimental results conducted on the selected benchmark datasets. The results are promising.


2015 ◽  
Vol 1 (311) ◽  
Author(s):  
Mariusz Kubus

Feature selection methods are usually classified into three groups: filters, wrappers and embedded methods. The second important criterion of their classification is an individual or multivariate approach to evaluation of the feature relevance. The chessboard problem is an illustrative example, where two variables which have no individual influence on the dependent variable can be essential to separate the classes. The classifiers which deal well with such data structure are sensitive to irrelevant variables. The generalization error increases with the number of noisy variables. We discuss the feature selection methods in the context of chessboard-like structure in the data with numerous irrelevant variables.


Author(s):  
Mohamed Farah ◽  
Hafedh Nefzi ◽  
Imed Riadh Farah

Nowadays, geographic information becomes too complex and abundant, thus recent research projects have been undertaken to make it manageable and exploitable. Ontologies are considered as a valuable support for geographic information representation. Building geographic ontologies could be viewed as an enrichment process. Alignment of concepts coming from different ontologies is central to the enrichment process and deeply affects the quality of the resulting ontology. The alignment of ontologies is based on using similarity measures. In the literature, there are many models for ontology alignment that mainly differ with respect to the similarity measures they use and the way they are combined. Most of the alignment methods do not deal with the problem of correlation between similarity measures. In this chapter, we address this issue to better decide which similarity measures we should consider to better assess the true similarity between concepts. Our proposal consists of using feature selection methods, in order to select a reduced set of relevant similarity measures.


2019 ◽  
Vol 64 (6) ◽  
pp. 5-15
Author(s):  
Iwona Markowicz ◽  
Paweł Baran

Official statistics on trade in goods between EU member states are collect-ed on country-level and then aggregated by Eurostat. Methodology of data collecting differs slightly between member states(e.g. various statistical thresholds and coverage), including differences in exchange rates as well as undeclared or late-declared transac-tions, errors in classification of goods and other mistakes. It often involves incomparability of mirror data (nominally concerning the same transactions recorded in statistics of both dispatcher and receiver countries). A huge part of these differences can be explained with the variable quality of data resources in the Eurostat database. In the study data quality on intra-EU trade in goods for 2017 was compared between Poland and neigh-bouring EU countries, i.e.:Germany, Czech Republic, Slovakia, Lithuania,and other Baltic states–Latvia and Estonia. The additional aim was to indicate the directions hav-ing the greatestinfluence on the observed differences in mirror data. The results of the study indicate that the declarations made in Estonia affect the poor quality of data on trade in goods between the countries mentioned above to the greatest extent.


2017 ◽  
Vol 222 ◽  
pp. 49-56 ◽  
Author(s):  
Lucas R. Trambaiolli ◽  
Claudinei E. Biazoli ◽  
Joana B. Balardin ◽  
Marcelo Q. Hoexter ◽  
João R. Sato

Sign in / Sign up

Export Citation Format

Share Document