The Influence of Unbalanced Economic Data on Feature Selection and Quality of Classifiers

AbstractResearch background: The successful learning of classifiers depends on the quality of data. Modeling is especially difficult when the data are unbalanced or contain many irrelevant variables. This is the case in many applications. The classification of rare events is the overarching goal, e.g. in bankruptcy prediction, churn analysis or fraud detection. The problem of irrelevant variables accompanies situations where the specification of the model is not known a priori, thus in typical conditions for data mining analysts.Purpose: The purpose of this paper is to compare the combinations of the most popular strategies of handling unbalanced data with feature selection methods that represent filters, wrappers and embedded methods.Research methodology: In the empirical study, we use real datasets with additionally introduced irrelevant variables. In this way, we are able to recognize which method correctly eliminates irrelevant variables.Results: Having carried out the experiment we conclude that over-sampling does not work in connection with feature selection. Some recommendations of the most promising methods also are given.Novelty: There are many solutions proposed in the literature concerning unbalanced data as well as feature selection. The innovative field of our interests is to examine their interactions.

Download Full-text

A Text Mining Approach in the Classification of Free-Text Cancer Pathology Reports from the South African National Health Laboratory Services

Information ◽

10.3390/info12110451 ◽

2021 ◽

Vol 12 (11) ◽

pp. 451

Author(s):

Okechinyere J. Achilonu ◽

Victor Olago ◽

Elvira Singh ◽

René M. J. C. Eijkemans ◽

Gideon Nimako ◽

...

Keyword(s):

Prostate Cancer ◽

Feature Selection ◽

Text Mining ◽

Support Vector ◽

Free Text ◽

Selection Methods ◽

Cancer Pathology ◽

Pathology Reports

A cancer pathology report is a valuable medical document that provides information for clinical management of the patient and evaluation of health care. However, there are variations in the quality of reporting in free-text style formats, ranging from comprehensive to incomplete reporting. Moreover, the increasing incidence of cancer has generated a high throughput of pathology reports. Hence, manual extraction and classification of information from these reports can be intrinsically complex and resource-intensive. This study aimed to (i) evaluate the quality of over 80,000 breast, colorectal, and prostate cancer free-text pathology reports and (ii) assess the effectiveness of random forest (RF) and variants of support vector machine (SVM) in the classification of reports into benign and malignant classes. The study approach comprises data preprocessing, visualisation, feature selections, text classification, and evaluation of performance metrics. The performance of the classifiers was evaluated across various feature sizes, which were jointly selected by four filter feature selection methods. The feature selection methods identified established clinical terms, which are synonymous with each of the three cancers. Uni-gram tokenisation using the classifiers showed that the predictive power of RF model was consistent across various feature sizes, with overall F-scores of 95.2%, 94.0%, and 95.3% for breast, colorectal, and prostate cancer classification, respectively. The radial SVM achieved better classification performance compared with its linear variant for most of the feature sizes. The classifiers also achieved high precision, recall, and accuracy. This study supports a nationally agreed standard in pathology reporting and the use of text mining for encoding, classifying, and production of high-quality information abstractions for cancer prognosis and research.

Download Full-text

BETTER ALTERNATIVES FOR STEPWISE DISCRIMINANT ANALYSIS

Acta Universitatis Lodziensis Folia oeconomica ◽

10.18778/0208-6018.311.02 ◽

2015 ◽

Vol 1 (311) ◽

Author(s):

Katarzyna Stąpor

Keyword(s):

Feature Selection ◽

Discriminant Analysis ◽

Tabu Search ◽

Stepwise Discriminant Analysis ◽

Selection Methods ◽

Discrimination Power ◽

Statistical Software ◽

Software Packages ◽

Benchmark Datasets

Discriminant Analysis can best be defined as a technique which allows the classification of an individual into several dictinctive populations on the basis of a set of measurements. Stepwise discriminant analysis (SDA) is concerned with selecting the most important variables whilst retaining the highest discrimination power possible. The process of selecting a smaller number of variables is often necessary for a variety number of reasons. In the existing statistical software packages SDA is based on the classic feature selection methods. Many problems with such stepwise procedures have been identified. In this work the new method based on the metaheuristic strategy tabu search will be presented together with the experimental results conducted on the selected benchmark datasets. The results are promising.

Download Full-text

Classification of Imaginary motor task from Electroencephalographic Signals: A Comparison of Feature Selection Methods and Classification Algorithms

10.17488/rmib.39.1.8 ◽

2017 ◽

Author(s):

H. J. Vélez-Lora

Keyword(s):

Feature Selection ◽

Motor Task ◽

Classification Algorithms ◽

Selection Methods ◽

Electroencephalographic Signals

Download Full-text

FEATURE SELECTION AND THE CHESSBOARD PROBLEM

Acta Universitatis Lodziensis Folia oeconomica ◽

10.18778/0208-6018.311.03 ◽

2015 ◽

Vol 1 (311) ◽

Author(s):

Mariusz Kubus

Keyword(s):

Feature Selection ◽

Data Structure ◽

Important Criterion ◽

Multivariate Approach ◽

Generalization Error ◽

Selection Methods ◽

Embedded Methods ◽

Feature Relevance

Feature selection methods are usually classified into three groups: filters, wrappers and embedded methods. The second important criterion of their classification is an individual or multivariate approach to evaluation of the feature relevance. The chessboard problem is an illustrative example, where two variables which have no individual influence on the dependent variable can be essential to separate the classes. The classifiers which deal well with such data structure are sensitive to irrelevant variables. The generalization error increases with the number of noisy variables. We discuss the feature selection methods in the context of chessboard-like structure in the data with numerous irrelevant variables.

Download Full-text

Classification of Prostatic Tissues using Feature Selection Methods

11th Mediterranean Conference on Medical and Biomedical Engineering and Computing 2007 - IFMBE Proceedings ◽

10.1007/978-3-540-73044-6_219 ◽

2007 ◽

pp. 843-846 ◽

Cited By ~ 1

Author(s):

S. Bouatmane ◽

B. Nekhoul ◽

A. Bouridane ◽

C. Tanougast

Keyword(s):

Feature Selection ◽

Selection Methods

Download Full-text

A Feature Selection-Based Method for an Ontological Enrichment Process in Geographic Knowledge Modelling

Advances in Geospatial Technologies - Handbook of Research on Geographic Information Systems Applications and Advancements ◽

10.4018/978-1-5225-0937-0.ch016 ◽

2017 ◽

pp. 407-426

Author(s):

Mohamed Farah ◽

Hafedh Nefzi ◽

Imed Riadh Farah

Keyword(s):

Feature Selection ◽

Similarity Measures ◽

Geographic Information ◽

Information Representation ◽

Ontology Alignment ◽

Research Projects ◽

Selection Methods ◽

Geographic Knowledge ◽

Enrichment Process

Nowadays, geographic information becomes too complex and abundant, thus recent research projects have been undertaken to make it manageable and exploitable. Ontologies are considered as a valuable support for geographic information representation. Building geographic ontologies could be viewed as an enrichment process. Alignment of concepts coming from different ontologies is central to the enrichment process and deeply affects the quality of the resulting ontology. The alignment of ontologies is based on using similarity measures. In the literature, there are many models for ontology alignment that mainly differ with respect to the similarity measures they use and the way they are combined. Most of the alignment methods do not deal with the problem of correlation between similarity measures. In this chapter, we address this issue to better decide which similarity measures we should consider to better assess the true similarity between concepts. Our proposal consists of using feature selection methods, in order to select a reduced set of relevant similarity measures.

Download Full-text

The Effect of Different Feature Selection Methods for Classification of Melanoma

Recent Trends in Signal and Image Processing - Advances in Intelligent Systems and Computing ◽

10.1007/978-981-33-6966-5_13 ◽

2021 ◽

pp. 123-133

Author(s):

Ananjan Maiti ◽

Biswajoy Chatterjee

Keyword(s):

Feature Selection ◽

Selection Methods

Download Full-text

A comparison of three feature selection methods for object-based classification of sub-decimeter resolution UltraCam-L imagery

International Journal of Applied Earth Observation and Geoinformation ◽

10.1016/j.jag.2011.05.011 ◽

2012 ◽

Vol 15 ◽

pp. 70-78 ◽

Cited By ~ 85

Author(s):

A.S. Laliberte ◽

D.M. Browning ◽

A. Rango

Keyword(s):

Feature Selection ◽

Selection Methods ◽

Object Based

Download Full-text

Quality of data on intra-EU trade in goods

Wiadomości Statystyczne. The Polish Statistician ◽

10.5604/01.3001.0013.8496 ◽

2019 ◽

Vol 64 (6) ◽

pp. 5-15

Author(s):

Iwona Markowicz ◽

Paweł Baran

Keyword(s):

Poor Quality ◽

Study Data ◽

Quality Of Data ◽

Member States ◽

Eu Member States ◽

Country Level ◽

The Poor ◽

Made In

Official statistics on trade in goods between EU member states are collect-ed on country-level and then aggregated by Eurostat. Methodology of data collecting differs slightly between member states(e.g. various statistical thresholds and coverage), including differences in exchange rates as well as undeclared or late-declared transac-tions, errors in classification of goods and other mistakes. It often involves incomparability of mirror data (nominally concerning the same transactions recorded in statistics of both dispatcher and receiver countries). A huge part of these differences can be explained with the variable quality of data resources in the Eurostat database. In the study data quality on intra-EU trade in goods for 2017 was compared between Poland and neigh-bouring EU countries, i.e.:Germany, Czech Republic, Slovakia, Lithuania,and other Baltic states–Latvia and Estonia. The additional aim was to indicate the directions hav-ing the greatestinfluence on the observed differences in mirror data. The results of the study indicate that the declarations made in Estonia affect the poor quality of data on trade in goods between the countries mentioned above to the greatest extent.

Download Full-text