A Text Mining Approach in the Classification of Free-Text Cancer Pathology Reports from the South African National Health Laboratory Services

A cancer pathology report is a valuable medical document that provides information for clinical management of the patient and evaluation of health care. However, there are variations in the quality of reporting in free-text style formats, ranging from comprehensive to incomplete reporting. Moreover, the increasing incidence of cancer has generated a high throughput of pathology reports. Hence, manual extraction and classification of information from these reports can be intrinsically complex and resource-intensive. This study aimed to (i) evaluate the quality of over 80,000 breast, colorectal, and prostate cancer free-text pathology reports and (ii) assess the effectiveness of random forest (RF) and variants of support vector machine (SVM) in the classification of reports into benign and malignant classes. The study approach comprises data preprocessing, visualisation, feature selections, text classification, and evaluation of performance metrics. The performance of the classifiers was evaluated across various feature sizes, which were jointly selected by four filter feature selection methods. The feature selection methods identified established clinical terms, which are synonymous with each of the three cancers. Uni-gram tokenisation using the classifiers showed that the predictive power of RF model was consistent across various feature sizes, with overall F-scores of 95.2%, 94.0%, and 95.3% for breast, colorectal, and prostate cancer classification, respectively. The radial SVM achieved better classification performance compared with its linear variant for most of the feature sizes. The classifiers also achieved high precision, recall, and accuracy. This study supports a nationally agreed standard in pathology reporting and the use of text mining for encoding, classifying, and production of high-quality information abstractions for cancer prognosis and research.

Download Full-text

Classification of Noisy Free-Text Prostate Cancer Pathology Reports Using Natural Language Processing

Pattern Recognition. ICPR International Workshops and Challenges - Lecture Notes in Computer Science ◽

10.1007/978-3-030-68763-2_12 ◽

2021 ◽

pp. 154-166

Author(s):

Anjani Dhrangadhariya ◽

Sebastian Otálora ◽

Manfredo Atzori ◽

Henning Müller

Keyword(s):

Prostate Cancer ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Free Text ◽

Cancer Pathology ◽

Pathology Reports

Download Full-text

The Influence of Unbalanced Economic Data on Feature Selection and Quality of Classifiers

Folia Oeconomica Stetinensia ◽

10.2478/foli-2020-0014 ◽

2020 ◽

Vol 20 (1) ◽

pp. 232-247

Author(s):

Mariusz Kubus

Keyword(s):

Feature Selection ◽

A Priori ◽

Bankruptcy Prediction ◽

Unbalanced Data ◽

Selection Methods ◽

Economic Data ◽

Quality Of Data ◽

Embedded Methods

AbstractResearch background: The successful learning of classifiers depends on the quality of data. Modeling is especially difficult when the data are unbalanced or contain many irrelevant variables. This is the case in many applications. The classification of rare events is the overarching goal, e.g. in bankruptcy prediction, churn analysis or fraud detection. The problem of irrelevant variables accompanies situations where the specification of the model is not known a priori, thus in typical conditions for data mining analysts.Purpose: The purpose of this paper is to compare the combinations of the most popular strategies of handling unbalanced data with feature selection methods that represent filters, wrappers and embedded methods.Research methodology: In the empirical study, we use real datasets with additionally introduced irrelevant variables. In this way, we are able to recognize which method correctly eliminates irrelevant variables.Results: Having carried out the experiment we conclude that over-sampling does not work in connection with feature selection. Some recommendations of the most promising methods also are given.Novelty: There are many solutions proposed in the literature concerning unbalanced data as well as feature selection. The innovative field of our interests is to examine their interactions.

Download Full-text

Classification of Gene Expression Data Using Feature Selection Based on Type Combination Approach Model With Advanced Feature Selection Technology

International Journal of Cognitive Informatics and Natural Intelligence ◽

10.4018/ijcini.20211001.oa46 ◽

2021 ◽

Vol 15 (4) ◽

pp. 1-18

Author(s):

Siddesh G. M. ◽

Gururaj T.

Keyword(s):

Feature Selection ◽

Classification Accuracy ◽

Error Rates ◽

Recursive Feature Elimination ◽

Support Vector ◽

Selection Methods ◽

Maximum Information ◽

Combination Approach ◽

Selection Of

A key step in addressing the classification issue was the selection of genes for removing redundant and irrelevant genes. The proposed Type Combination Approach –Feature Selection(TCA-FS) model uses the efficient feature selection methods, and the classification accuracy can be enhanced. The three classifiers such as K Nearest Neighbour(KNN), Support Vector Machine(SVM) and Random Forest(RF) are selected for evaluating the opted feature selection methods, and prediction accuracy. The effects of three new approaches for feature selection are Improved Recursive Feature Elimination (IRFE), Revised Maximum Information co-efficient (RMIC), as well as Upgraded Masked Painter (UMP), are analysed. These three proposed techniques are compared with existing techniques and are validated with (i) Stability determination test. (ii) Classification accuracy. (iii) Error rates of three proposed techniques are analysed. Due to the selection of proper threshold on classification, the proposed TCA-FS method provides a higher accuracy compared to the existing system.

Download Full-text

Classification of Gene Expression Data Using Feature Selection Based on Type Combination Approach Model with Advanced Feature Selection Techn

International Journal of Cognitive Informatics and Natural Intelligence ◽

10.4018/ijcini.20211001oa34 ◽

2021 ◽

Vol 15 (4) ◽

pp. 0-0

Keyword(s):

Feature Selection ◽

Classification Accuracy ◽

Error Rates ◽

Recursive Feature Elimination ◽

Support Vector ◽

Selection Methods ◽

Maximum Information ◽

Combination Approach ◽

Selection Of

Download Full-text

A fuzzy gaussian rank aggregation ensemble feature selection method for microarray data

International Journal of Knowledge-based and Intelligent Engineering Systems ◽

10.3233/kes-190134 ◽

2021 ◽

Vol 24 (4) ◽

pp. 289-301

Author(s):

B. Venkatesh ◽

J. Anuradha

Keyword(s):

Feature Selection ◽

Microarray Data ◽

Classification Accuracy ◽

Performance Metrics ◽

Feature Selection Method ◽

Selection Method ◽

Support Vector ◽

Svm Classifier ◽

Binary Particle Swarm Optimization ◽

Selection Methods

In Microarray Data, it is complicated to achieve more classification accuracy due to the presence of high dimensions, irrelevant and noisy data. And also It had more gene expression data and fewer samples. To increase the classification accuracy and the processing speed of the model, an optimal number of features need to extract, this can be achieved by applying the feature selection method. In this paper, we propose a hybrid ensemble feature selection method. The proposed method has two phases, filter and wrapper phase in filter phase ensemble technique is used for aggregating the feature ranks of the Relief, minimum redundancy Maximum Relevance (mRMR), and Feature Correlation (FC) filter feature selection methods. This paper uses the Fuzzy Gaussian membership function ordering for aggregating the ranks. In wrapper phase, Improved Binary Particle Swarm Optimization (IBPSO) is used for selecting the optimal features, and the RBF Kernel-based Support Vector Machine (SVM) classifier is used as an evaluator. The performance of the proposed model are compared with state of art feature selection methods using five benchmark datasets. For evaluation various performance metrics such as Accuracy, Recall, Precision, and F1-Score are used. Furthermore, the experimental results show that the performance of the proposed method outperforms the other feature selection methods.

Download Full-text

CLASSIFICATION OF HIGH-DIMENSIONAL MICROARRAY DATA WITH A TWO-STEP PROCEDURE VIA A WILCOXON CRITERION AND MULTILAYER PERCEPTRON

International Journal of Computational Intelligence and Applications ◽

10.1142/s1469026811002969 ◽

2011 ◽

Vol 10 (01) ◽

pp. 1-14

Author(s):

VLADIMIR NIKULIN ◽

TIAN-HSIANG HUANG ◽

GEOFFREY J. MCLACHLAN

Keyword(s):

Data Mining ◽

Feature Selection ◽

High Dimensional ◽

Second Step ◽

Support Vector ◽

Step Procedure ◽

Leave One Out ◽

Natural Combination ◽

Feature Selection Techniques

The method presented in this paper is novel as a natural combination of two mutually dependent steps. Feature selection is a key element (first step) in our classification system, which was employed during the 2010 International RSCTC data mining (bioinformatics) Challenge. The second step may be implemented using any suitable classifier such as linear regression, support vector machine or neural networks. We conducted leave-one-out (LOO) experiments with several feature selection techniques and classifiers. Based on the LOO evaluations, we decided to use feature selection with the separation type Wilcoxon-based criterion for all final submissions. The method presented in this paper was tested successfully during the RSCTC data mining Challenge, where we achieved the top score in the Basic track.

Download Full-text

Multiclass classification of leukemia cancer data using Fuzzy Support Vector Machine (FSVM) with feature selection using Principal Component Analysis (PCA)

Journal of Physics Conference Series ◽

10.1088/1742-6596/1725/1/012012 ◽

2021 ◽

Vol 1725 ◽

pp. 012012

Author(s):

I R Fauzi ◽

Z Rustam ◽

A Wibowo

Keyword(s):

Principal Component Analysis ◽

Support Vector Machine ◽

Feature Selection ◽

Principal Component ◽

Component Analysis ◽

Multiclass Classification ◽

Support Vector ◽

Fuzzy Support Vector Machine ◽

Cancer Data

Download Full-text

Support Vector Machine VS Information Gain: Analisis Sentimen Cyberbullying di Twitter Indonesia

Jurnal ULTIMA InfoSys ◽

10.31937/si.v11i2.1740 ◽

2020 ◽

Vol 11 (2) ◽

pp. 107-111

Author(s):

Christevan Destitus ◽

Wella Wella ◽

Suryasari Suryasari

Keyword(s):

Support Vector Machine ◽

Feature Selection ◽

Text Mining ◽

Information Gain ◽

Text Processing ◽

Support Vector ◽

Term Weighting ◽

System Process ◽

Research Stage

This study aims to clarify tweets on twitter using the Support Vector Machine and Information Gain methods. The clarification itself aims to find a hyperplane that separates the negative and positive classes. In the research stage, there is a system process, namely text mining, text processing which has stages of tokenizing, filtering, stemming, and term weighting. After that, a feature selection is made by information gain which calculates the entropy value of each word. After that, clarify based on the features that have been selected and the output is in the form of identifying whether the tweet is bully or not. The results of this study found that the Support Vector Machine and Information Gain methods have sufficiently maximum results.

Download Full-text

Prediction of Diabetic Nephropathy from the Relationship between Fatigue, Sleep and Quality of Life

Applied Sciences ◽

10.3390/app10093282 ◽

2020 ◽

Vol 10 (9) ◽

pp. 3282

Author(s):

Angela Shin-Yu Lien ◽

Yi-Der Jiang ◽

Jia-Ling Tsai ◽

Jawl-Shan Hwang ◽

Wei-Chao Lin

Keyword(s):

Quality Of Life ◽

Feature Selection ◽

Diabetic Nephropathy ◽

Sleep Quality ◽

Support Vector ◽

Svm Classifier ◽

Poor Sleep ◽

Poor Sleep Quality ◽

The Relationship

Fatigue and poor sleep quality are the most common clinical complaints of people with diabetes mellitus (DM). These complaints are early signs of DM and are closely related to diabetic control and the presence of complications, which lead to a decline in the quality of life. Therefore, an accurate measurement of the relationship between fatigue, sleep status, and the complication of DM nephropathy could lead to a specific definition of fatigue and an appropriate medical treatment. This study recruited 307 people with Type 2 diabetes from two medical centers in Northern Taiwan through a questionnaire survey and a retrospective investigation of medical records. In an attempt to identify the related factors and accurately predict diabetic nephropathy, we applied hybrid research methods, integrated biostatistics, and feature selection methods in data mining and machine learning to compare and verify the results. Consequently, the results demonstrated that patients with diabetic nephropathy have a higher fatigue level and Charlson comorbidity index (CCI) score than without neuropathy, the presence of neuropathy leads to poor sleep quality, lower quality of life, and poor metabolism. Furthermore, by considering feature selection in selecting representative features or variables, we achieved consistence results with a support vector machine (SVM) classifier and merely ten representative factors and a prediction accuracy as high as 74% in predicting the presence of diabetic nephropathy.

Download Full-text

BETTER ALTERNATIVES FOR STEPWISE DISCRIMINANT ANALYSIS

Acta Universitatis Lodziensis Folia oeconomica ◽

10.18778/0208-6018.311.02 ◽

2015 ◽

Vol 1 (311) ◽

Author(s):

Katarzyna Stąpor

Keyword(s):

Feature Selection ◽

Discriminant Analysis ◽

Tabu Search ◽

Stepwise Discriminant Analysis ◽

Selection Methods ◽

Discrimination Power ◽

Statistical Software ◽

Software Packages ◽

Benchmark Datasets

Discriminant Analysis can best be defined as a technique which allows the classification of an individual into several dictinctive populations on the basis of a set of measurements. Stepwise discriminant analysis (SDA) is concerned with selecting the most important variables whilst retaining the highest discrimination power possible. The process of selecting a smaller number of variables is often necessary for a variety number of reasons. In the existing statistical software packages SDA is based on the classic feature selection methods. Many problems with such stepwise procedures have been identified. In this work the new method based on the metaheuristic strategy tabu search will be presented together with the experimental results conducted on the selected benchmark datasets. The results are promising.

Download Full-text