scholarly journals A Text Mining Approach in the Classification of Free-Text Cancer Pathology Reports from the South African National Health Laboratory Services

Information ◽  
2021 ◽  
Vol 12 (11) ◽  
pp. 451
Author(s):  
Okechinyere J. Achilonu ◽  
Victor Olago ◽  
Elvira Singh ◽  
René M. J. C. Eijkemans ◽  
Gideon Nimako ◽  
...  

A cancer pathology report is a valuable medical document that provides information for clinical management of the patient and evaluation of health care. However, there are variations in the quality of reporting in free-text style formats, ranging from comprehensive to incomplete reporting. Moreover, the increasing incidence of cancer has generated a high throughput of pathology reports. Hence, manual extraction and classification of information from these reports can be intrinsically complex and resource-intensive. This study aimed to (i) evaluate the quality of over 80,000 breast, colorectal, and prostate cancer free-text pathology reports and (ii) assess the effectiveness of random forest (RF) and variants of support vector machine (SVM) in the classification of reports into benign and malignant classes. The study approach comprises data preprocessing, visualisation, feature selections, text classification, and evaluation of performance metrics. The performance of the classifiers was evaluated across various feature sizes, which were jointly selected by four filter feature selection methods. The feature selection methods identified established clinical terms, which are synonymous with each of the three cancers. Uni-gram tokenisation using the classifiers showed that the predictive power of RF model was consistent across various feature sizes, with overall F-scores of 95.2%, 94.0%, and 95.3% for breast, colorectal, and prostate cancer classification, respectively. The radial SVM achieved better classification performance compared with its linear variant for most of the feature sizes. The classifiers also achieved high precision, recall, and accuracy. This study supports a nationally agreed standard in pathology reporting and the use of text mining for encoding, classifying, and production of high-quality information abstractions for cancer prognosis and research.

2020 ◽  
Vol 20 (1) ◽  
pp. 232-247
Author(s):  
Mariusz Kubus

AbstractResearch background: The successful learning of classifiers depends on the quality of data. Modeling is especially difficult when the data are unbalanced or contain many irrelevant variables. This is the case in many applications. The classification of rare events is the overarching goal, e.g. in bankruptcy prediction, churn analysis or fraud detection. The problem of irrelevant variables accompanies situations where the specification of the model is not known a priori, thus in typical conditions for data mining analysts.Purpose: The purpose of this paper is to compare the combinations of the most popular strategies of handling unbalanced data with feature selection methods that represent filters, wrappers and embedded methods.Research methodology: In the empirical study, we use real datasets with additionally introduced irrelevant variables. In this way, we are able to recognize which method correctly eliminates irrelevant variables.Results: Having carried out the experiment we conclude that over-sampling does not work in connection with feature selection. Some recommendations of the most promising methods also are given.Novelty: There are many solutions proposed in the literature concerning unbalanced data as well as feature selection. The innovative field of our interests is to examine their interactions.


Author(s):  
Siddesh G. M. ◽  
Gururaj T.

A key step in addressing the classification issue was the selection of genes for removing redundant and irrelevant genes. The proposed Type Combination Approach –Feature Selection(TCA-FS) model uses the efficient feature selection methods, and the classification accuracy can be enhanced. The three classifiers such as K Nearest Neighbour(KNN), Support Vector Machine(SVM) and Random Forest(RF) are selected for evaluating the opted feature selection methods, and prediction accuracy. The effects of three new approaches for feature selection are Improved Recursive Feature Elimination (IRFE), Revised Maximum Information co-efficient (RMIC), as well as Upgraded Masked Painter (UMP), are analysed. These three proposed techniques are compared with existing techniques and are validated with (i) Stability determination test. (ii) Classification accuracy. (iii) Error rates of three proposed techniques are analysed. Due to the selection of proper threshold on classification, the proposed TCA-FS method provides a higher accuracy compared to the existing system.


A key step in addressing the classification issue was the selection of genes for removing redundant and irrelevant genes. The proposed Type Combination Approach –Feature Selection(TCA-FS) model uses the efficient feature selection methods, and the classification accuracy can be enhanced. The three classifiers such as K Nearest Neighbour(KNN), Support Vector Machine(SVM) and Random Forest(RF) are selected for evaluating the opted feature selection methods, and prediction accuracy. The effects of three new approaches for feature selection are Improved Recursive Feature Elimination (IRFE), Revised Maximum Information co-efficient (RMIC), as well as Upgraded Masked Painter (UMP), are analysed. These three proposed techniques are compared with existing techniques and are validated with (i) Stability determination test. (ii) Classification accuracy. (iii) Error rates of three proposed techniques are analysed. Due to the selection of proper threshold on classification, the proposed TCA-FS method provides a higher accuracy compared to the existing system.


Author(s):  
B. Venkatesh ◽  
J. Anuradha

In Microarray Data, it is complicated to achieve more classification accuracy due to the presence of high dimensions, irrelevant and noisy data. And also It had more gene expression data and fewer samples. To increase the classification accuracy and the processing speed of the model, an optimal number of features need to extract, this can be achieved by applying the feature selection method. In this paper, we propose a hybrid ensemble feature selection method. The proposed method has two phases, filter and wrapper phase in filter phase ensemble technique is used for aggregating the feature ranks of the Relief, minimum redundancy Maximum Relevance (mRMR), and Feature Correlation (FC) filter feature selection methods. This paper uses the Fuzzy Gaussian membership function ordering for aggregating the ranks. In wrapper phase, Improved Binary Particle Swarm Optimization (IBPSO) is used for selecting the optimal features, and the RBF Kernel-based Support Vector Machine (SVM) classifier is used as an evaluator. The performance of the proposed model are compared with state of art feature selection methods using five benchmark datasets. For evaluation various performance metrics such as Accuracy, Recall, Precision, and F1-Score are used. Furthermore, the experimental results show that the performance of the proposed method outperforms the other feature selection methods.


Author(s):  
VLADIMIR NIKULIN ◽  
TIAN-HSIANG HUANG ◽  
GEOFFREY J. MCLACHLAN

The method presented in this paper is novel as a natural combination of two mutually dependent steps. Feature selection is a key element (first step) in our classification system, which was employed during the 2010 International RSCTC data mining (bioinformatics) Challenge. The second step may be implemented using any suitable classifier such as linear regression, support vector machine or neural networks. We conducted leave-one-out (LOO) experiments with several feature selection techniques and classifiers. Based on the LOO evaluations, we decided to use feature selection with the separation type Wilcoxon-based criterion for all final submissions. The method presented in this paper was tested successfully during the RSCTC data mining Challenge, where we achieved the top score in the Basic track.


2020 ◽  
Vol 11 (2) ◽  
pp. 107-111
Author(s):  
Christevan Destitus ◽  
Wella Wella ◽  
Suryasari Suryasari

This study aims to clarify tweets on twitter using the Support Vector Machine and Information Gain methods. The clarification itself aims to find a hyperplane that separates the negative and positive classes. In the research stage, there is a system process, namely text mining, text processing which has stages of tokenizing, filtering, stemming, and term weighting. After that, a feature selection is made by information gain which calculates the entropy value of each word. After that, clarify based on the features that have been selected and the output is in the form of identifying whether the tweet is bully or not. The results of this study found that the Support Vector Machine and Information Gain methods have sufficiently maximum results.


2020 ◽  
Vol 10 (9) ◽  
pp. 3282
Author(s):  
Angela Shin-Yu Lien ◽  
Yi-Der Jiang ◽  
Jia-Ling Tsai ◽  
Jawl-Shan Hwang ◽  
Wei-Chao Lin

Fatigue and poor sleep quality are the most common clinical complaints of people with diabetes mellitus (DM). These complaints are early signs of DM and are closely related to diabetic control and the presence of complications, which lead to a decline in the quality of life. Therefore, an accurate measurement of the relationship between fatigue, sleep status, and the complication of DM nephropathy could lead to a specific definition of fatigue and an appropriate medical treatment. This study recruited 307 people with Type 2 diabetes from two medical centers in Northern Taiwan through a questionnaire survey and a retrospective investigation of medical records. In an attempt to identify the related factors and accurately predict diabetic nephropathy, we applied hybrid research methods, integrated biostatistics, and feature selection methods in data mining and machine learning to compare and verify the results. Consequently, the results demonstrated that patients with diabetic nephropathy have a higher fatigue level and Charlson comorbidity index (CCI) score than without neuropathy, the presence of neuropathy leads to poor sleep quality, lower quality of life, and poor metabolism. Furthermore, by considering feature selection in selecting representative features or variables, we achieved consistence results with a support vector machine (SVM) classifier and merely ten representative factors and a prediction accuracy as high as 74% in predicting the presence of diabetic nephropathy.


2015 ◽  
Vol 1 (311) ◽  
Author(s):  
Katarzyna Stąpor

Discriminant Analysis can best be defined as a technique which allows the classification of an individual into several dictinctive populations on the basis of a set of measurements. Stepwise discriminant analysis (SDA) is concerned with selecting the most important variables whilst retaining the highest discrimination power possible. The process of selecting a smaller number of variables is often necessary for a variety number of reasons. In the existing statistical software packages SDA is based on the classic feature selection methods. Many problems with such stepwise procedures have been identified. In this work the new method based on the metaheuristic strategy tabu search will be presented together with the experimental results conducted on the selected benchmark datasets. The results are promising.


Sign in / Sign up

Export Citation Format

Share Document