scholarly journals Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain

2021 ◽  
Vol 11 (2) ◽  
pp. 796
Author(s):  
Alhanoof Althnian ◽  
Duaa AlSaeed ◽  
Heyam Al-Baity ◽  
Amani Samha ◽  
Alanoud Bin Dris ◽  
...  

Dataset size is considered a major concern in the medical domain, where lack of data is a common occurrence. This study aims to investigate the impact of dataset size on the overall performance of supervised classification models. We examined the performance of six widely-used models in the medical field, including support vector machine (SVM), neural networks (NN), C4.5 decision tree (DT), random forest (RF), adaboost (AB), and naïve Bayes (NB) on eighteen small medical UCI datasets. We further implemented three dataset size reduction scenarios on two large datasets and analyze the performance of the models when trained on each resulting dataset with respect to accuracy, precision, recall, f-score, specificity, and area under the ROC curve (AUC). Our results indicated that the overall performance of classifiers depend on how much a dataset represents the original distribution rather than its size. Moreover, we found that the most robust model for limited medical data is AB and NB, followed by SVM, and then RF and NN, while the least robust model is DT. Furthermore, an interesting observation is that a robust machine learning model to limited dataset does not necessary imply that it provides the best performance compared to other models.

2016 ◽  
Vol 2016 ◽  
pp. 1-12 ◽  
Author(s):  
Mohamed Abdel-Nasser ◽  
Jaime Melendez ◽  
Antonio Moreno ◽  
Domenec Puig

Texture analysis methods are widely used to characterize breast masses in mammograms. Texture gives information about the spatial arrangement of the intensities in the region of interest. This information has been used in mammogram analysis applications such as mass detection, mass classification, and breast density estimation. In this paper, we study the effect of factors such as pixel resolution, integration scale, preprocessing, and feature normalization on the performance of those texture methods for mass classification. The classification performance was assessed considering linear and nonlinear support vector machine classifiers. To find the best combination among the studied factors, we used three approaches: greedy, sequential forward selection (SFS), and exhaustive search. On the basis of our study, we conclude that the factors studied affect the performance of texture methods, so the best combination of these factors should be determined to achieve the best performance with each texture method. SFS can be an appropriate way to approach the factor combination problem because it is less computationally intensive than the other methods.


Author(s):  
Tao Yang ◽  
Dongmei Fu ◽  
Chunhong Wu

Promoted by its convexity and low time complexity, Laplacian embedded support vector regression (LapESVR) model based on manifold regularization (MR) has assumed an important role in semi-supervised classification. Conventionally, the LapESVR model is based on a single kernel function that is intrinsically capable of describing one feature mapping relation only. However, when the data to be processed is from a complex dataset where multiple features of the data are required to be treated, the classification performance using the LapESVR based on a single kernel substantially degrade, indicating that the classification requirement in this case is beyond the capability of the LapESVR. In addition, the processing data is often subject to the impact of abnormal data samples; therefore, in practice assigning a fixed value that is related to the average distance of the processing data as the parameter value of kernel function of the LapESVR is by no means optimal. To solve the problems as mentioned regarding the LapESVR, this paper proposes a Laplacian embedded infinite kernel regression (LapEIKR) model. The proposed model combines the multiple kernels linearly to improve its ability of characterization of the processing data, typical in semi-supervised classification of complex datasets, with multiple features. Further, the parameter setting of the multiple kernels of the LapEIKR model is turned into an optimization problem by formulating a corresponding minimum objective function and an iterative algorithm, and then the values of the settings are facilitated to be obtained by a formulated calculation, assuming the optimal values with respect to the designed objective function. Comparative experiments on the UCI datasets, benchmark datasets and Caltech256 datasets show that the proposed LapEIKR model is improving in terms of adaptivity and efficiency.


Sensors ◽  
2021 ◽  
Vol 21 (19) ◽  
pp. 6677
Author(s):  
Sahand Hajifar ◽  
Saeb Ragani Lamooki ◽  
Lora A. Cavuoto ◽  
Fadel M. Megahed ◽  
Hongyue Sun

Human activity recognition has been extensively used for the classification of occupational tasks. Existing activity recognition approaches perform well when training and testing data follow an identical distribution. However, in the real world, this condition may be violated due to existing heterogeneities among training and testing data, which results in degradation of classification performance. This study aims to investigate the impact of four heterogeneity sources, cross-sensor, cross-subject, joint cross-sensor and cross-subject, and cross-scenario heterogeneities, on classification performance. To that end, two experiments called separate task scenario and mixed task scenario were conducted to simulate tasks of electrical line workers under various heterogeneity sources. Furthermore, a support vector machine classifier equipped with domain adaptation was used to classify the tasks and benchmarked against a standard support vector machine baseline. Our results demonstrated that the support vector machine equipped with domain adaptation outperformed the baseline for cross-sensor, joint cross-subject and cross-sensor, and cross-subject cases, while the performance of support vector machine equipped with domain adaptation was not better than that of the baseline for cross-scenario case. Therefore, it is of great importance to investigate the impact of heterogeneity sources on classification performance and if needed, leverage domain adaptation methods to improve the performance.


2013 ◽  
Vol 6 ◽  
pp. BII.S11987 ◽  
Author(s):  
Mindy K. Ross ◽  
Ko-Wei Lin ◽  
Karen Truong ◽  
Abhishek Kumar ◽  
Mike Conway

The database of Genotypes and Phenotypes (dbGaP) allows researchers to understand phenotypic contribution to genetic conditions, generate new hypotheses, confirm previous study results, and identify control populations. However, effective use of the database is hindered by suboptimal study retrieval. Our objective is to evaluate text classification techniques to improve study retrieval in the context of the dbGaP database. We utilized standard machine learning algorithms (naive Bayes, support vector machines, and the C4.5 decision tree) trained on dbGaP study text and incorporated n-gram features and study metadata to identify heart, lung, and blood studies. We used the χ2 feature selection algorithm to identify features that contributed most to classification performance and experimented with dbGaP associated PubMed papers as a proxy for topicality. Classifier performance was favorable in comparison to keyword-based search results. It was determined that text categorization is a useful complement to document retrieval techniques in the dbGaP


2018 ◽  
Vol 2018 ◽  
pp. 1-8 ◽  
Author(s):  
Na’eem Hoosen Agjee ◽  
Onisimo Mutanga ◽  
Kabir Peerbhay ◽  
Riyad Ismail

Hyperspectral datasets contain spectral noise, the presence of which adversely affects the classifier performance to generalize accurately. Despite machine learning algorithms being regarded as robust classifiers that generalize well under unfavourable noisy conditions, the extent of this is poorly understood. This study aimed to evaluate the influence of simulated spectral noise (10%, 20%, and 30%) on random forest (RF) and oblique random forest (oRF) classification performance using two node-splitting models (ridge regression (RR) and support vector machines (SVM)) to discriminate healthy and low infested water hyacinth plants. Results from this study showed that RF was slightly influenced by simulated noise with classification accuracies decreasing for week one and week two with the addition of 30% noise. In comparison to RF, oRF-RR and oRF-SVM yielded higher test accuracies (oRF-RR: 5.36%–7.15%; oRF-SVM: 3.58%–5.36%) and test kappa coefficients (oRF-RR: 10.72%–14.29%; oRF-SVM: 7.15%–10.72%). Notably, oRF-RR test accuracies and kappa coefficients remained consistent irrespective of simulated noise level for week one and week two while similar results were achieved for week three using oRF-SVM. Overall, this study has demonstrated that oRF-RR can be regarded a robust classification algorithm that is not influenced by noisy spectral conditions.


2019 ◽  
Vol 9 (4) ◽  
pp. 643 ◽  
Author(s):  
Geun-Ho Kwak ◽  
No-Wook Park

Unmanned aerial vehicle (UAV) images that can provide thematic information at much higher spatial and temporal resolutions than satellite images have great potential in crop classification. Due to the ultra-high spatial resolution of UAV images, spatial contextual information such as texture is often used for crop classification. From a data availability viewpoint, it is not always possible to acquire time-series UAV images due to limited accessibility to the study area. Thus, it is necessary to improve classification performance for situations when a single or minimum number of UAV images are available for crop classification. In this study, we investigate the potential of gray-level co-occurrence matrix (GLCM)-based texture information for crop classification with time-series UAV images and machine learning classifiers including random forest and support vector machine. In particular, the impact of combining texture and spectral information on the classification performance is evaluated for cases that use only one UAV image or multi-temporal images as input. A case study of crop classification in Anbandegi of Korea was conducted for the above comparisons. The best classification accuracy was achieved when multi-temporal UAV images which can fully account for the growth cycles of crops were combined with GLCM-based texture features. However, the impact of the utilization of texture information was not significant. In contrast, when one August UAV image was used for crop classification, the utilization of texture information significantly affected the classification performance. Classification using texture features extracted from GLCM with larger kernel size significantly improved classification accuracy, an improvement of 7.72%p in overall accuracy for the support vector machine classifier, compared with classification based solely on spectral information. These results indicate the usefulness of texture information for classification of ultra-high-spatial-resolution UAV images, particularly when acquisition of time-series UAV images is difficult and only one UAV image is used for crop classification.


2021 ◽  
Vol 12 (1) ◽  
pp. 197
Author(s):  
Chunxia Zhang ◽  
Xiaoli Wei ◽  
Sang-Woon Kim

This paper empirically evaluates two kinds of features, which are extracted, respectively, with traditional statistical methods and convolutional neural networks (CNNs), in order to improve the performance of seismic patch image classification. In the latter case, feature vectors, named “CNN-features”, were extracted from one trained CNN model, and were then used to learn existing classifiers, such as support vector machines. In this case, to learn the CNN model, a technique of transfer learning using synthetic seismic patch data in the source domain, and real-world patch data in the target domain, was applied. The experimental results show that CNN-features lead to some improvements in the classification performance. By analyzing the data complexity measures, the CNN-features are found to have the strongest discriminant capabilities. Furthermore, the transfer learning technique alleviates the problems of long processing times and the lack of learning data.


2018 ◽  
Vol 7 (4.19) ◽  
pp. 1025
Author(s):  
Mr. Manoj Ashok Wakchaure ◽  
Prof . Dr.S.S.Sane

Discrimination prevention in Data mining has been studied by researchers. Several methods have been devised to take care of both direct and indirect discrimination prevention. In order to prevent discrimination, each of these methods tries to minimize the impact of discriminating attributes by modifying certain discriminating rules. The discriminating rules are identified using certain threshold and discrimination measure such as elift for direct discrimination and elb for indirect discrimination. Performance of these methods are measured and compared in terms discrimination removal using DDPD, DDPP for direct discrimination and IDPD, IDPP for indirect discrimination as well as resultant data quality using MC and GC for both kinds of discrimination.This paper deals with study of use of discrimination measures other than elift such as slift, clift and olift. The empirical evaluation presented here shows that slift provides best overall performance.  


2020 ◽  
Vol 39 (6) ◽  
pp. 8927-8935
Author(s):  
Bing Zheng ◽  
Dawei Yun ◽  
Yan Liang

Under the impact of COVID-19, research on behavior recognition are highly needed. In this paper, we combine the algorithm of self-adaptive coder and recurrent neural network to realize the research of behavior pattern recognition. At present, most of the research of human behavior recognition is focused on the video data, which is based on the video number. At the same time, due to the complexity of video image data, it is easy to violate personal privacy. With the rapid development of Internet of things technology, it has attracted the attention of a large number of experts and scholars. Researchers have tried to use many machine learning methods, such as random forest, support vector machine and other shallow learning methods, which perform well in the laboratory environment, but there is still a long way to go from practical application. In this paper, a recursive neural network algorithm based on long and short term memory (LSTM) is proposed to realize the recognition of behavior patterns, so as to improve the accuracy of human activity behavior recognition.


Sign in / Sign up

Export Citation Format

Share Document