scholarly journals Arabic Authorship Attribution Using Synthetic Minority Over-sampling Technique and Principal Components Analysis for Imbalanced Documents

Nowadays, dealing with imbalanced data represents a great challenge in data mining as well as in machine learning task. In this investigation, we are interested in the problem of class imbalance in Authorship Attribution (AA) task, with specific application on Arabic text data. This article proposes a new hybrid approach based on Principal Components Analysis (PCA) and Synthetic Minority Over-sampling Technique (SMOTE), which considerably improve the performances of authorship attribution on imbalanced data. The used dataset contains 7 Arabic books written by 7 different scholars, which are segmented into text segments of the same size, with an average length of 2900 words per text. The obtained results of our experiments show that the proposed approach using the SMO-SVM classifier, presents high performance in terms of authorship attribution accuracy (100%), especially with starting character-bigrams. In addition, the proposed method appears quite interesting by improving the AA performances in imbalanced datasets, mainly with function words.

Author(s):  
Hassina Hadjadj ◽  
Halim Sayoud

Nowadays, dealing with imbalanced data represents a great challenge in data mining as well as in machine learning task. In this investigation, we are interested in the problem of class imbalance in Authorship Attribution (AA) task, with specific application on Arabic text data. This article proposes a new hybrid approach based on Principal Components Analysis (PCA) and Synthetic Minority Over-sampling Technique (SMOTE), which considerably improve the performances of authorship attribution on imbalanced data. The used dataset contains 7 Arabic books written by 7 different scholars, which are segmented into text segments of the same size, with an average length of 2900 words per text. The obtained results of our experiments show that the proposed approach using the SMO-SVM classifier, presents high performance in terms of authorship attribution accuracy (100%), especially with starting character-bigrams. In addition, the proposed method appears quite interesting by improving the AA performances in imbalanced datasets, mainly with function words.


2016 ◽  
Vol 22 (4) ◽  
pp. 97-103 ◽  
Author(s):  
Donia Ben Hassen ◽  
Sihem Ben Zakour ◽  
Hassen Taleb

Abstract A novel scheme for lesions classification in chest radiographs is presented in this paper. Features are extracted from detected lesions from lung regions which are segmented automatically. Then, we needed to eliminate redundant variables from the subset extracted because they affect the performance of the classification. We used Stepwise Forward Selection and Principal Components Analysis. Then, we obtained two subsets of features. We finally experimented the Stepwise/FCM/SVM classification and the PCA/FCM/SVM one. The ROC curves show that the hybrid PCA/FCM/SVM has relatively better accuracy and remarkable higher efficiency. Experimental results suggest that this approach may be helpful to radiologists for reading chest images.


1980 ◽  
Vol 19 (04) ◽  
pp. 205-209
Author(s):  
L. A. Abbott ◽  
J. B. Mitton

Data taken from the blood of 262 patients diagnosed for malabsorption, elective cholecystectomy, acute cholecystitis, infectious hepatitis, liver cirrhosis, or chronic renal disease were analyzed with three numerical taxonomy (NT) methods : cluster analysis, principal components analysis, and discriminant function analysis. Principal components analysis revealed discrete clusters of patients suffering from chronic renal disease, liver cirrhosis, and infectious hepatitis, which could be displayed by NT clustering as well as by plotting, but other disease groups were poorly defined. Sharper resolution of the same disease groups was attained by discriminant function analysis.


Sign in / Sign up

Export Citation Format

Share Document