Hybrid supervised clustering based ensemble scheme for text classification

Kybernetes ◽  
2017 ◽  
Vol 46 (2) ◽  
pp. 330-348 ◽  
Author(s):  
Aytug Onan

Purpose The immense quantity of available unstructured text documents serve as one of the largest source of information. Text classification can be an essential task for many purposes in information retrieval, such as document organization, text filtering and sentiment analysis. Ensemble learning has been extensively studied to construct efficient text classification schemes with higher predictive performance and generalization ability. The purpose of this paper is to provide diversity among the classification algorithms of ensemble, which is a key issue in the ensemble design. Design/methodology/approach An ensemble scheme based on hybrid supervised clustering is presented for text classification. In the presented scheme, supervised hybrid clustering, which is based on cuckoo search algorithm and k-means, is introduced to partition the data samples of each class into clusters so that training subsets with higher diversities can be provided. Each classifier is trained on the diversified training subsets and the predictions of individual classifiers are combined by the majority voting rule. The predictive performance of the proposed classifier ensemble is compared to conventional classification algorithms (such as Naïve Bayes, logistic regression, support vector machines and C4.5 algorithm) and ensemble learning methods (such as AdaBoost, bagging and random subspace) using 11 text benchmarks. Findings The experimental results indicate that the presented classifier ensemble outperforms the conventional classification algorithms and ensemble learning methods for text classification. Originality/value The presented ensemble scheme is the first to use supervised clustering to obtain diverse ensemble for text classification

Kybernetes ◽  
2016 ◽  
Vol 45 (10) ◽  
pp. 1576-1588 ◽  
Author(s):  
Mohammadali Abedini ◽  
Farzaneh Ahmadzadeh ◽  
Rassoul Noorossana

Purpose A crucial decision in financial services is how to classify credit or loan applicants into good and bad applicants. The purpose of this paper is to propose a four-stage hybrid data mining approach to support the decision-making process. Design/methodology/approach The approach is inspired by the bagging ensemble learning method and proposes a new voting method, namely two-level majority voting in the last stage. First some training subsets are generated. Then some different base classifiers are tuned and afterward some ensemble methods are applied to strengthen tuned classifiers. Finally, two-level majority voting schemes help the approach to achieve more accuracy. Findings A comparison of results shows the proposed model outperforms powerful single classifiers such as multilayer perceptron (MLP), support vector machine, logistic regression (LR). In addition, it is more accurate than ensemble learning methods such as bagging-LR or rotation forest (RF)-MLP. The model outperforms single classifiers in terms of type I and II errors; it is close to some ensemble approaches such as bagging-LR and RF-MLP but fails to outperform them in terms of type I and II errors. Moreover, majority voting in the final stage provides more reliable results. Practical implications The study concludes the approach would be beneficial for banks, credit card companies and other credit provider organisations. Originality/value A novel four stages hybrid approach inspired by bagging ensemble method proposed. Moreover the two-level majority voting in two different schemes in the last stage provides more accuracy. An integrated evaluation criterion for classification errors provides an enhanced insight for error comparisons.


Author(s):  
Sarmad Mahar ◽  
Sahar Zafar ◽  
Kamran Nishat

Headnotes are the precise explanation and summary of legal points in an issued judgment. Law journals hire experienced lawyers to write these headnotes. These headnotes help the reader quickly determine the issue discussed in the case. Headnotes comprise two parts. The first part comprises the topic discussed in the judgment, and the second part contains a summary of that judgment. In this thesis, we design, develop and evaluate headnote prediction using machine learning, without involving human involvement. We divided this task into a two steps process. In the first step, we predict law points used in the judgment by using text classification algorithms. The second step generates a summary of the judgment using text summarization techniques. To achieve this task, we created a Databank by extracting data from different law sources in Pakistan. We labelled training data generated based on Pakistan law websites. We tested different feature extraction methods on judiciary data to improve our system. Using these feature extraction methods, we developed a dictionary of terminology for ease of reference and utility. Our approach achieves 65% accuracy by using Linear Support Vector Classification with tri-gram and without stemmer. Using active learning our system can continuously improve the accuracy with the increased labelled examples provided by the users of the system.


2021 ◽  
pp. 42-51
Author(s):  
Muhammed J. A. Patwary ◽  
S. Akter ◽  
M. S. Bin Alam ◽  
A. N. M. Rezaul Karim

Bank deposit is one of the vital issues for any financial institution. It is very challenging to predict a customer if he/she can be a depositor by analyzing related information. Some recent reports demonstrate that economic depression and the continuous decline of the economy negatively impact business organizations and banking sectors. Due to such economic depression, banks cannot attract a customer's attention. Thus, marketing is preferred to be a handy tool for the banking sector to draw customers' attention for a term deposit. The purpose of this paper is to study the performance of ensemble learning algorithms which is a novel approach to predict whether a new customer will have a term deposit or not. A Portuguese retail bank data is used for our study, containing 45,211 phone contacts with 16 input attributes and one decision attribute. The data are preprocessed by using the Discretization technique. 40,690 samples are used for training the classifiers, and 4,521 samples are used for testing. In this work, the performance of the three mostly used classification algorithms named Support Vector Machine (SVM), Neural Network (NN), and Naive Bayes (NB) are analyzed. Then the ability of ensemble methods to improve the efficiency of basic classification algorithms is investigated and experimentally demonstrated. Experimental results exhibit that the performance metrics of Neural Network (Bagging) is higher than other ensemble methods. Its accuracy, sensitivity, and specificity are 96.62%, 97.14%, and 99.08%, respectively. Although all input attributes are considered in the classification method, in the end, a descriptive analysis has shown that some input attributes have more importance for this classification. Overall, it is shown that ensemble methods outperformed the traditional algorithms in this domain. We believe our contribution can be used as a depositor prediction system to provide additional support for bank deposit prediction.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Hima Bindu Valiveti ◽  
Anil Kumar B. ◽  
Lakshmi Chaitanya Duggineni ◽  
Swetha Namburu ◽  
Swaraja Kuraparthi

Purpose Road accidents, an inadvertent mishap can be detected automatically and alerts sent instantly with the collaboration of image processing techniques and on-road video surveillance systems. However, to rely exclusively on visual information especially under adverse conditions like night times, dark areas and unfavourable weather conditions such as snowfall, rain, and fog which result in faint visibility lead to incertitude. The main goal of the proposed work is certainty of accident occurrence. Design/methodology/approach The authors of this work propose a method for detecting road accidents by analyzing audio signals to identify hazardous situations such as tire skidding and car crashes. The motive of this project is to build a simple and complete audio event detection system using signal feature extraction methods to improve its detection accuracy. The experimental analysis is carried out on a publicly available real time data-set consisting of audio samples like car crashes and tire skidding. The Temporal features of the recorded audio signal like Energy Volume Zero Crossing Rate 28ZCR2529 and the Spectral features like Spectral Centroid Spectral Spread Spectral Roll of factor Spectral Flux the Psychoacoustic features Energy Sub Bands ratio and Gammatonegram are computed. The extracted features are pre-processed and trained and tested using Support Vector Machine (SVM) and K-nearest neighborhood (KNN) classification algorithms for exact prediction of the accident occurrence for various SNR ranges. The combination of Gammatonegram with Temporal and Spectral features of the validates to be superior compared to the existing detection techniques. Findings Temporal, Spectral, Psychoacoustic features, gammetonegram of the recorded audio signal are extracted. A High level vector is generated based on centroid and the extracted features are classified with the help of machine learning algorithms like SVM, KNN and DT. The audio samples collected have varied SNR ranges and the accuracy of the classification algorithms is thoroughly tested. Practical implications Denoising of the audio samples for perfect feature extraction was a tedious chore. Originality/value The existing literature cites extraction of Temporal and Spectral features and then the application of classification algorithms. For perfect classification, the authors have chosen to construct a high level vector from all the four extracted Temporal, Spectral, Psycho acoustic and Gammetonegram features. The classification algorithms are employed on samples collected at varied SNR ranges.


2019 ◽  
Vol 47 (3) ◽  
pp. 154-170
Author(s):  
Janani Balakumar ◽  
S. Vijayarani Mohan

Purpose Owing to the huge volume of documents available on the internet, text classification becomes a necessary task to handle these documents. To achieve optimal text classification results, feature selection, an important stage, is used to curtail the dimensionality of text documents by choosing suitable features. The main purpose of this research work is to classify the personal computer documents based on their content. Design/methodology/approach This paper proposes a new algorithm for feature selection based on artificial bee colony (ABCFS) to enhance the text classification accuracy. The proposed algorithm (ABCFS) is scrutinized with the real and benchmark data sets, which is contrary to the other existing feature selection approaches such as information gain and χ2 statistic. To justify the efficiency of the proposed algorithm, the support vector machine (SVM) and improved SVM classifier are used in this paper. Findings The experiment was conducted on real and benchmark data sets. The real data set was collected in the form of documents that were stored in the personal computer, and the benchmark data set was collected from Reuters and 20 Newsgroups corpus. The results prove the performance of the proposed feature selection algorithm by enhancing the text document classification accuracy. Originality/value This paper proposes a new ABCFS algorithm for feature selection, evaluates the efficiency of the ABCFS algorithm and improves the support vector machine. In this paper, the ABCFS algorithm is used to select the features from text (unstructured) documents. Although, there is no text feature selection algorithm in the existing work, the ABCFS algorithm is used to select the data (structured) features. The proposed algorithm will classify the documents automatically based on their content.


2017 ◽  
Vol 10 (2) ◽  
pp. 111-129 ◽  
Author(s):  
Ali Hasan Alsaffar

Purpose The purpose of this paper is to present an empirical study on the effect of two synthetic attributes to popular classification algorithms on data originating from student transcripts. The attributes represent past performance achievements in a course, which are defined as global performance (GP) and local performance (LP). GP of a course is an aggregated performance achieved by all students who have taken this course, and LP of a course is an aggregated performance achieved in the prerequisite courses by the student taking the course. Design/methodology/approach The paper uses Educational Data Mining techniques to predict student performance in courses, where it identifies the relevant attributes that are the most key influencers for predicting the final grade (performance) and reports the effect of the two suggested attributes on the classification algorithms. As a research paradigm, the paper follows Cross-Industry Standard Process for Data Mining using RapidMiner Studio software tool. Six classification algorithms are experimented: C4.5 and CART Decision Trees, Naive Bayes, k-neighboring, rule-based induction and support vector machines. Findings The outcomes of the paper show that the synthetic attributes have positively improved the performance of the classification algorithms, and also they have been highly ranked according to their influence to the target variable. Originality/value This paper proposes two synthetic attributes that are integrated into real data set. The key motivation is to improve the quality of the data and make classification algorithms perform better. The paper also presents empirical results showing the effect of these attributes on selected classification algorithms.


2015 ◽  
Vol 49 (1) ◽  
pp. 2-22
Author(s):  
Jiunn-Liang Guo ◽  
Hei-Chia Wang ◽  
Ming-Way Lai

Purpose – The purpose of this paper is to develop a novel feature selection approach for automatic text classification of large digital documents – e-books of online library system. The main idea mainly aims on automatically identifying the discourse features in order to improving the feature selection process rather than focussing on the size of the corpus. Design/methodology/approach – The proposed framework intends to automatically identify the discourse segments within e-books and capture proper discourse subtopics that are cohesively expressed in discourse segments and treating these subtopics as informative and prominent features. The selected set of features is then used to train and perform the e-book classification task based on the support vector machine technique. Findings – The evaluation of the proposed framework shows that identifying discourse segments and capturing subtopic features leads to better performance, in comparison with two conventional feature selection techniques: TFIDF and mutual information. It also demonstrates that discourse features play important roles among textual features, especially for large documents such as e-books. Research limitations/implications – Automatically extracted subtopic features cannot be directly entered into FS process but requires control of the threshold. Practical implications – The proposed technique has demonstrated the promised application of using discourse analysis to enhance the classification of large digital documents – e-books as against to conventional techniques. Originality/value – A new FS technique is proposed which can inspect the narrative structure of large documents and it is new to the text classification domain. The other contribution is that it inspires the consideration of discourse information in future text analysis, by providing more evidences through evaluation of the results. The proposed system can be integrated into other library management systems.


Text Classification plays a vital role in the world of data mining and same is true for the classification algorithms in text categorization. There are many techniques for text classification but this paper mainly focuses on these approaches Support vector machine (SVM), Naïve Bayes (NB), k-nearest neighbor (k-NN). This paper reveals results of the classifiers on mini-newsgroups data which consists of the classifies on mini-newsgroups data which consists a lot of documents and step by step tasks like a listing of files, preprocessing, the creation of terms(a specific subset of terms), using classifiers on specific subset of datasets. Finally, after the results and experiments over the dataset, it is concluded that SVM achieves good classification output corresponding to accuracy, precision, F-measure and recall but execution time is good for the k-NN approach.


Author(s):  
Wolfgang Drobetz ◽  
Tizian Otto

AbstractThis paper evaluates the predictive performance of machine learning methods in forecasting European stock returns. Compared to a linear benchmark model, interactions and nonlinear effects help improve the predictive performance. But machine learning models must be adequately trained and tuned to overcome the high dimensionality problem and to avoid overfitting. Across all machine learning methods, the most important predictors are based on price trends and fundamental signals from valuation ratios. However, the models exhibit substantial variation in statistical predictive performance that translate into pronounced differences in economic profitability. The return and risk measures of long-only trading strategies indicate that machine learning models produce sizeable gains relative to our benchmark. Neural networks perform best, also after accounting for transaction costs. A classification-based portfolio formation, utilizing a support vector machine that avoids estimating stock-level expected returns, performs even better than the neural network architecture.


2021 ◽  
Vol 11 (3) ◽  
pp. 767-772
Author(s):  
Wenxian Peng ◽  
Yijia Qian ◽  
Yingying Shi ◽  
Shuyun Chen ◽  
Kexin Chen ◽  
...  

Purpose: Calcification nodules in thyroid can be found in thyroid disease. Current clinical computed tomography systems can be used to detect calcification nodules. Our aim is to identify the nature of thyroid calcification nodule based on plain CT images. Method: Sixty-three patients (36 benign and 27 malignant nodules) found thyroid calcification nodules were retrospectively analyzed, together with computed tomography images and pathology finding. The regions of interest (ROI) of 6464 pixels containing calcification nodules were manually delineated by radiologists in CT plain images. We extracted thirty-one texture features from each ROI. And nineteen texture features were picked up after feature optimization by logistic regression analysis. All the texture features were normalized to [0, 1]. Four classification algorithms, including ensemble learning, support vector machine, K-nearest neighbor, decision tree, were used as classification algorithms to identity the benign and malignant nodule. Accuracy, PPV, NPV, SEN, and AUC were calculated to evaluate the performance of different classifiers. Results: Nineteen texture features were selected after feature optimization by logistic regression analysis (P <0.05). Both Ensemble Learning and Support Vector Machine achieved the highest accuracy of 97.1%. The PPV, NPV, SEN, and SPC are 96.9%, 97.4%, 98.4%, and 95.0%, respectively. The AUC was 1. Conclusion: Texture features extracted from calcification nodules could be used as biomarkers to identify benign or malignant thyroid calcification.


Sign in / Sign up

Export Citation Format

Share Document