scholarly journals An novel cluster based feature selection and document classification model on high dimension trec data

2017 ◽  
Vol 7 (1.1) ◽  
pp. 466
Author(s):  
Lalitha Kumari ◽  
Ch. Satyanarayana

TREC text documents are complex to analyze the features its relevant similar documents using the traditional document similarity measures. As the size of the TREC repository is increasing, finding relevant clustered documents from a large collection of unstructured documents is a challenging task. Traditional document similarity and classification models are implemented on homogeneous TREC data to find essential features for document entities that are similar to the TREC documents. Also, most of the traditional models are applicable to limited text document sets for text analysis. The main issues in the traditional text mining models in TREC repository include :1) Each document is represented in vector form with many sparsity values 2) Failed to find the  document semantic similarity between the intra and inter clusters 3) High mean squared error rate. In this paper, novel feature selection based clustered and classification model is proposed on large number of different TREC repositories. Traditional latent Semantic Indexing and document clustering models are failed to find the topic relevance on large number of TREC clinical text document sets due to computational memory and time. Proposed document feature selection and clustered based classification model is applied on TREC clinical benchmark datasets. From the experimental results, it is proved that the proposed model is efficient than the existing models in terms of computational memory, accuracy and error rate are concerned.

Healthcare ◽  
2021 ◽  
Vol 9 (7) ◽  
pp. 884
Author(s):  
Antonio García-Domínguez ◽  
Carlos E. Galván-Tejada ◽  
Ramón F. Brena ◽  
Antonio A. Aguileta ◽  
Jorge I. Galván-Tejada ◽  
...  

Children’s healthcare is a relevant issue, especially the prevention of domestic accidents, since it has even been defined as a global health problem. Children’s activity classification generally uses sensors embedded in children’s clothing, which can lead to erroneous measurements for possible damage or mishandling. Having a non-invasive data source for a children’s activity classification model provides reliability to the monitoring system where it is applied. This work proposes the use of environmental sound as a data source for the generation of children’s activity classification models, implementing feature selection methods and classification techniques based on Bayesian networks, focused on the recognition of potentially triggering activities of domestic accidents, applicable in child monitoring systems. Two feature selection techniques were used: the Akaike criterion and genetic algorithms. Likewise, models were generated using three classifiers: naive Bayes, semi-naive Bayes and tree-augmented naive Bayes. The generated models, combining the methods of feature selection and the classifiers used, present accuracy of greater than 97% for most of them, with which we can conclude the efficiency of the proposal of the present work in the recognition of potentially detonating activities of domestic accidents.


Symmetry ◽  
2021 ◽  
Vol 13 (4) ◽  
pp. 556
Author(s):  
Thaer Thaher ◽  
Mahmoud Saheb ◽  
Hamza Turabieh ◽  
Hamouda Chantar

Fake or false information on social media platforms is a significant challenge that leads to deliberately misleading users due to the inclusion of rumors, propaganda, or deceptive information about a person, organization, or service. Twitter is one of the most widely used social media platforms, especially in the Arab region, where the number of users is steadily increasing, accompanied by an increase in the rate of fake news. This drew the attention of researchers to provide a safe online environment free of misleading information. This paper aims to propose a smart classification model for the early detection of fake news in Arabic tweets utilizing Natural Language Processing (NLP) techniques, Machine Learning (ML) models, and Harris Hawks Optimizer (HHO) as a wrapper-based feature selection approach. Arabic Twitter corpus composed of 1862 previously annotated tweets was utilized by this research to assess the efficiency of the proposed model. The Bag of Words (BoW) model is utilized using different term-weighting schemes for feature extraction. Eight well-known learning algorithms are investigated with varying combinations of features, including user-profile, content-based, and words-features. Reported results showed that the Logistic Regression (LR) with Term Frequency-Inverse Document Frequency (TF-IDF) model scores the best rank. Moreover, feature selection based on the binary HHO algorithm plays a vital role in reducing dimensionality, thereby enhancing the learning model’s performance for fake news detection. Interestingly, the proposed BHHO-LR model can yield a better enhancement of 5% compared with previous works on the same dataset.


2020 ◽  
Vol 11 (1) ◽  
pp. 96
Author(s):  
Wen-Lan Wu ◽  
Meng-Hua Lee ◽  
Hsiu-Tao Hsu ◽  
Wen-Hsien Ho ◽  
Jing-Min Liang

Background: In this study, an automatic scoring system for the functional movement screen (FMS) was developed. Methods: Thirty healthy adults fitted with full-body inertial measurement unit sensors completed six FMS exercises. The system recorded kinematics data, and a professional athletic trainer graded each participant. To reduce the number of input variables for the predictive model, ordinal logistic regression was used for subset feature selection. The ensemble learning algorithm AdaBoost.M1 was used to construct classifiers. Accuracy and F score were used for classification model evaluation. The consistency between automatic and manual scoring was assessed using a weighted kappa statistic. Results: When all the features were used, the predict model presented moderate to high accuracy, with kappa values between fair to very good agreement. After feature selection, model accuracy decreased about 10%, with kappa values between poor to moderate agreement. Conclusions: The results indicate that higher prediction accuracy was achieved using the full feature set compared with using the reduced feature set.


Author(s):  
Ritam Guha ◽  
Manosij Ghosh ◽  
Pawan Kumar Singh ◽  
Ram Sarkar ◽  
Mita Nasipuri

AbstractIn any multi-script environment, handwritten script classification is an unavoidable pre-requisite before the document images are fed to their respective Optical Character Recognition (OCR) engines. Over the years, this complex pattern classification problem has been solved by researchers proposing various feature vectors mostly having large dimensions, thereby increasing the computation complexity of the whole classification model. Feature Selection (FS) can serve as an intermediate step to reduce the size of the feature vectors by restricting them only to the essential and relevant features. In the present work, we have addressed this issue by introducing a new FS algorithm, called Hybrid Swarm and Gravitation-based FS (HSGFS). This algorithm has been applied over three feature vectors introduced in the literature recently—Distance-Hough Transform (DHT), Histogram of Oriented Gradients (HOG), and Modified log-Gabor (MLG) filter Transform. Three state-of-the-art classifiers, namely, Multi-Layer Perceptron (MLP), K-Nearest Neighbour (KNN), and Support Vector Machine (SVM), are used to evaluate the optimal subset of features generated by the proposed FS model. Handwritten datasets at block, text line, and word level, consisting of officially recognized 12 Indic scripts, are prepared for experimentation. An average improvement in the range of 2–5% is achieved in the classification accuracy by utilizing only about 75–80% of the original feature vectors on all three datasets. The proposed method also shows better performance when compared to some popularly used FS models. The codes used for implementing HSGFS can be found in the following Github link: https://github.com/Ritam-Guha/HSGFS.


Sensors ◽  
2021 ◽  
Vol 21 (9) ◽  
pp. 2910
Author(s):  
Kei Suzuki ◽  
Tipporn Laohakangvalvit ◽  
Ryota Matsubara ◽  
Midori Sugaya

In human emotion estimation using an electroencephalogram (EEG) and heart rate variability (HRV), there are two main issues as far as we know. The first is that measurement devices for physiological signals are expensive and not easy to wear. The second is that unnecessary physiological indexes have not been removed, which is likely to decrease the accuracy of machine learning models. In this study, we used single-channel EEG sensor and photoplethysmography (PPG) sensor, which are inexpensive and easy to wear. We collected data from 25 participants (18 males and 7 females) and used a deep learning algorithm to construct an emotion classification model based on Arousal–Valence space using several feature combinations obtained from physiological indexes selected based on our criteria including our proposed feature selection methods. We then performed accuracy verification, applying a stratified 10-fold cross-validation method to the constructed models. The results showed that model accuracies are as high as 90% to 99% by applying the features selection methods we proposed, which suggests that a small number of physiological indexes, even from inexpensive sensors, can be used to construct an accurate emotion classification model if an appropriate feature selection method is applied. Our research results contribute to the improvement of an emotion classification model with a higher accuracy, less cost, and that is less time consuming, which has the potential to be further applied to various areas of applications.


2021 ◽  
Vol 5 (3) ◽  
pp. 527-533
Author(s):  
Yoga Religia ◽  
Amali Amali

The quality of an airline's services cannot be measured from the company's point of view, but must be seen from the point of view of customer satisfaction. Data mining techniques make it possible to predict airline customer satisfaction with a classification model. The Naïve Bayes algorithm has demonstrated outstanding classification accuracy, but currently independent assumptions are rarely discussed. Some literature suggests the use of attribute weighting to reduce independent assumptions, which can be done using particle swarm optimization (PSO) and genetic algorithm (GA) through feature selection. This study conducted a comparison of PSO and GA optimization on Naïve Bayes for the classification of Airline Passenger Satisfaction data taken from www.kaggle.com. After testing, the best performance is obtained from the model formed, namely the classification of Airline Passenger Satisfaction data using the Naïve Bayes algorithm with PSO optimization, where the accuracy value is 86.13%, the precision value is 87.90%, the recall value is 87.29%, and the value is AUC of 0.923.


Author(s):  
Danlei Xu ◽  
Lan Du ◽  
Hongwei Liu ◽  
Penghui Wang

A Bayesian classifier for sparsity-promoting feature selection is developed in this paper, where a set of nonlinear mappings for the original data is performed as a pre-processing step. The linear classification model with such mappings from the original input space to a nonlinear transformation space can not only construct the nonlinear classification boundary, but also realize the feature selection for the original data. A zero-mean Gaussian prior with Gamma precision and a finite approximation of Beta process prior are used to promote sparsity in the utilization of features and nonlinear mappings in our model, respectively. We derive the Variational Bayesian (VB) inference algorithm for the proposed linear classifier. Experimental results based on the synthetic data set, measured radar data set, high-dimensional gene expression data set, and several benchmark data sets demonstrate the aggressive and robust feature selection capability and comparable classification accuracy of our method comparing with some other existing classifiers.


2020 ◽  
pp. 3397-3407
Author(s):  
Nur Syafiqah Mohd Nafis ◽  
Suryanti Awang

Text documents are unstructured and high dimensional. Effective feature selection is required to select the most important and significant feature from the sparse feature space. Thus, this paper proposed an embedded feature selection technique based on Term Frequency-Inverse Document Frequency (TF-IDF) and Support Vector Machine-Recursive Feature Elimination (SVM-RFE) for unstructured and high dimensional text classificationhis technique has the ability to measure the feature’s importance in a high-dimensional text document. In addition, it aims to increase the efficiency of the feature selection. Hence, obtaining a promising text classification accuracy. TF-IDF act as a filter approach which measures features importance of the text documents at the first stage. SVM-RFE utilized a backward feature elimination scheme to recursively remove insignificant features from the filtered feature subsets at the second stage. This research executes sets of experiments using a text document retrieved from a benchmark repository comprising a collection of Twitter posts. Pre-processing processes are applied to extract relevant features. After that, the pre-processed features are divided into training and testing datasets. Next, feature selection is implemented on the training dataset by calculating the TF-IDF score for each feature. SVM-RFE is applied for feature ranking as the next feature selection step. Only top-rank features will be selected for text classification using the SVM classifier. Based on the experiments, it shows that the proposed technique able to achieve 98% accuracy that outperformed other existing techniques. In conclusion, the proposed technique able to select the significant features in the unstructured and high dimensional text document.


2010 ◽  
Vol 44-47 ◽  
pp. 1130-1134
Author(s):  
Sheng Li ◽  
Pei Lin Zhang ◽  
Bing Li

Feature selection is a key step in hydraulic system fault diagnosis. Some of the collected features are unrelated to classification model, and some are high correlated to other features. These features are harmful for establishing classification model. In order to solve this problem, genetic algorithm-partial least squares (GA-PLS) is proposed for selecting the representative and optimal features. K nearest neighbor algorithm (KNN) is used for diagnosing and classifying hydraulic system faults. For expressing better performance of GA-PLS, the original data of a model engineering hydraulic system is used, and the results of GA-PLS are compared with all feature used and GA. The experimental results show that, the proposed feature method can diagnose and classify hydraulic system faults more efficiently with using fewer features.


2020 ◽  
Author(s):  
Yu Wang ◽  
ZAHEER ULLAH KHAN ◽  
Shaukat Ali ◽  
Maqsood Hayat

Abstract BackgroundBacteriophage or phage is a type of virus that replicates itself inside bacteria. It consist of genetic material surrounded by a protein structure. Bacteriophage plays a vital role in the domain of phage therapy and genetic engineering. Phage and hydrolases enzyme proteins have a significant impact on the cure of pathogenic bacterial infections and disease treatment. Accurate identification of bacteriophage proteins is important in the host subcellular localization for further understanding of the interaction between phage, hydrolases, and in designing antibacterial drugs. Looking at the significance of Bacteriophage proteins, besides wet laboratory-based methods several computational models have been developed so far. However, the performance was not considerable due to inefficient feature schemes, redundancy, noise, and lack of an intelligent learning engine. Therefore we have developed an anovative bi-layered model name DeepEnzyPred. A Hybrid feature vector was obtained via a novel Multi-Level Multi-Threshold subset feature selection (MLMT-SFS) algorithm. A two-dimensional convolutional neural network was adopted as a baseline classifier.ResultsA conductive hybrid feature was obtained via a serial combination of CTD and KSAACGP features. The optimum feature was selected via a Novel Multi-Level Multi-Threshold Subset Feature selection algorithm. Over 5-fold jackknife cross-validation, an accuracy of 91.6 %, Sensitivity of 63.39%, Specificity 95.72%, MCC of 0.6049, and ROC value of 0.8772 over Layer-1 were recorded respectively. Similarly, the underline model obtained an Accuracy of 96.05%, Sensitivity of 96.22%, Specificity of 95.91%, MCC of 0.9219, and ROC value of 0.9899 over layer-2 respectivily.ConclusionThis paper presents a robust and effective classification model was developed for bacteriophage and their types. Primitive features were extracted via CTD and KSAACGP. A novel method (MLMT-SFS ) was devised for yielding optimum hybrid feature space out of primitive features. The result drew over hybrid feature space and 2D-CNN shown an excellent classification. Based on the recorded results, we believe that the developed predictor will be a valuable resource for large scale discrimination of unknown Phage and hydrolase enzymes in particular and new antibacterial drug design in pharmaceutical companies in general.


Sign in / Sign up

Export Citation Format

Share Document