The Research on Tibetan Text Classification Based on N-Gram Model

2014 ◽  
Vol 543-547 ◽  
pp. 1896-1900
Author(s):  
Deng Zhou ◽  
Wen Huang He ◽  
Tao Tao Wu

This Compared with the traditional text classification model, the Tibetan text classification based on N-Gram model has adopted N-Gram model in terms of the level of word. In other words, during the text classification, word segmentation is not required. Also, feature selection and abundant pre-treatment processes are avoided. This paper not only carried out profound research on N-Gram models, but also discusses the selection of parameter N in the model by adopting Naïve Bayes Multinomial classifier.

2021 ◽  
Vol 25 (3) ◽  
pp. 509-525
Author(s):  
Maximiliano García ◽  
Sebastián Maldonado ◽  
Carla Vairetti

In this paper, we present a novel approach for n-gram generation in text classification. The a-priori algorithm is adapted to prune word sequences by combining three feature selection techniques. Unlike the traditional two-step approach for text classification in which feature selection is performed after the n-gram construction process, our proposal performs an embedded feature elimination during the application of the a-priori algorithm. The proposed strategy reduces the number of branches to be explored, speeding up the process and making the construction of all the word sequences tractable. Our proposal has the additional advantage of constructing a low-dimensional dataset with only the features that are relevant for classification, that can be used directly without the need for a feature selection step. Experiments on text classification datasets for sentiment analysis demonstrate that our approach yields the best predictive performance when compared with other feature selection approaches, while also facilitating a better understanding of the words and phrases that explain a given task; in our case online reviews and ratings in various domains.


Author(s):  
Triyas Hevianto Saputro ◽  
Arief Hermawan

Sentiment analysis is a part of text mining used to dig up information from a sentence or document. This study focuses on text classification for the purpose of a sentiment analysis on hospital review by customers through criticism and suggestion on Google Maps Review. The data of texts collected still contain a lot of nonstandard words. These nonstandard words cause problem in the preprocessing stage. Thus, the selection and combination of techniques in the preprocessing stage emerge as something crucial for the accuracy improvement in the computation of machine learning. However, not all of the techniques in the preprocessing stage can contribute to improve the accuracy on classification machine. The objective of this study is to improve the accuracy of classification model on hospital review by customers for a sentiment analysis modeling. Through the implementation of the preprocessing technique combination, it can produce a highly accurate classification model. This study experimented with several preprocessing techniques: (1) tokenization, (2) case folding, (3) stop words removal, (4) stemming, and (5) removing punctuation and number. The experiment was done by adding the preprocessing methods: (1) spelling correction and (2) Slang. The result shows that spelling correction and Slang method can assist for improving the accuracy value. Furthermore, the selection of suitable preprocessing technique combination can fasten the training process to produce the more ideal text classification model.


Author(s):  
Narongsak Chayangkoon ◽  
Anongnart Srivihok

<span>Methamphetamine addiction is a prominent problem in Southeast Asia. Drug addicts often discuss illegal activities on popular social networking services. These individuals spread messages on social media as a means of both buying and selling drugs online. This paper proposes a model, the “text classification model of methamphetamine tweets in Southeast Asia” (TMTA), to identify whether a tweet from Southeast Asia is related to methamphetamine abuse. The research addresses the weakness of bag of words (BoW) by introducing BoW and Word2Vec feature selection (BWF) techniques. A domain-based feature selection method was performed using the BoW dataset and Word2Vec. The BWF dataset provided a smaller number of features than the BoW and TF–IDF dataset. We experimented with three candidate classifiers: Support vector machine (SVM), decision tree (J48) and naive bayes (NB). We found that the J48 classifier with the BWF dataset provided the best performance for the TMTA in terms of accuracy (0.815), F-measure (0.818), Kappa (0.528), Matthews correlation coefficient (0.529) and high area under the ROC Curve (0.763). Moreover, TMTA provided the lowest runtime (3.480 seconds) using the J48 with the BWF dataset.</span>


2021 ◽  
Vol 906 (1) ◽  
pp. 012111
Author(s):  
Ingrid Znamenácková ◽  
Silvia Dolinská ◽  
Slavomír Hredzák ◽  
Vladimir Cablík

Abstract In mineral processing, the use of microwave radiation is important especially in pre-treatment processes. At present, there is an acceleration of processes as well as an increase in the efficiency of metal recovery. One of the main problems in copper recovery from complex sulphide ores is the removal of impurities such as antimony, arsenic, mercury. In the hydrometallurgical processing scheme, the key step is the leaching. The extraction process can be influenced by the selection of suitable leaching reagents or by suitable pre-treatment of the ore. The article describes the effect of microwave radiation on the leaching Sb, As and Hg of tetrahedrite and tetrahedrite concentrate. The samples were irradiated at the power 900 W for 30 seconds. The leaching of irradiated and non-irradiated samples was realized in an alkaline sodium sulphide. The positive effect of microwave radiation was confirmed by an increase in the recovery of Sb and As already after 15 min of extraction. After microwave leaching of irradiated tetrahedrite samples, the yield of Sb was 43.2 %, in irradiated tetrahedrite concentrate, the yield of Sb was 81.3 %.


2012 ◽  
Vol 57 (3) ◽  
pp. 829-835 ◽  
Author(s):  
Z. Głowacz ◽  
J. Kozik

The paper describes a procedure for automatic selection of symptoms accompanying the break in the synchronous motor armature winding coils. This procedure, called the feature selection, leads to choosing from a full set of features describing the problem, such a subset that would allow the best distinguishing between healthy and damaged states. As the features the spectra components amplitudes of the motor current signals were used. The full spectra of current signals are considered as the multidimensional feature spaces and their subspaces are tested. Particular subspaces are chosen with the aid of genetic algorithm and their goodness is tested using Mahalanobis distance measure. The algorithm searches for such a subspaces for which this distance is the greatest. The algorithm is very efficient and, as it was confirmed by research, leads to good results. The proposed technique is successfully applied in many other fields of science and technology, including medical diagnostics.


Sign in / Sign up

Export Citation Format

Share Document