scholarly journals Albanian Text Classification: Bag of Words Model and Word Analogies

2019 ◽  
Vol 10 (1) ◽  
pp. 74-87
Author(s):  
Arbana Kadriu ◽  
Lejla Abazi ◽  
Hyrije Abazi

Abstract Background: Text classification is a very important task in information retrieval. Its objective is to classify new text documents in a set of predefined classes, using different supervised algorithms. Objectives: We focus on the text classification for Albanian news articles using two approaches. Methods/Approach: In the first approach, the words in a collection are considered as independent components, allocating to each of them a conforming vector in the vector’s space. Here we utilized nine classifiers from the scikit-learn package, training the classifiers with part of news articles (80%) and testing the accuracy with the remaining part of these articles. In the second approach, the text classification treats words based on their semantic and syntactic word similarities, supposing a word is formed by n-grams of characters. In this case, we have used the fastText, a hierarchical classifier, that considers local word order, as well as sub-word information. We have measured the accuracy for each classifier separately. We have also analyzed the training and testing time. Results: Our results show that the bag of words model does better than fastText when testing the classification process for not a large dataset of text. FastText shows better performance when classifying multi-label text. Conclusions: News articles can serve to create a benchmark for testing classification algorithms of Albanian texts. The best results are achieved with a bag of words model, with an accuracy of 94%.

2020 ◽  
Vol 2020 ◽  
pp. 1-16 ◽  
Author(s):  
Heyong Wang ◽  
Dehang Zeng

With the development of computer science and information science, text classification technology has been greatly developed and its application scenarios have been widened. In traditional process of text classification, the existing method will lose much logical relationship information of text. The logical relationship information of a text refers to the relationship information among different logical parts of the text, such as title, abstract, and body. When human beings are reading, they will take title as an important part to remind the central idea of the article, abstract as a brief summary of the content of the article, and body as a detailed description of the article. In most of the text classification studies, researchers concern more about the relationship among words (word frequency, semantics, etc.) and neglect the logical relationship information of text. It will lose information about the relationship among different parts (title, body, etc.) and have an influence on the performance of text classification. Therefore, we propose a text classification algorithm—fusing the logical relationship information of text in neural network (FLRIOTINN), which complements the logical relationship information into text classification algorithms. Experiments show that the effect of FLRIOTINN is better than the conventional backpropagation neural networks which does not consider the logical relationship information of text.


Author(s):  
Lars Werner

Text documents stored in information systems usually consist of more information than the pure concatenation of words, i.e., they also contain typographic information. Because conventional text retrieval methods evaluate only the word frequency, they miss the information provided by typography, e.g., regarding the importance of certain terms. In order to overcome this weakness, we present an approach which uses the typographical information of text documents and show how this improves the efficiency of text retrieval methods. Our approach uses weighting of typographic information in addition to term frequencies for separating relevant information in text documents from the noise. We have evaluated our approach on the basis of automated text classification algorithms. The results show that our weighting approach achieves very competitive classification results using at most 30% of the terms used by conventional approaches, which makes our approach significantly more efficient.


2019 ◽  
Vol 18 (03) ◽  
pp. 1950033
Author(s):  
Madan Lal Yadav ◽  
Basav Roychoudhury

One can either use machine learning techniques or lexicons to undertake sentiment analysis. Machine learning techniques include text classification algorithms like SVM, naive Bayes, decision tree or logistic regression, whereas lexicon-based sentiment analysis uses either general or domain-based lexicons. In this paper, we investigate the effectiveness of domain lexicons vis-à-vis general lexicon, wherein we have performed aspect-level sentiment analysis on data from three different domains, viz. car, guitar and book. While it is intuitive that domain lexicons will always perform better than general lexicons, the actual performance however may depend on the richness of the concerned domain lexicon as well as the text analysed. We used the general lexicon SentiWordNet and the corresponding domain lexicons in the aforesaid domains to compare their relative performances. The results indicate that domain lexicon used along with general lexicon performs better as compared to general lexicon or domain lexicon, when used alone. They also suggest that the performance of domain lexicons depends on the text content; and also on whether the language involves technical or non-technical words in the concerned domain. This paper makes a case for development of domain lexicons across various domains for improved performance, while gathering that they might not always perform better. It further highlights that the importance of general lexicons cannot be underestimated — the best results for aspect-level sentiment analysis are obtained, as per this paper, when both the domain and general lexicons are used side by side.


Author(s):  
Sarmad Mahar ◽  
Sahar Zafar ◽  
Kamran Nishat

Headnotes are the precise explanation and summary of legal points in an issued judgment. Law journals hire experienced lawyers to write these headnotes. These headnotes help the reader quickly determine the issue discussed in the case. Headnotes comprise two parts. The first part comprises the topic discussed in the judgment, and the second part contains a summary of that judgment. In this thesis, we design, develop and evaluate headnote prediction using machine learning, without involving human involvement. We divided this task into a two steps process. In the first step, we predict law points used in the judgment by using text classification algorithms. The second step generates a summary of the judgment using text summarization techniques. To achieve this task, we created a Databank by extracting data from different law sources in Pakistan. We labelled training data generated based on Pakistan law websites. We tested different feature extraction methods on judiciary data to improve our system. Using these feature extraction methods, we developed a dictionary of terminology for ease of reference and utility. Our approach achieves 65% accuracy by using Linear Support Vector Classification with tri-gram and without stemmer. Using active learning our system can continuously improve the accuracy with the increased labelled examples provided by the users of the system.


2011 ◽  
Vol 268-270 ◽  
pp. 697-700
Author(s):  
Rui Xue Duan ◽  
Xiao Jie Wang ◽  
Wen Feng Li

As the volume of online short text documents grow tremendously on the Internet, it is much more urgent to solve the task of organizing the short texts well. However, the traditional feature selection methods cannot suitable for the short text. In this paper, we proposed a method to incorporate syntactic information for the short text. It emphasizes the feature which has more dependency relations with other words. The classifier SVM and machine learning environment Weka are involved in our experiments. The experiment results show that incorporate syntactic information in the short text, we can get more powerful features than traditional feature selection methods, such as DF, CHI. The precision of short text classification improved from 86.2% to 90.8%.


2019 ◽  
Vol Volume-3 (Issue-2) ◽  
pp. 579-581
Author(s):  
Nida Zafar Khan ◽  
Prof. S. R. Yadav ◽  

2021 ◽  
Vol 3 (4) ◽  
pp. 922-945
Author(s):  
Shaw-Hwa Lo ◽  
Yiqiao Yin

Text classification is a fundamental language task in Natural Language Processing. A variety of sequential models are capable of making good predictions, yet there is a lack of connection between language semantics and prediction results. This paper proposes a novel influence score (I-score), a greedy search algorithm, called Backward Dropping Algorithm (BDA), and a novel feature engineering technique called the “dagger technique”. First, the paper proposes to use the novel influence score (I-score) to detect and search for the important language semantics in text documents that are useful for making good predictions in text classification tasks. Next, a greedy search algorithm, called the Backward Dropping Algorithm, is proposed to handle long-term dependencies in the dataset. Moreover, the paper proposes a novel engineering technique called the “dagger technique” that fully preserves the relationship between the explanatory variable and the response variable. The proposed techniques can be further generalized into any feed-forward Artificial Neural Networks (ANNs) and Convolutional Neural Networks (CNNs), and any neural network. A real-world application on the Internet Movie Database (IMDB) is used and the proposed methods are applied to improve prediction performance with an 81% error reduction compared to other popular peers if I-score and “dagger technique” are not implemented.


2020 ◽  
pp. 3397-3407
Author(s):  
Nur Syafiqah Mohd Nafis ◽  
Suryanti Awang

Text documents are unstructured and high dimensional. Effective feature selection is required to select the most important and significant feature from the sparse feature space. Thus, this paper proposed an embedded feature selection technique based on Term Frequency-Inverse Document Frequency (TF-IDF) and Support Vector Machine-Recursive Feature Elimination (SVM-RFE) for unstructured and high dimensional text classificationhis technique has the ability to measure the feature’s importance in a high-dimensional text document. In addition, it aims to increase the efficiency of the feature selection. Hence, obtaining a promising text classification accuracy. TF-IDF act as a filter approach which measures features importance of the text documents at the first stage. SVM-RFE utilized a backward feature elimination scheme to recursively remove insignificant features from the filtered feature subsets at the second stage. This research executes sets of experiments using a text document retrieved from a benchmark repository comprising a collection of Twitter posts. Pre-processing processes are applied to extract relevant features. After that, the pre-processed features are divided into training and testing datasets. Next, feature selection is implemented on the training dataset by calculating the TF-IDF score for each feature. SVM-RFE is applied for feature ranking as the next feature selection step. Only top-rank features will be selected for text classification using the SVM classifier. Based on the experiments, it shows that the proposed technique able to achieve 98% accuracy that outperformed other existing techniques. In conclusion, the proposed technique able to select the significant features in the unstructured and high dimensional text document.


Sign in / Sign up

Export Citation Format

Share Document